FreeBSD Remote Process Plugin on Non-x86 Architectures

By Michał Górny

February 4, 2021 - 14 minutes read - 2822 words

BSD contract debugger FreeBSD LLDB LLVM

Moritz Systems have started a new contract with the FreeBSD Foundation to continue our work on modernizing the LLDB debugger’s support for FreeBSD. Throughout the previous contract we have introduced a FreeBSD Remote Process Plugin utilizing the mode modern client-server layout of LLDB.

We have managed to achieve the feature parity with the original FreeBSD plugin on the x86 architecture. However, as of today other architectures still use the original. During the next two months, we are going to work on bringing the remaining previously supported architectures to the new layout, with special focus on providing first class support for ARM64 (also known as AArch64) architecture. Afterwards, we are going to continue improving FreeBSD support in LLDB.

The complete Project Schedule is divided into four milestones, each taking approximately one month:

M1 Switch all the non-x86 CPUs to the LLDB FreeBSD Remote-Process-Plugin.
M2 Iteration over regression tests on ARM64 and fixing known bugs, marking the non-trivial ones for future work. Remove the old local-only Process-Plugin.
M3 Implement follow-fork and follow-vfork operations on par with the GNU GDB support. Cover the functionality with LLDB regression tests.
M4 Implement SaveCore functionality for FreeBSD and enhance the regression testing of core files in LLDB. Update the FreeBSD manual.

Cross-compiling LLDB to other FreeBSD architectures

A short introduction to cross-compilation

Cross-compilation is a technique permitting to use a compiler running on one platform to create executables for another platform. It can be used to build software for another CPU architecture (e.g. ARM64 executables on an x86 system) or e.g. for another operating system (e.g. FreeBSD packages from Linux), or both.

The most common use case for cross-compilation is to use a single development environment to produce executables for multiple target platforms. What’s really important for our case, it permits building software much faster than when running a native compiler via an emulator or on hardware that is much less performant than modern x86 PCs (e.g. commonly available ARM boards).

An important limitation of cross-compilation is that the resulting executables cannot be executed on the platform running the compiler. This means that the tools needed at build-time need to be built separately – depending on the build system, this either needs to be done manually or is done automatically as part of cross compilation. This also means that the build scripts cannot perform tests that require running the test program.

There are two main prerequisites to cross-compilation:

A cross-toolchain, i.e. the compiler and link editor capable of producing executables for the target platform.
A sysroot, i.e. the system libraries and dependencies compiled for the target platform.

Preparing the cross-compiler and sysroot

Ordinarily in order to obtain a cross-toolchain, you need to build the compiler for a specific target. However, the Clang compiler that is used by default on FreeBSD has integrated cross-compilation support. Rather than rebuilding the whole compiler for each target, it is sufficient to ensure that appropriate target support is enabled at build time. Therefore, the standard Clang builds on FreeBSD are sufficient to cross-build for ARM and ARM64.

Cross-compiling a FreeBSD sysroot is very similar to building it natively. The only difference is the necessity of passing a TARGET_ARCH variable specifying the target architecture. For example, to build arm64 sysroot we run the following commands in /usr/src:

make -j$(sysctl -n hw.ncpu) buildworld TARGET_ARCH=aarch64
make -j$(sysctl -n hw.ncpu) installworld TARGET_ARCH=aarch64 \
    DESTDIR=/sysroot/arm64
make -j$(sysctl -n hw.ncpu) distribution TARGET_ARCH=aarch64 \
    DESTDIR=/sysroot/arm64

Cross-compiling LLVM

LLVM is using the CMake build system. CMake partially facilitates cross-compilation itself, while the other part is handled in LLVM-specific CMake files.

The first step towards cross-compiling CMake-based projects is to create a toolchain file. This file is used to set some internal CMake variables that cannot be directly overriden via the command-line. One toolchain file can be shared between multiple projects, so it is also a convenient place to set standard cross-related CMake options.

For our purpose, toolchain-arm64.cmake contained the following variables:

# Since we are using clang as the compiler and it is the default
# on FreeBSD, we do not need to override the compiler.  However,
# we do need to pass a correct -target indicating the platform we're
# build for and the path to our sysroot.
set(CMAKE_C_FLAGS
    "-target aarch64-unknown-freebsd13.0 --sysroot /sysroot/arm64")
set(CMAKE_CXX_FLAGS
    "-target aarch64-unknown-freebsd13.0 --sysroot /sysroot/arm64")

# While this may seem redundant, setting it explicitly (even to the same
# value) actually causes CMake to consider itself to be cross-compiling.
# This is important since LLVM relies on CMAKE_CROSSCOMPILING being set.
set(CMAKE_SYSTEM_NAME "FreeBSD")

# Force library search functions to use our sysroot.  Make sure it never
# uses programs from the sysroot (since we can't execute them).  Headers
# and libraries have to be taken from sysroot, on the other hand.
set(CMAKE_FIND_ROOT_PATH "/sysroot/arm64")
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

In addition to that, a few additional options needed to be passed to CMake. The following snippet explains them, in the Bash array form:

mkdir build.arm64
cd build.arm64

args=(
    # Path to the source directory
    ../llvm

    # Use the Ninja generator since it's faster and has cleaner output
    # than Makefiles.
    -G Ninja

    # -Os builds reduce space usage while maintaining good performance.
    # LLVM uses complex C++ that normally has a tendency towards
    # creating huge object files.
    -DCMAKE_BUILD_TYPE=MinSizeRel

    # Enable assertions to aid debugging.
    -DLLVM_ENABLE_ASSERTIONS=ON

    # Build LLVM, Clang and LLDB.
    -DLLVM_ENABLE_PROJECTS='llvm;clang;lldb'

    # Set the toolchain to built for aarch64 by default.
    -DLLVM_DEFAULT_TARGET_TRIPLE=aarch64-unknown-freebsd13.0
    -DLLVM_HOST_TRIPLE=aarch64-unknown-freebsd13.0
    -DLLVM_TARGET_ARCH=AArch64
    -DLLVM_TARGETS_TO_BUILD=AArch64

    # Use shared libs to speed up linking and avoid huge interim static
    # libraries.
    -DBUILD_SHARED_LIBS=ON

    # Use our toolchain file.
    -DCMAKE_TOOLCHAIN_FILE="${HOME}"/toolchain-arm64.cmake
)

cmake "${args[@]}"

Once LLVM is configured this way, the regular ninja calls can be used to build the project. The build system will automatically configure a NATIVE subdirectory containing utilities that need to be executed during the build, while the rest of the project will be built for ARM64.

Working on additional architectures

Architecture-specific code in LLDB

While a significant part of ptrace(2) API used by debuggers, and therefore of the debugger itself is architecture-agnostic, it is practically impossible to debug programs without specific support for the processor in question. For example, the debugger benefits from support for deassembling the code, generating function calls, inspecting registers, etc.

A large part of the architecture support is generic and shared between different operating systems. Moreover, a part of FreeBSD-specific architecture support is shared between the legacy and new plugins. Therefore, in order to extend the new plugin to support additional architectures we mostly needed to introduce the ‘glue’ binding the new plugin with the architecture support. However, we also had to implement support for inspecting and modifying registers (in more modern form than used by the legacy plugin), choosing opcodes for software breakpoints and fix existing bugs in platform support.

In the following subsections, we will discuss shortly the architectures we were working on, and the specifics that we needed to research in order to proceed.

ptrace(2) register groups

Traditionally, the ptrace(2) API defines three pairs of requests for getting and setting registers, effectively splitting the registers visible to userland programs into three groups: General-Purpose Registers (GPRs), Floating-Point Unit Registers (FPRs) and Debug Registers (DRs).

General-Purpose Registers are the baseline set of processor’s registers exposed to userland programs. This group includes generic registers that can be used to store arbitrary data by the program (usually integers or memory addresses) and special CPU registers with predefined meaning. Two common examples of special registers are the Program Counter that is used to help the memory address of the code being executed currently, and the flag register that is used to expose part of the CPU state and boolean results of executed instructions.

The non-special registers also often have functions predefined by the platform’s ABI. For example, often one of the registers is dedicated to be the Stack Pointer. While the program could technically use it on some CPUs for another purpose, it must preserve the original value for interoperability.

Floating-Point Unit Registers are the registers used to store and perform computation on floating-point numbers. This class of registers is only present on processors implementing hardware floating-point support.

Finally, Debug Registers are special CPU registers used to aid debuggers. Usually, they are used to enable hardware-assisted breakpoints and watchpoints.

ARM (and AArch64)

Raspberry Pi Zero

ARM is a family CPU architectures maintained by ARM Ltd., primarily used on embedded platforms. The processors up to ARMv7 were pure 32-bit. ARMv8 featured an optional (though present in the vast majority of ARMv8 processors) 64-bit architecture, often called AArch64 or ARM64.

All 32-bit ARM processors feature 15 general purpose registers, a program counter (that is the 16th register) and a flag register. All of them are 32 bits wide.

The 32-bit ARM architecture did not originally feature a hardware floating-point number support. There are two extensions remedying this: VFP (Vector Floating Point) and Neon.

VFP is the earlier extension, existing in a number of versions. For our purposes, it is sufficient to say that it includes either 16 (in the -D16 versions of the VFP extension) or 32 (in the -D32 versions) FPU registers, 64-bit wide.

Neon (also called Advanced SIMD) is focused on media and signal processing. It shares the registers with VFP while introducing the possibility of using them as 128-bit wide registers (with half the number).

The AArch64 architecture features 32 general purpose registers and a program counter (that is the 33th register), all of them 64-bit, plus a 32-bit flag register. AArch64 also features 32 VFP/NEON-compatible 128-bit registers.

There is also a recent SVE (Scalable Vector Extension) extension that provides for variable vector widths from 128 bits to 2048 bits. It is not supported by FreeBSD at the moment.

FreeBSD/ARM supports a number of ARM and ARM64 boards. The FreeBSD wiki provides convenient instructions on running AArch64 VM and AArch64 VM images, as well as instructions on running ARM via QEMU. We were able to successfully use AArch64 VM but we were not able to boot one for ARMv7.

The ptrace(2) API for 32-bit ARM uses PT_GETREGS and PT_SETREGS requests to work on general-purpose registers, and PT_GETVFPREGS and PT_SETVFPREGS machine-dependent requests to work on floating-point (VFP/NEON) registers. PT_GETFPREGS, PT_SETFPREGS, PT_GETDBREGS and PT_SETDBREGS are stubs.

One interesting property of ARM is that it defines two Instruction Set Architectures. The original ARM ISA encodes instructions in 32-bit words. Newer processors also support Thumb ISA that uses more compact but less flexible 16-bit encoding. The processor needs to be explicitly switched between these two encodings. In order to insert a software breakpoint, the debugger needs to know explicitly whether the code is encoded using ARM or Thumb ISA, and use an appropriate opcode.

The ptrace(2) API for AArch64 is covered by the standard ptrace(2) requests: PT_GETREGS and PT_SETREGS for the general-purpose registers, PT_GETFPREGS and PT_SETFPREGS for floating-point (VFP/NEON) registers and PT_GETDBREGS and PT_SETDBREGS for debug registers (limited to hardware breakpoints at the moment).

MIPS

Silicon Graphics Octane

MIPS is yet another architecture primarily used for embedded products. MIPS I and II architectures were 32-bit, MIPS III through V were 64-bit architectures (with 32-bit backwards compatibility). Modern MIPS architectures are called MIPS32/MIPS64 since the specification permits both pure 32-bit and 64-bit processors.

One of the more curious features of 64-bit MIPS architecture is the popularity of N32 ABI. This ABI combines 64-bit code with 32-bit pointers. This makes it possible to reduce the program’s memory footprint at the cost of limiting it to 4 GiB of memory. Given embedded platforms often have less memory than that, it can be quite useful. For comparison, a similar X32 ABI for x86 is barely known.

MIPS features 32 general-purpose registers that are either 32-bit or 64-bit depending on the architecture. Of these, only 31 are actually generally usable, while register $0 has a constant value of zero. Additionally, there are special HI/LO registers used to store multiplication results, PC (Program Counter), Status Register, Cause Register, Bad Virtual Address Register and more. MIPS also features 32 FPU registers, each of them 64 bits wide.

Vector operations are provided by the MSA (MIPS SIMD Architecture) extension that introduces 32 128-bit vector registers that are shared with the FPU registers. FreeBSD’s ptrace(2) API does not support MSA.

FreeBSD does not provide prebuilt MIPS images. However, the QEMU recipes page provides convenient instructions for building and booting a VM based on the Malta Development Board.

The General Purpose Registers and FPU Registers are accessed via the standard ptrace(2) requests. Support for hardware-assisted breakpoints, watchpoints or even single-stepping are not available. MIPS support in LLDB specifically requires software single-stepping implementation.

PowerPC

IBM Power Systems E870

PowerPC (in later versions called Power) architecture is found in a wide range of products ranging from (older) gaming consoles and Macintosh computers to servers used for HPC (High-Performance Computing). It includes both 32-bit and 64-bit processors (PPC64).

PowerPC features 32 General Purpose Registers (32-bit for PPC, 64-bit for PPC64) plus a few special purpose registers:

the Link Register (LR) providing the branch target address for bclr* instructions
the Condition Register (CR) that can store up to 8 results of comparison/arithmetic operations
the XER Register (XER) used to indicate overflows and carry conditions in integer arithmetics
the Count Register (CTR) that is used as a loop counter
the Program Counter Register (PC)

PowerPC also features 32 64-bit Floating Point Registers along with a Floating-point Status and Control Register (FPSCR). Additionally, the AltiVec extension provides 32 128-bit vector registers, and later VSX extensions increase their number to 64.

FreeBSD provides install images for PPC and PPC64 hardware. However, FreeBSD does not work on qemu-system-ppc, and we have not managed to run it on qemu-system-ppc64 either although apparently it is supposed to work.

The access to General Purpose Registers and FPU Registers is provided via the standard ptrace(2) requests. The AltiVec Registers can be accessed via machine-specific PT_GETVRREGS and PT_SETVRREGS. The additional VSX Registers are exposed via PT_GETVSRREGS and PT_SETVSRREGS.

Register summary

The following table summarizes General-Purpose and Floating-Point Unit Registers on the discussed architectures, and compares them to x86.

Comparison of GPRs (minus special CPU registers) and FPRs on discussed architectures
Registers		i386	amd64	arm	arm64	mips	mips64	ppc	ppc64
GPR	Num	8	16	15	32	32	32	32	32
GPR	Bits	32	64	32	64	32	64	32	64
FPR	Has?	yes	yes	opt.	yes	yes	yes	yes	yes
	Num	8	8(x87) 16(SSE) 32(AVX-512)	16(-D16) 32(-D32)	32	32	32	32	32
	Bits	80(x87) 64(SSE)		64	64	64	64	64	64
Vec.	Num	8	16(SSE) 32(AVX-512)	16(NEON) 32(SVE)		32		16(AltiVec) 32(VSX)
Vec.	Bits	128(SSE) 256(AVX) 512(AVX-512)		128(NEON) up to 2048(SVE)		128		128

Of the discussed architectures, the vast majority features 32 General Purpose Registers (32-bit or 64-bit). The only exceptions of that are 32-bit ARM processors that feature 15 GPRs (the 16th is used as Program Counter) and… x86 variants, with 8 GPRs in 32-bit processors and 16 GPRs in amd64.

All architectures except for old ARM processors provide a hardware Floating Point Unit. Modern x86 systems provide two sets of FPRs: 8 80-bit x87 registers, or 8-32 64-bit registers provided by SSE/AVX/AVX-512 extensions. On i386, only the first 8 registers are visible. All other registers provide 32 64-bit FPRs, except for some ARM processors with the VFP-D16 variant providing only 16 registers.

All the architectures also feature an extension for vector operations. The baseline for that are 128-bit registers provided by the SSE extension on x86 (8 on i386, 16 on amd64, can alternatively be used for floating-point operations), NEON on ARM (16 128-bit registers interchangeable for FPRs), MSA on MIPS (32 128-bit registers interchangeable for FPRs) or AltiVec on PPC (16 128-bit registers).

The AVX extension for x86 extends them to 16 256-bit registers, and AVX-512 extends to 32 512-bit registers (only the first 8 available in 32-bit mode). The VSX extension on PPC increases the register number to 32 without extending the size. The SVE extension on AArch64 extends to 32-bit registers of implementation-defined size up to 2048 bits.

Changes merged upstream

Future plans

The goal of the first milestone was to reach feature parity for non-x86 architecture support in the FreeBSD LLDB plugin, that is implement the support for ARM, ARM64, MIPS64 and PowerPC targets with the capabilities matching the legacy plugin. Once all the patches are reviewed and merged, we’re going to remove the legacy plugin along with obsolete code from LLDB.

Once the legacy plugin is gone, the way for additional enhancements of the platform support will open. The potential enhancements include:

support for hardware breakpoints and watchpoints on ARM platforms (pending kernel support on 32-bit ARM, and ptrace(2) improvements on ARM64)
support for Floating-Point Unit Registers on MIPS (blocked by the necessity of breaking changes to the code shared between both plugins)
support for Vector Registers on PowerPC

We are also starting to work on the second milestone, that is running the LLDB test suite on ARM64 and fixing test failures. The goal is to provide first-class support for the ARM64 platform.