How Debuggers Work: Getting and Setting x86 Registers, Part 1

By Michał Górny

October 22, 2020 - 15 minutes read - 3001 words

asm freebsd fsave fxsave netbsd ptrace x86

Context Switch

In this article, I would like to shortly describe the methods used to dump and restore the different kinds of registers on 32-bit and 64-bit x86 CPUs. The first part will focus on General Purpose Registers, Debug Registers and Floating-Point Registers up to the XMM registers provided by the SSE extension. I will explain how their values can be obtained via the ptrace(2) interface.

The ptrace(2) API is commonly used in all modern BSD systems and Linux, as all of them derive it from the original form designed and implemented in 4.3BSD. The primary focus in this article is on the FreeBSD and NetBSD systems. Nevertheless, the users of other Operating Systems such as OpenBSD, DragonFly BSD or Linux can still benefit from this article as the basic principles are the same and the code examples are intended to be easily adapted to other platforms.

A single CPU (in modern hardware: CPU core or CPU thread, if hyperthreading is available) can execute only one program thread at a time. In order to be able to run multiple processes and threads quasi-simultaneously, the Operating System must perform context switching — that is periodically suspend the currently running thread, save its state, restore the saved state of another thread and resume it. Saving and restoring the values of the processor’s registers play an important part in context switching. It is important that this process is fully transparent to the process being switched, and in a properly implemented kernel there should be no side effects that are perceptible to the program.

The debugger may need to examine the register sets of the debugged program for a number of reasons. By inspecting the Program Counter, it is able to determine the location in source code at which the execution will continue, and by altering it it can control the execution. The Stack Pointer is necessary to introspect variables stored on the stack, while the remaining registers can hold variables themselves.

A special set of the x86 registers are the Debug Registers. They are not accessible to the program itself; however, they can be read or written by the debugger. They allow setting hardware assisted breakpoints (instruction execute trap) on the code being executed, and watchpoints (read and/or write operation trap) on the variables.

General Purpose Registers (GPR)

Copying GPRs

The term ‘General Purpose Registers’ is a bit ambiguous. In the narrower sense, it means the few (8 on i386, 16 on amd64) baseline registers that can be used to store arbitrary data (usually integers or pointers). In the wider sense it means all baseline registers in the processor architecture, historically excluding floating-point registers and special kinds of registers. On x86, this includes the ‘narrower sense’ general-purpose registers, the Program Counter (EIP/RIP), segment registers and the flag register.

The majority of the General Purpose Registers can be copied directly, e.g. using the MOV instruction, or pushed onto the stack via PUSH. The EIP/RIP register can be copied using the LEA instruction, and restored via JMP. The flag register can be pushed onto the stack via PUSHFD/PUSHFQ, and afterwards popped from it via POPFD/POPFQ.

The listing below demonstrates a program that grabs the values of all amd64 GPRs at an arbitrary point during the execution and prints them after returning from assembly.

(standalone example source, skip listing)

#include <stdio.h>
#include <stdint.h>

enum {
    R_RAX, R_RBX, R_RCX, R_RDX, R_RSI, R_RDI, R_RBP, R_RSP,
    R_R8, R_R9, R_R10, R_R11, R_R12, R_R13, R_R14, R_R15,
    R_RIP, R_RFLAGS,
    R_LENGTH
};

enum {
    S_CS, S_DS, S_ES, S_FS, S_GS, S_SS,
    S_LENGTH
};

int main()
{
    uint64_t gpr[R_LENGTH];
    uint16_t seg[S_LENGTH];

    asm volatile (
        /* fill registers with random data */
        "mov $0x0102030405060708, %%rax\n\t"
        "mov $0x1112131415161718, %%rbx\n\t"
        "mov $0x2122232425262728, %%rcx\n\t"
        "mov $0x3132333435363738, %%rdx\n\t"
        "mov $0x4142434445464748, %%rsi\n\t"
        "mov $0x5152535455565758, %%rdi\n\t"
        /* RBP is used for frame pointer, RSP is stack pointer */
        "mov $0x8182838485868788, %%r8\n\t"
        "mov $0x9192939495969798, %%r9\n\t"
        "mov $0xa1a2a3a4a5a6a7a8, %%r10\n\t"
        "mov $0xb1b2b3b4b5b6b7b8, %%r11\n\t"
        "mov $0xc1c2c3c4c5c6c7c8, %%r12\n\t"
        "mov $0xd1d2d3d4d5d6d7d8, %%r13\n\t"
        "mov $0xe1e2e3e4e5e6e7e8, %%r14\n\t"
        "mov $0xf1f2f3f4f5f6f7f8, %%r15\n\t"

        /* dump GPRs */
        "mov %%rax, %[rax]\n\t"
        "mov %%rbx, %[rbx]\n\t"
        "mov %%rcx, %[rcx]\n\t"
        "mov %%rdx, %[rdx]\n\t"
        "mov %%rsi, %[rsi]\n\t"
        "mov %%rdi, %[rdi]\n\t"
        "mov %%rbp, %[rbp]\n\t"
        "mov %%rsp, %[rsp]\n\t"
        "mov %%r8, %[r8]\n\t"
        "mov %%r9, %[r9]\n\t"
        "mov %%r10, %[r10]\n\t"
        "mov %%r11, %[r11]\n\t"
        "mov %%r12, %[r12]\n\t"
        "mov %%r13, %[r13]\n\t"
        "mov %%r14, %[r14]\n\t"
        "mov %%r15, %[r15]\n\t"
        /* dump RIP */
        "lea (%%rip), %%rbx\n\t"
        "mov %%rbx, %[rip]\n\t"
        "mov %[rbx], %%rbx\n\t"
        /* dump segment registers */
        "mov %%cs, %[cs]\n\t"
        "mov %%ds, %[ds]\n\t"
        "mov %%es, %[es]\n\t"
        "mov %%fs, %[fs]\n\t"
        "mov %%gs, %[gs]\n\t"
        "mov %%ss, %[ss]\n\t"
        /* dump RFLAGS */
        "pushfq\n\t"
        "popq %[rflags]\n\t"

        : [rax] "=m"(gpr[R_RAX]), [rbx] "=m"(gpr[R_RBX]),
          [rcx] "=m"(gpr[R_RCX]), [rdx] "=m"(gpr[R_RDX]),
          [rsi] "=m"(gpr[R_RSI]), [rdi] "=m"(gpr[R_RDI]),
          [rbp] "=m"(gpr[R_RBP]), [rsp] "=m"(gpr[R_RSP]),
           [r8] "=m"(gpr[ R_R8]), [ r9] "=m"(gpr[ R_R9]),
          [r10] "=m"(gpr[R_R10]), [r11] "=m"(gpr[R_R11]),
          [r12] "=m"(gpr[R_R12]), [r13] "=m"(gpr[R_R13]),
          [r14] "=m"(gpr[R_R14]), [r15] "=m"(gpr[R_R15]),
          [rip] "=m"(gpr[R_RIP]), [rflags] "=m"(gpr[R_RFLAGS]),
          [cs] "=m"(seg[S_CS]), [ds] "=m"(seg[S_DS]),
          [es] "=m"(seg[S_ES]), [fs] "=m"(seg[S_FS]),
          [gs] "=m"(seg[S_GS]), [ss] "=m"(seg[S_SS])
        :
        : "%rax", "%rbx", "%rcx", "%rdx", "%rsi", "%rdi",
          "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
          "memory"
    );

    printf("rax = 0x%016lx\n", gpr[R_RAX]);
    printf("rbx = 0x%016lx\n", gpr[R_RBX]);
    printf("rcx = 0x%016lx\n", gpr[R_RCX]);
    printf("rdx = 0x%016lx\n", gpr[R_RDX]);
    printf("rsi = 0x%016lx\n", gpr[R_RSI]);
    printf("rdi = 0x%016lx\n", gpr[R_RDI]);
    printf("rbp = 0x%016lx\n", gpr[R_RBP]);
    printf("rsp = 0x%016lx\n", gpr[R_RSP]);
    printf(" r8 = 0x%016lx\n", gpr[R_R8]);
    printf(" r9 = 0x%016lx\n", gpr[R_R9]);
    printf("r10 = 0x%016lx\n", gpr[R_R10]);
    printf("r11 = 0x%016lx\n", gpr[R_R11]);
    printf("r12 = 0x%016lx\n", gpr[R_R12]);
    printf("r13 = 0x%016lx\n", gpr[R_R13]);
    printf("r14 = 0x%016lx\n", gpr[R_R14]);
    printf("r15 = 0x%016lx\n", gpr[R_R15]);
    printf("rip = 0x%016lx\n", gpr[R_RIP]);
    printf("cs = 0x%04x\n", seg[S_CS]);
    printf("ds = 0x%04x\n", seg[S_DS]);
    printf("es = 0x%04x\n", seg[S_ES]);
    printf("fs = 0x%04x\n", seg[S_FS]);
    printf("gs = 0x%04x\n", seg[S_GS]);
    printf("ss = 0x%04x\n", seg[S_SS]);
    printf("rflags = 0x%016lx\n", gpr[R_RFLAGS]);

    return 0;
}

The GPR ptrace(2) API

Both FreeBSD and NetBSD use the PT_GETREGS request to get the values of GPRs from the program, and PT_SETREGS to update them. The requests take a pointer to struct reg as an argument.

On FreeBSD, both i386 and amd64 have the individual registers listed as fields of the struct. On NetBSD, i386 uses a regular structure, while amd64 puts all values into an array whose indices are defined in the headers as constants.

The listing below compares the structures used on FreeBSD and NetBSD. Note that NetBSD/amd64 uses a special macro. For example, greg(rdi RDI, 0) defines _REG_RDI.

(FreeBSD structs, NetBSD/i386 struct, NetBSD/amd64 struct, NetBSD/amd64 register names, skip listing)

/* FreeBSD/i386 */               /* NetBSD/i386 */

struct __reg32 {                 struct reg {
    __uint32_t  r_fs;               int r_eax;
    __uint32_t  r_es;               int r_ecx;
    __uint32_t  r_ds;               int r_edx;
    __uint32_t  r_edi;              int r_ebx;
    __uint32_t  r_esi;              int r_esp;
    __uint32_t  r_ebp;              int r_ebp;
    __uint32_t  r_isp;              int r_esi;
    __uint32_t  r_ebx;              int r_edi;
    __uint32_t  r_edx;              int r_eip;
    __uint32_t  r_ecx;              int r_eflags;
    __uint32_t  r_eax;              int r_cs;
    __uint32_t  r_trapno;           int r_ss;
    __uint32_t  r_err;              int r_ds;
    __uint32_t  r_eip;              int r_es;
    __uint32_t  r_cs;               int r_fs;
    __uint32_t  r_eflags;           int r_gs;
    __uint32_t  r_esp;           };
    __uint32_t  r_ss;
    __uint32_t  r_gs;
};


/* FreeBSD/amd64 */              /* NetBSD/amd64 */

struct __reg64 {                 #define _FRAME_REG(greg, freg) \
    __int64_t   r_r15;               greg(rdi, RDI, 0) \
    __int64_t   r_r14;               greg(rsi, RSI, 1) \
    __int64_t   r_r13;               greg(rdx, RDX, 2) \
    __int64_t   r_r12;               greg(r10, R10, 6) \
    __int64_t   r_r11;               greg(r8,  R8,  4) \
    __int64_t   r_r10;               greg(r9,  R9,  5) \
    __int64_t   r_r9;                /* ... */ \
    __int64_t   r_r8;                greg(rcx, RCX, 3) \
    __int64_t   r_rdi;               greg(r11, R11, 7) \
    __int64_t   r_rsi;               greg(r12, R12, 8) \
    __int64_t   r_rbp;               greg(r13, R13, 9) \
    __int64_t   r_rbx;               greg(r14, R14, 10) \
    __int64_t   r_rdx;               greg(r15, R15, 11) \
    __int64_t   r_rcx;               greg(rbp, RBP, 12) \
    __int64_t   r_rax;               greg(rbx, RBX, 13) \
    __uint32_t  r_trapno;            greg(rax, RAX, 14) \
    __uint16_t  r_fs;                greg(gs,  GS,  15) \
    __uint16_t  r_gs;                greg(fs,  FS,  16) \
    __uint32_t  r_err;               greg(es,  ES,  17) \
    __uint16_t  r_es;                greg(ds,  DS,  18) \
    __uint16_t  r_ds;                greg(trapno, TRAPNO, 19) \
    __int64_t   r_rip;               greg(err, ERR, 20) \
    __int64_t   r_cs;                greg(rip, RIP, 21) \
    __int64_t   r_rflags;            greg(cs,  CS,  22) \
    __int64_t   r_rsp;               greg(rflags, RFLAGS, 23) \
    __int64_t   r_ss;                greg(rsp, RSP, 24) \
};                                   greg(ss,  SS,  25)

                                 struct reg {
                                     long    regs[_NGREG];
                                 };

Floating-Point Registers

Dumping via FSAVE and FXSAVE

Floating-Point Registers is the term used to indicate registers whose primary purpose was handling floating-point numbers. The traditional separation between General Purpose Registers and Floating-Point Registers is reflected in the ptrace(2) CPU-specific calls that allow setting or getting either the GPR or FPU registers in a single operation. Some CPU architectures include additional sets of registers, e.g. x86 exposes the Debug Registers separately.

There are a few architectures that do not use the Floating Point Unit setters and getters, as they do not feature a hardware-assisted FPU (this is often the case in low-power embedded devices).

The earliest FPRs on x86 were the x87 registers, including 8 80-bit extended precision number registers ST(i) and a few control registers.

The contents of these registers can be dumped using FSAVE instruction, and restored using FRSTOR instruction. The instruction takes a pointer to a 108-byte memory buffer, stores the current values of control registers and ST(i) registers and resets the FPU.

The FSAVE mnemonic implicitly inserts an additional FWAIT instruction that ensures that the FPU completes handling the previous operation. If you wish to capture the FPU state in the middle of exception handling, FNSAVE should be used instead as it captures the immediate FPU state without waiting.

The 64-bit MMi registers introduced as part of the MMX instruction set overlap with ST(i) registers. As a result, no new dumping instruction is necessary, and if the MMi registers are used, they are dumped as part of ST(i) in FSAVE.

`FSAVE` data layout
64	48	32	16	0	bits
unused	FSW	unused	FCW		-16
FCS	FIP		unused	FTW	64
unused	FDS	FDP		FOP	144
ST(0) / MM0					224
ST(1) / MM1					304
ST(2) / MM2					384
ST(3) / MM3					464
ST(4) / MM4					544
ST(5) / MM5					624
ST(6) / MM6					704
ST(7) / MM7					784

The SSE register set introduced 8 new 128-bit registers XMMi and a control MXCSR register. Along with them, a new dumping function FXSAVE and its restoring counterpart FXRSTOR were introduced. They use a 512-byte memory buffer aligned on a 16-byte boundary, with a different layout than FSAVE.

The obvious difference between FSAVE and FXSAVE is that the latter saves SSE registers. On i386, the registers XMM0..XMM7 are stored, and the remaining part of the buffer is left reserved/unused. On amd64, a major part of the reserved space is used to store XMM8..XMM15.

The other difference that is frequently missed is that the FTW status register is stored by FSAVE in its entire form, while by FXSAVE in its abridged form. The former indicates what kind of value every ST(i) register contains — empty, zero, normalized number and special. The latter only indicates whether the register is empty or not.

To access the registers introduced by further processor extensions such as AVX, XSAVE instruction needs to be used. Unlike these previously described here, it has been designed to be extensible. XSAVE is a wide topic, and it will be the subject of the second part of this article.

FXSAVE vs FXSAVE64

The traditional variant of FXSAVE/FXRSTOR instruction stores the FIP (instruction causing an exception) and FDP (its operand) pointers as pairs of 16-bit segment registers (FCS, FDS, respectively) and 32-bit address registers (FIP, FDP). This is a problem for amd64 programs since the original 64-bit pointer is truncated to 32 bits.

To resolve this, the additional mnemonics FXSAVE64/FXRSTOR64 are provided. They prepend a REX.W=1 prefix to the respective instruction, changing the FIP and FDP fields to use a 64-bit pointer instead. Their drawback is that the segment is no longer reported; however, newer amd64 processors no longer support FCS/FDS anyway.

`FXSAVE` variants data layout
112	96	80	64	48	32		16	0	bits
rs.	FCS	FIP		FOP	rs.	FTW (abr.)	FSW	FCW	0
FIP_(FXSAVE64)				FOP	rs.	FTW (abr.)	FSW	FCW	0
MXCSR_MASK		MXCSR		rs.	FDS		FDP		128
MXCSR_MASK		MXCSR		FDP_(FXSAVE64)					128
reserved			ST(0) / MM0						256
reserved			ST(1) / MM1						384
⋮			⋮						⋮
reserved			ST(7) / MM7						1152
XMM0									1280
XMM1									1408
⋮									⋮
XMM7									2176
XMM8 _(amd64)									2304
XMM9									2432
⋮									⋮
XMM15									3200
reserved									3328
reserved									3456
reserved									3584
unused									3712
unused									3840
unused									3968

The ptrace(2) API

Both FreeBSD and NetBSD share roughly the same API for getting baseline floating-point registers. The baseline ptrace requests are PT_GETFPREGS and PT_SETFPREGS, both filling in a struct fpreg. While the visible fields of this struct differ between FreeBSD and NetBSD, their underlying layout is the same.

For historical reasons, struct fpreg on i386 follows the FSAVE layout. This has two important implications. Firstly, it does not include the SSE registers. Secondly, it includes a full FPU Tag Word (FTW) register. The latter implies that a kernel using FXSAVE or a newer instruction internally needs to reconstruct the full value.

Until a few days ago, the reconstruction performed by both FreeBSD and NetBSD kernels was incomplete — all non-empty registers were represented as normalized values regardless of what their actual value was. I have fixed it in the FreeBSD and NetBSD kernels recently. You can read more about the problem in the FreeBSD Remote plugin report.

Resolving the lack of SSE registers required another pair of requests — PT_GETXMMREGS and PT_SETXMMREGS. Both use a struct whose underlying layout matches FXSAVE. They are implemented as machine-dependent requests (while other listed requests are common to all architectures, even if not actually used). They are available to compat32 programs (e.g. a 32-bit debugger used on a 64-bit system) in the upcoming NetBSD 10 release. You can read more about that in the XSAVE and compat32 kernel work report.

On amd64, PT_GETFPREGS and PT_SETFPREGS both use the FXSAVE-based structure.

Debug Registers

The i386 architecture includes 8 Debug Registers, while amd64 has 16 of them. In reality, only a subset of these registers is available, namely DR0 through DR3, DR6 and DR7. DR0 through DR3 are used to specify the memory addresses for breakpoints or watchpoints, DR6 is used as status register, DR7 as control register. The remaining DRs are reserved.

Debug registers
reg.	purpose
DR0	bp./wp. #0 address
DR1	bp./wp. #1 address
DR2	bp./wp. #2 address
DR3	bp./wp. #3 address
DR4	reserved (obsolete alias to DR6)
DR5	reserved (obsolete alias to DR7)
DR6	debug status register
DR7	debug control register
DR8 ⋮ DR15	reserved (amd64 only)

The Debug Registers can be copied using MOV but only at privilege level 0 (thus, inside the kernel). The access to them is exposed to the debugger via PT_GETDBREGS and PT_SETDBREGS that take a struct dbreg argument. The structure is the same on FreeBSD and NetBSD.

An example code

The following listing demonstrates a NetBSD program that reads General-Purpose Registers and Floating-Point Registers from a child via ptrace(2), prints some of the registers and then writes modified GPR back.

(standalone example source, skip listing)

#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <machine/reg.h>

#include <assert.h>
#include <inttypes.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    int ret;
    pid_t pid = fork();
    assert(pid != -1);

    if (pid == 0) {
        uint64_t rax = 0x0001020304050607;
        printf("RAX in child before trap: 0x%016" PRIx64 "\n", rax);

        /* child -- debugged program */
        /* request tracing */
        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);

        __asm__ __volatile__ (
            "finit\n\t"
            "fldz\n\t"
            "fld1\n\t"
            "int3\n\t"
            : "+a"(rax)
            :
            : "st"
        );

        printf("RAX in child after trap: 0x%016" PRIx64 "\n", rax);
        _exit(0);
    }

    /* parent -- the debugger */
    /* wait for the child to become ready for tracing */
    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    assert(WIFSTOPPED(ret));
    assert(WSTOPSIG(ret) == SIGTRAP);

    struct reg gpr;
    struct fpreg fpr;

    /* get GPRs and FPRs */
    ret = ptrace(PT_GETREGS, pid, &gpr, 0);
    assert (ret == 0);
    ret = ptrace(PT_GETFPREGS, pid, &fpr, 0);
    assert (ret == 0);

    printf("RAX from PT_GETREGS: 0x%016" PRIx64 "\n",
            gpr.regs[_REG_RAX]);
    printf("ST(0) (raw) from PT_GETFPREGS: 0x%04" PRIx16
            "%016" PRIx64 "\n",
            fpr.fxstate.fx_87_ac[0].r.f87_exp_sign,
            fpr.fxstate.fx_87_ac[0].r.f87_mantissa);
    printf("ST(1) (raw) from PT_GETFPREGS: 0x%04" PRIx16
            "%016" PRIx64 "\n",
            fpr.fxstate.fx_87_ac[1].r.f87_exp_sign,
            fpr.fxstate.fx_87_ac[1].r.f87_mantissa);
    gpr.regs[_REG_RAX] = 0x0f0e0d0c0b0a0908;
    printf("RAX set via PT_SETREGS: 0x%016" PRIx64 "\n",
            gpr.regs[_REG_RAX]);

    /* set GPRs and resume the program */
    ret = ptrace(PT_SETREGS, pid, &gpr, 0);
    assert (ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    /* wait for the child to exit */
    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    assert(WIFEXITED(ret));
    assert(WEXITSTATUS(ret) == 0);

    return 0;
}

When compiled and run on NetBSD, the program outputs:

RAX in child before trap: 0x0001020304050607
RAX from PT_GETREGS: 0x0001020304050607
ST(0) (raw) from PT_GETFPREGS: 0x3fff8000000000000000
ST(1) (raw) from PT_GETFPREGS: 0x00000000000000000000
RAX set via PT_SETREGS: 0x0f0e0d0c0b0a0908
RAX in child after trap: 0x0f0e0d0c0b0a0908

Summary

Saving and restoring registers plays a crucial part in context switching. The General Purpose Registers — can normally be copied directly, or with some trivial tricks. On the other hand, the x87 Floating-Point Registers are not directly accessible and are dumped using dedicated instructions.

The two base instructions for dumping FPRs are FSAVE and FXSAVE. FSAVE stores x87 register dump into a 108-byte memory area, while FXSAVE stores x87 and SSE state into a 512-byte memory area. Both of these instructions also implicitly include MMX state since MMi registers overlap with ST(i).

Debuggers can access the stored registers of interrupted debugged processes via ptrace(2) API. The vast majority of targets both on NetBSD and FreeBSD define PT_GETREGS and PT_SETREGS to work with General Purpose Registers, and PT_GETFPREGS and PT_SETFPREGS to work with Floating-Point Registers. For historical reasons, the two latter requests on i386 use the limited historical FSAVE layout. This deficiency is amended via an additional pair of PT_GETXMMREGS and PT_SETXMMREGS.

Listing of ptrace(2) register-related requests
Request	Data type	Register group
`PT_GETREGS PT_SETREGS`	`struct reg`	General Purpose Registers
`PT_GETFPREGS PT_SETFPREGS`	`struct fpreg`	Floating-Point Registers (`FSAVE` on i386, `FXSAVE` on amd64)
`PT_GETXMMREGS PT_SETXMMREGS`	`struct xmmregs`	Floating-Point Registers (i386 only, `FXSAVE`)
`PT_GETDBREGS PT_SETDBREGS`	`struct dbreg`	Debug Registers (x86 only)

Just as FXSAVE is insufficient to dump all the registers on the modern (e.g. AVX-enabled) CPUs, the aforementioned calls do not provide the ability to expose more values. The newer x86 CPUs introduce the XSAVE family of instructions providing a forward-extensible dump format, and the Operating Systems have followed by providing a future-extensible API to work with these dumps. This will be the topic of the second part of this article.