Recently as part of a sandboxing runtime project at work I developed an unconventional debugging technique using signal handlers and some archaic bits in the x86 FLAGS register. This post goes over some of the failed attempts while developing this technique and what I eventually settled on.

This post assumes a decent understanding of systems programming concepts, but only a vague familiarity with POSIX signals. Basically signals are asynchronous notifications that programs can respond to with signal handler functions. Signal handlers runs in userspace but in a different context than normal execution since the kernel invokes these functions when a thread needs to handle a signal. Handlers usually poke the process state as needed, and when the function returns the kernel resumes the process thread from the normal execution context.

Sandbox basics

The start of this whole rabbit hole was a runtime for retrofitting sandboxes onto existing C codebases. The idea here was to limit the “blast radius” of unintentional, spatial-memory vulnerabilities in code with an initial focus on dynamically-linked programs on Linux. We (the team at Immunant) isolated memory belonging to different dynamically-linked libraries within a single process and used the relatively new x86-64 Memory Protection Keys (MPK) feature to protect these regions with pkey_mprotect. By protect I mean that each library could only access its corresponding memory regions unless the variable was specifically annotated as shared between libraries. These regions include the stack, heap, statically-allocated variables and thread-local storage and we inserted code to change the thread-specific permissions each time a thread switched libraries. If a library tried to read or write to something it didn’t have permission to, MPK would terminate the process with a segfault.

With this setup the sandboxing process went something like:

  1. integrate our sandbox’s instrumentation with the codebase’s build system
  2. annotate variables that obviously need to be shared or refactor code to avoid sharing
  3. run the program to find other variables that may need to be shared during normal execution

In theory delineating sandbox boundaries along dynamically-linked libraries sounds great because they’re loaded into separate parts of the address space, but the reality is that they still typically require annotating a decent number of variables as shared. Since MPK kills the process (by raising a SIGSEGV) after a sandbox violation the third step would only potentially find one bad memory access per run which is a pretty frustrating development process. Ideally we wanted some sort of permissive mode where the program ignores sandbox violations and just logs them for developers to look at later. However the MPK check happens in hardware so there’s no assembly instructions corresponding to the checks. This means we can’t just throw in a printf somewhere, right?

Testing for segfaults

Actually to test the sandbox we basically did just add a printf right after the SIGSEGV. We did this by catching the segfault with a signal handler. This signal handler would check some flags to see if the segfault was expected or not and write to stdout (literally write since printf is not async-signal-safe). This worked fine for automated testing, but what we wanted for developers was to allow catching the segfault, logging it, then returning to their regularly scheduled program execution. Well what really happens during an MPK violation is that the memory access doesn’t finish, so the permissive mode handler had to allow permission to all MPK sandboxes, retry the instruction that caused the segfault, log it, then restore MPK permissions and continue executing from the next instruction.

It turns out that signal handlers can access a lot of information about the processor state at the moment the signal was raised. This can be accessed by setting SA_SIGINFO in sa_flags that’s passed to sigaction so that Linux passes in a ucontext_t * when it invokes the signal handler. This ucontext_t has an mcontext_t field with machine-specific state (i.e. CPU registers). The handler for automated tests didn’t need this, but after a few frustrating days of trying to sandbox a moderately large codebase I decided to try to use it to implement permissive mode. Note that it might’ve also been possible to implement permissive mode with a gdb script or ptrace, but those options seemed a bit more cumbersome when a sandboxed process forks itself. In any case it didn’t seem that different than how we were already using signal handlers for testing so it seemed like a reasonable idea at the time.

MPK and signal handlers

The intro mentioned that handlers run in a different context than normal execution so let’s talk about the interplay between that and MPK. MPK provides userspace code with a thread-local register named PKRU which determines memory access permissions. When Linux invokes a signal handler it sets the thread’s PKRU to the minimum set of permissions. With our sandboxes this meant that the PKRU might even forbid writing to the stack depending on where the segfault happened, so our SIGSEGV handler for testing was actually a small assembly stub that uses the wrpkru instruction to set the PKRU to allow all permissions before jumping to the real handler function. In the beginning I mentioned that the kernel restores the normal execution context after the handler returns and this includes the PKRU. That meant that a wrpkru in the handler would have no effect on the PKRU after it returns and I couldn’t use that to retry the instruction that caused the segfault.

Fortunately, the PKRU can also be modified with xsave/xrstor. While this isn’t a great idea from a security perspective, it did come in handy for changing the PKRU outside the signal handler context. Basically I needed to compute an offset into the XSAVE buffer with cpuid, then use it to write the PKRU I wanted after the handler return to the fpregs field of the mcontext_t. The details of XSAVE or what Linux does behind the scenes to change the PKRU aren’t too important here, but at this point my first draft of permissive mode looked roughly like this.

__asm__(
    "permissive_mode_trampoline:"
        /* Set registers to allow access to all sandboxes */
        ...
        /* Change the PKRU and go to the real signal handler */
        "wrpkru"
        "jmp permissive_mode_handler"
)

void permissive_mode_handler(int sig, siginfo_t *info, void *ctxt) {
    if (info->si_code != SEGV_PKUERR) {
        /* Give up on segfaults not caused by the sandbox */
        abort();
    }
    ucontext_t *uctxt = (ucontext_t *)ctxt;

    int offset = calc_pkru_offset_from_cpuid();
    /*
     * This points to the PKRU value saved
     * by the kernel when the segfault happened
     */
    uint32_t *pkru = (uint32_t *)(&uctxt->uc_mcontext.fpregs[offset]);

    /* Save the value of the old PKRU somewhere */
    static uint32_t old_pkru = 0;
    old_pkru = *pkru;

    /*
     * Tell the kernel to allow access to
     * everything when restoring the PKRU
     */
    *pkru = PKRU_ALL_ACCESS;

    /* Return to the instruction that caused the segfault */
    return;

    /* Restore PKRU to old_pkru somehow after retrying the instruction??? */
}

Ignoring the nonsense where we’re modifying the PKRU through the float-point registers field fpregs this was pretty straightforward so far. The only question remaining was how to restore the PKRU after retrying the instruction that caused the segfault. This was a pretty crucial part of the permissive mode idea, because otherwise it’d be limited to only finding one bad memory access per run just like when MPK kills the process. So essentially I needed to return from the handler above, execute one instruction then invoke another handler to switch the PKRU back. This second handler invocation could not be triggered by an MPK segfault but Linux provided plenty of other signals that seemed like they might work. My first thought was to use one of the alarm signals to reset the PKRU a short amount of time after the MPK segfault. That should work, even if it provides a small window where permissive mode will miss bad memory accesses. With that as a backup option, I kept looking for options that wouldn’t have this disadvantage.

INT3

According to wikipedia

“The INT3 instruction is a one-byte-instruction defined for use by debuggers to temporarily replace an instruction in a running program in order to set a code breakpoint”

If you squint hard enough, permissive mode kind of looks like a debugger so that sounded promising enough. The idea with this approach was to replace an instruction (or its first byte) with a 0xcc byte so that executing it raises a SIGTRAP. Overwriting instructions from the first signal handler (i.e. the process’s own context) requires making the executable segment we want to change writeable, but this was just for debugging so I wasn’t really concerned about security there.

To get a bit more concrete let’s look at some code that might cause an MPK segfault.

/* Assume x is defined in another library */
extern int x;
int foo(void) {
    /* Next line will cause sandbox segfault */
    x = 4;
    return x;
}
$ gcc -shared -fPIC foo.c; objdump -d a.out
00000000000010e9 <foo>:
    10e9:	55                   	push   %rbp
    10ea:	48 89 e5             	mov    %rsp,%rbp
    10ed:	48 8b 05 d4 2e 00 00 	mov    0x2ed4(%rip),%rax
    # Next instruction will cause sandbox segfault
    10f4:	c7 00 04 00 00 00    	movl   $0x4,(%rax)
    10fa:	48 8b 05 c7 2e 00 00 	mov    0x2ec7(%rip),%rax
    1101:	8b 00                	mov    (%rax),%eax
    1103:	5d                   	pop    %rbp
    1104:	c3                   	ret

To handle this with INT3 permissive mode would have to overwrite the byte at offset 0x10fa (i.e. the 0x48 in mov 0x2ec7(%rip),%rax). This way the first handler (triggered by the sandbox) returns to movl $0x4,(%rax), does the write successfully then the executes a 0xcc which triggers the second handler for SIGTRAP. There we can restore the PKRU to the sandbox value, replace the 0xcc we added with the original 0x48 and return to 0x10fa.

Of course the code won’t necessarily be loaded at these offsets, but figuring out roughly where the code is loaded is easy to do from a signal handler. Going back to the mcontext_t we can access gregs[REG_RIP] to get the instruction pointer %rip where the segfault occurred (corresponding to the0x10f4 offset). Figuring out exactly where the next instruction was loaded is much harder though. x86-64 instructions are between 1 and 15 bytes long, but the ucontext_t has no concept of “current instruction length” or “next instruction address”. It seemed so close to working, but I just needed to get the length of instructions that could cause sandbox segfaults so the handler could figure out where to write the INT3. After considering various approaches based off of decoding the machine code to figure out instruction lengths, I decided against INT3. As much as I like writing emulators, I really didn’t want to decode an instruction set like x86 even if it was a more precise backup option than alarm signals.

User signals

I did like the idea of triggering one signal handler from the initial SIGSEGV handler though. From a performance standpoint this isn’t great because context switching between the kernel and userspace is costly, but this is for debugging and handling a sandbox segfault already costs at least one context switch so I kept on trying. Fortunately glibc sets aside not just one, but two signals for userspace code. The creatively-named SIGUSR1 and SIGUSR2 are specifically reserved for userspace programs to play around with so it seemed like there was a way out. Unfortunately after messing around with these signals in the SIGSEGV handler and reading a bit about signal disposition I concluded that you can’t raise one of these signals right after an instruction (i.e. the one that caused the segfault) executes correctly. And so it was back to the drawing board…

x86 Trap flag

While casually perusing Intel’s Software Development Manual Vol. 3, I stumbled on the Trap flag in the EFLAGS register. This bit predates the x86-64 architecture by quite a bit and can even be found in Intel’s 8086 from 1978. From the description in the manual this seemed quite promising for raising a SIGTRAP while avoiding the INT3 problem.

TF Trap (bit 8) – Set to enable single-step mode for debugging. … This allows the execution state of a program to be inspected after each instruction.

This causes Linux to invoke the SIGTRAP handler after executing each instruction in the normal execution context. So basically a sandbox segfault triggers the SIGSEGV handler which will set the Trap flag in EFLAGS and the permissive PKRU for the normal context through the mcontext_t, the segfault instruction executes successfully and the processor triggers a SIGTRAP which invokes its handler. This SIGTRAP handler would then log the sandbox segfault, restore the PKRU and EFLAGS and return to executing the sandboxed program normally. Note that even though the SIGTRAP handler runs in the userspace context, when the kernel invokes it it also blocks further SIGTRAP signals so only code outside the handler is single-stepped. It’s effectively like the program is single-stepping itself when sandbox violations occur. In the end it looked something like this.

void permissive_mode_handler(int sig, siginfo_t *info, void *ctxt) {
    bool handle_pkuerr = sig == SIGSEGV && info->si_code == SEGV_PKUERR;
    bool handle_trap = sig == SIGTRAP;
    if (!handle_pkuerr && !handle_trap) {
        /* I don't know what's going on here */
        abort();
    }
    ucontext_t *uctxt = (ucontext_t *)ctxt;

    int offset = calc_pkru_offset_from_cpuid();
    /*
     * This points to the PKRU value saved
     * by the kernel when the segfault happened
     */
    uint32_t *pkru = (uint32_t *)(&uctxt->uc_mcontext.fpregs[offset]);
    uint64_t *eflags = (uint64_t *)(&uctxt->uc_mcontext_t.gregs[REG_EFL]);

    static uint32_t old_pkru = 0;

    if (handle_pkuerr) {
        /* Enable single-step mode when the SIGSEGV handler returns */
        *eflags |= TRAP_BIT;

        /* Save the value of the old PKRU somewhere */
        old_pkru = *pkru;

        *pkru = PKRU_ALL_ACCESS;

        /*
         * Async-signal-safety limits what we can do here
         * so just push info about the sandbox segfault onto
         * a queue for asynchronous logging
         */
         push_logging_queue(info, ctxt);
    } else if (handle_trap) {
        /* Restore the old PKRU */
        *pkru = old_pkru;

        /* Enable single-step mode when the SIGTRAP handler returns */
        *eflags &= ~TRAP_BIT;
    }
}

Wrap up

After adding a few more bells and whistles to try to get as much ELF symbol information as possible for each MPK violation, I finally had a working permissive mode. It also ended up being quite ergonomic from a developer point of view since enabling it simply requires #includeing a header and I used a constructor which calls pthread_atfork with itself to make it persists when a process forks. In retrospect, while a ptrace-based solution may have been a bit more standard (i.e. less hacky) this was nonetheless an interesting journey in learning about signal handlers and the different ways you could get a program to single-step itself.