exploiting the linux kernel via intel's sysret implementation

Exploiting the Linux Kernel via Intel's SYSRET ImplementationNiko@FluxFingers

Outline

●Syscalls and Context Switches●Canonical Addresses●SYSRET #GP Triggering●Step by Step Exploitation and Rooting

Linux x86_64 Syscalls

●On OLD x86 Processors int $0x80 with Nr. in %eax and Params in %ebx, %ecx, etc○However it’s super slow and got replaced with Intel’s

SYSENTER mechanism●x86_64 uses AMD’s SYSCALL with Params in %rdi, %

rsi, %rdx, %rcx, ...○ Faster to handle than the whole interrupt path○ Intel CPUs adapted SYSCALL according to AMD’s specs since it

became the standard syscall-mechanism

SYSCALL/SYSRET

●Whenever a syscall is invoked via SYSCALL a context switch to kernel mode takes place○When leaving the syscall the kernel needs to restore specific

userland registers ○And transfer back to ring3 with SYSRET

●SYSRET is fast since it “only” needs to:○ Load the saved %rip from %rcx○ Swap %cs back to ring3 mode

●The kernel itself has to make sure to restore all other userland registers before executing SYSRET

SYSCALL/SYSRET0x0000000000000000

0x0000000000400000Process (/bin/cat)

.text, .data, .bss, Heap0x00000000006XXXXX

Shared Libraries

0x00007ffffXXXXXXX

Stack

0x00007fXXXXXXXXXX

VSYSCALL

0xffffffffff600000

0xffffffff80000000

Kernel Memory

SYSCALL

SYSCALL/SYSRET0x0000000000000000

0x0000000000400000Process (/bin/cat)

.text, .data, .bss, Heap0x00000000006XXXXX

Shared Libraries

0x00007ffffXXXXXXX

Stack

0x00007fXXXXXXXXXX

VSYSCALL

0xffffffffff600000

0xffffffff80000000

Kernel MemorySYSRET

How Linux handles SYSRET

●arch/x86/kernel/entry_64.S:

ret_from_sys_call: movl $_TIF_ALLWORK_MASK,%edi...sysret_check:... movq RIP-ARGOFFSET(%rsp),%rcx CFI_REGISTER rip,rcx RESTORE_ARGS 1,-ARG_SKIP,0 movq PER_CPU_VAR(old_rsp), %rsp USERGS_SYSRET64

●The kernel makes sure to restore %rsp and %gs etc and calls SYSRET in the end

Canonical Addresses

●On x86_64 registers are 64 bit wide●The instruction pointer (%rip) can only use 48 bits

○ 48 Bits == balanced value for page-tables/accessible memory●Leftover bits of %rip used for CPU specific tricks

○ like NX bit on position 63●Meaning the value of %rip has to be “canonical” aka

between○0x0000000000000000 -> 0x00007FFFFFFFFFFF○0x00FFFFFFFFFFFFFF -> 0xFFFF800000000000

● (Bits 48 .. 63 have to be copies of bit 47)●Non-canonical values in %rip are not allowed and will

trigger exceptions in certain cases

Non-canonical addresses and SYSRET

●Whenever a SYSRET is executed and the CPU sees a non-canonical value in %rcx it triggers a #GP

●AMD specs however never defined when the #GP will actually happen

●Clever researches at XEN found out AMD CPUs will trigger #GP when back in Usermode

●Not so on Intel ...

Intel’s Version of SYSRET

●AMD’s specs omitted the check for non-canonical values in %rcx / %rip

● Intel decided to check for non-canonical values before the privilege level is changed

Intel’s Version of SYSRET

●Triggering a #GP from kernel mode has consequences on Linux

●Recall that prior to executing SYSRET Linux restores the userland %rsp and swaps %gs

● Intel’s SYSRET will #GP on the userland stack while still being in ring0

#GP on userland %rsp

●#GP is an exception reached via an IDT entry:arch/x86/kernel/traps.c:set_intr_gate(X86_TRAP_GP, general_protection);

●Where general_protection resolves to an error_entry macro in arch/x86/kernel/entry_64.S:

.macro errorentry sym do_sym ENTRY(\sym) XCPT_FRAME ASM_CLAC PARAVIRT_ADJUST_EXCEPTION_FRAME subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 call error_entry...


● error_entry sets up an exception stack and backups all registers:ENTRY(error_entry) XCPT_FRAME CFI_ADJUST_CFA_OFFSET 15*8

cld movq_cfi rdi, RDI+8 movq_cfi rsi, RSI+8 movq_cfi rdx, RDX+8

…● where movq_cfi is defined as

.macro movq_cfi reg offset=0 movq %\reg, \offset(%rsp) CFI_REL_OFFSET \reg, \offset.endm


●When setting up the stack frame in error_entry all (general) registers are saved to x(%rsp) / [rsp+x]

●The kernel restored the userland %rsp and registers before SYSRET

●=> Arbitrary memory write while in ring0●Classic possibility for privilege escalation

Linux’ Protection against n/c %rip

●This behaviour already bit Linux in 2006 (CVE-2006-0744)

●To make sure no code lands up in non-canonical address space (or right before) a guard page was introduced

●mmap(0x7ffffffff000, 4096, PROT_READ … will return ENOMEM

●This way SYSRET “shouldn’t” return to any n/c address

Linux’ Protection against n/c %rip

●Another possibility is using a “safe” IRET path for returning back to ring3○ IRET requires ring3-backup on the stack to return to user-code○ Is slower than SYSRET

●The ptrace interface sets an IRET path most of the time

●However some syscalls use a SYSRET path albeit being ptraced

●One example is fork() since it signals with ptrace_event() that does not force IRET

Crash PoC

● fork() a child●Child sets PTRACE_TRACEME●Raise SIGSTOP●Parent sets PTRACE_O_TRACEFORK●Child fork()s again●Parent catches this fork●And uses PTRACE_SETREGS to set %rip to n/c●Pivots %rsp to arbitrary place●And PTRACE_CONTINUEs●fork() will return with SYSRET with n/c %rcx●CPU will #GP, Pagefault, Doublefault and Panic

How to get root

The plan

●We need to get Kernel Code Execution between the #GP and Panic

●Then restore the damage we have done●Set credentials of current process to 0●Return back to userland●And open shell

The target

●Since #GP will always trigger a Pagefault and Doublefault we can pivot %rsp back to IDT

●And set 2 specific registers to craft a fake IDT gate●That will be placed instead of the orig Page- or

Doublefault handler.

IDT Layout

●We can read IDTR with the sidt-instruction

IDT Gate Entry

●And setup a new gate with modified “Offsets”

The target

●Before we trigger #GP we can allocate a Landing Area in Userland

●Where we copy code that will be executed●Craft a fake IDT gate that points to this area●Triggering #GP will then overwrite e.g. Doublefault

with the fake gate●And the kernel will jump to Userland and execute

our code with kernel privs

Kernel Shellcode

● Inside this code we will have to swapgs in order to access kernel structures

●Then we carefully rebuild all IDT entries that were trashed in the overwrite process

●Then we can raise process credentials

Process structures

●Each process in userland has an associated kernel structure (thread_union) that builds the kernel stack:

Kernel Stack

thread_info

thread_union

Process structures

●thread_info itself has an element that points to task_struct

…

*task_struct

thread_info

*exec_domain

Process structures < 2.6.29

●task_struct contains lots of info about the running task

●and its credentials

...uid, guid, caps,...

state

task_struct

stack

usage

Process structures < 2.6.29

...uid, guid, caps,...

state

task_struct

stack

usage

…

*task_struct

thread_info

*exec_domain

Kernel Stack

thread_info

thread_union

Kernel Shellcode

●On < 2.6.29 raising process credentials is a matter of finding uid, gid and caps in task_struct

●And patching them to 0●Luckily %gs in kernel mode contains offset to

x8664_pda (/include/asm-x86/pda.h)/* Per processor datastructure. %gs points to it while the kernel runs */ struct x8664_pda { struct task_struct *pcurrent; /* 0 Current process */ unsigned long data_offset; /* 8 Per cpu data offset from linker address */ unsigned long kernelstack; /* 16 top of kernel stack for current */ unsigned long oldrsp; /* 24 user rsp for system call */ int irqcount; /* 32 Irq nesting counter. Starts with -1 */ int cpunumber; /* 36 Logical CPU number */#ifdef CONFIG_CC_STACKPROTECTOR unsigned long stack_canary;...

Kernel Shellcode

●%gs:0 will point to task_struct●So we can simply:

asm("movq %%gs:0, %0" : "=r"(ptr));

cred = (uint32_t *)ptr;

for (i = 0; i < 1000; i++, cred++) { if (cred[0] == uid && cred[1] == uid && cred[2] == uid && cred[3] == uid && cred[4] == gid && cred[5] == gid && cred[6] == gid && cred[7] == gid) { cred[0] = cred[1] = cred[2] = cred[3] = 0; cred[4] = cred[5] = cred[6] = cred[7] = 0;

●Where uid/gid are getuid() and getdid()●And our process will be root

Kernel Shellcode

●On > 2.6.29 x8664_pda is removed●And task_struct contains a new member called

cred (credential records)● If %rsp wasn’t modified we could walk back to top

of stack to find thread_info ●And do heuristic scanning to find thread_info-

>task_struct->creds->uid/gid●However with credential records come two new

functions●prepare_kernel_cred / commit_creds

Kernel Shellcode

●prepare_kernel_cred creates a new clean credentials structure

●commit_creds installs the new cred to the current task

●Both symbols are exported through /proc/kallsyms or /boot/System.map

●Kernel shellcode just needs tocommit_creds(prepare_kernel_cred(0));

●And we’re root again

Kernel Shellcode

●Next we will have to cleanly return back to userland

●Easiest method is to use IRET: __asm__ __volatile__( "movq %0, 0x20(%%rsp);" "movq %1, 0x18(%%rsp);" "movq %2, 0x10(%%rsp);" "movq %3, 0x08(%%rsp);" "movq %4, 0x00(%%rsp);" "swapgs;" "iretq;" :: "i"(USER_SS), "i"(user_stack), "i"(USER_FL), "i"(USER_CS), "i"(user_code) );

●Where user_code points to memory in userland that should be executed when kernel exits

Popping uid=0(root)

●user_code can do anything now since it runs as root

●So we can simply execve(/bin/sh) from there●However that happens inside the child so we have

to bring the rootshell back to the parent●Or we just chmod() or setxattr() to drop a root-

shell

Demo Time

Liminations

●These techniques work well with 2.6.18 - 3.9.X3.10 mitigates the IDT attack by remapping it to rodata (arch/x86/kernel/traps.c)__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);idt_descr.address = fix_to_virt(FIX_RO_IDT);

●CPUs with SMAP/SMEP will detect accessing userland code while still being in ring0

●Grsecurity will provide handful of protections to make this bug a pain to exploit○GRKERNSEC_RANDSTRUCT○ PAX_MEMORY_UDEREF○GRKERNSEC_HIDESYM○ ...

Further thoughts

●Linux fix is weird (“only” forces ptrace_stop() to use IRET)

●Syscalls can still return via SYSRET●Also bug within SYSRET is still present●Since it’s a hardware issue it might be present in

other OSes in different variations (OHAI 2006)●Any1 wanna check FreeBSD …?

Questions?

exploiting the linux kernel via intel's sysret implementation

Science