Kernel memory randomization and trampoline page tables
In the past few months, I have been working on adding memory randomization to the Linux kernel for x86_64. Coding low-level and early boot features can introduce strange bugs. You usually don’t have a call stack or information, it just reboots. Fixing these bugs is often more like solving a mystery than anything else. This post is about an interesting issue I ran into while trying to complete the original proposal.
If you are interested in improving the Linux kernel security, a good first step is to join the Kernel Self Protection Project.
KASLR and memory randomization
Kernel Address Space Layout Randomization (or KASLR) has been part of the Linux kernel since 2013~2014, thanks to the work of Kees Cook. The original implementation randomizes the base address of the kernel modules. My changes randomize the three main memory sections (physical mapping, vmalloc and vmemmap) to prevent guessing the location of critical structures without direct leak.
This type of attacks were described by Nicolas A. Economou & Enrique E. Nissim in their “Getting physical” talk. I have done similar research internally at Google with a different approach but a similar conclusion. The timeline was perfect because I could reference Nicolas & Enrique paper, thanks guys!
Randomization at the PUD level
The memory section randomization is done by generating virtual addresses early at boot time. The non-randomized virtual addresses were aligned on the 2nd page table level (PUD). The new addresses are randomized are aligned on the 3rd page table level (PMD). With randomization enabled the PUD offset might be different than zero on this schema:
The physical mapping section base address was 0xffff880000000000 with 64Tb reserved. With memory randomization, the virtual address starts at the same base but randomized using this mask: 0x0000fffffc0000000 (30 bit shift) based on memory available and placement of other memory sections.
You can learn about the page management on Linux here. Also this commit changes how the kernel map the physical mapping section to support PUD level virtual addresses. This commit setups everything needed for memory randomization.
It only boots with one processor
After completing the first prototype, I tested different configurations. With a second processor, it rebooted directly without any information. It usually means something went terribly wrong.
I tracked down the exact place of the crash within earlyprintk and kernel debugging. It happened when the second processor started, accessing incorrect virtual addresses. I also modified the randomization code to test different virtual addresses. The crash only happened if the physical memory section was not aligned at the PUD level.
The crash happens soon after this page table switch on processor startup:
# Setup trampoline 4 level pagetables
movl $pa_trampoline_pgd, %eax
movl %eax, %cr3
The following comment is on top of the trampoline assembly file:
Entry: CS:IP point to the start of our code, we are
in real mode with no stack, but the rest of the
trampoline page to make our stack and everything else
is a mystery.On entry to trampoline_start, the processor is in real mode
with 16-bit addressing and 16-bit data. CS has some value
and IP is zero. Thus, data addresses need to be absolute
(no relocation) and are taken with regard to r_base.With the addition of trampoline_level4_pgt this code can
now enter a 64bit kernel that lives at arbitrary 64bit
physical addresses.
I found this code initializing trampoline_pgd in setup_real_mode:
trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd); trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd; trampoline_pgd[511] = init_level4_pgt[511].pgd;
The physical memory mapping (__PAGE_OFFSET) PGD entry is copied from the current page table to a trampoline page table. The PGD offset is different though, the trampoline puts the kernel at the PGD offset zero.
The kernel uses the trampoline page after exiting real-mode and before fully moving to 64bit mode. The page table has to reflect what happens in real-mode with the first physical page at the lowest virtual address (explaining trampoline_pgd[0]).
The crash happened because the assembly startup code was trying to access global variables at the wrong virtual address leading to an access violation. Nothing is caught at this stage, the machine just reboots.
The easiest way to fix this issue is to have a correct trampoline page table (aligned on PUD level). The following code generates the right trampoline page table layout:
void __meminit init_trampoline(void)
{
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;
if (!kaslr_memory_enabled()) {
init_trampoline_default();
return;
}
pud_page_tramp = alloc_low_page();
paddr = 0;
pgd = pgd_offset_k((unsigned long)__va(paddr));
pud_page = (pud_t *) pgd_page_vaddr(*pgd);
for (i = pud_index(paddr);
i < PTRS_PER_PUD;
i++, paddr = paddr_next) {
pud_t *pud, *pud_tramp;
unsigned long vaddr = (unsigned long)__va(paddr);
pud_tramp = pud_page_tramp + pud_index(paddr);
pud = pud_page + pud_index(vaddr);
paddr_next = (paddr & PUD_MASK) + PUD_SIZE;
*pud_tramp = *pud;
}
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
}
You can see that another PUD page is allocated for the trampoline and each offset is copied with a different alignment. Only the first page is needed, nothing else is accessed in this transition.
Another way to fix this issue might have been to keep the PUD offset in the real mode header and shift each global variable accessed. Though I am not sure it would have worked with the 64bit mode jump.
KASLR is a work in progress
As a side note, I wanted to mention that KASLR is still a work in progress. My goal is to contribute and make it better. I know it has multiple weaknesses and leaks. I still think it is useful and going in the right direction especially for remote (including Cloud / KVM) and sandboxed environments.