Linux kernel memory management Part 3.

Introduction to the kmemcheck in the Linux kernel

This is the third part of the chapter which describes memory management in the Linux kernel and in the previous part of this chapter we met two memory management related concepts:

  • Fix-Mapped Addresses;
  • ioremap.

The first concept represents special area in virtual memory, whose corresponding physical mapping is calculated in compile-time. The second concept provides ability to map input/output related memory to virtual memory.

For example if you will look at the output of the /proc/iomem:

  1. $ sudo cat /proc/iomem
  2. 00000000-00000fff : reserved
  3. 00001000-0009d7ff : System RAM
  4. 0009d800-0009ffff : reserved
  5. 000a0000-000bffff : PCI Bus 0000:00
  6. 000c0000-000cffff : Video ROM
  7. 000d0000-000d3fff : PCI Bus 0000:00
  8. 000d4000-000d7fff : PCI Bus 0000:00
  9. 000d8000-000dbfff : PCI Bus 0000:00
  10. 000dc000-000dffff : PCI Bus 0000:00
  11. 000e0000-000fffff : reserved
  12. ...
  13. ...
  14. ...

you will see map of the system’s memory for each physical device. Here the first column displays the memory registers used by each of the different types of memory. The second column lists the kind of memory located within those registers. Or for example:

  1. $ sudo cat /proc/ioports
  2. 0000-0cf7 : PCI Bus 0000:00
  3. 0000-001f : dma1
  4. 0020-0021 : pic1
  5. 0040-0043 : timer0
  6. 0050-0053 : timer1
  7. 0060-0060 : keyboard
  8. 0064-0064 : keyboard
  9. 0070-0077 : rtc0
  10. 0080-008f : dma page reg
  11. 00a0-00a1 : pic2
  12. 00c0-00df : dma2
  13. 00f0-00ff : fpu
  14. 00f0-00f0 : PNP0C04:00
  15. 03c0-03df : vga+
  16. 03f8-03ff : serial
  17. 04d0-04d1 : pnp 00:06
  18. 0800-087f : pnp 00:01
  19. 0a00-0a0f : pnp 00:04
  20. 0a20-0a2f : pnp 00:04
  21. 0a30-0a3f : pnp 00:04
  22. ...
  23. ...
  24. ...

can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must map it to the virtual memory space which is the main purpose of the ioremap mechanism. Note that we saw only early ioremap in the previous part. Soon we will look at the implementation of the non-early ioremap function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.

So, before we will move on to the non-early memory management of the Linux kernel, we will see some mechanisms which provide special abilities for debugging, check of memory leaks, memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.

As you already may guess from the title of this part, we will start to consider memory mechanisms from the kmemcheck. As we always did in other chapters, we will start to consider from theoretical side and will learn what is kmemcheck mechanism in general and only after this, we will see how it is implemented in the Linux kernel.

So let’s start. What is it kmemcheck in the Linux kernel? As you may guess from the name of this mechanism, the kmemcheck checks memory. That’s true. Main point of the kmemcheck mechanism is to check that some kernel code accesses uninitialized memory. Let’s take following simple C program:

  1. #include <stdlib.h>
  2. #include <stdio.h>
  3. struct A {
  4. int a;
  5. };
  6. int main(int argc, char **argv) {
  7. struct A *a = malloc(sizeof(struct A));
  8. printf("a->a = %d\n", a->a);
  9. return 0;
  10. }

Here we allocate memory for the A structure and tries to print value of the a field. If we will compile this program without additional options:

  1. gcc test.c -o test

The compiler will not show us warning that a filed is not unitialized. But if we will run this program with valgrind tool, we will see the following output:

  1. ~$ valgrind --leak-check=yes ./test
  2. ==28469== Memcheck, a memory error detector
  3. ==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
  4. ==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
  5. ==28469== Command: ./test
  6. ==28469==
  7. ==28469== Conditional jump or move depends on uninitialised value(s)
  8. ==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so)
  9. ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
  10. ==28469== by 0x4005B9: main (in /home/alex/test)
  11. ==28469==
  12. ==28469== Use of uninitialised value of size 8
  13. ==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so)
  14. ==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so)
  15. ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
  16. ==28469== by 0x4005B9: main (in /home/alex/test)
  17. ...
  18. ...
  19. ...

Actually the kmemcheck mechanism does the same for the kernel, what the valgrind does for userspace programs. It check uninitialized memory.

To enable this mechanism in the Linux kernel, you need to enable the CONFIG_KMEMCHECK kernel configuration option in the:

  1. Kernel hacking
  2. -> Memory Debugging

menu of the Linux kernel configuration:

kernel configuration menu

We may not only enable support of the kmemcheck mechanism in the Linux kernel, but it also provides some configuration options for us. We will see all of these options in the next paragraph of this part. Last note before we will consider how does the kmemcheck check memory. Now this mechanism is implemented only for the x86_64 architecture. You can be sure if you will look in the arch/x86/Kconfig x86 related kernel configuration file, you will see following lines:

  1. config X86
  2. ...
  3. ...
  4. ...
  5. select HAVE_ARCH_KMEMCHECK
  6. ...
  7. ...
  8. ...

So, there is no anything which is specific for other architectures.

Ok, so we know that kmemcheck provides mechanism to check usage of uninitialized memory in the Linux kernel and how to enable it. How it does these checks? When the Linux kernel tries to allocate some memory i.e. something is called like this:

  1. struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);

or in other words somebody wants to access a page, a page fault exception is generated. This is achieved by the fact that the kmemcheck marks memory pages as non-present (more about this you can read in the special part which is devoted to paging). If a page fault exception is occurred, the exception handler knows about it and in a case when the kmemcheck is enabled it transfers control to it. After the kmemcheck will finish its checks, the page will be marked as present and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the kmemcheck will mark the page as non-present again. In this way next access to memory will be caught again.

We just considered the kmemcheck mechanism from theoretical side. Now let’s consider how it is implemented in the Linux kernel.

Implementation of the kmemcheck mechanism in the Linux kernel

So, now we know what is it kmemcheck and what it does in the Linux kernel. Time to see at its implementation in the Linux kernel. Implementation of the kmemcheck is split in two parts. The first is generic part is located in the mm/kmemcheck.c source code file and the second x86_64 architecture-specific part is located in the arch/x86/mm/kmemcheck directory.

Let’s start from the initialization of this mechanism. We already know that to enable the kmemcheck mechanism in the Linux kernel, we must enable the CONFIG_KMEMCHECK kernel configuration option. But besides this, we need to pass one of following parameters:

  • kmemcheck=0 (disabled)
  • kmemcheck=1 (enabled)
  • kmemcheck=2 (one-shot mode)

to the Linux kernel command line. The first two are clear, but the last needs a little explanation. This option switches the kmemcheck in a special mode when it will be turned off after detecting the first use of uninitialized memory. Actually this mode is enabled by default in the Linux kernel:

kernel configuration menu

We know from the seventh part of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in do_initcall_level, do_early_param functions. Actually the kmemcheck subsystem consists from two stages. The first stage is early. If we will look at the mm/kmemcheck.c source code file, we will see the param_kmemcheck function which is will be called during early command line parsing:

  1. static int __init param_kmemcheck(char *str)
  2. {
  3. int val;
  4. int ret;
  5. if (!str)
  6. return -EINVAL;
  7. ret = kstrtoint(str, 0, &val);
  8. if (ret)
  9. return ret;
  10. kmemcheck_enabled = val;
  11. return 0;
  12. }
  13. early_param("kmemcheck", param_kmemcheck);

As we already saw, the param_kmemcheck may have one of the following values: 0 (enabled), 1 (disabled) or 2 (one-shot). The implementation of the param_kmemcheck is pretty simple. We just convert string value of the kmemcheck command line option to integer representation and set it to the kmemcheck_enabled variable.

The second stage will be executed during initialization of the Linux kernel, rather during initialization of early initcalls. The second stage is represented by the kmemcheck_init:

  1. int __init kmemcheck_init(void)
  2. {
  3. ...
  4. ...
  5. ...
  6. }
  7. early_initcall(kmemcheck_init);

Main goal of the kmemcheck_init function is to call the kmemcheck_selftest function and check its result:

  1. if (!kmemcheck_selftest()) {
  2. printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n");
  3. kmemcheck_enabled = 0;
  4. return -EINVAL;
  5. }
  6. printk(KERN_INFO "kmemcheck: Initialized\n");

and return with the EINVAL if this check is failed. The kmemcheck_selftest function checks sizes of different memory access related opcodes like rep movsb, movzwq and etc. If sizes of opcodes are equal to expected sizes, the kmemcheck_selftest will return true and false in other way.

So when the somebody will call:

  1. struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);

through a series of different function calls the kmem_getpages function will be called. This function is defined in the mm/slab.c source code file and main goal of this function tries to allocate pages with the given flags. In the end of this function we can see following code:

  1. if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
  2. kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
  3. if (cachep->ctor)
  4. kmemcheck_mark_uninitialized_pages(page, nr_pages);
  5. else
  6. kmemcheck_mark_unallocated_pages(page, nr_pages);
  7. }

So, here we check that the if kmemcheck is enabled and the SLAB_NOTRACK bit is not set in flags we set non-present bit for the just allocated page. The SLAB_NOTRACK bit tell us to not track uninitialized memory. Additionally we check if a cache object has constructor (details will be considered in next parts) we mark allocated page as uninitialized or unallocated in other way. The kmemcheck_alloc_shadow function is defined in the mm/kmemcheck.c source code file and does following things:

  1. void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)
  2. {
  3. struct page *shadow;
  4. shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);
  5. for(i = 0; i < pages; ++i)
  6. page[i].shadow = page_address(&shadow[i]);
  7. kmemcheck_hide_pages(page, pages);
  8. }

First of all it allocates memory space for the shadow bits. If this bit is set in a page, this means that this page is tracked by the kmemcheck. After we allocated space for the shadow bit, we fill all allocated pages with this bit. In the end we just call the kmemcheck_hide_pages function with the pointer to the allocated page and number of these pages. The kmemcheck_hide_pages is architecture-specific function, so its implementation is located in the arch/x86/mm/kmemcheck/kmemcheck.c source code file. The main goal of this function is to set non-present bit in given pages. Let’s look at the implementation of this function:

  1. void kmemcheck_hide_pages(struct page *p, unsigned int n)
  2. {
  3. unsigned int i;
  4. for (i = 0; i < n; ++i) {
  5. unsigned long address;
  6. pte_t *pte;
  7. unsigned int level;
  8. address = (unsigned long) page_address(&p[i]);
  9. pte = lookup_address(address, &level);
  10. BUG_ON(!pte);
  11. BUG_ON(level != PG_LEVEL_4K);
  12. set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
  13. set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN));
  14. __flush_tlb_one(address);
  15. }
  16. }

Here we go through all pages and and tries to get page table entry for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush translation lookaside buffer, because some pages was changed. From this point allocated pages are tracked by the kmemcheck. Now, as present bit is unset, the page fault execution will be occurred right after the kmalloc will return pointer to allocated space and a code will try to access this memory.

As you may remember from the second part of the Linux kernel initialization chapter, the page fault handler is located in the arch/x86/mm/fault.c source code file and represented by the do_page_fault function. We can see following check from the beginning of the do_page_fault function:

  1. static noinline void
  2. __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  3. unsigned long address)
  4. {
  5. ...
  6. ...
  7. ...
  8. if (kmemcheck_active(regs))
  9. kmemcheck_hide(regs);
  10. ...
  11. ...
  12. ...
  13. }

The kmemcheck_active gets kmemcheck_context per-cpu structure and return the result of comparison of the balance field of this structure with zero:

  1. bool kmemcheck_active(struct pt_regs *regs)
  2. {
  3. struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
  4. return data->balance > 0;
  5. }

The kmemcheck_context is structure which describes current state of the kmemcheck mechanism. It stored uninitialized addresses, number of such addresses and etc. The balance field of this structure represents current state of the kmemcheck or in other words it can tell us did kmemcheck already hid pages or not yet. If the data->balance is greater than zero, the kmemcheck_hide function will be called. This means than kmemecheck already set present bit for given pages and now we need to hide pages again to cause next step to page fault. This function will hide addresses of pages again by unsetting of present bit. This means that one session of kmemcheck already finished and new page fault occurred. At the first step the kmemcheck_active will return false as the data->balance is zero for the start and the kmemcheck_hide will not be called. Next, we may see following line of code in the do_page_fault:

  1. if (kmemcheck_fault(regs, address, error_code))
  2. return;

First of all the kmemcheck_fault function checks that the fault was occured by the correct reason. At first we check the flags register and check that we are in normal kernel mode:

  1. if (regs->flags & X86_VM_MASK)
  2. return false;
  3. if (regs->cs != __KERNEL_CS)
  4. return false;

If these checks wasn’t successful we return from the kmemcheck_fault function as it was not kmemcheck related page fault. After this we try to lookup a page table entry related to the faulted address and if we can’t find it we return:

  1. pte = kmemcheck_pte_lookup(address);
  2. if (!pte)
  3. return false;

Last two steps of the kmemcheck_fault function is to call the kmemcheck_access function which check access to the given page and show addresses again by setting present bit in the given page. The kmemcheck_access function does all main job. It check current instruction which caused a page fault. If it will find an error, the context of this error will be saved by kmemcheck to the ring queue:

  1. static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];

The kmemcheck mechanism declares special tasklet:

  1. static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);

which runs the do_wakeup function from the arch/x86/mm/kmemcheck/error.c source code file when it will be scheduled to run.

The do_wakeup function will call the kmemcheck_error_recall function which will print errors collected by kmemcheck. As we already saw the:

  1. kmemcheck_show(regs);

function will be called in the end of the kmemcheck_fault function. This function will set present bit for the given pages again:

  1. if (unlikely(data->balance != 0)) {
  2. kmemcheck_show_all();
  3. kmemcheck_error_save_bug(regs);
  4. data->balance = 0;
  5. return;
  6. }

Where the kmemcheck_show_all function calls the kmemcheck_show_addr for each address:

  1. static unsigned int kmemcheck_show_all(void)
  2. {
  3. struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
  4. unsigned int i;
  5. unsigned int n;
  6. n = 0;
  7. for (i = 0; i < data->n_addrs; ++i)
  8. n += kmemcheck_show_addr(data->addr[i]);
  9. return n;
  10. }

by the call of the kmemcheck_show_addr:

  1. int kmemcheck_show_addr(unsigned long address)
  2. {
  3. pte_t *pte;
  4. pte = kmemcheck_pte_lookup(address);
  5. if (!pte)
  6. return 0;
  7. set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
  8. __flush_tlb_one(address);
  9. return 1;
  10. }

In the end of the kmemcheck_show function we set the TF flag if it wasn’t set:

  1. if (!(regs->flags & X86_EFLAGS_TF))
  2. data->flags = regs->flags;

We need to do it because we need to hide pages again after first executed instruction after a page fault will be handled. In a case when the TF flag, so the processor will switch into single-step mode after the first instruction will be executed. In this case debug exception will occurred. From this moment pages will be hidden again and execution will be continued. As pages hidden from this moment, page fault exception will occur again and kmemcheck continue to check/collect errors again and print them from time to time.

That’s all.

Conclusion

This is the end of the third part about linux kernel memory management. If you have questions or suggestions, ping me on twitter 0xAX, drop me an anotherworldofworld@gmail.com">email or just create an issue. In the next part we will see yet another memory debugging related tool - kmemleak.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to linux-insides.