Kernel initialization. Part 3.

Last preparations before the kernel entry point

This is the third part of the Linux kernel initialization process series. In the previous part we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is ‘kernel entry point’ - start_kernel function from the init/main.c source code file. Yes, technically it is not kernel’s entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the start_kernel function, we must do some preparations. So let’s continue.

boot_params again

In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the IDTR register. At the next step after this we can see a call of the copy_bootdata function:

  1. copy_bootdata(__va(real_mode_data));

This function takes one argument - virtual address of the real_mode_data. Remember that we passed the address of the boot_params structure from arch/x86/include/uapi/asm/bootparam.h to the x86_64_start_kernel function as first argument in arch/x86/kernel/head_64.S:

  1. /* rsi is pointer to real mode structure with interesting info.
  2. pass it to C */
  3. movq %rsi, %rdi

Now let’s look at __va macro. This macro defined in init/main.c:

  1. #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))

where PAGE_OFFSET is __PAGE_OFFSET which is 0xffff880000000000 and the base virtual address of the direct mapping of all physical memory. So we’re getting virtual address of the boot_params structure and pass it to the copy_bootdata function, where we copy real_mod_data to the boot_params which is declared in the arch/x86/kernel/setup.h

  1. extern struct boot_params boot_params;

Let’s look at the copy_boot_data implementation:

  1. static void __init copy_bootdata(char *real_mode_data)
  2. {
  3. char * command_line;
  4. unsigned long cmd_line_ptr;
  5. memcpy(&boot_params, real_mode_data, sizeof boot_params);
  6. sanitize_boot_params(&boot_params);
  7. cmd_line_ptr = get_cmd_line_ptr();
  8. if (cmd_line_ptr) {
  9. command_line = __va(cmd_line_ptr);
  10. memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
  11. }
  12. }

First of all, note that this function is declared with __init prefix. It means that this function will be used only during the initialization and used memory will be freed.

We can see declaration of two variables for the kernel command line and copying real_mode_data to the boot_params with the memcpy function. The next call of the sanitize_boot_params function which fills some fields of the boot_params structure like ext_ramdisk_image and etc… if bootloaders which fail to initialize unknown fields in boot_params to zero. After this we’re getting address of the command line with the call of the get_cmd_line_ptr function:

  1. unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
  2. cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;
  3. return cmd_line_ptr;

which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check cmd_line_ptr, getting its virtual address and copy it to the boot_command_line which is just an array of bytes:

  1. extern char __initdata boot_command_line[];

After this we will have copied kernel command line and boot_params structure. In the next step we can see call of the load_ucode_bsp function which loads processor microcode, but we will not see it here.

After microcode was loaded we can see the check of the console_loglevel and the early_printk function which prints Kernel Alive string. But you’ll never see this output because early_printk is not initialized yet. It is a minor bug in the kernel and i sent the patch - commit and you will see it in the mainline soon. So you can skip this code.

Move on init pages

In the next step, as we have copied boot_params structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous part and dropped all it in the reset_early_page_tables function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:

  1. clear_page(init_level4_pgt);

function and pass init_level4_pgt which also defined in the arch/x86/kernel/head_64.S and looks:

  1. NEXT_PAGE(init_level4_pgt)
  2. .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
  3. .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
  4. .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
  5. .org init_level4_pgt + L4_START_KERNEL*8, 0
  6. .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. clear_page function defined in the arch/x86/lib/clear_page_64.S let’s look on this function:

  1. ENTRY(clear_page)
  2. CFI_STARTPROC
  3. xorl %eax,%eax
  4. movl $4096/64,%ecx
  5. .p2align 4
  6. .Lloop:
  7. decl %ecx
  8. #define PUT(x) movq %rax,x*8(%rdi)
  9. movq %rax,(%rdi)
  10. PUT(1)
  11. PUT(2)
  12. PUT(3)
  13. PUT(4)
  14. PUT(5)
  15. PUT(6)
  16. PUT(7)
  17. leaq 64(%rdi),%rdi
  18. jnz .Lloop
  19. nop
  20. ret
  21. CFI_ENDPROC
  22. .Lclear_page_end:
  23. ENDPROC(clear_page)

As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the CFI_STARTPROC and CFI_ENDPROC which are expands to GNU assembly directives:

  1. #define CFI_STARTPROC .cfi_startproc
  2. #define CFI_ENDPROC .cfi_endproc

and used for debugging. After CFI_STARTPROC macro we zero out eax register and put 64 to the ecx (it will be a counter). Next we can see loop which starts with the .Lloop label and it starts from the ecx decrement. After it we put zero from the rax register to the rdi which contains the base address of the init_level4_pgt now and do the same procedure seven times but every time move rdi offset on 8. After this we will have first 64 bytes of the init_level4_pgt filled with zeros. In the next step we put the address of the init_level4_pgt with 64-bytes offset to the rdi again and repeat all operations until ecx reaches zero. In the end we will have init_level4_pgt filled with zeros.

As we have init_level4_pgt filled with zeros, we set the last init_level4_pgt entry to kernel high mapping with the:

  1. init_level4_pgt[511] = early_level4_pgt[511];

Remember that we dropped all early_level4_pgt entries in the reset_early_page_table function and kept only kernel high mapping there.

The last step in the x86_64_start_kernel function is the call of the:

  1. x86_64_start_reservations(real_mode_data);

function with the real_mode_data as argument. The x86_64_start_reservations function defined in the same source code file as the x86_64_start_kernel function and looks:

  1. void __init x86_64_start_reservations(char *real_mode_data)
  2. {
  3. if (!boot_params.hdr.version)
  4. copy_bootdata(__va(real_mode_data));
  5. reserve_ebda_region();
  6. start_kernel();
  7. }

You can see that it is the last function before we are in the kernel entry point - start_kernel function. Let’s look what it does and how it works.

Last step before kernel entry point

First of all we can see in the x86_64_start_reservations function the check for boot_params.hdr.version:

  1. if (!boot_params.hdr.version)
  2. copy_bootdata(__va(real_mode_data));

and if it is zero we call copy_bootdata function again with the virtual address of the real_mode_data (read about its implementation).

In the next step we can see the call of the reserve_ebda_region function which defined in the arch/x86/kernel/head.c. This function reserves memory block for the EBDA or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc…

Let’s look on the reserve_ebda_region function. It starts from the checking is paravirtualization enabled or not:

  1. if (paravirt_enabled())
  2. return;

we exit from the reserve_ebda_region function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory:

  1. lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
  2. lowmem <<= 10;

We’re getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the:

  1. ebda_addr = get_bios_ebda();

where get_bios_ebda function defined in the arch/x86/include/asm/bios_ebda.h and looks like:

  1. static inline unsigned int get_bios_ebda(void)
  2. {
  3. unsigned int address = *(unsigned short *)phys_to_virt(0x40E);
  4. address <<= 4;
  5. return address;
  6. }

Let’s try to understand how it works. Here we can see that we converting physical address 0x40E to the virtual, where 0x0040:0x000e is the segment which contains base address of the extended BIOS data area. Don’t worry that we are using phys_to_virt function for converting a physical address to virtual address. You can note that previously we have used __va macro for the same point, but phys_to_virt is the same:

  1. static inline void *phys_to_virt(phys_addr_t address)
  2. {
  3. return __va(address);
  4. }

only with one difference: we pass argument with the phys_addr_t which depends on CONFIG_PHYS_ADDR_T_64BIT:

  1. #ifdef CONFIG_PHYS_ADDR_T_64BIT
  2. typedef u64 phys_addr_t;
  3. #else
  4. typedef u32 phys_addr_t;
  5. #endif

This configuration option is enabled by CONFIG_PHYS_ADDR_T_64BIT. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this ebda_addr variables contains the base address of the extended BIOS data area.

In the next step we check that address of the extended BIOS data area and low memory is not less than INSANE_CUTOFF macro

  1. if (ebda_addr < INSANE_CUTOFF)
  2. ebda_addr = LOWMEM_CAP;
  3. if (lowmem < INSANE_CUTOFF)
  4. lowmem = LOWMEM_CAP;

which is:

  1. #define INSANE_CUTOFF 0x20000U

or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call memblock_reserve function which will reserve memory region for extended bios data between low memory and one megabyte mark:

  1. lowmem = min(lowmem, ebda_addr);
  2. lowmem = min(lowmem, LOWMEM_CAP);
  3. memblock_reserve(lowmem, 0x100000 - lowmem);

memblock_reserve function is defined at mm/block.c and takes two parameters:

  • base physical address;
  • region size.

and reserves memory region for the given base address and size. memblock_reserve is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let’s look at its implementation.

First touch of the linux kernel memory manager framework

In the previous paragraph we stopped at the call of the memblock_reserve function and as i said before it is the first function from the memory manager framework. Let’s try to understand how it works. memblock_reserve function just calls:

  1. memblock_reserve_region(base, size, MAX_NUMNODES, 0);

function and passes 4 parameters there:

  • physical base address of the memory region;
  • size of the memory region;
  • maximum number of numa nodes;
  • flags.

At the start of the memblock_reserve_region body we can see definition of the memblock_type structure:

  1. struct memblock_type *_rgn = &memblock.reserved;

which presents the type of the memory block and looks:

  1. struct memblock_type {
  2. unsigned long cnt;
  3. unsigned long max;
  4. phys_addr_t total_size;
  5. struct memblock_region *regions;
  6. };

As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where memblock structure is:

  1. struct memblock {
  2. bool bottom_up;
  3. phys_addr_t current_limit;
  4. struct memblock_type memory;
  5. struct memblock_type reserved;
  6. #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
  7. struct memblock_type physmem;
  8. #endif
  9. };

and describes generic memory block. You can see that we initialize _rgn by assigning it to the address of the memblock.reserved. memblock is the global variable which looks:

  1. struct memblock memblock __initdata_memblock = {
  2. .memory.regions = memblock_memory_init_regions,
  3. .memory.cnt = 1,
  4. .memory.max = INIT_MEMBLOCK_REGIONS,
  5. .reserved.regions = memblock_reserved_init_regions,
  6. .reserved.cnt = 1,
  7. .reserved.max = INIT_MEMBLOCK_REGIONS,
  8. #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
  9. .physmem.regions = memblock_physmem_init_regions,
  10. .physmem.cnt = 1,
  11. .physmem.max = INIT_PHYSMEM_REGIONS,
  12. #endif
  13. .bottom_up = false,
  14. .current_limit = MEMBLOCK_ALLOC_ANYWHERE,
  15. };

We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that memblock variable defined with the __initdata_memblock which is:

  1. #define __initdata_memblock __meminitdata

and __meminit_data is:

  1. #define __meminitdata __section(.meminit.data)

From this we can conclude that all memory blocks will be in the .meminit.data section. After we defined _rgn we print information about it with memblock_dbg macros. You can enable it by passing memblock=debug to the kernel command line.

After debugging lines were printed next is the call of the following function:

  1. memblock_add_range(_rgn, base, size, nid, flags);

which adds new memory block region into the .meminit.data section. As we do not initialize _rgn but it just contains &memblock.reserved, we just fill passed _rgn with the base address of the extended BIOS data area region, size of this region and flags:

  1. if (type->regions[0].size == 0) {
  2. WARN_ON(type->cnt != 1 || type->total_size);
  3. type->regions[0].base = base;
  4. type->regions[0].size = size;
  5. type->regions[0].flags = flags;
  6. memblock_set_region_node(&type->regions[0], nid);
  7. type->total_size = size;
  8. return 0;
  9. }

After we filled our region we can see the call of the memblock_set_region_node function with two parameters:

  • address of the filled memory region;
  • NUMA node id.

where our regions represented by the memblock_region structure:

  1. struct memblock_region {
  2. phys_addr_t base;
  3. phys_addr_t size;
  4. unsigned long flags;
  5. #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
  6. int nid;
  7. #endif
  8. };

NUMA node id depends on MAX_NUMNODES macro which is defined in the include/linux/numa.h:

  1. #define MAX_NUMNODES (1 << NODES_SHIFT)

where NODES_SHIFT depends on CONFIG_NODES_SHIFT configuration parameter and defined as:

  1. #ifdef CONFIG_NODES_SHIFT
  2. #define NODES_SHIFT CONFIG_NODES_SHIFT
  3. #else
  4. #define NODES_SHIFT 0
  5. #endif

memblick_set_region_node function just fills nid field from memblock_region with the given value:

  1. static inline void memblock_set_region_node(struct memblock_region *r, int nid)
  2. {
  3. r->nid = nid;
  4. }

After this we will have first reserved memblock for the extended bios data area in the .meminit.data section. reserve_ebda_region function finished its work on this step and we can go back to the arch/x86/kernel/head64.c.

We finished all preparations before the kernel entry point! The last step in the x86_64_start_reservations function is the call of the:

  1. start_kernel()

function from init/main.c file.

That’s all for this part.

Conclusion

It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - start_kernel function. It will be the first step before we will see launch of the first init process.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.