Misc - Linkers - 《Linux内核揭秘(Linux Inside 英文版)》

Introduction
Linking process
Relocation
GNU linker
Useful command line options of the GNU linker
Control Language linker
Conclusion
Links

Introduction

During the writing of the linux-insides book I have received many emails with questions related to the linker script and linker-related subjects. So I’ve decided to write this to cover some aspects of the linker and the linking of object files.

If we open the Linker page on Wikipedia, we will see following definition:

In computer science, a linker or link editor is a computer program that takes one or more object files generated by a compiler and combines them into a single executable file, library file, or another object file.

If you’ve written at least one program on C in your life, you will have seen files with the *.o extension. These files are object files. Object files are blocks of machine code and data with placeholder addresses that reference data and functions in other object files or libraries, as well as a list of its own functions and data. The main purpose of the linker is collect/handle the code and data of each object file, turning it into the final executable file or library. In this post we will try to go through all aspects of this process. Let’s start.

Linking process

Let’s create a simple project with the following structure:

*-linkers
*--main.c
*--lib.c
*--lib.h

Our main.c source code file contains:

#include <stdio.h>
#include "lib.h"
int main(int argc, char **argv) {
    printf("factorial of 5 is: %d\n", factorial(5));
    return 0;
}

The lib.c file contains:

int factorial(int base) {
    int res,i = 1;
    if (base == 0) {
        return 1;
    }
    while (i <= base) {
        res *= i;
        i++;
    }
    return res;
}

And the lib.h file contains:

#ifndef LIB_H
#define LIB_H
int factorial(int base);
#endif

Now let’s compile only the main.c source code file with:

$ gcc -c main.c

If we look inside the outputted object file with the nm util, we will see the
following output:

$ nm -A main.o
main.o:                 U factorial
main.o:0000000000000000 T main
main.o:                 U printf

The nm util allows us to see the list of symbols from the given object file. It consists of three columns: the first is the name of the given object file and the address of any resolved symbols. The second column contains a character that represents the status of the given symbol. In this case the U means undefined and the T denotes that the symbols are placed in the .text section of the object. The nm utility shows us here that we have three symbols in the main.c source code file:

factorial - the factorial function defined in the lib.c source code file. It is marked as undefined here because we compiled only the main.c source code file, and it does not know anything about code from the lib.c file for now;
main - the main function;
printf - the function from the glibc library. main.c does not know anything about it for now either.

What can we understand from the output of nm so far? The main.o object file contains the local symbol main at address 0000000000000000 (it will be filled with correct address after is is linked), and two unresolved symbols. We can see all of this information in the disassembly output of the main.o object file:

$ objdump -S main.o
main.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
   0:    55                       push   %rbp
   1:    48 89 e5                 mov    %rsp,%rbp
   4:    48 83 ec 10              sub    $0x10,%rsp
   8:    89 7d fc                 mov    %edi,-0x4(%rbp)
   b:    48 89 75 f0              mov    %rsi,-0x10(%rbp)
   f:    bf 05 00 00 00           mov    $0x5,%edi
  14:    e8 00 00 00 00           callq  19 <main+0x19>
  19:    89 c6                    mov    %eax,%esi
  1b:    bf 00 00 00 00           mov    $0x0,%edi
  20:    b8 00 00 00 00           mov    $0x0,%eax
  25:    e8 00 00 00 00           callq  2a <main+0x2a>
  2a:    b8 00 00 00 00           mov    $0x0,%eax
  2f:    c9                       leaveq 
  30:    c3                       retq

Here we are interested only in the two callq operations. The two callq operations contain linker stubs, or the function name and offset from it to the next instruction. These stubs will be updated to the real addresses of the functions. We can see these functions’ names with in the following objdump output:

$ objdump -S -r main.o
...
  14:    e8 00 00 00 00           callq  19 <main+0x19>
  15: R_X86_64_PC32                   factorial-0x4
  19:    89 c6                    mov    %eax,%esi
...
  25:    e8 00 00 00 00           callq  2a <main+0x2a>
  26:   R_X86_64_PC32                   printf-0x4
  2a:    b8 00 00 00 00           mov    $0x0,%eax
...

The -r or --reloc flags of the objdump util print the relocation entries of the file. Now let’s look in more detail at the relocation process.

Relocation

Relocation is the process of connecting symbolic references with symbolic definitions. Let’s look at the previous snippet from the objdump output:

  14:    e8 00 00 00 00           callq  19 <main+0x19>
  15:   R_X86_64_PC32                   factorial-0x4
  19:    89 c6                    mov    %eax,%esi

Note the e8 00 00 00 00 on the first line. The e8 is the opcode of the call, and the remainder of the line is a relative offset. So the e8 00 00 00 00 contains a one-byte operation code followed by a four-byte address. Note that the 00 00 00 00 is 4-bytes. Why only 4-bytes if an address can be 8-bytes in a x86_64 (64-bit) machine? Actually we compiled the main.c source code file with the -mcmodel=small! From the gcc man page:

-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.

Of course we didn’t pass this option to the gcc when we compiled the main.c, but it is the default. We know that our program will be linked in the lower 2 GB of the address space from the gcc manual extract above. Four bytes is therefore enough for this. So we have opcode of the call instruction and an unknown address. When we compile main.c with all its dependencies to an executable file, and then look at the factorial call we see:

$ gcc main.c lib.c -o factorial | objdump -S factorial | grep factorial
factorial:     file format elf64-x86-64
...
...
0000000000400506 <main>:
    40051a:    e8 18 00 00 00           callq  400537 <factorial>
...
...
0000000000400537 <factorial>:
    400550:    75 07                    jne    400559 <factorial+0x22>
    400557:    eb 1b                    jmp    400574 <factorial+0x3d>
    400559:    eb 0e                    jmp    400569 <factorial+0x32>
    40056f:    7e ea                    jle    40055b <factorial+0x24>
...
...

As we can see in the previous output, the address of the main function is 0x0000000000400506. Why it does not start from 0x0? You may already know that standard C programs are linked with the glibc C standard library (assuming the -nostdlib was not passed to the gcc). The compiled code for a program includes constructor functions to initialize data in the program when the program is started. These functions need to be called before the program is started, or in another words before the main function is called. To make the initialization and termination functions work, the compiler must output something in the assembler code to cause those functions to be called at the appropriate time. Execution of this program will start from the code placed in the special .init section. We can see this in the beginning of the objdump output:

objdump -S factorial | less
factorial:     file format elf64-x86-64
Disassembly of section .init:
00000000004003a8 <_init>:
  4003a8:       48 83 ec 08             sub    $0x8,%rsp
  4003ac:       48 8b 05 a5 05 20 00    mov    0x2005a5(%rip),%rax        # 600958 <_DYNAMIC+0x1d0>

Not that it starts at the 0x00000000004003a8 address relative to the glibc code. We can check it also in the ELF output by running readelf:

$ readelf -d factorial | grep \(INIT\)
 0x000000000000000c (INIT)               0x4003a8

So, the address of the main function is 0000000000400506 and is offset from the .init section. As we can see from the output, the address of the factorial function is 0x0000000000400537 and binary code for the call of the factorial function now is e8 18 00 00 00. We already know that e8 is opcode for the call instruction, the next 18 00 00 00 (note that address represented as little endian for x86_64, so it is 00 00 00 18) is the offset from the callq to the factorial function:

>>> hex(0x40051a + 0x18 + 0x5) == hex(0x400537)
True

So we add 0x18 and 0x5 to the address of the call instruction. The offset is measured from the address of the following instruction. Our call instruction is 5-bytes long (e8 18 00 00 00) and the 0x18 is the offset of the call after the factorial function. A compiler generally creates each object file with the program addresses starting at zero. But if a program is created from multiple object files, these will overlap.

What we have seen in this section is the relocation process. This process assigns load addresses to the various parts of the program, adjusting the code and data in the program to reflect the assigned addresses.

Ok, now that we know a little about linkers and relocation it is time to learn more about linkers by linking our object files.

GNU linker

As you can understand from the title, I will use GNU linker or just ld in this post. Of course we can use gcc to link our factorial project:

$ gcc main.c lib.o -o factorial

and after it we will get executable file - factorial as a result:

./factorial 
factorial of 5 is: 120

But gcc does not link object files. Instead it uses collect2 which is just wrapper for the GNU ld linker:

~$ /usr/lib/gcc/x86_64-linux-gnu/4.9/collect2 --version
collect2 version 4.9.3
/usr/bin/ld --version
GNU ld (GNU Binutils for Debian) 2.25
...
...
...

Ok, we can use gcc and it will produce executable file of our program for us. But let’s look how to use GNU ld linker for the same purpose. First of all let’s try to link these object files with the following example:

ld main.o lib.o -o factorial

Try to do it and you will get following error:

$ ld main.o lib.o -o factorial
ld: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
main.o: In function `main':
main.c:(.text+0x26): undefined reference to `printf'

Here we can see two problems:

Linker can’t find _start symbol;
Linker does not know anything about printf function.

First of all let’s try to understand what is this _start entry symbol that appears to be required for our program to run? When I started to learn programming I learned that the main function is the entry point of the program. I think you learned this too :) But it actually isn’t the entry point, it’s _start instead. The _start symbol is defined in the crt1.o object file. We can find it with the following command:

$ objdump -S /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_start>:
   0:    31 ed                    xor    %ebp,%ebp
   2:    49 89 d1                 mov    %rdx,%r9
   ...
   ...
   ...

We pass this object file to the ld command as its first argument (see above). Now let’s try to link it and will look on result:

ld /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
main.o lib.o -o factorial
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o: In function `_start':
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:115: undefined reference to `__libc_csu_fini'
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:116: undefined reference to `__libc_csu_init'
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:122: undefined reference to `__libc_start_main'
main.o: In function `main':
main.c:(.text+0x26): undefined reference to `printf'

Unfortunately we will see even more errors. We can see here old error about undefined printf and yet another three undefined references:

__libc_csu_fini
__libc_csu_init
__libc_start_main

The _start symbol is defined in the sysdeps/x86_64/start.S assembly file in the glibc source code. We can find following assembly code lines there:

mov $__libc_csu_fini, %R8_LP
mov $__libc_csu_init, %RCX_LP
...
call __libc_start_main

Here we pass address of the entry point to the .init and .fini section that contain code that starts to execute when the program is ran and the code that executes when program terminates. And in the end we see the call of the main function from our program. These three symbols are defined in the csu/elf-init.c source code file. The following two object files:

crtn.o;
crti.o.

define the function prologs/epilogs for the .init and .fini sections (with the _init and _fini symbols respectively).

The crtn.o object file contains these .init and .fini sections:

$ objdump -S /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o
0000000000000000 <.init>:
   0:    48 83 c4 08              add    $0x8,%rsp
   4:    c3                       retq   
Disassembly of section .fini:
0000000000000000 <.fini>:
   0:    48 83 c4 08              add    $0x8,%rsp
   4:    c3                       retq

And the crti.o object file contains the _init and _fini symbols. Let’s try to link again with these two object files:

$ ld \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o \
-o factorial

And anyway we will get the same errors. Now we need to pass -lc option to the ld. This option will search for the standard library in the paths present in the $LD_LIBRARY_PATH environment variable. Let’s try to link again wit the -lc option:

$ ld \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o -lc \
-o factorial

Finally we get an executable file, but if we try to run it, we will get strange results:

$ ./factorial 
bash: ./factorial: No such file or directory

What’s the problem here? Let’s look on the executable file with the readelf util:

$ readelf -l factorial 
Elf file type is EXEC (Executable file)
Entry point 0x4003c0
There are 7 program headers, starting at offset 64
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x0000000000000188 0x0000000000000188  R E    8
  INTERP         0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000610 0x0000000000000610  R E    200000
  LOAD           0x0000000000000610 0x0000000000600610 0x0000000000600610
                 0x00000000000001cc 0x00000000000001cc  RW     200000
  DYNAMIC        0x0000000000000610 0x0000000000600610 0x0000000000600610
                 0x0000000000000190 0x0000000000000190  RW     8
  NOTE           0x00000000000001e4 0x00000000004001e4 0x00000000004001e4
                 0x0000000000000020 0x0000000000000020  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     10
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame 
   03     .dynamic .got .got.plt .data 
   04     .dynamic 
   05     .note.ABI-tag 
   06

Note on the strange line:

  INTERP         0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

The .interp section in the elf file holds the path name of a program interpreter or in another words the .interp section simply contains an ascii string that is the name of the dynamic linker. The dynamic linker is the part of Linux that loads and links shared libraries needed by an executable when it is executed, by copying the content of libraries from disk to RAM. As we can see in the output of the readelf command it is placed in the /lib64/ld-linux-x86-64.so.2 file for the x86_64 architecture. Now let’s add the -dynamic-linker option with the path of ld-linux-x86-64.so.2 to the ld call and will see the following results:

$ gcc -c main.c lib.c
$ ld \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o \
-dynamic-linker /lib64/ld-linux-x86-64.so.2 \
-lc -o factorial

Now we can run it as normal executable file:

$ ./factorial
factorial of 5 is: 120

It works! With the first line we compile the main.c and the lib.c source code files to object files. We will get the main.o and the lib.o after execution of the gcc:

$ file lib.o main.o
lib.o:  ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

and after this we link object files of our program with the needed system object files and libraries. We just saw a simple example of how to compile and link a C program with the gcc compiler and GNU ld linker. In this example we have used a couple command line options of the GNU linker, but it supports much more command line options than -o, -dynamic-linker, etc… Moreover GNU ld has its own language that allows to control the linking process. In the next two paragraphs we will look into it.

Useful command line options of the GNU linker

As I already wrote and as you can see in the manual of the GNU linker, it has big set of the command line options. We’ve seen a couple of options in this post: -o <output> - that tells ld to produce an output file called output as the result of linking, -l<name> that adds the archive or object file specified by the name, -dynamic-linker that specifies the name of the dynamic linker. Of course ld supports much more command line options, let’s look at some of them.

The first useful command line option is @file. In this case the file specifies filename where command line options will be read. For example we can create file with the name linker.ld, put there our command line arguments from the previous example and execute it with:

$ ld @linker.ld

The next command line option is -b or --format. This command line option specifies format of the input object files ELF, DJGPP/COFF and etc. There is a command line option for the same purpose but for the output file: --oformat=output-format.

The next command line option is --defsym. Full format of this command line option is the --defsym=symbol=expression. It allows to create global symbol in the output file containing the absolute address given by expression. We can find following case where this command line option can be useful: in the Linux kernel source code and more precisely in the Makefile that is related to the kernel decompression for the ARM architecture - arch/arm/boot/compressed/Makefile, we can find following definition:

LDFLAGS_vmlinux = --defsym _kernel_bss_size=$(KBSS_SZ)

As we already know, it defines the _kernel_bss_size symbol with the size of the .bss section in the output file. This symbol will be used in the first assembly file that will be executed during kernel decompressing:

ldr r5, =_kernel_bss_size

The next command line options is the -shared that allows us to create shared library. The -M or -map <filename> command line option prints the linking map with the information about symbols. In our case:

$ ld -M @linker.ld
...
...
...
.text           0x00000000004003c0      0x112
 *(.text.unlikely .text.*_unlikely .text.unlikely.*)
 *(.text.exit .text.exit.*)
 *(.text.startup .text.startup.*)
 *(.text.hot .text.hot.*)
 *(.text .stub .text.* .gnu.linkonce.t.*)
 .text          0x00000000004003c0       0x2a /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
...
...
...
 .text          0x00000000004003ea       0x31 main.o
                0x00000000004003ea                main
 .text          0x000000000040041b       0x3f lib.o
                0x000000000040041b                factorial

Of course the GNU linker support standard command line options: --help and --version that print common help of the usage of the ld and its version. That’s all about command line options of the GNU linker. Of course it is not the full set of command line options supported by the ld util. You can find the complete documentation of the ld util in the manual.

Control Language linker

As I wrote previously, ld has support for its own language. It accepts Linker Command Language files written in a superset of AT&T’s Link Editor Command Language syntax, to provide explicit and total control over the linking process. Let’s look on its details.

With the linker language we can control:

input files;
output files;
file formats
addresses of sections;
etc…

Commands written in the linker control language are usually placed in a file called linker script. We can pass it to ld with the -T command line option. The main command in a linker script is the SECTIONS command. Each linker script must contain this command and it determines the map of the output file. The special variable . contains current position of the output. Let’s write a simple assembly program and we will look at how we can use a linker script to control linking of this program. We will take a hello world program for this example:

.data
        msg:    .ascii  "hello, world!\n"
.text
.global _start
_start:
        mov    $1,%rax
        mov    $1,%rdi
        mov    $msg,%rsi
        mov    $14,%rdx
        syscall
        mov    $60,%rax
        mov    $0,%rdi
        syscall

We can compile and link it with the following commands:

$ as -o hello.o hello.asm
$ ld -o hello hello.o

Our program consists from two sections: .text contains code of the program and .data contains initialized variables. Let’s write simple linker script and try to link our hello.asm assembly file with it. Our script is:

/*
 * Linker script for the factorial
 */
OUTPUT(hello) 
OUTPUT_FORMAT("elf64-x86-64")
INPUT(hello.o)
SECTIONS
{
    . = 0x200000;
    .text : {
          *(.text)
    }
    . = 0x400000;
    .data : {
          *(.data)
    }
}

On the first three lines you can see a comment written in C style. After it the OUTPUT and the OUTPUT_FORMAT commands specify the name of our executable file and its format. The next command, INPUT, specifies the input file to the ld linker. Then, we can see the main SECTIONS command, which, as I already wrote, must be present in every linker script. The SECTIONS command represents the set and order of the sections which will be in the output file. At the beginning of the SECTIONS command we can see following line . = 0x200000. I already wrote above that . command points to the current position of the output. This line says that the code should be loaded at address 0x200000 and the line . = 0x400000 says that data section should be loaded at address 0x400000. The second line after the . = 0x200000 defines .text as an output section. We can see *(.text) expression inside it. The * symbol is wildcard that matches any file name. In other words, the *(.text) expression says all .text input sections in all input files. We can rewrite it as hello.o(.text) for our example. After the following location counter . = 0x400000, we can see definition of the data section.

We can compile and link it with the following command:

$ as -o hello.o hello.S && ld -T linker.script && ./hello
hello, world!

If we look inside it with the objdump util, we can see that .text section starts from the address 0x200000 and the .data sections starts from the address 0x400000:

$ objdump -D hello
Disassembly of section .text:
0000000000200000 <_start>:
  200000:    48 c7 c0 01 00 00 00     mov    $0x1,%rax
  ...
Disassembly of section .data:
0000000000400000 <msg>:
  400000:    68 65 6c 6c 6f           pushq  $0x6f6c6c65
  ...

Apart from the commands we have already seen, there are a few others. The first is the ASSERT(exp, message) that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you’ve read about Linux kernel booting process in the linux-insides book, you may know that the setup header of the Linux kernel has offset 0x1f1. In the linker script of the Linux kernel we can find a check for this:

. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");

The INCLUDE filename command allows to include external linker script symbols in the current one. In a linker script we can assign a value to a symbol. ld supports a couple of assignment operators:

symbol = expression ;
symbol += expression ;
symbol -= expression ;
symbol *= expression ;
symbol /= expression ;
symbol <<= expression ;
symbol >>= expression ;
symbol &= expression ;
symbol |= expression ;

As you can note all operators are C assignment operators. For example we can use it in our linker script as:

START_ADDRESS = 0x200000;
DATA_OFFSET   = 0x200000;
SECTIONS
{
    . = START_ADDRESS;
    .text : {
          *(.text)
    }
    . = START_ADDRESS + DATA_OFFSET;
    .data : {
          *(.data)
    }
}

As you already may noted the syntax for expressions in the linker script language is identical to that of C expressions. Besides this the control language of the linking supports following builtin functions:

ABSOLUTE - returns absolute value of the given expression;
ADDR - takes the section and returns its address;
ALIGN - returns the value of the location counter (. operator) that aligned by the boundary of the next expression after the given expression;
DEFINED - returns 1 if the given symbol placed in the global symbol table and 0 in other way;
MAX and MIN - return maximum and minimum of the two given expressions;
NEXT - returns the next unallocated address that is a multiple of the give expression;
SIZEOF - returns the size in bytes of the given named section.

That’s all.

Conclusion

This is the end of the post about linkers. We learned many things about linkers in this post, such as what is a linker and why it is needed, how to use it, etc..

If you have any questions or suggestions, write me an kuleshovmail@gmail.com">email or ping me on twitter.

Please note that English is not my first language, and I am really sorry for any inconvenience. If you find any mistakes please let me know via email or send a PR.

Linkers