Tools for examining different phases of compiling and running a C program

This page provides information about the different phases of compiling and running a C (or C++) program and tools that can be used to examine the results of these different phases.

The different phases examined below are:

  1. the preprocessor (expands #'s)
  2. the compiler (produces .o files)
  3. the link editor (produces a.out files)
  4. the runtime linker (loads and links shared libraries used by a.out)
The different tools used to examine compiler output include:

More information and examples using some of these tools to examine .o and a.out files (hexdump, strings, objdump, gdb).


The following program is used as an example below (it is also available in ~newhall/public/cs75/compilecycle/ with a Makefile for building .o and executable files):
// simple.c:
#include <unistd.h>

#define MAX  10

int foo(int y);

main() {

  int x, i;
  char buf[10];

  for(i=0; i < MAX; i++) {
    x = foo(i);
    // a crazy way to print to stdout
    sprintf(buf, "%d", x);
    write(0, buf, strlen(buf));
    buf[0] = '\n';
    write(0, buf, 1);
  }

}
int foo(int y) {
  return y*y;
}

The Unix file command can be used to find out information about the type of a file. For example:
# the C source file:
#
$ file simple.c
  simple.c: ASCII C program text

# the object file: produces relocatable machine code
#   ELF: stands for Executable and Linking Format, and is the format for
#        .o, a.out, and .so files produced by gcc.  The format is necessary
#        so that programs that process these files, and the OS, know how
#        to find different parts of the code and data in this file
#   Intel 80386: is the target architecture 
#   not stripped: means that this .o file includes a symbol table
#
$ file simple.o
  simple.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped

# the executable file: 
#
$ file simple
  simple: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for 
  GNU/Linux 2.6.8, dynamically linked (uses shared libs), not stripped

# a shared object file (dynamically linked library): 
#  
$ file /lib/libc-2.7.so 
  /lib/libc-2.7.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV),
  for GNU/Linux 2.6.8, stripped

  1. The preprocessor

    The C preprocessor (cpp) is the first part of the compiler to run. You can run cpp directly on the simple.c or you can run gcc with the -E flag to run just the preprocessor part of the compiler that expands #include (replaces them with .h file contents), #define (replaces macro and constant use with their definition), and #if (determines conditional inclusion):
    # run cpp: 
    $ cpp simple.c | less
    
    # run just the preprocessor part of gcc:
    $gcc -E simple.c  | less
    
    # look at the output to see what happens to #includes and #defines from simple.c
    
    here is a very detailed reference about the pre-processor.
  2. The core compiler

    The core of the C compiler creates a .o file (relocatable machine code for module simple.c) from the output of the preprocessor. Use the -c option to gcc to produce a .o file:
    $ gcc -c simple.c
    
    you can see the assembly code in simple.o using either objdump or gdb (all addresses are listed in hexidecimal (base 16)):
    $ objdump -d simple.o
      ...
      00000000 
    : 0: 8d 4c 24 04 lea 0x4(%esp),%ecx 4: 83 e4 f0 and $0xfffffff0,%esp 7: ff 71 fc pushl -0x4(%ecx) a: 55 push %ebp b: 89 e5 mov %esp,%ebp d: 51 push %ecx e: 83 ec 34 sub $0x34,%esp 11: 65 a1 14 00 00 00 mov %gs:0x14,%eax 17: 89 45 f8 mov %eax,-0x8(%ebp) 1a: 31 c0 xor %eax,%eax ...
    $ gdb simple.o
     (gdb) disass main
     (gdb) disass foo
     (gdb) quit
    
  3. The link editor

    The link editor creates an executable file (a.out file) from one or more .o files and .a or .so files (static or dynamic libraries):
    # create an executable file from simple.o and some standard libraries that gcc automatically links in:
    gcc -o simple simple.o
    

    Disassembling Executable Code:

    You can use objdump (or in gdb the disass command) to disassemble the code in the executable (simple) to see how it differs from the code in simple.o (look at the call instructions)
    $ objdump -d simple
     ...
    08048434 
    : 8048434: 8d 4c 24 04 lea 0x4(%esp),%ecx 8048438: 83 e4 f0 and $0xfffffff0,%esp 804843b: ff 71 fc pushl -0x4(%ecx) 804843e: 55 push %ebp 804843f: 89 e5 mov %esp,%ebp 8048441: 51 push %ecx 8048442: 83 ec 34 sub $0x34,%esp 8048445: 65 a1 14 00 00 00 mov %gs:0x14,%eax 804844b: 89 45 f8 mov %eax,-0x8(%ebp) 804844e: 31 c0 xor %eax,%eax 8048450: c7 45 e4 00 00 00 00 movl $0x0,-0x1c(%ebp) 8048457: eb 6d jmp 80484c6 ...

    Viewing the Symbol Table:

    Use nm (or objdump -t) to list the symbol table from an a.out or .so file
    $ nm --format sysv simple	# system V format is easier to read than bsd format which is the default
    
    Name                  Value   Class        Type         Size     Line  Section
    
    ...
    foo                 |080484e6|   T  |              FUNC|0000000c|     |.text
    frame_dummy         |08048410|   t  |              FUNC|        |     |.text
    main                |08048434|   T  |              FUNC|000000b2|     |.text
    p.5841              |080496dc|   d  |            OBJECT|        |     |.data
    sprintf@@GLIBC_2.0  |        |   U  |              FUNC|00000034|     |*UND*
    strlen@@GLIBC_2.0   |        |   U  |              FUNC|000000af|     |*UND*
    write@@GLIBC_2.0    |        |   U  |              FUNC|00000076|     |*UND*
    
    Section *UND* means that these symbols are from .so files that will be
    loaded at run-time, Section .text means that these are in the .text 
    section of the executable file (the code section).  Class T and t are 
    functions and D and d are data (global variables), R is read-only data, 
    the Value column gives the address of the function or data.
    
  4. The runtime linker and dynamically linked libraries:

    The runtime linker sets entries in the PLT (procedure linkage table) and/or the GOT (global offset table) at runtime to bind variables and functions to their locations in shared objects (dynamically linked libraries) that are loaded at runtime. What exactly is done depends on the format of the a.out file and the underlying OS/Arch.

    If you do objdump -d simple you can see that the call to write in main is a call into the .plt section of the a.out (which contains the PLT):

    08048434 
    : ... 804849e: e8 c9 fe ff ff call 804836c Disassembly of section .plt: ... 0804836c : 804836c: ff 25 c4 96 04 08 jmp *0x80496c4 8048372: 68 10 00 00 00 push $0x10 8048377: e9 c0 ff ff ff jmp 804833c <_init+0x30>
    The jmp *0x80496c4 instruction is jumping to a value stored in the Global Offset Table (GOT) at address 0x80496b0. The value in the GOT is loaded at runtime by the dynamic linker.

    To see what this value is set to at runtime, disassemble instructions in gdb:

    1. set a breakpoint at write and run
      $ gdb simple
      (gdb) break *0x0804849e
      (gdb) cont
      (gdb) disass main
      ...
      0x0804849e :  call   0x804836c 
      ...
      
    2. disassemble the PLT entry that is called from main (the call to write in libc.so):
      
      
      (gdb) disass 0x804836c
      Dump of assembler code for function write@plt:
      0x0804836c :       jmp    *0x80496c4
      0x08048372 :       push   $0x10
      0x08048377 :      jmp    0x804833c <_init+48>
      
      
    3. disassemble instructions around 0x80496c4 just to see that the jmp target is stored in a location in the GOT (ignore the disassembled "instructions" in the GOT: the GOT stores jump target addresses not instructions, so the disassembled target addresses have no meaning):
      (gdb) disass 0x80496c4
      Dump of assembler code for function _GLOBAL_OFFSET_TABLE_:
      0x080496b0 <_GLOBAL_OFFSET_TABLE_+0>:   fcoml  0x66680804(%ebp)
      0x080496b6 <_GLOBAL_OFFSET_TABLE_+6>:   icebp  
      0x080496b7 <_GLOBAL_OFFSET_TABLE_+7>:   mov    $0x30,%bh
      0x080496b9 <_GLOBAL_OFFSET_TABLE_+9>:   fdiv   %st,%st(0)
      0x080496bb <_GLOBAL_OFFSET_TABLE_+11>:  mov    $0xb0,%bh
      0x080496bd <_GLOBAL_OFFSET_TABLE_+13>:  xchg   %eax,%ebx
      0x080496be <_GLOBAL_OFFSET_TABLE_+14>:  fnsave 0x8048362(%edi)
      0x080496c4 <_GLOBAL_OFFSET_TABLE_+20>:  rclb   -0x7c8f481b(%edx)
      0x080496ca <_GLOBAL_OFFSET_TABLE_+26>:  fidivl -0x481fbdb0(%edi)
      0x080496d0 <_GLOBAL_OFFSET_TABLE_+32>:  mov    %al,0x80483
      
    4. print out the value stored in the GOT table for write (the GOT entry is at address 0x80496c4 and it contains the address of the write function (0xb7e592d0)):
      (gdb) print/x *0x80496c4
      $2 = 0xb7e592d0
      
    5. now try disassembling code around address 0xb7e592d0 to see code from the write function from libc.so:
      (gdb) disass 0xb7e592d0
      Dump of assembler code for function write:
      0xb7e592d0 :  cmpl   $0x0,%gs:0xc
      0xb7e592d8 :  jne    0xb7e592fc 
      0xb7e592da :  push   %ebx
      0xb7e592db :  mov    0x10(%esp),%edx
      0xb7e592df :  mov    0xc(%esp),%ecx
      0xb7e592e3 :  mov    0x8(%esp),%ebx
      ...
      

    Veiwing shared object dependencies and the dynamic symbol table:

    ldd will list shared object dependencies on an a.out or .so files (i.e. which shared objects need to be loaded at runtime to run the a.out or with loading the .so):
    ldd simple
            linux-gate.so.1 =>  (0xb7ef2000)
            libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7d8a000)
            /lib/ld-linux.so.2 (0xb7ef3000)
    
    Use objdump -T to see dynamic symbol table entries from a .so file (here we are just finding the one for write):
    $ objdump -T /lib/libc.so.6 | grep write     
    
    000b6ab0  w   DF .text  00000076  GLIBC_2.0   write
    

    Here is some more information about readelf, objdump, and other tools: