Compilation Example

Write code

C code uses curly braces, semi-colons, and type names before variable names during declarations

Example:

##define three (1+2)

int printf(const char *format, ...);

int main() {
    // this is a comment
    double x;
    int y = three;
    /*
    this is also a comment
    */
    x = y>>1;
    printf("%g\n", x);
    return 0;
}

Preprocess

The C pre-processor removes comments, handles any lines that begin with #, and may change some other parts.

You can invoke the C pre-processor with the command cpp{.sh}. It will add a lot of lines beginning with # which are only to help subsequent error messages make sense and we can ignore.

Example:

int printf(const char *format, ...);

int main() {
    double x;
    int y = (1+2);
    x = y>>1;
    printf("%g\n", x);
    return 0;
}

The preprocessing can generate errors, such as for unmatched /* and */ or invalid # lines.

Lex

A lexer (also called a “tokenizer”) identifies the “tokens” in the file. For example, it knows that int is an “identifier”-type token, ( is a “left parenthesis”-type token, >> is a “binary operator”-type token, and so on.

Lexing disambiguates some things, such as deciding that >> is one shift token, not two greater-than tokens and that 1.2 is a single token but x.y is three tokens.

Lexing also removes most non-semantic of style like spacing.

Example: int printf ( const char * format , ... ) ; int main ( ) { double x ; int y = ( 1 + 2 ) ; x = y >> 1 ; printf ( "%g\n" , x ) ; return 0 ; }

Lexing can generate errors if you type some symbol that’s not part of C

Lexers are generally built as an extension of a finite automata, a topic you’ll learn about in CS 3120.

Lex - An implementation perspective

The following documentation on lex is derived from this link. Please read the Lex and Yacc theory section of this document before proceeding. Additionally, this playlist has a tutorial on Lex and Yacc(1hr in total). The PDF and youtube playlist is what we used to design this homework. Reading through the examples will help you in understanding lex and yacc.

Let’s talk Lex now and we will come back to Yacc and Code generation later. Given the rules for tokenization, Lex generates a C program that reads the program that we wrote and tokenizes it. Below is an actual lex program that we have created for a custom language.

 1 | %{
 2 | // This section contains C style includes and declarations
 3 | #include <stdlib.h>
 4 | #include <string.h>
 5 | #include "../inc/includes.h"
 6 | #include "y.tab.h"
 7 | void yyerror(char *);
 8 | int yylineno;
 9 | %}
10 |
11 | %%
12 |
13 | [a-z]       {
14 |                 yylval.sIndex = *yytext - 'a';
15 |                 return VARIABLE;
16 |             }
17 |
18 | [0]*           {
19 |                 yylval.iValue = atoi(yytext);
20 |                 return INTEGER;
21 |             }
22 |
23 | [1-9][0-9]* {
24 |                 yylval.iValue = atoi(yytext);
25 |                 return INTEGER;
26 |             }
27 |
28 | [-()<>=+*/;{}.] {
29 |                 return *yytext;
30 |             }
31 |
32 | ">="            return GE;
33 | "<="            return LE;
34 | "=="            return EQ;
35 | "!="            return NE;
36 | "while"         return WHILE;
37 | "if"            return IF;
38 | "else"          return ELSE;
39 | "printc"        return PRINTC;
40 | 
41 |
42 | \n              {yylineno++;};
43 | #(.*)        ;       /* ignore comments */
44 | [ \t]+          ;       /* ignore whitespace */
45 |
46 | .              {
47 |                 char numStr[10];
48 |                 snprintf(numStr, sizeof(numStr), "%d", yylineno);
49 |                 yyerror(strcat("Unknown character in line ", numStr));
50 |                }
51 | %%
52 | int yywrap(void) {
53 |     return 1;
54 | }

The lex file (.l extension shown above) contains three sections.

... definitions ...
%%
... rules ...
%%
... subroutines ...

Input to Lex is divided into three sections with %% dividing the sections. The definitions section consists of token declarations and C code bracketed by %{“ and “%}.

Additionally, all variable names and function names starting with yy are variables internal to lex and yacc. You can find their significance in the lex(flex) and yacc(bison) manual pages. The first section contains C style definitions and includes as shown. The second section contains several rules that will help lex with tokenizing. Each sentence in this section contains a Regular Expression followed by the rule associated with that regular expression.

In the shown lex file, take a look at the rule section. The first rule is to handle characters a to z and return that it is a VARIABLE while also storing the "index" of the variable. We will discuss why we are storing the index later in this article. Similarly, in all the rules we return the value or index in addition to the token(VARIABLE, INTEGER) of the characters that match the regular expression. The second rule handles zeros. Why a separate rule for zero? Why is it not a part of the other integer handling rule (the third rule in the lex file)? This is becuase of two reasons,

The language we are compiling for considers 007 to be illegal.
0000 and 0 are the same integer, so there is no need to have different rules to detect them. Finally, in the next sections, we use regular expressions to identify several other TOKENS that we allow in our language. We define these token types in the YACC file.

In the third section, we can define the functions we declared in the first section. But since we will be defining our functions in the YACC file, we only redefine the yywrap() function. Based on this answer on stackoverflow, When the scanner receives an end-of-file indication from YY_INPUT(in simple words, input to the lex program), it then checks the yywrap() function. If yywrap() returns false (zero), then it is assumed that the function has gone ahead and set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), then the scanner terminates, returning 0 to its caller. This means that our compiler can only read one program file. It will not be able to read two files.

Let us consider the following statement, a = z + 5;. Lex would tokenize this into ` VARIABLE = VARIABLE + INTEGER ; `. It will also store the Index and Value of the Variable and Integer.

Parse

A parser combines the tokens into a tree structure called an “abstract syntax tree” or AST.

Example: There are several ways we could make an AST, but one of them is

function declaration
- int
- printf
- arguments
  1. declaration
    - type
      - const
      - char
      - *
    - name
      - format
  2. ...
function definition
- int
- main
- arguments
- body
  1. definition
    - double
    - x
  2. initialization
    - int
    - y
    - +
      - 1
      - 2
  3. assignment
    - x
    - >>
      - y
      - 1
  4. invocation
    - printf
    - arguments
      1. "%g\n"
      2. x
  5. return
    - 0

Note that not all tokens make it into the AST; some, like parentheses and semi-colons, help organize the tree but can then be abstracted away (hence the word “abstract” in “abstract syntax tree”).

Parsers can generate errors if you write something that doesn’t look like C code.

Parsers are related to, but generally not actually implemented as, pushdown automata, a topic you’ll learn about in CS 3120. The actual design of parsers is discussed in CS 4610 and CS 4620.

Yacc (A parser) - An implementation perspective

The following documentation on lex is derived from this link. Please read the Lex and Yacc theory section of this document before proceeding. Additionally, this playlist has a tutorial on Lex and Yacc(1hr in total). The PDF and youtube playlist is what we used to design this homework. Reading through the examples will help you in understanding lex and yacc.

As mentioned earlier, a critical step in compilation is converting the source program into an intermediate representation that can be translated to machine language(x86 assembly) easily. Yacc will convert our program in the custom language into an AST after lex tokenizes it. Unlike the previous section on lex, this section assumes that you have read through and understood the Lex and Yacc theory sections in this document.

Each node in the AST will either be a token and our language supports 3 types of tokens, namely VARIABLE, INTEGER, and OPERATOR. In the lex file, you would have encountered INTEGER and VARIABLE. The OPERATOR tokens were either returned as *yytext or as a token like WHILE, PRINTCN, and so on. Yacc encounters these tokens and uses them to create an AST. The structure of the nodes are defined in the includes file that is shown below. You don’t have to know about the include file to understand the rest of this section.

 1 | /* Creating an enumerated type */
 2 | /* for the tokens types we support. Namely, Constant(Integer), Variable(Identifier) and Operations(Operator). */
 3 | typedef enum { typeCon, typeId, typeOpr } nodeEnum;
 4 |
 5 | /* constants */
 6 | typedef struct {
 7 |     int value;                  /* value of constant */
 8 | } conNodeType;
 9 |
10 | /* identifiers */
11 | typedef struct {
12 |     int i;                      /* subscript to sym array */
13 | } idNodeType;
14 |
15 | /* operators */
16 | typedef struct {
17 |     int oper;                   /* operator */
18 |     int nops;                   /* number of operands */
19 |     struct nodeTypeTag *op[1];     /* operands, extended at runtime */
20 | } oprNodeType;
21 |
22 | /* Node in our AST */
23 | typedef struct nodeTypeTag {
24 |     nodeEnum type;              /* type of node */
25 |
26 |     union {                     /* union because any node can be only one of the three types */
27 |         conNodeType con;        /* constants */
28 |         idNodeType id;          /* identifiers */
29 |         oprNodeType opr;        /* operators */
30 |     };
31 | } nodeType;
32 |
33 | /* We implement variables as a sequence of 26 memeory spaces, each of 8 byte size. */
34 | extern int sym[26]; /* These are our identifiers/variables */

The following code snippet is the actual Yacc file.

  1 | %{
  2 | #include <stdio.h>
  3 | #include <stdlib.h>
  4 | #include <stdarg.h>
  5 | #include <string.h>
  6 | #include "../inc/includes.h"
  7 |
  8 | /* input output handling */
  9 | extern FILE *yyin;
 10 | extern FILE *yyout;
 11 |
 12 | /* prototypes */
 13 | nodeType *opr(int oper, int nops, ...);
 14 | nodeType *id(int i);
 15 | nodeType *con(int value);
 16 | void freeNode(nodeType *p);
 17 | int ex(nodeType *p, FILE *yyout);
 18 | int yylex(void);
 19 | int epilogue(FILE *yyout);
 20 |
 21 | void yyerror(char *s);
 22 | int sym[26];                    /* symbol table */
 23 |
 24 | %}
 25 |
 26 | %union {
 27 |     int iValue;                 /* integer value */
 28 |     char sIndex;                /* symbol table index */
 29 |     nodeType *nPtr;             /* node pointer */
 30 | };
 31 |
 32 | %token <iValue> INTEGER
 33 | %token <sIndex> VARIABLE
 34 | %token WHILE IF PRINTC
 35 | %nonassoc IFX
 36 | %nonassoc ELSE
 37 |
 38 | %left GE LE EQ NE '>' '<'
 39 | %left '+' '-'
 40 | %left '*' '/'
 41 | %nonassoc UMINUS
 42 |
 43 | %type <nPtr> stmt expr stmt_list
 44 |
 45 | %%
 46 |
 47 | program:
 48 |         function                {;}
 49 |         ;
 50 |
 51 | function:
 52 |           function stmt         { ex($2, yyout); freeNode($2); }
 53 |         | /* NULL */
 54 |         ;
 55 |
 56 | stmt:
 57 |           ';'                            { $$ = opr(';', 2, NULL, NULL); }
 58 |         | expr ';'                       { $$ = $1; }
 59 |         | PRINTC expr ';'                { $$ = opr(PRINTC, 1, $2); }
 60 |
 61 |         | VARIABLE '=' expr ';'          { $$ = opr('=', 2, id($1), $3); }
 62 |         | WHILE '(' expr ')' stmt        { $$ = opr(WHILE, 2, $3, $5); }
 63 |         | IF '(' expr ')' stmt %prec IFX { $$ = opr(IF, 2, $3, $5); }
 64 |         | IF '(' expr ')' stmt ELSE stmt { $$ = opr(IF, 3, $3, $5, $7); }
 65 |         | '{' stmt_list '}'              { $$ = $2; }
 66 |         ;
 67 |
 68 | stmt_list:
 69 |           stmt                  { $$ = $1; }
 70 |         | stmt_list stmt        { $$ = opr(';', 2, $1, $2); }
 71 |         ;
 72 |
 73 | expr:
 74 |           INTEGER               { $$ = con($1); }
 75 |         | VARIABLE              { $$ = id($1); }
 76 |         | '-' expr %prec UMINUS { $$ = opr(UMINUS, 1, $2); }
 77 |         | expr '+' expr         { $$ = opr('+', 2, $1, $3); }
 78 |         | expr '-' expr         { $$ = opr('-', 2, $1, $3); }
 79 |         | expr '*' expr         { $$ = opr('*', 2, $1, $3); }
 80 |         | expr '/' expr         { $$ = opr('/', 2, $1, $3); }
 81 |         | expr '<' expr         { $$ = opr('<', 2, $1, $3); }
 82 |         | expr '>' expr         { $$ = opr('>', 2, $1, $3); }
 83 |         | expr GE expr          { $$ = opr(GE, 2, $1, $3); }
 84 |         | expr LE expr          { $$ = opr(LE, 2, $1, $3); }
 85 |         | expr NE expr          { $$ = opr(NE, 2, $1, $3); }
 86 |         | expr EQ expr          { $$ = opr(EQ, 2, $1, $3); }
 87 |         | '(' expr ')'          { $$ = $2; }
 88 |         ;
 89 |
 90 | %%
 91 |
 92 | nodeType *con(int value) {
... |   /* code to handle constants here */
104 | }
105 |
106 | nodeType *id(int i) {
... |   /* code to handle identifiers here */
118 | }
119 |
120 | nodeType *opr(int oper, int nops, ...) {
... |   /* code to handle operators here */
138 | }
139 |
140 | void freeNode(nodeType *p) {
... |   /* code to free memory here */
149 | }
150 |
151 | void yyerror(char *s) {
152 |     fprintf(stderr, "%s \n", s);
153 |
154 | 
155 | }
156 |
... |
164 |
165 | int main(int argc, char *argv[]) {
166 |     ++argv, --argc;
167 |     if (argc > 0) {
168 |         yyin = fopen(argv[0], "r");
169 |     } else {
170 |         yyin = stdin;
171 |     }
172 |
173 |     yyout = fopen("out.s", "w");
174 |
... |
181 |     yyparse();
182 |
183 |     return 0;
184 | }

Like a Lex file, Yacc file also has three sections.

... definitions ...
%%
... rules ...
%%
... subroutines ...

Input to yacc is divided into three sections. The definitions section consists of token declarations and C code bracketed by %{“ and “%}. The BNF grammar is placed in the rules section and user subroutines are added in the subroutines section.

In the definitions section, we include the libraries that we need followed by the include file that was shown earlier. We declare the input and output files and then declare some functions that we will be using to create the nodes, free them as well as parse through the AST of the created nodes. We also declare some tokens as we can see. %nonassoc, and %left help define the precedence and associativity of the tokens.

The rules in the next section help us recursively breakdown the stream of tokens generated by lex into sets of statements and then convert it into a AST. The working of the rules is discussed in the theory section of the document referred to earlier.

In the subroutines section, we define the functions that we defined earlier. There are functions that create the AST nodes. The freeNode function frees the memory allocated for creating the AST once the assembly generation is complete. The yyerror function is redefined by us to display the error in stderr. The epilogue function is invoked by default every time we compile a program to add few lines at the end of the generated assembly program to call sys_exit and properly terminate the code.

The main function reads the input ToyG function, compiles it by calling yyparse() and terminates once the compilation is successful.

The statement, a = z + 5; that Lex tokenized into ` VARIABLE = VARIABLE + INTEGER ; ` would be converted into the following AST by Yacc. Trying to trace Yacc’s flow to convert the tokens into the AST is left as an activity for the reader.

       [=]
        |
   |-------|
   |       |
 id(A)    [+]
           |
         |----|
         |    |
       id(Z) c(5)

In the above AST, id() refers to identifiers, c() refers to constants. The operators are shown in between square brackets as you can see.

Type Checking

A type checker annotates the AST with the type of each expression and variable, adding implicit type-casts where needed.

Example:

function declaration
- int
- printf
- arguments
  1. declaration
    - type
      - const
      - char
      - *
    - name
      - format
  2. ...
function definition
- int
- main
- arguments
- body
  1. definition
    - double
    - x (double)
  2. initialization
    - int
    - y (int)
    - + (int)
      - 1 (int)
      - 2 (int)
  3. assignment
    - x (double)
    - (cast to double)
      - >> (int)
        
        y (int)
        
        1 (int)
  4. invocation
    - (int) printf (const char *, …)
    - arguments
      1. "%g\n" (const char *)
      2. x (double)
  5. return
    - 0 (int)

Type checking can generate errors if the types of values and operators do not match, if the declared type does not match usage, or if you failed to declare something.

Type checking in C is done in a single pass, top to bottom, which means that declarations must precede usage.

Example: This is valid C

int f();
int g() {
    return f();
}

But this is not valid C; in particular, it will fail during type-checking with an “undefined identifier” error on the first use of f.

int g() {
    return f();
}
int f();

After type checking, the declarations can all be removed from the AST as they’ve been copied to where they were used.

Code Generation and Optimization

Code generation is the process of turning an annotated AST into assembly. Typically, this is done with several intermediate steps, varying by compiler, but usually looking something like

Turn AST into an assembly-like language with infinite registers
Apply various optimizations, such as
- removing code that doesn’t have a lasting impact on the program
- replacing variables that have a constant value with literals
- re-ordering code to reduce the number of jumps
- using the same “register” for different values
- ISA-specific tricks like replacing 5*x with x + x*4 which can be implemented with a lea instruction
Allocate registers and memory locations
- a simple (non-optimal) way places all variables in memory
- optimizations try to find ways to have some variables only in registers
Assemble the results

Code generation never generates errors.

The level of optimization can be controlled with various flags to the compiler. The most common are

-O0 meaning no optimization at all: every variable is written to memory and every line of C code has one or more lines of machine code
-O1 meaning basic optimization: use registers where possible, but every line of C code has one or more lines of machine code
-O2 meaning normal optimizations: change code in ways that always make it faster
-O3 meaning aggressive optimizations: change code in ways that usually makes it faster, but might also make it bigger or slower in certain circumstances
-Os meaning optimize for program size, not program speed

Each of those -O flags enables a group of optimizations which can be individually enabled or disabled with -f... flags like -finline-functions (which will replace some calls with the entire code of the called function) and -fno-inline-functions (which disables -finline-functions).

Code generation, together with all previous steps of compilation, are performed by running clang -c myfile.c (or gcc -c myfile.c). To stop before assembling, use -S instead of -c.

Example: The assembly generated by clang -c -Os on the file used above is

50                   	push   %rax
f2 0f 10 05 00 00 00 	movsd  0x0(%rip),%xmm0
00 
bf 00 00 00 00       	mov    $0x0,%edi
b0 01                	mov    $0x1,%al
e8 00 00 00 00       	callq  15
31 c0                	xor    %eax,%eax
59                   	pop    %rcx
c3                   	retq   

Note the optimization: the code now just says %al is 1 which is what the previous math ((1+2)>>1) evaluates to; the floating-point equivalent of that is placed in a memory location to be loaded by the movsd command.

Three of those instructions (movsd, the first mov, and call) have placeholder addresses initialized to all 00 bytes which will be changed during linking.

The output of code generation is not the finished assembly or machine code we’ll run. Rather, it is something called a “relocatable object file” that contains

The machine code, with placeholders for addresses so that multiple files can be merged without address collision
Constants and global variables
The names of functions and variables that other programs can use
If compiled with -g, a rough mapping between source code and the resulting machine code to aid in debugging

Linking

Most code depends on code written by others. There are two ways to connect code together; one is called “static linking” or often simply “linking”.

Static linking combines several relocatable object files into a single executable. It places each object file’s assembly into memory, updates all jump and call targets to those new locations, and links up uses from one file with definitions in other files.

Static linking can generate errors if two of the object files both define the same name.

Dynamic linking is like a deferred static linking. It has the same general goals: to connect multiple files. But it picks some of those files and says “don’t bundle it with the binary I’m creating now; instead, load it from a separate file each time I run the program.” Dynamic linking is valuable in saving disk space and reducing memory usage; it can support fixing a bug in a library for all programs without recompiling them; but it cal also lead to errors where two programs expect different versions of the same library.

If you run clang or gcc without the -c or -S flags, it will perform linking. You can also run the linker directly as ld, though we’ll never have you do that.

Example: The assembly generated by clang -Os on the file used above is about 150 assembly instructions including a staticly-linked printf that invokes a dynamically-linked printf@GLIBC_2.2.5 where GLIBC_2.2.5 is the name of a library file to be loaded at runtime. It also changes addresses in main to resolve linkage, giving

50                   	push   %rax
f2 0f 10 05 d7 0e 00 	movsd  0xed7(%rip),%xmm0
00 
bf 10 20 40 00       	mov    $0x402010,%edi
b0 01                	mov    $0x1,%al
e8 f3 fe ff ff       	callq  401030
31 c0                	xor    %eax,%eax
59                   	pop    %rcx
c3                   	retq   

Note that xmm0 now has the address of the floating-point literal, edi now has the address of the string literal, and callq now goes to the address of the statically-linked printf function.

Loading

Loading takes a program in a file and puts it into memory so it can run. It resolves dynamic links and may move some memory around to reduce the chance that certain security vulnerabilities will work.

Loading is performed by the operating system, and most OSes provide several ways to do it, such as

clicking on an icon in your operating system
typing ./myprogram in a terminal window
if the directory containing myprogram is in “the path”, typing myprogram in a terminal window
using the execve library function or its relatives