Creating a simplified Pascal compiler in C++

Introduction : Why create a compiler ?

Every developer has to use a compiler. However, few really know how it works. This article and the following ones are therefore here, on the one hand to present the different stages of compilation, and on the other hand to explain how to create a program that will carry out all of these stages.

What is a compiler ?

Before the advent of programming languages, programmers had to know by heart the architecture of the machine on which they were working, and speak directly to it using a binary language. Very quickly, a language using textual shortcuts was invented: assembly.

This new language had several advantages: first of all, it was easier to read text using meaningful words, but above all a basic check was made in case of a typing error. However, this language had many flaws. It imposed few constraints on the developer, which could be advantageous, but was impractical for productivity and code robustness. Since type checking did not exist, many bugs resulted from it, and these were particularly difficult to detect.

Finally, the level of abstraction was low, the language being very close to machine language. To overcome this, so-called high-level programming languages were invented. The main characteristic of these was a high level of abstraction. That is to say that the language resembled a human language more than machine instructions. Thus, programming became less esoteric, and it was easier for a neophyte to embark on learning this field. In addition, these languages imposed fictitious constraints, such as strong data typing, which allowed for better robustness, or the automatic handling of certain concepts.

For example in Pascal, it is possible to concatenate two character strings using the + operator. In assembly, it would have been necessary to allocate memory for the new string, and to copy each of the characters of these two strings one by one. The advantage of a compiler is undeniable, and apart from a few specific areas, the vast majority of developers do not program in assembly. Many programmers have never even seen it!

How does a compiler work ?

There are many compilers. Since their operation is specific to a language, we will obviously not review them all. Instead, we will define what our language should look like. From there, we will establish what our compiler will be able to achieve.

The language

Creating a compiler is extremely long and can be particularly difficult. Since the goal of this project is to create a basic compiler in an acceptable amount of time, we will use a very simple language. The chosen language will have a syntax close to Pascal.

Let’s now define what the language will be able to do. First of all, it will be typed, that is to say that a variable can only contain a value of a compatible type. Three basic types will be sufficient: integer, boolean, and character strings.

Functions and recursion, in addition to being practical, are pedagogically very interesting to implement. Let’s include them, as well as the return keyword, which will allow to exit a function by returning a value. A function will be obliged to return something. The presence of functions therefore implies that the scope of variables becomes necessary. A variable can therefore be local or global.

A language without control structures would not be very interesting, so let’s add the possibility of making a choice (if), and of performing an operation in a loop (while).

Finally, for the sake of simplification, there will be no ambiguity on the evaluation of an expression since we will force the user to put parentheses explicitly, which avoids any priority problems. We will not deal with more complicated notions, such as arrays, memory management, pointers or the notion of object.

Finally, let’s include three built-ins, that is to say three pre-existing functions, read, print and exit, which will respectively allow to retrieve information entered by the user, to display a value on the screen and to exit the program before the end of its execution.

An example of a program that asks the user for a value and displays its factorial:

var a : integer;
var s : string;

function facto (nb : integer) : integer;
var res : integer;
begin
   if nb <= 1 then
   begin
      return 1;
   end
   res = nb * facto(nb - 1);
   return res;
end

begin
   print("Enter a number: ");
   read(a);
   res = facto(a);
   s = "The factorial of ";
   print(s);
   print(a);
   print(" is ");
   print(res);
   print("\n");
   exit(0);
   print("Ignore");
end

The grammar of the language

Now that we are set on the functionalities of our language, let’s write its grammar. You should know that any programming language, just like a real language, has writing rules. You can’t write just anything, the order of the words of the language is important. This is where this famous grammar comes in. It allows to specify in a clear and exhaustive way, how to arrange the words of the language among themselves. It is customary when writing it, to use a universal formalism. The one we will use will be the BNF (Backus-Naur Form) notation.

If you have no knowledge in the field of language theory, there is a good chance that reading this grammar will be a little complicated. Here are some basic rules to understand its meaning:

A terminal is an existing word of the language, and which as its name indicates, does not refer to another rule.
A non-terminal is a rule, which is not explicitly present in the language, and which is composed of other rules. We surround it with the signs < and > to differentiate it from terminals.
A rule is preceded by the sign ::= which indicates how it is constituted.
The | represents an OR, that is to say a choice between several rules. A rule or a terminal between [ and ] means that this element is optional.

A small example to illustrate this with the grammar rule of an if in C++:

<if> ::= if ( <expression> ) { <expression> } [ else { <expression>} ]

which could also be written:

<if> ::= if ( <expression> ) { <expression> }
       | if ( <expression> ) { <expression> } else { <expression>}

Here is the one in our language:

<program> ::= [<declarations>] [<funcs>] [<compound_instruction>]
<declarations> := <declaration>
                | <declaration> <declarations>

<declaration> ::= var <declaration_body> ;

<declaration_body> ::= <ids> : <type>

<ids> ::= <id>
        | <id> , <ids>

<funcs> ::= <func>
          | <func> <funcs>

<func> ::= <header_func> <compound_instruction>
         | <header_func> <declarations> <compound_instruction>

<header_func> ::= function <idfunc> ( [<arguments>] ) : <type> ;

<arguments> ::= <argument>
              | <argument> ; <arguments>

<argument> ::= <declaration_body>
             | <declaration_body> ; <argument>

<compound_instruction> ::= begin end
                         | begin <instructions> <end>

<instructions> ::= <instruction>
                 | <instruction> <instructions>

<instruction> ::= <affect> ;
                | <call_func> ;
                | <compound_instruction>
                | <if>
                | <while>
                | <return> ;
                | <exit> ;
                | <builtin_print> ;
                | <builtin_read> ;

<affect> ::= <id> = <expression>

<return> ::= return <expression>
<exit> ::= exit <expression>

<builtin_print> ::= print ( <expression> )

<builtin_read> ::= read ( <id> )

<if> ::= if <expression> then <compound_instruction>
       | if <expression> then <compound_instruction> else <compound_instruction>

<while> ::= while <cond> do <compound_instruction>

<call_func> ::= <idfunc> ( )
              | <idfunc> ( <expressions> )

<expressions> ::= <expression>
                | <expression>, <expressions>

<expression> ::= <operation>
/* This node may seem useless, but it is present for a future extension
   of the language, which I will not present in this article */

<operation> ::= <factor> <symb> <factor>
                | <factor>   /* operation without right factor */
                | - <factor> /* left factor = 0 */

<factor> ::= <id>
           | <call_func>
           | <number>
           | <stringexpr>
           | <boolean>
           | ( <expression> )

<id> ::= [a-zA-Z]([a-zA-Z0-9_]*[a-zA-Z0-9])?
<type> ::= integer | string | boolean
<symb> ::= > | < | >= | <= | != | == | + | - | * | /
<number> ::= [0-9]+
<stringexpr> ::= .*
<boolean> ::= true | false

The different stages of compilation

When a compilation is launched, several stages follow one another to lead to the creation of an executable. First of all, the majority of these stages are used to check the coherence of the written code. Here is a non-exhaustive presentation of the different stages that we are going to carry out. The technical details will be discussed in future articles dedicated each time to a specific phase.

The first stage, called “lexing”, consists of reading the file and retrieving tokens. A token is the smallest atomic piece defined in a grammar. If we take an analogy with the English language, a token would represent a word. So the following sentence: “Hello to everybody” would be considered valid, while: “Hello xvcfbjnb everybody” would cause a lexing error. Indeed, the word “Hello” and the word “everybody” are indeed present in the English dictionary, but the word “xvcfbjnb” is not. This stage will therefore check that certain impossible words are not present in the source code. To cause a lexing error in our language, we could write : if @_@ then. The word @_@ not being an acceptable token, the code is considered invalid.

The second stage, called “parsing”, will, from the previously retrieved tokens, check that the arrangement is correct. Indeed, it is not because we have not detected any invalid tokens, that they are necessarily in the right order. Let’s take an analogy with the language of Shakespear. If I say to you: “Hello, how are you?” you should understand me, whereas with “you ? here, Hello are” communication would be somewhat complicated. Well with our language, it’s the same thing. If the user enters “if then true if if”, then we are indeed in the presence of a parsing error. The arrangement of each of these tokens, do not respect the grammar that we have defined.

The next stage is called “binding”. Here we will check that each of the variables and functions are correctly linked between their declarations and their calls. To summarize, this phase is content to detect that a variable or a function is indeed declared once and only once.

Example:

var a : integer;
begin
  b = 50;
end

Here we can see that the variable b is not declared. So the compiler must warn the user, and refuse to create an executable.

Similarly:

var b : integer;
var b : integer;
begin
  b = 50;
end

Here the variable b is defined twice, so it is also a “binding” error. The same check is applied to functions, also taking into account that the number of arguments described, and the number of arguments actually passed, correspond.

Finally, the last verification stage is the one called “type checking”. This one checks that the typing of the expressions is correct and that each variable receives a value whose type is compatible.

For example:

var b : integer;
begin
  b = 50;
end

Here b, which is of integer type, will receive an integer value, so compatibility is assured, there is no error.

Whereas in the following example:

var b : integer;
begin
  b = "hello";
end

The variable b which is of integer type, receives a character string. This is obviously not possible, and it is this type of problem that will cause a type checking error.

Conclusion

Once all these stages have been validated, we consider our source code to be valid. All that remains is to interpret it, or to convert it into assembly. These two stages will be detailed later. This first article ends here. I hope that this first approach will have made you want to go a little further in understanding compilation. The following articles will be a little more technical.

Creating a compiler (part 1)

Creating a compiler step by step