Simplified Pascal Compiler
Cubs is a compiler and interpreter for a language based on Pascal and C. The language itself isn’t very advanced:
- 3 types (integer, boolean, string).
- 2 control structures (if, while)
- 3 builtins (read, print, exit)
- No pointers
- No arrays.
- Functions and recursive functions, with the return keyword.
- 5 calculation operators +, -, /, *, %
- 6 comparison operators ==, !=, >, >=, <, <=
- No and or statements. (Not yet, but planned).
- Parentheses are required on compound expressions. Ex: 1 + 2 + 3 doesn’t work, you have to write: 1 + (2 + 3)
- Possibility of creating global or local variables.
The primary purpose of this project was to present the steps involved in compiling source code. This is why the compiled language is not complicated. Each step can be displayed, with options allowing you to see the desired steps. To briefly review the steps:
- Lexing (-L option): We retrieve all the tokens in a file. A token is the smallest atomic piece defined in a grammar. For example, in a sentence, a token is a word. This step also checks that certain impossible words are not present. For example: if @ then will generate a lexing error, because @ is not a valid token.
- Parsing (-P option): We check that all the retrieved tokens are arranged in the order defined by the language grammar. This step checks for misplaced tokens. For example: then if while is a parsing error, since this arrangement is incorrect. Create an AST, reused by all subsequent steps, using the visitor design pattern.
- Binding (-B option): Checks that all variables and functions are properly defined and not redefined. For example: a = 10; will raise a binding error since a is not declared. Similarly: var a : integer; var a : integer; will also raise a binding error since a is declared twice.
- Type Checking (-T option): Checks that the variable typing is respected. For example: var a : integer; then a = 10; is correct, while var a : string; then a = 10 is incorrect.
Once all these steps are completed, our source code is considered valid. We can then perform the following steps:
- Execution (-X option): Executes the code and displays the execution path. Default option.
- Compilation (-S option): Generates assembler, which only needs to be assembled with nasm as follows:
./minicompil -S prog.mmc > prog.asm
nasm -o prog.o -f elf -d ELF_TYPE prog.asm
THEN
ld -s --dynamic-linker /lib/ld-linux.so.2 -lc prog.o -o prog
OR
./auto_compile.sh prog.asm
Bonus Steps
Conversion to C++ (-C option): Converts the code to C++. Compilable by typing:
./minicompil -C prog.mmc > prog.cc
THEN
g++ -W -Wall prog.cc -o prog
OR
./auto_compile.sh prog.cc
- Debugging (-D option): Acts as a debugger. Launches an execution, showing the state of all variables at each instruction.
- All (-V option): Launches and displays all steps up to execution.
- Grammar generator (-G option): Generates a valid grammar at random.
- Dotty tree (-O option): Generates output for dotty from the created AST. (Dotty is a program that allows you to draw graphs and trees, among other things; see screenshot). An image can be created by typing:
./minicompil -O prog.mmc > prog.dot
THEN
dot -Tpng prog.dot > prog.png
OR
./auto_compile.sh prog.dot
Criticisms
- The generated assembler is not optimized. Assembler purists will find some routines written inelegantly. Furthermore, I pass a lot of things through the stack.
- Some pieces of code could be optimized; I’ll work on them.
- The AST could be improved, for example, by creating only one node during a constant expression. E.g., 1 + (2 + 3) generates many nodes, whereas a single node containing 6 would be sufficient. * Overloading, inling, or and and instructions, and the ability to split the code into multiple files are not supported.
- Not tested on Windows, but should work easily, since I’ve made it a point to have no dependencies on either flex/bison or boost…
- The tree structure isn’t great (everything in src).
- The generated assembler uses libc functions. I wish I didn’t have any dependencies on that.
Documentation
Technical documentation (33MB): http://0217021.free.fr/cubs/refman.pdf
Notes
Coded entirely under Linux with Emacs.
Successfully compiled under g++ 4.2.4.
Assembler assembled under NASM version 2.06rc2.
Comments in Doxygen format.
To compile: ./configure && make
To compile in debug mode (-g + -lefence + assert): ./configure –with-debugmax && make
To generate the documentation: make doc
To run the tests: make check
The latest version is freely available here: http://my-svn.assembla.com/svn/Cubs2
Under Unix: svn co http://my-svn.assembla.com/svn/Cubs2
Under Windows, there is tortoise svn which allows you to do this.
Another copy was made here: https://github.com/cptpingu/Cubs
Bibliography
Dr. Paul Carter: http://www.drpaulcarter.com/pcasm/, from whom I borrowed the only part of this project I didn’t create alone: the read_int, read_char, print_int, and print_string routines.
My lectures on regular languages and compilation, which I can’t share here.
Bugs
Please contact me if you find a bug: Contact me All constructive feedback is welcome.
Binaries
If you don’t want to compile the project, I’ve already created the binaries here:
Linux: http://0217021.free.fr/cubs/cubs
Windows: http://0217021.free.fr/cubs/cubs.exe
Detailed Explanations
For more detailed explanations, I invite you to consult the articles I wrote on this subject: