Cubs

Pseudo-pascal compiler

Simplified Pascal Compiler

Cubs is a compiler and interpreter for a language based on Pascal and C. The language itself isn’t very advanced:

  • 3 types (integer, boolean, string).
  • 2 control structures (if, while)
  • 3 builtins (read, print, exit)
  • No pointers
  • No arrays.
  • Functions and recursive functions, with the return keyword.
  • 5 calculation operators +, -, /, *, %
  • 6 comparison operators ==, !=, >, >=, <, <=
  • No and or statements. (Not yet, but planned).
  • Parentheses are required on compound expressions. Ex: 1 + 2 + 3 doesn’t work, you have to write: 1 + (2 + 3)
  • Possibility of creating global or local variables.

The primary purpose of this project was to present the steps involved in compiling source code. This is why the compiled language is not complicated. Each step can be displayed, with options allowing you to see the desired steps. To briefly review the steps:

  • Lexing (-L option): We retrieve all the tokens in a file. A token is the smallest atomic piece defined in a grammar. For example, in a sentence, a token is a word. This step also checks that certain impossible words are not present. For example: if @ then will generate a lexing error, because @ is not a valid token.
  • Parsing (-P option): We check that all the retrieved tokens are arranged in the order defined by the language grammar. This step checks for misplaced tokens. For example: then if while is a parsing error, since this arrangement is incorrect. Create an AST, reused by all subsequent steps, using the visitor design pattern.
  • Binding (-B option): Checks that all variables and functions are properly defined and not redefined. For example: a = 10; will raise a binding error since a is not declared. Similarly: var a : integer; var a : integer; will also raise a binding error since a is declared twice.
  • Type Checking (-T option): Checks that the variable typing is respected. For example: var a : integer; then a = 10; is correct, while var a : string; then a = 10 is incorrect.

Once all these steps are completed, our source code is considered valid. We can then perform the following steps:

  • Execution (-X option): Executes the code and displays the execution path. Default option.
  • Compilation (-S option): Generates assembler, which only needs to be assembled with nasm as follows:
./minicompil -S prog.mmc > prog.asm
nasm -o prog.o -f elf -d ELF_TYPE prog.asm

THEN

ld -s --dynamic-linker /lib/ld-linux.so.2 -lc prog.o -o prog

OR

./auto_compile.sh prog.asm

Bonus Steps

Conversion to C++ (-C option): Converts the code to C++. Compilable by typing:

./minicompil -C prog.mmc > prog.cc

THEN

g++ -W -Wall prog.cc -o prog

OR

./auto_compile.sh prog.cc
  • Debugging (-D option): Acts as a debugger. Launches an execution, showing the state of all variables at each instruction.
  • All (-V option): Launches and displays all steps up to execution.
  • Grammar generator (-G option): Generates a valid grammar at random.
  • Dotty tree (-O option): Generates output for dotty from the created AST. (Dotty is a program that allows you to draw graphs and trees, among other things; see screenshot). An image can be created by typing:
./minicompil -O prog.mmc > prog.dot

THEN

dot -Tpng prog.dot > prog.png

OR

./auto_compile.sh prog.dot

Criticisms

  • The generated assembler is not optimized. Assembler purists will find some routines written inelegantly. Furthermore, I pass a lot of things through the stack.
  • Some pieces of code could be optimized; I’ll work on them.
  • The AST could be improved, for example, by creating only one node during a constant expression. E.g., 1 + (2 + 3) generates many nodes, whereas a single node containing 6 would be sufficient. * Overloading, inling, or and and instructions, and the ability to split the code into multiple files are not supported.
  • Not tested on Windows, but should work easily, since I’ve made it a point to have no dependencies on either flex/bison or boost…
  • The tree structure isn’t great (everything in src).
  • The generated assembler uses libc functions. I wish I didn’t have any dependencies on that.

Documentation

Technical documentation (33MB): http://0217021.free.fr/cubs/refman.pdf

Notes

Coded entirely under Linux with Emacs.

Successfully compiled under g++ 4.2.4.

Assembler assembled under NASM version 2.06rc2.

Comments in Doxygen format.

To compile: ./configure && make

To compile in debug mode (-g + -lefence + assert): ./configure –with-debugmax && make

To generate the documentation: make doc

To run the tests: make check

The latest version is freely available here: http://my-svn.assembla.com/svn/Cubs2

Under Unix: svn co http://my-svn.assembla.com/svn/Cubs2

Under Windows, there is tortoise svn which allows you to do this.

Another copy was made here: https://github.com/cptpingu/Cubs

Bibliography

Dr. Paul Carter: http://www.drpaulcarter.com/pcasm/, from whom I borrowed the only part of this project I didn’t create alone: ​​the read_int, read_char, print_int, and print_string routines.

My lectures on regular languages ​​and compilation, which I can’t share here.

Bugs

Please contact me if you find a bug: Contact me All constructive feedback is welcome.

Binaries

If you don’t want to compile the project, I’ve already created the binaries here:

Linux: http://0217021.free.fr/cubs/cubs

Windows: http://0217021.free.fr/cubs/cubs.exe

Detailed Explanations

For more detailed explanations, I invite you to consult the articles I wrote on this subject:

Building a Pseudo-Pascal Compiler in C++, Part 1

Building a Compiler, Part 2, Lexing.