Creating a compiler: Part 2, lexing

In the previous article we briefly saw the different stages that allowed us to create an executable. Let’s now discuss the first phase in more detail. In this stage, we will not only have to detect and extract the tokens from the source file, but also verify them. Finally, all of these tokens will be saved since they are essential for carrying out the following stages.

Some good practices

Before going any further, it is important when tackling a project of this size, to set some good practices. Here is a list, which may seem obvious to some, of habits to adopt:

Set a writing standard, that is to say choose the way you will indent your code, and the writing style you will adopt (pascal case, camel case, C case, etc…).
Try as much as possible to make your code as close to a spoken language as possible. To do this, use explicit identifier names and break your code down into a series of small functions and classes. A function should not exceed an average of fifty lines.
Avoid putting code in a header file. For templates and inline functions, prefer the use of an external file included in your header file (the extension is generally .hxx under Unix).
Document your code. A comment should be written before starting to write the behavior of a function to which it is attached.
Try to have only one class per file.
Avoid using namespace std;. For example, prefer a simple using std::cout;, if your goal is just not to rewrite std:: before a cout. See: On the proper use of using namespace
Use namespaces to modularize your code.
Use and abuse of assert ! Assert is a macro function, which allows you to ensure that a predicate is true. In release mode, asserts disappear purely and simply, so there is no additional cost.

Design patterns

A design pattern is an elegant and proven solution to a recurring problem. Many design patterns exist and I will not detail them all. This project uses three: the singleton, the flyweight and the visitor. Each of these design patterns will be explained as we go along.

Symbols and expressions

In our language, there are words and symbols. In order to easily modify them later, we will start by creating a class that will be responsible for externalizing the list of valid words. We will call this class Configuration. This class representing a unique configuration must not be duplicated, and be accessible anywhere in our code. The singleton design pattern is therefore perfectly suited.

Explanation of the singleton design pattern

The purpose of this design pattern is to ensure the uniqueness of a target object. The code necessary for its realization is not very complicated. We will create a class as generic as possible. This class will be templated by the object whose uniqueness must be ensured. First of all, we will make it impossible to construct such an object. To do this, we will establish the visibility level of the default constructor as being inaccessible. It is now impossible to instantiate the class directly. To instantiate such a class, we will create a public, and especially static method (that is to say that can be used without instantiating the class where it is located), and manage the construction of this class through this method. The method will ensure that it is always the same object that will be returned, and not a duplication. To achieve this, it is sufficient to use a static local variable, which by definition, will be initialized only once. Thus, it is impossible to make a copy of this same object.

Here is what our class looks like (This is a simplified and compacted version):

template<typename T>
class Singleton
{
protected:
  Singleton() {}
public:
  static T& getInstance()
  {
    static T object;
    return object;
  }
};

The Configuration class will contain a key-value association container (std::map), which will associate a type of word with the representation we want to make of it. For example, I can associate the word str with ", to indicate the symbol that allows the opening and closing of a character string. You will find all of these symbols in the Configuration.hxx file. Now, for our configuration class to be a singleton, it is sufficient to inherit from this class as follows:

class Configuration : public Singleton<Configuration>
{
  friend class Singleton<Configuration>;
private:
  Configuration();
  std::map<std::string, std::string> _keyWords;
};

You will notice that the constructor of this class is set to private visibility, in order to prevent the construction of this said-class. In addition, the presence of the friend keyword may seem strange here. Although it is optional, it will allow to lighten the writing when calling an instance. Indeed, to be able to retrieve the instance you have to write:

Configuration& cfg = Singleton<Configuration>::getInstance();

Which is not very elegant. By indicating that Configuration is a friend of the singleton, it is possible to simplify the writing to:

Configuration& cfg = Configuration::getInstance();

This writing being all the more intuitive.

The Error class

Whatever the stage, in case of invalid source code, it will be necessary to report the error to the user. To help him in this process, it is of course necessary to indicate the type of error, where it is located and to accompany it with an explanatory message. All this information will be grouped within this class, whose simplified code is as follows:

class Error
{
public:
  Error(const Error::type type, const std::string msg, const int line)
   : _type(type), _msg(msg), _line(line)
  {
  }
private:
  const type _type;
  const std::string _msg;
  const int _line;
};

The Error Handler class

When an error occurs, we generally do not stop immediately. We try to detect as many as possible. This is why even if a source code is wrong, we will continue to analyze the file in order to present the user with the maximum number of errors at once. This class is not very complicated to understand and will essentially contain a list of errors (std::vector<Error*> _errors;).

The Symbol class

Each time we retrieve a token, we will not only have to remember the textual symbol, but also the information associated with it, such as the line where it was present, as well as the type of token (for example, if it is a simple expression, an operator, a character string, etc…). The Symbol class will be responsible, depending on the textual symbol given to the constructor, for deducing the type of token. By delegating this responsibility to this class, we will greatly simplify the one that will be in charge of lexing.

This class will have three constructors, the first will ask for all the information necessary to create a symbol (token, line and type), the second will only ask for two pieces of information, and will deduce the type of token from the token, and finally the third will construct a new symbol from another. Finally, you will note the presence of a check() method which will allow to verify, in the case where the token is of identifier type (that is to say if it is a variable or function name), if this name is correct. For a name to be valid it must be composed only of letters, numbers, and _ characters, and start with a letter. Finally, the << operator is redefined in order to easily display a symbol on the screen.

Our Symbol class still has a flaw. If a token appears several times, then in the list of symbols, we will have the character string representing it in memory several times. For optimization purposes, we will ensure the uniqueness of each string encountered. To do this, we will now study the flyweight design pattern, which meets this need.

The Flyweight design pattern

Imagine the following situation: you are trying to code a basic word processing software. Each typed letter has a visual representation depending on its size, font and color. You will therefore naturally create a Letter class. Thus, for each typed letter, a new instance of Letter is created. The number of typed letters can quickly become large, and memory will be saturated very quickly. An intelligent approach is to simply check when creating a letter, that it is not already present with the same characteristics. If this is the case, you might as well use the one that already exists. This trick will save a lot of memory space.

In C++, to create a generic flyweight class, you need very little. First of all, let’s use a container common to all instances of the same class. This container must ensure the uniqueness of each element, so we will use a set. Then it is sufficient to check, when creating a new element, whether it is present or not. If this is not the case, we add this element. We then return the address of the element, which will necessarily exist. In addition, we will redefine all the assignment and comparison operators in order to make our class as easy to use as possible. Finally, static clear() and size() methods will allow us to respectively empty and obtain the size of the shared container.

Here is an excerpt from the simplified class:

template <typename T>
class Flyweight
{
public:
  Flyweight(const T& elt);
  static void clear();
  static unsigned int size();
  const Flyweight& operator=(const T& s);
  bool operator==(const T& elt);
  bool operator!=(const T& elt);
protected:
  static std::set<T> set;
  const T* _elt;
};

The SharedString class

Now that we have created this design pattern, we will apply it to our project. To do this we will create the SharedString class which will allow to share in memory the instances of similar character strings. The creation of this class is trivial since it is sufficient to inherit from a specialized flyweight. Some additional methods are present, such as the possibility of retrieving a character at a given position, obtaining the size of the string, as well as defined interactions with character strings in the form of char*. This allows this class to be used exactly like a std::string, which makes its use very intuitive.

Here is an excerpt from the simplified class:

class SharedString : public Flyweight<std::string>
{
public:
  SharedString(const SharedString& s);
  SharedString(const std::string& s);
  SharedString(const char* s);
  const SharedString& operator=(const char* s);
  bool operator==(const char* s);
  bool operator!=(const char* s);
  char& operator[](const unsigned int i);
  unsigned int length() const;
};

The Lexer class

Now that we have created everything we need, let’s get to the main part: the lexing. Our Lexer class is not very difficult to understand. It has a list of errors (it encapsulates an Error Handler class), and a list of symbols. You just have to retrieve each of the tokens and update the symbol table as you go. In case of an error, we ignore the current token, add a new error to the error list, and continue trying to extract the following tokens.

Let’s now detail the method used to extract the tokens. First, we check that the current token is not a comment opening token, in which case, we go directly to the next line, since a comment is by definition ignored. We also check that the current token is not a string opening token. If this is the case, then we just find the closing character of the string and save the string thus recovered as a token in its own right.

It is important to differentiate keywords from atomic symbols. A keyword must be written in a strict way, that is to say without it being attached to another token, which is not the case for atomic symbols. For example: begin end are two keywords, but beginend will be considered a simple expression. On the other hand, if we take the atomic symbol >=, we realize that 4 >= 5 and 4>=5 are two identical expressions.

We will start by detecting atomic symbols with two characters. Indeed, you should always look for the largest tokens first to avoid any ambiguity. For example, it is important when you encounter the following string: <=, to detect the atomic symbol <= and not two atomic symbols that follow each other. If the expression being analyzed is neither an atomic symbol nor a keyword, then all that remains is to check whether it is a value or a variable. If the current expression is composed only of digits, then it is a value. Otherwise, if it is none of the previously described token types, then it can only be a variable.

Implementation

Now that our Lexer is finished, we will just set up an option system. This is trivial, you just have to consider that the first argument of the command line (argv[1]), represents the option, and the following arguments, the files passed to our binary. We will therefore create the following two options: -l, which will launch the lexing on the files given in arguments, that is to say which will display any lexing errors, and -L, which in addition to doing what the first option does, will also display information on all the recovered tokens, that is to say the type of token, the line where they were extracted, and their textual symbols. This implementation is simple enough that I will not detail it.

Conclusion

In the next article, we will discuss parsing, but for the most impatient among you, the complete project can be retrieved at this address: https://github.com/cptpingu/Cubs/tree/master/cubs

Creating a compiler (part 2)

Creating a compiler step by step