Creating a compiler: Part 2, lexing
In the previous article we briefly saw the different stages that allowed us to create an executable. Let’s now discuss the first phase in more detail. In this stage, we will not only have to detect and extract the tokens from the source file, but also verify them. Finally, all of these tokens will be saved since they are essential for carrying out the following stages.
Some good practices
Before going any further, it is important when tackling a project of this size, to set some good practices. Here is a list, which may seem obvious to some, of habits to adopt:
- Set a writing standard, that is to say choose the way you will indent your code, and the writing style you will adopt (pascal case, camel case, C case, etc…).
- Try as much as possible to make your code as close to a spoken language as possible. To do this, use explicit identifier names and break your code down into a series of small functions and classes. A function should not exceed an average of fifty lines.
- Avoid putting code in a header file. For templates and inline functions, prefer the use of an external file included in your header file (the extension is generally .hxx under Unix).
- Document your code. A comment should be written before starting to write the behavior of a function to which it is attached.
- Try to have only one class per file.
- Avoid
using namespace std;. For example, prefer a simpleusing std::cout;, if your goal is just not to rewritestd::before acout. See: On the proper use of using namespace - Use namespaces to modularize your code.
- Use and abuse of assert !
Assertis a macro function, which allows you to ensure that a predicate is true. In release mode, asserts disappear purely and simply, so there is no additional cost.
Design patterns
A design pattern is an elegant and proven solution to a recurring problem. Many design patterns exist and I will not detail them all. This project uses three: the singleton, the flyweight and the visitor. Each of these design patterns will be explained as we go along.
Symbols and expressions
In our language, there are words and symbols. In order to easily modify them
later, we will start by creating a class that will be responsible for
externalizing the list of valid words. We will call this class Configuration.
This class representing a unique configuration must not be duplicated, and be
accessible anywhere in our code. The singleton design pattern is therefore
perfectly suited.
Explanation of the singleton design pattern
The purpose of this design pattern is to ensure the uniqueness of a target object. The code necessary for its realization is not very complicated. We will create a class as generic as possible. This class will be templated by the object whose uniqueness must be ensured. First of all, we will make it impossible to construct such an object. To do this, we will establish the visibility level of the default constructor as being inaccessible. It is now impossible to instantiate the class directly. To instantiate such a class, we will create a public, and especially static method (that is to say that can be used without instantiating the class where it is located), and manage the construction of this class through this method. The method will ensure that it is always the same object that will be returned, and not a duplication. To achieve this, it is sufficient to use a static local variable, which by definition, will be initialized only once. Thus, it is impossible to make a copy of this same object.
Here is what our class looks like (This is a simplified and compacted version):
template<typename T>
class Singleton
{
protected:
Singleton() {}
public:
static T& getInstance()
{
static T object;
return object;
}
};
The Configuration class will contain a key-value association container
(std::map), which will associate a type of word with the representation we want
to make of it. For example, I can associate the word str with ", to indicate
the symbol that allows the opening and closing of a character string. You will
find all of these symbols in the Configuration.hxx file. Now, for our
configuration class to be a singleton, it is sufficient to inherit from this
class as follows:
class Configuration : public Singleton<Configuration>
{
friend class Singleton<Configuration>;
private:
Configuration();
std::map<std::string, std::string> _keyWords;
};
You will notice that the constructor of this class is set to private visibility,
in order to prevent the construction of this said-class. In addition, the
presence of the friend keyword may seem strange here. Although it is optional,
it will allow to lighten the writing when calling an instance. Indeed, to be
able to retrieve the instance you have to write:
Configuration& cfg = Singleton<Configuration>::getInstance();
Which is not very elegant. By indicating that Configuration is a friend of the
singleton, it is possible to simplify the writing to:
Configuration& cfg = Configuration::getInstance();
This writing being all the more intuitive.
The Error class
Whatever the stage, in case of invalid source code, it will be necessary to report the error to the user. To help him in this process, it is of course necessary to indicate the type of error, where it is located and to accompany it with an explanatory message. All this information will be grouped within this class, whose simplified code is as follows:
class Error
{
public:
Error(const Error::type type, const std::string msg, const int line)
: _type(type), _msg(msg), _line(line)
{
}
private:
const type _type;
const std::string _msg;
const int _line;
};
The Error Handler class
When an error occurs, we generally do not stop immediately. We try to detect as
many as possible. This is why even if a source code is wrong, we will continue
to analyze the file in order to present the user with the maximum number of
errors at once. This class is not very complicated to understand and will
essentially contain a list of errors (std::vector<Error*> _errors;).
The Symbol class
Each time we retrieve a token, we will not only have to remember the textual
symbol, but also the information associated with it, such as the line where it
was present, as well as the type of token (for example, if it is a simple
expression, an operator, a character string, etc…). The Symbol class will be
responsible, depending on the textual symbol given to the constructor, for
deducing the type of token. By delegating this responsibility to this class, we
will greatly simplify the one that will be in charge of lexing.
This class will have three constructors, the first will ask for all the
information necessary to create a symbol (token, line and type), the second will
only ask for two pieces of information, and will deduce the type of token from
the token, and finally the third will construct a new symbol from another.
Finally, you will note the presence of a check() method which will allow to
verify, in the case where the token is of identifier type (that is to say if it
is a variable or function name), if this name is correct. For a name to be valid
it must be composed only of letters, numbers, and _ characters, and start with
a letter. Finally, the << operator is redefined in order to easily display a
symbol on the screen.
Our Symbol class still has a flaw. If a token appears several times, then in
the list of symbols, we will have the character string representing it in memory
several times. For optimization purposes, we will ensure the uniqueness of each
string encountered. To do this, we will now study the flyweight design pattern,
which meets this need.
The Flyweight design pattern
Imagine the following situation: you are trying to code a basic word processing
software. Each typed letter has a visual representation depending on its size,
font and color. You will therefore naturally create a Letter class. Thus, for
each typed letter, a new instance of Letter is created. The number of typed
letters can quickly become large, and memory will be saturated very quickly. An
intelligent approach is to simply check when creating a letter, that it is not
already present with the same characteristics. If this is the case, you might as
well use the one that already exists. This trick will save a lot of memory
space.
In C++, to create a generic flyweight class, you need very little. First of all,
let’s use a container common to all instances of the same class. This container
must ensure the uniqueness of each element, so we will use a set. Then it is
sufficient to check, when creating a new element, whether it is present or not.
If this is not the case, we add this element. We then return the address of the
element, which will necessarily exist. In addition, we will redefine all the
assignment and comparison operators in order to make our class as easy to use as
possible. Finally, static clear() and size() methods will allow us to
respectively empty and obtain the size of the shared container.
Here is an excerpt from the simplified class:
template <typename T>
class Flyweight
{
public:
Flyweight(const T& elt);
static void clear();
static unsigned int size();
const Flyweight& operator=(const T& s);
bool operator==(const T& elt);
bool operator!=(const T& elt);
protected:
static std::set<T> set;
const T* _elt;
};
The SharedString class
Now that we have created this design pattern, we will apply it to our project.
To do this we will create the SharedString class which will allow to share in
memory the instances of similar character strings. The creation of this class is
trivial since it is sufficient to inherit from a specialized flyweight. Some
additional methods are present, such as the possibility of retrieving a
character at a given position, obtaining the size of the string, as well as
defined interactions with character strings in the form of char*. This allows
this class to be used exactly like a std::string, which makes its use very
intuitive.
Here is an excerpt from the simplified class:
class SharedString : public Flyweight<std::string>
{
public:
SharedString(const SharedString& s);
SharedString(const std::string& s);
SharedString(const char* s);
const SharedString& operator=(const char* s);
bool operator==(const char* s);
bool operator!=(const char* s);
char& operator[](const unsigned int i);
unsigned int length() const;
};
The Lexer class
Now that we have created everything we need, let’s get to the main part: the
lexing. Our Lexer class is not very difficult to understand. It has a list of
errors (it encapsulates an Error Handler class), and a list of symbols. You
just have to retrieve each of the tokens and update the symbol table as you go.
In case of an error, we ignore the current token, add a new error to the error
list, and continue trying to extract the following tokens.
Let’s now detail the method used to extract the tokens. First, we check that the current token is not a comment opening token, in which case, we go directly to the next line, since a comment is by definition ignored. We also check that the current token is not a string opening token. If this is the case, then we just find the closing character of the string and save the string thus recovered as a token in its own right.
It is important to differentiate keywords from atomic symbols. A keyword must be
written in a strict way, that is to say without it being attached to another
token, which is not the case for atomic symbols. For example: begin end are
two keywords, but beginend will be considered a simple expression. On the
other hand, if we take the atomic symbol >=, we realize that 4 >= 5 and
4>=5 are two identical expressions.
We will start by detecting atomic symbols with two characters. Indeed, you
should always look for the largest tokens first to avoid any ambiguity. For
example, it is important when you encounter the following string: <=, to
detect the atomic symbol <= and not two atomic symbols that follow each other.
If the expression being analyzed is neither an atomic symbol nor a keyword, then
all that remains is to check whether it is a value or a variable. If the current
expression is composed only of digits, then it is a value. Otherwise, if it is
none of the previously described token types, then it can only be a variable.
Implementation
Now that our Lexer is finished, we will just set up an option system. This is
trivial, you just have to consider that the first argument of the command line
(argv[1]), represents the option, and the following arguments, the files
passed to our binary. We will therefore create the following two options: -l,
which will launch the lexing on the files given in arguments, that is to say
which will display any lexing errors, and -L, which in addition to doing what
the first option does, will also display information on all the recovered
tokens, that is to say the type of token, the line where they were extracted,
and their textual symbols. This implementation is simple enough that I will not
detail it.
Conclusion
In the next article, we will discuss parsing, but for the most impatient among you, the complete project can be retrieved at this address: https://github.com/cptpingu/Cubs/tree/master/cubs