A compiler from the user’s perspective is a software that reads input source files and compiles them. The output of the compiler is usually one main executable file and some auxiliary files. The compiler should be fast and should generate optimized code.
But for the compiler designer a compiler is a beautiful balance between data structures and algorithms. Both are needed to quickly scan source files, to parse the tokens, to generate intermediate code, to optimize it and to link modules. Each compiler stage needs the data in some format. Even highly optimized algorithm would be inefficient if the data would not be stored efficiently. One of the most important data structures in every compiler is symbol table.
Symbol table is a special data structure that holds all symbols, from identifiers to internally generated nodes. Compiler symbol table must contain data structures that will hold string values for symbol names, integer values for data pointers, bit values for boolean flags and fields for special purposes. The organization of the symbol table must be such that it is possible to quickly search for a symbol, to quickly move to the next one. to easily add a new symbol at any position, to easily move data from one place to another and not to use much memory. When you try to combine all the requirements you will find that it is not so easy to decide in which form the data should be stored. One of the compromises is to use different symbol tables for different kind of data.
For example, the symbol table that stores the identifiers needs to efficiently store strings of variable length with associated attributes. One of the functions most frequently called during the scanning of the source file is to check whether the identifier is already in the symbol table. The brute force method to check all identifiers would be very inefficient. Therefore, a better method has to be found. A common approach is to use hash tables. There is a hash function that for each identifier calculates some integer value. This value should only depend on the identifier name. This value needs to be power of 2 and few bits is enough. For each hash value there is a separate linked list of identifiers. So the hash function determines in which list the identifier will be stored. This way we can minimize the search count.
Another example is symbol table that, for example, holds the nodes of program control flow. You need to be able to quickly move in both directions starting from any node. This requirement implies use of two-way linked lists.
The best way to learn about symbol tables, hash functions, linked lists and algorithms is to examine the code of some compiler. You will need some time to become familiar with the functions and the data used but then you will have an overview of the whole picture. Every compiler is a symphony of data structures and algorithms.
Leave a Reply