First the specifics, then the curiosity

This text was first written in Polish, then machine-translated to English.

Bottom-up thinking - Moving from the visible effect to the mechanisms underneath. People call this bottom-up thinking or mechanical curiosity. For example, I see C++ turning into binary, and then I wonder what's really going on under the hood. There's another way to understand it: bottom-up thinking = proof, top-down thinking = hypothesis.

I was fascinated by what computers could do, and when I watched "Hackers" as a kid, I simply wanted to understand how these machines worked.

I put together a few simple Delphi programs. They worked just fine; some of them are still on my blog. Back then, I was incredibly excited to understand physics, math, and chemistry. I dreamed of a great career, and honestly, I still have some of those dreams.

The first program that actually did anything was Ryzyk Fizyk. It was used to spout out formulas for simple physics problems. I showed it to the school, and the principal immediately sensed some potential. And that's how I ended up with her invitation to the Junior High School Olympiad in Informatics.

I solved a few problems in an ancient language called Object Pascal. The fact that the language was object-oriented didn't hinder my problem-solving. What did was the syntax and lack of wider acceptance in the computer science community.

In 2008, during the summer, I went on vacation with some computer scientists. I was then introduced to C++. I started wondering how code is translated from a high-level language like C++ into binary code, meaning it's understood by a computer.

Then the idea of creating my own compiler was born. There were quite a few people there who wanted to create something like that, or were already involved in it. One idea I really liked was GLI - Goovno Language Interpreter.

Compiler Operation

A compiler, like many other programs, has input and output. A program in a high-level programming language is fed to the program's standard input, and the output is usually machine code. There are also transpilers, programs that translate code from one programming language to another.

Today, we'll delve into how the compilation process works. We'll take a simple program and write a simple compiler that will hopefully spit out a finished executable file. We'll learn how a text file is decomposed into a tree, and then we'll transform that tree into an executable file.

Decomposing a File

The first step in building a compiler is tokenization. Normally, we'd do this using a program like flex, but here we'll simplify things and move straight to building a syntax tree.

AST

The basis of a compiler's operation is the AST (Abstract Syntax Tree). It describes how a source file is constructed. By traversing this tree, we can perform various tasks using algorithms (DFS, checking for well-declared variables, generating assembly code, checking data consistency, etc.).

We'll build a simple syntax tree using the Recursive Descent algorithm. There are more advanced tree building methods (LALR) that allow for the creation of syntax trees in more advanced programming languages, but we'll focus on the simplest variant and implement the simplest non-trivial subset of the C language.

Our compiler will scan the data stream from the beginning of the file to the end. (In the meantime, it will likely use a stack and a few other data structures.)

Implementation

We are going to create a class in Pythnon, which inherits from list. This way, we have all benefits of a list. We can sort it, traverse, it is ordered etc. We will add an argument: name. This will be the name of the function in source code (like print(), foo(), bar(), etc).

class RecursiveDescentParser(list):
    def __init__(self, name):
        super().__init__()
        self.name = name

We need to parse the text from line 0 to last character, one character at a time. We will do this the following way:

[text starts] ------------ > :c -------------------> [text ends]

We are going to progress through the text stream and when there is an opening bracked, call the parser recursively. On closing bracket, we return from the recursive call (this is where Recursive Descent Parser took its name).

On return, we append the returned object to self. This is where our list comes to use: an object in our parser is a list of code blocks, so for example:

function1 () {
}

function2 () {
}

Will each be treated as separate object on our list. We can then iterate recursively through these objects and obtain somewhat simplified syntax tree.

What we will do is the following: Scan text character by character, if we encounter pattern string (string) { or string () {, we call Recursive Parser recursively and return an object, which finally is appended to parent object.

We are going to use a regular expression to match

import re

class RecursiveDescentParser(list):
    def __init__(self, name):
        super().__init__()
        self.name = name

    def recurse(self, text, i=0):
        while i < len(text):
            m = re.match(r"(\w+)\s*\(([^)]*)\)\s*{", text[i:])

            if m:
                name = m.group(1)
                i += len(m.group(0))

                child = RecursiveDescentParser(name)
                i = child.recurse(text, i)
                self.append(child)

            elif text[i] == '}':
                return i + 1

            else:
                i += 1

        return i

if __name__ == "__main__":
    dp = RecursiveDescentParser("default")

    dp.recurse("""
    main() {
        function() {
            inner_function()
        }
    }
    """)

We should add a function to print the tree, so we can actually see if it's correct or not. We add the following to_string function. It will take indentation as argument, and traverse the syntax tree using depth-first search, in simple words meaning that it will go through each node inside self and call to_string recursively.

def to_string(self, indent="  "):
    res = f"{indent}{self.name} {{\n"

    for child in self:
        res += child.to_string(indent + "  ")

    res += f"{indent}}}\n"
    return res

Let's summarize our effort and show the entire parser in one piece of code:

import re

class RecursiveDescentParser(list):
    def __init__(self, name):
        super().__init__()
        self.name = name

    def to_string(self, indent="  "):
        res = f"{indent}{self.name} {{\n"

        for child in self:
            res += child.to_string(indent + "  ")

        res += f"{indent}}}\n"
        return res

    def recurse(self, text, i=0):
        while i < len(text):
            m = re.match(r"(\w+)\s*\(([^)]*)\)\s*{", text[i:])

            if m:
                name = m.group(1)
                i += len(m.group(0))

                child = RecursiveDescentParser(name)
                i = child.recurse(text, i)
                self.append(child)

            elif text[i] == '}':
                return i + 1

            else:
                i += 1

        return i

if __name__ == "__main__":
    dp = RecursiveDescentParser("default")

    dp.recurse("""
    main() {
        function() {
            inner_function()
        }
    }
    """)

    print("Parsed tree: ")
    print(dp.to_string())

This program returns:

Parsed tree:
  default {
    main {
      function {
      }
    }
  }