Compiler

A compiler has always seemed to me to be one of the supreme examples of a serious computer program. It operates on other programs, translating them from one language to another. It must simultaneously work with high-level concepts, and manipulate low-level data. Although a basic batch program, it has many stages and numerous possibilities for optional improvements.

For a long time, it has been my ambition to write one.

Unlike many compiler writers, my goal isn’t to implement a new language, or explore new compilation techniques, or write a better compiler. I’m aiming simply to write a compiler, for fun and learning. I do not expect to invent anything new or better, except in the subjective senses that it will be new to me, and better for my understanding of compilers.

Source language

The source language is called E, and has no formal definition. It’s an imperative, C-like language. Sometimes when you describe a language as C-like, people assume it has all the warts of C. In my case, in a discussion of control structures, it was assumed that E had gotos. Well, it doesn’t. It doesn’t even have for-loops (yet).

E is C-like in that it uses { and } for blocks and ; as a statement separator. It has declarations of the form int x or int f(int y, int z) { ... }.

But it has no pointers, no typedefs, and no preprocessor. Instead, it has tuples, function types, and closures.  Here’s an actual working example, a factorial function using continuation passing style:

int factorial_cps(int x, (int -> int) c)
{
    if (x <= 1)
    {
        return c(x);
    }
    else
    {
        int old_x = x;
        (int -> int) old_c = c;
        return factorial_cps(x - 1, int lambda(int y) {
            return y * old_c(old_x);
        });
    }
}

public int factorial(int x)
{
    return factorial_cps(x, int lambda(int x) { return x; });
}

Why is it mucking around with old_x and old_c? That’s a workaround for a compiler bug, which I will describe since I’m quite proud of the fact my compiler is advanced enough to have interesting bugs.

  1. The recursive call to factorial_cps on line 11, is a tail call; the caller returns directly after calling it.
  2. The compiler optimises this by inserting an assignment of the call’s arguments to the functions variables x and c, followed by a restart instruction. This instruction translates to an edge in the CFG leading back to the top of the function.
  3. Since the parameter list to the function is just a tuple, the assignment (without the workaround) would look like: (x, c) = (x - 1, int lambda(int y){...});
  4. Unfortunately the source expressions in the tuple overlap the destination expressions; in particular the second argument (the closure) makes use of x and c. Depending on what mood it’s in, the compiler could happily generate the code to overwrite one of those variables before it generates the closure creation.
  5. Consequently the closure will lose track of the input variable’s original values; it won’t operate on the correct value of x, and it could (conceivably) call itself rather than the passed in closure c!

I have a plan to fix it, as described in the procrastination list issue tracker.

Parsing

The first major stage of compilation is parsing. Parsing, in most compilers, is the translation of a human-readable text-based program into a symbolic representation of the program: an abstract syntax tree.

Parsing has historically been considered a major topic in compilation. The parser generator YACC is “Yet Another Compiler Compiler”. The cover of the famous Dragon Book is illustrated by the knight of “LALR parser generator” battling a dragon called “Complexity of Compiler Construction”.

Yet I have discovered that parsing is only a small and relatively easy bit of writing a compiler! Admittedly, I use Bison (a GNU version of YACC), so perhaps it’s easy because YACC has done the hard work for me.

There were still a few interesting problems:

  1. Proper precedence of operators. YACC can do most of this for you using built in precedence indications, but I opted to do it using a hierarchy of nonterminals, each of which corresponds to a precedence level.  For example, to have a * b + c * d parsed using the standard convention, the grammar defines a sum nonterminal as either a product, or a product * a product. product itself is defined similarly in terms of atom, which is either a single name or constant, or an expression in brackets.
  2. Nesting of if/else statements. I used the technique from this excellent page.
  3. Differentiating declarations from expressions.
  4. Recovering from syntax errors so that other errors in the input can be picked up and reported too.

I’ve not yet solved the last two.  For item 3, my understanding of the usual technique is that the lexer will identify a name as a type name, and return a terminal to that effect, rather than a terminal for a variable name.

Regarding item 4, the compiler would benefit from a general mechanism for collecting errors and reporting them.  Another tricky part of this is storing the input location (line number) of a problematic node — sometimes it may not be obvious because code can move around significantly during compilation.

Current status

The compiler is currently broken!  I started making changes to the register allocator and then got sidetracked for a few months.  I have lots of ideas though (see the issue list).

Choosing C for the compiler’s implementation language felt natural at the time.  I don’t worry about garbage collection much — I have vague plans to use Boehm at some stage — but even so pointer errors are not uncommon.  The main programming difficulty is remembering to maintain the numerous fields that nodes can contain.  For instance, an expression node might be created in a compiler stage to store a temporary.  What is the line number of that temporary?  It needs to be copied from whichever expression impels its creation.  What about the type?  That needs to be set too, otherwise something will crash later on.

I have often forgotten things like that and had to track down obscure bugs.  I wonder whether a language with stronger types and more static analysis would help?

Interesting things I might blog about in future (even though they don’t entirely work):

  • Some interesting conundrums with implementing closures in a low-level language.
  • How the register allocater works (or is supposed to).  (See my StackOverflow question Register allocation and spilling, the easy way for some insights into the direction I was taking it.)
  • Various CFG operations: building it from the AST; inlining CFGs; optimising tail-calls; emitting it as assembly language.
  • The DFA implementation and its current (and possible) uses.
  • The node class hierarchy and its convoluted implementation in C.

I can’t hope to contribute much to compiler theory, or even to provide much practical knowledge for other compiler writers, but hopefully I can describe some of the interesting problems that arise from it, and impart some enthusiasm for the topic.  After all, compilers occupy a special niche in computer science: they are programs whose inputs are programs, and whose outputs are programs.  Understanding them is essential to a complete understanding of computers.

This entry was posted in Programming and tagged , . Bookmark the permalink.

7 Responses to Compiler

  1. erentz says:

    Nice :) when can we see The E Programming Language book?

    • ejrh says:

      You have to remember that the compiler comes first — I’m implementing what’s fun to implement, then making up an input language for it. Also E is already taken; I’ll have to call mine E# instead. Or E++.

  2. Neil Leslie says:

    Hi Edmund
    why is the type of cpsfac int -> (int -> int) -> int?
    surely it should be int -> (int -> alpha) -> alpha?
    and shouldn’t the new continuation be \y -> c_old (x_old * y)?

    in a more lambda-ish notation
    cps_fac(0, k) => k(1)
    cps_fac(s(n), k) => cps_fac(n, \y->k(s(n) * y))

    with a trace:
    cps_fac(3, k)
    => cps-fac(2, \y -> k(3 * y))
    => cps_fac(1, \x -> ((\y -> k(3*y)) (2*x)))
    => cps_fac(0, \z -> ( \x -> ((\y -> k(3*y)) (2*x))) (1 * z))
    => (\z -> ( \x -> ((\y -> k(3*y)) (2*x))) (1 * z))) 1
    => ( \x -> ((\y -> k(3*y)) (2*x))) (1 * 1)
    => (\y -> k(3*y)) (2 * (1*1))
    => k (3*(2 * (1*1)))

    I used to know a little bit about programming CPS style :-)

    • ejrh says:

      I’m starting from the “how do I do this imperatively on a typical real-world CPU” end of the puzzle, and gradually (glacially) working my way towards implementing what I suspect you’re implying is the ideal language. To paraphrase Freud, the goal of all programming language design is Haskell!

      It’s possible I’ll figure out int -> (int -> alpha) -> alpha one day, and I do now see that it is required for general continuations (thanks). My type system is ints and stuff you can build out of ints using tuples and lambdas — it’s quite constructivist really, you’ll be glad to notice — but no type variables yet.

      Regarding the factorial continuation: hmm. I’m not sure. I think mine worked but I see yours has a tail call, which is an aesthetic point in its favour, at least. Is it more correct? Unfortunately all the examples on Wikipedia are in Scheme so I guess we’ll never know! ;-)

      I mentally ran a similar example when I wrote the post. In full syntactic glory:

          factorial(3)
          = factorial_cps(3, int lambda(int x) { return x})
          = factorial_cps(2, int lambda(int y) { return y * int lambda(int x) { return x}(3); })    /* Since x > 1 */
          = factorial_cps(1, int lambda(int y) { return y * int lambda(int y) { return y * int lambda(int x) { return x}(3); }(3); })   /* Since x > 1 */
          = int lambda(int y) { return y * int lambda(int y) { return y * int lambda(int x) { return x}(3); }(2)); }(1)   /* Since x <= 1 */
          = 1 * int lambda(int y) { return y * int lambda(int x) { return x}(3); }(2);   /* Call the outermost closure with its argument 1. */
          = 1 * 2 * int lambda(int x) { return x}(3); }   /* Call the next closure with its argument 2. */
          = 1 * 2 * 3   /* Call the last closure with its argument 3. */
          = 6
    • ejrh says:

      Argh. Yes you are right. The continuation should be called last, because it’s what we continue with after all our work is done. This ties in with the correct type of the continuation being int -> alpha. My example above uses the wrong type, and consequently places unnecessary restrictions on the type of the passed-in continuation, such as demanding that it returns an int.

      Another demonstration of how getting the types right helps to get the program right! :-)

      • Neil Leslie says:

        When I said I used to know a bit about continuations, I should have said: “My PhD thesis was on continuations in Martin-Lof’s Type Theory” :-)

  3. Pingback: Some practical uses of goto | EJRH

Leave a comment