Now coding: parser generator

Thinking of great design is one thing, making it happen is a step fewer people are able to make. So here I’ll describe what have been done so far, what is currently being implemented, and why my husband chose these steps to begin with. The Spirit parser has been lost and wouldn’t be of any use anyway, so he’s making a fresh start. The first step into making a compiler, would be making a parser, which would transform code into an abstract syntax tree. But the first step into making C3 is to make a parser generator.

Currently used language compilers are often built around a context-free grammar. This is a concept directly taken from the Chomsky hierarchy, which describes the different types of formal languages. In the 2000s, Brian Ford gave a name to a concept that has already been floating for a while: the parsing expression grammar, or PEG. The Chomsky hierarchy describes how the sentence (or program) should be composed, while the PEG is going reverse, describing how to recognize a correct sentence of the language.

A context-free grammar usually leads to a bottom-up parser which is very complex. Two well known bottom-up parser generators are YACC (yet another compiler compiler) and Bison (GNU version of Yacc). They achieve a level of complexity that makes the generated code hard to understand and almost impossible to debug. On the other hand, there isn’t many top-down parser generators, probably because it’s a lot easier to directly write the parser which would be a recursive descent parser. PEGs are formal descriptions for this kind of parser. Spirit was described by its author as using modified context-free grammars but in fact it was using PEGs before anyone heard of them!

There are many advantages to use PEGs. The context-free grammar have to consider all possibilities at parsing time, and when facing two possible solutions, it cannot resolve itself without a hack. This is called a “conflict”. For example, many subsequent “if” are in the code, then comes an “else”. The context-free grammar doesn’t know with which “if” to associate the “else”, they are all possible solutions in the grammar definition. A hack is necessary to associate this “else” with the correct “if”. The PEG use a greedy algorithm with a priority to the different possibilities. It doesn’t need to be told what to do in all conflicting situation, it just take the highest priority. This makes the algorithm a lot simpler, and possibly more efficient. This is ideal for programming languages for which the main objective is to be parsed into something else, and the language can be designed based on this assumption. Chomsky’s theory of language was made to describe existing (natural) languages, it is normal that it may not be the most fitting tool to create a new one.

Another thing: when creating a context-free grammar parser, we need to have both a parser and a lexer (also called scanner). The lexer do a top-down operation on small pieces (numbers, variables, etc.) in order to let the parser handle the bigger chunks (instructions, functions, etc.) in a bottom-up way. When parsing top-down, no need for two tools, this is all handled in a single operation. The main reason why this can be done with PEG is that repetition is handled without recursivity. The more I learn about lexer, the more they feel like a big hack, good riddance, hi hi!

A funny thing is that Bjarne Stroustrup himself now regrets his choice of a bottom-up parser. At that time he was counseled by language and grammar specialists that this was the only way to go. But the implementation was so complex, he now thinks that a simple recursive descent parser would have done the job after all.

The type of parser the parser-generator is going to create is a packrat parser, which is a type of recursive descent parser. Recursive descent parsers have backtracking. That means that if the parser realize it took the wrong path because of its greedy nature, it will go backward an try the next solution in the priority table. With packrat parsing, if a subpart of the bad parsing finished correctly, it will be stored in a table to be reused in the following treatment if possible. This optimization is named memoisation.

So at this time, my husband is writing a PEG-packrat parser generator. This isn’t a necessary task, he could have just jumped into coding the parser itself. But having the parser generator will make things more flexible, and it’s also fun to implement (for him at least). There wasn’t any satisfying generator known when he started coding it, including Spirit which have poor compile-time performance and too much complexity coming from C++ templates techniques (after all, he wants to reimplement everything in C3 at some point). He now has a “PEG PEG”, which parse files of PEG description and is working toward PEG parser generator.

There are many concepts in this post that you may never have heard of. This is totally normal. I encourage you to follow the links and read about them. I don’t know if it will make you a better person, but this is all very interesting and essential to understand the fundamentals of programming languages.

Leave a Reply

Your email address will not be published. Required fields are marked *