Tag Archives: Parser

C3’s first sprint backlog

c3wifefirstbacklogLots of our friends, coworkers and cousins had a baby recently. Each of them have this picture of the baby at the hospital with a very cute hat written “My first hat” on it. There is this obssession with babies and the first time they get something done: first step, first word, first day at school, first whatever! C3 having no head or feet can’t have what other children have, but still have it’s first something. We have been talking about it for a while, but today mark the beginning of the first sprint and this comes with a very nice backlog. This is a task-list based on the scrum method, which is gaining popularity in the technology field. Obviously, since my husband is doing this part time we won’t put an end date to the sprint, but at the end, we intent to release something for you my fellow reader, in order to get your comments.

It was hard for me at first to define how to list this. In my husband head, it took the form of a recursive algorithm, while a backlog should be linear (without ifs or gotos) So here I’m unfolding this for you:

Sprint 1: Build a PEG parser with semantic actions that will be able able to regenerate ifself
-Parse PEG description file (without semantic actions) – Done
-Display concrete syntactic tree with the console – Done
-Build the abtract syntactic tree – In progress (Done but not tested)
-Display abstract syntactic tree with the console
-Create a generator of all previous steps
-Parse PEG description file with semantic actions
-Build the abstract syntactic tree with semantic actions
-Adapt the generator with semantic actions

This is a proof of concept. If the PEG parser generator can regenerate itself, (and again, and again) this will be a proof that the algorithm is working. After that he’ll concentrate on the next step: Update the C3 grammar with all the new thing he though about in the last few years!

Double-sided discovery

Yesterday my husband found out about YARD, yet another recursive descent. If you read the description in the article he send me this morning, you’ll find out it does mostly what I described in my latest post (PEG Parser). The problem is, he already have coded most of the features that exist in this library and there is no point redoing everything once again. Inspiration on the other hand is a very good thing.

The main problem is that YARD doesn’t fully support semantic actions and pegtl (another library based on YARD) doesn’t work with many compilers because it uses unreleased feature of C++0x.

This is another proof that the world needs a unified way to share code. YARD have been existing as early as 2004 and yet, he didn’t stumble upon it while searching with google in december. It took an article on Dr. Dobb’s to learn about it. The date: january 2009; five year after the first publication.

Funny thing, the author, Christopher Diggins, lives in the same province, not that far from our place. Almost looks like a trend in our area!

Now coding: parser generator

Thinking of great design is one thing, making it happen is a step fewer people are able to make. So here I’ll describe what have been done so far, what is currently being implemented, and why my husband chose these steps to begin with. The Spirit parser has been lost and wouldn’t be of any use anyway, so he’s making a fresh start. The first step into making a compiler, would be making a parser, which would transform code into an abstract syntax tree. But the first step into making C3 is to make a parser generator.

Currently used language compilers are often built around a context-free grammar. This is a concept directly taken from the Chomsky hierarchy, which describes the different types of formal languages. In the 2000s, Brian Ford gave a name to a concept that has already been floating for a while: the parsing expression grammar, or PEG. The Chomsky hierarchy describes how the sentence (or program) should be composed, while the PEG is going reverse, describing how to recognize a correct sentence of the language.

A context-free grammar usually leads to a bottom-up parser which is very complex. Two well known bottom-up parser generators are YACC (yet another compiler compiler) and Bison (GNU version of Yacc). They achieve a level of complexity that makes the generated code hard to understand and almost impossible to debug. On the other hand, there isn’t many top-down parser generators, probably because it’s a lot easier to directly write the parser which would be a recursive descent parser. PEGs are formal descriptions for this kind of parser. Spirit was described by its author as using modified context-free grammars but in fact it was using PEGs before anyone heard of them!

There are many advantages to use PEGs. The context-free grammar have to consider all possibilities at parsing time, and when facing two possible solutions, it cannot resolve itself without a hack. This is called a “conflict”. For example, many subsequent “if” are in the code, then comes an “else”. The context-free grammar doesn’t know with which “if” to associate the “else”, they are all possible solutions in the grammar definition. A hack is necessary to associate this “else” with the correct “if”. The PEG use a greedy algorithm with a priority to the different possibilities. It doesn’t need to be told what to do in all conflicting situation, it just take the highest priority. This makes the algorithm a lot simpler, and possibly more efficient. This is ideal for programming languages for which the main objective is to be parsed into something else, and the language can be designed based on this assumption. Chomsky’s theory of language was made to describe existing (natural) languages, it is normal that it may not be the most fitting tool to create a new one.

Another thing: when creating a context-free grammar parser, we need to have both a parser and a lexer (also called scanner). The lexer do a top-down operation on small pieces (numbers, variables, etc.) in order to let the parser handle the bigger chunks (instructions, functions, etc.) in a bottom-up way. When parsing top-down, no need for two tools, this is all handled in a single operation. The main reason why this can be done with PEG is that repetition is handled without recursivity. The more I learn about lexer, the more they feel like a big hack, good riddance, hi hi!

A funny thing is that Bjarne Stroustrup himself now regrets his choice of a bottom-up parser. At that time he was counseled by language and grammar specialists that this was the only way to go. But the implementation was so complex, he now thinks that a simple recursive descent parser would have done the job after all.

The type of parser the parser-generator is going to create is a packrat parser, which is a type of recursive descent parser. Recursive descent parsers have backtracking. That means that if the parser realize it took the wrong path because of its greedy nature, it will go backward an try the next solution in the priority table. With packrat parsing, if a subpart of the bad parsing finished correctly, it will be stored in a table to be reused in the following treatment if possible. This optimization is named memoisation.

So at this time, my husband is writing a PEG-packrat parser generator. This isn’t a necessary task, he could have just jumped into coding the parser itself. But having the parser generator will make things more flexible, and it’s also fun to implement (for him at least). There wasn’t any satisfying generator known when he started coding it, including Spirit which have poor compile-time performance and too much complexity coming from C++ templates techniques (after all, he wants to reimplement everything in C3 at some point). He now has a “PEG PEG”, which parse files of PEG description and is working toward PEG parser generator.

There are many concepts in this post that you may never have heard of. This is totally normal. I encourage you to follow the links and read about them. I don’t know if it will make you a better person, but this is all very interesting and essential to understand the fundamentals of programming languages.

The early days

The first ideas about C3 emerged while my husband was at the university. He was a first class student with very high marks even though he wasn’t the most studious. These led him to internship in major companies, but the inspiration didn’t really came from there, the dream was already in him before entering the university in year 2000. At that time, he wanted to created a language as simple as Visual Basic, but as powerful as C++. Learning more about C++ made the idea evolve into something else, the more he coded, the more ideas he had about a new language that would get the name C3 three years later.

The first step into reality was a C3 parser made with Spirit, which is a C++ compile-time parser generator based on expression templates. This parser takes a C3 file, creates the abstract syntax tree and outputs it in a file in XML format. This is somewhat of a proof of concept, a demonstration that the grammar works on practical examples. He acknowledge that this isn’t a full proof, because he haven’t shown that it works for all possibilities. This isn’t a priority anyway, many languages haven’t proved that either. This experience proved two important things. First, the grammar is well designed and there is no need for hacks to generate the abstract syntax tree (which is the case for many languages, including C++). And, he is a good enough programmer to handle such a project! I wouldn’t have married him if I wasn’t sure of that. Ah ah! At least, I wouldn’t be writing this blog.

There is a big void after this, he talked about it quite often, probably thought a lot about it, things continued to grow in this head, but no coding happened for a while. That’s when I met him, I don’t feel I was the distraction however. The company we worked for was very small, and even though we had fun, we worked many hours which didn’t leave much inspiration for coding at home. This would last from 2004 to 2008. The company changed a lot in this period, there are now more than 10 times the number of people than there was when we were hired. At the end of 2008, my soon to be husband completed a very important project and took a few weeks off, connecting with the Christmas vacation, which would make a full month of time to spare. In the second week, he started coding again, like this essential need to create C3, buried for too long, found the surface once again. He hasn’t stopped since, and now that he transfered to a more stable department, I feel confident he won’t stop. I hope that my vulgarization work will help him keep his motivation, this is my main goal in writing this blog (and maybe entertain you all a little, ah!)