Parser Generator Based on LPeg(Label)

I worked together with LabLua during Google Summer of Code 2017 to develop a parser generator that allows an easy and efficient description of grammars in Lua. The final product and all the development commits can be found on GitHub.

Goals of the project:

The goal of this project is to build a parser generator on top of LPegLabel. This new tool should make the description of common idioms easier and error reporting more automatic.

Checklist:
  • A parser generator tool
  • The (re)writing of at least 2 parsers, including lua-parser, with the new tool. The description of the parsers using the new tool should be easier than using plain LPeg(Label).
  • A proper documentation
  • A marginal result would be the improvement of the LPegLabel tool based on the difficulties perceived during the project.

Motivation

Currently the description of Parsing Expression Grammars in Lua is rather complicated, compared to other tools. The user has to handle spaces between tokens explicitly, manually create and use error labels for each missing token and also write recovery grammars. This tool seeks to simplify this process. Parser-gen is a tool that will handle all these tasks and make Lua based parsers easier to implement and efficient to use.

Progress

Before working on the project, my mentor suggested that I write a few example parsers, unrelated to the final product. I wrote a LpegLabel based Tiny parser, which proved to be a very good exercise in understanding how LPegLabel works. The final code can be found here.

After I became accustomed to the LPegLabel interface, we started discussing the goals of the project. 

1. Parser generator tool


The first step in developing the tool was to decide the syntax to use. My mentor suggested me to use either ANTLR or re syntax. After some discussion, we decided that a bit of both will be necessary to achieve the functionality we want. We wanted this tool to automatically handle space and newline characters (very useful for most programming languages - but can be disabled). Secondly, we wanted the tool to generate error labels where possible - otherwise the parser cannot know where the parsing failed. And lastly, the tool should have a familiar interface so that its similar to other tools available, such as re or ANTLR.
We kept discussing all design choices along the way, but here's how the development of the tool went.

Tiny parser

Firstly, I wrote a parser using the parser-gen syntax so that we could test if the final tool was working. The tiny language is very basic, so it was easy to develop this test grammar to check the basic functionality of the tool; more advanced features were later tested using a Lua parser. The code for this parser can be found here.

PEG-parser

The first task the tool needed to complete was to parse an input grammar and build an AST of it. Other tools, such as the re module, build a new grammar straight from the old one, but we needed an intermediate step so that we could add space characters to the AST and later generate first/follow sets from the AST to add error labels. I used the relabel module to write the grammar for this parser, as it was easier to understand for me in the beginning of the project. Now I think it could be converted to use regular LPegLabel syntax to have one less dependency; however, the relabel module is included in the LPegLabel package so it is a very minor issue.
Some issues we had to consider were design choices: how many children should an AST node have, should the concatenation operator bind to the right or to the left and so forth. Another feature we decided to implement is token non-terminals, which could be identified by capitalized names, just like in ANTLR. These rules will not have space handling, as sometimes it is necessary to parse them without skipping. This feature was inspired by ANTLR grammars. The code for PEG-parser can be found here.
After this parser was complete, I wrote a test file that validates ASTs for each grammar rule.

Parser generator

Once we had the AST of the grammar, it was then possible to turn it into an LPegLabel grammar. For that I traversed the tree using a recursive post-order traversal algorithm. This part was quite easy to accomplish, as the structure of the AST was very fresh in my head. After this was done, I wrote a test file that tested these grammars and checked if they're the same as one's generated using relabel (the syntax is the same).

Space character handling

We discussed the approaches possible when automatically handling space characters. I decided it's best to have a special grammar rule called "SKIP", which would contain any characters the user might want to skip. I found that spaces were not the only characters we wanted to skip - comments could also be handled here for most programming languages.
The algorithm would add an optional space check in the beginning of the input and after each terminal in non-token rules and consume the characters if any were found.
This created an issue, that if a terminal was used inside a capture, the space would be captured together with the terminal. It took a long time for me to come up with an efficient solution - I tried changing the LPegLabel source code to add a function that parses an input but does not capture it, but that proved to be more difficult than expected. I solved this issue by noticing that we can use a stack, which stores the number of tokens found inside the capture. If there were any tokens inside it, we consume them after the capture is completed. This solved the issue.

Setting custom errors

The relabel module expects the user to use numbered error labels, which the user has to set before compiling the grammar. I thought it's better to let the user use only textual error labels and the code would generate numbered errors automatically, so the user now just needs to provide a table with all error names and their descriptions, additionally a custom recovery expression for each error as well.

Error generation

Generating error labels turned out to be the hardest part of this project. Me and my mentor were coming up with different solutions on how to add error labels to grammars accurately and conservatively, as a label placed in the wrong place could stop the grammar from working. It took us a few weeks to decide on the final approach, which was to find the FOLLOW set for each token and only add error labels to patterns following the token if only one pattern can follow it. That is, if we have a rule "A B", we only add an error label to B if FOLLOW(A) contains exactly one element. 
Finding the FOLLOW set of a PEG grammar was quite a difficult task - I did not find any approaches online, except rewriting the grammar in a different form. We wanted to avoid this, so I had to come up with an algorithm that finds the FOLLOW set of a PEG grammar. We tested the final approach on the Tiny parser and it worked, but the Lua parser did not.

We found that this approach only works for LL(1) grammars, so we left it as an optional feature of the tool. We did not achieve the expected goal of generating error labels automatically during GSoC, however, this feature might be implemented with more time, as it is way more difficult than any of the other tasks.

Recovery generation

All error labels can have a custom recovery expression, but it is also necessary to have a general one that is used for automatically generated labels and ones without a custom expression. I added a special "SYNC" rule to the grammar which allows the user to specify this recovery expression. This feature can be disabled as well, so that the parser stops when the first error is encountered, but is useful when we are trying to parse a large file, for example, a source code for a program.

2. Lua parser


A working Lua parser would make use of the tool I made. As a reference, I took the Lua parser written in ANTLR, which has a PEG-like syntax, but I had to make some modifications to remove left-recursion and modify prioritization, as LPegLabel uses a stack based parsing machine, unlike the LL(*) algorithm used by ANTLR. 
The parser was tested using the Lua test suite for Lua 5.3.4. Testing this parser allowed me to discover a myriad of bugs in the parser-gen, which were promptly fixed. The Lua parser can be found on GitHub.
This parser, however, did not have any captures, so it did not build an AST. Instead of writing a specific AST for Lua, I decided to take a more general approach - generate ASTs automatically for all grammars (see Section 3).

Comparison with the LpegLabel Lua parser

Since the goal was to write a tool to write parsers that are easier to understand, it is useful to compare the Lua parser written with parser-gen to one written just using LPegLabel, for example https://github.com/andremm/lua-parser.
To me it seems like the one written with parser-gen is much closer to a text-book version of a grammar. In the image below you can see part of the parser written using parser-gen.
I think for a person who has some experience with grammars this grammar is easily understandable, since the operators are rather similar to ones used in non-PEG grammars.
In comparison, one written with LPegLabel is not understandable unless you know the syntax of LPegLabel:
For a user, who has never used LPegLabel before, it might be confusing what the "+","*" or "^-1" operators mean, whereas it is more intuitive what a forward slash or a question mark might mean. The full documentation of parser-gen, of course, explains what all these operators mean.

Also, the LPegLabel parser needs a lot of custom function definitions, such as tagC(), expect(),sym() etc. The parser-gen code does not need this, because spaces are handled automatically and error labels can be thrown using the "^" operator. So, the code for the one generated with parser-gen ends up much shorter (in terms of lines of code, it is around 200 lines shorter for this parser!).

My parser, however, lacks a validator to check if all functions and variables are valid, as is done in the other one; however, this feature can easily be implemented by traversing the automatically generated AST.

3. Building ASTs


I noticed that it would be quite easy to build ASTs automatically, by tagging each rule with a "rulename" capture, and adding captures to each token. So I implemented this approach, which was not planned initially, but it was useful for the Lua parser and it made the tool much more usable, as now the grammars can generate ASTs automatically. Sometimes it is necessary to build a custom AST, which can be way simpler than the automatically generated one, so this feature can be disabled when compiling the grammar.

When building the ASTs I noticed that sometimes we want to avoid building certain branches. For example, a grammar like
WORD <- LETTER+
LETTER <- [A-Za-z]
when used to parse "ab" would build a tree like
      WORD
      /             \
LETTER     LETTER
    |                  |
  "a"               "b"

When using ASTs to write interpreters it is way more useful to have a tree like:
WORD
 |
"ab"
For this reason we implemented the "fragment" syntax as is used in ANTLR, which when used with the LETTER rule produces the expected AST.

4. Fixing bugs and writing documentation


Writing test files for each function was very handy, as it allowed to notice bugs as soon as they come up and to test the code after each modification to ensure nothing went wrong. Therefore, most bugs were fixed as soon as they were found, but some may still remain - GitHub is very handy for reporting them, so they can be fixed quickly.

It was quite difficult for me to write a thorough documentation, as I've never written one before. Since I spent so much time making this tool, I might wrongly assume that something is "obvious", where to a user it might not be at all. I'm trying to explain even the most trivial things in the documentation, so that the user can find each feature described there thoroughly, though this might take some time to get right.

My experience

This project has been an absolute blast and very different to anything I've ever programmed before. Usually when writing a program, I immediately know what approach I can take, so I spend most of the time just writing and testing the code. For this project, programming was way more interesting, as I would spend hours just trying to figure out algorithms that would achieve the necessary task. 
I learned so much about grammars during the development of this project, up to which I have only worked with basic BNF grammars.
I want to thank LabLua for being very supportive and responsive during GSoC, and especially my mentor Sérgio Medeiros for helping me understand grammars, develop approaches for each stage of the project and supporting my efforts along the way.

Comments