How to Build a LL Parser for Math Expressions
Greetings, friends! This is my twelfth post in the math expression parser series. In this tutorial, we'll learn how to use the Syntax tool to build a math expression LL parser using the LL(1) parsing algorithm! Let's begin!
Why create an LL Parser?
In this tutorial, we'll be building a math expression parser using the Syntax tool's LL1 mode. The parser will be quite dumb though. It won't be able to evaluate math expressions or create abstract syntax trees (AST). It will simply say whether the parse is accepted (succeeds) or display an error if the parse fails. The reason for this is due to the fact that the Syntax tool does not support semantic actions for LL parsers. It only supports validation.
You're probably wandering why you're here reading this article then. We just built a fully functional math expression LR parser in the last tutorial, so why do we an LL parser?
The reason is for educational purposes! Yes, that's right! As we learned in Part 9 of this tutorial series, LR parsers can handle many more grammars than LL parsers. If you're comfortable with the parser we built in the previous tutorial, then by all means, continue using it or building on top of it.
However! It's good to understand the pitfalls we may run into when creating grammars for LL parsers. This knowledge will help us in a later tutorial when we build a recursive descent parser and pratt parser from scratch. Understanding LL parsers will also help us if we ever decide to use a different parser generator tool in the future that only supports the LL(1) parsing algorithm.
Historically, LL parsers have been much easier to construct than LR parsers, so if you ever want to make your own parser generator, implementing the LL(1) parsing algorithm is a good starting point! Even Python was originally created using an LL(1) parser generator called pgen! This changed in Python 3.9, however, when it was replaced by a PEG-based parser generator called pegen.
Using LL1 Mode
As mentioned in a previous article, the Syntax tool supports multiple parsing algorithms. We've been using the LALR1 parsing mode. Now, it's time to use the LL1 parsing mode. Let's create a small grammar file called calc.g
that supports addition.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
e
: e '+' e
| NUMBER
;
Normally, we'd run the following command to parse a math expression in LALR1 mode.
syntax-cli -g calc.g -m lalr1 -p '1 + 2'
Let's see what happens when we use the LL1 mode instead:
syntax-cli -g calc.g -m ll1 -p '1 + 2'
We get an error!
SyntaxError: Found conflict in state e:NUMBER. Predicted productions: 1/2
Why does the LL1 parsing mode fail though? What makes it different from LALR1? Let's inspect the parsing table to find out.
We can use the -t
option to view the LL(1) parsing table generated by the Syntax tool.
syntax-cli -g calc.g -m ll1 -t
We should see the following output in our terminal.
Parsing mode: LL(1).
Grammar:
1. e -> e '+' e
2. | NUMBER
LL(1) parsing table:
┌───┬─────┬────────┬───┐
│ │ '+' │ NUMBER │ $ │
├───┼─────┼────────┼───┤
│ e │ │ 1/2 │ │
└───┴─────┴────────┴───┘
As we can see in the table above, there's a conflict, noted by 1/2
in the NUMBER
column. The parser doesn't know whether to use grammar rule 1 or grammar rule 2. This specific conflict is known as a FIRST/FIRST conflict. More specifically, this conflict is caused by the infamous left recursion problem.
Left Recursion
Left recursion is the main reason why we can't pass a grammar, meant for LR parsers, to a LL parser. An LR parser builds a parse tree from "bottom-up" while an LL parser builds a parse tree from "top-down". The LR parser builds up states in a parsing table to determine all the possible states the parser can be in as it scans input.
A LL(1) parser doesn't have as much information as a LR parser. When a LL(1) parser encounters a symbol it recognizes from the grammar, it will look at the next token to determine which action it should take. This is one reason why LL(0) parsers don't exist yet LR(0) parsers can exist. Remember, the number in the parentheses i.e. the 1
in LL(1), corresponds to the number of "lookahead" tokens we use in the parser.
Let's look at our grammar again:
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
e
: e '+' e
| NUMBER
;
When we feed this grammar into the Syntax parser generator using LL1 mode, it causes a conflict due to left recursion. Essentially, an LL(1) parser is trying to do the following when it sees e: e + e
:
function e() {
e();
consumeToken('+');
e();
}
As you can see, this will lead to infinite recursion. Now, the internals of LL(1) parsing algorithm isn't exactly like this, but the idea is that the parser won't work because it ends up in an infinite loop.
In reality, the parser generator will build parsing tables for the LL(1) parser, and the parser can't decide which production rule to choose. Should it choose e + e
or NUMBER
? LL(1) parsers don't know. LR parsers don't suffer from left recursion problems due to the nature of their design.
Resolving Left Recursion
Hope is not lost! We can resolve left recursion by rewriting our grammar using substitutions. Replace the contents of the calc.g
file with the following.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
expression
: primary expression'
;
expression'
: "+" primary expression'
| /* epsilon */
;
primary
: NUMBER
;
Instead of using e
everywhere, we're going to use names for each grammar rule. In future tutorials, we'll use similar names in other types of parsers. The grammar rule, primary
, refers to a "primary expression." Think of it as the final part of our parse tree. A primary expression usually resolves into simple values such as numbers, unary expressions, or parenthesized expressions.
Next, run the following command with the Syntax tool:
syntax-cli -g calc.g -m ll1 -p '1 + 2'
It should pass now! We can also run the same grammar using the LALR1 parsing mode, and it should pass as well.
syntax-cli -g calc.g -m lalr1 -p '1 + 2'
Right away, you may have noticed that our new grammar has grown and become more complicated. That's one of the disadvantages of LL parsers. Remember, LR parsers are generally better for making math expression parsers, but this tutorial is meant to educate everyone on how LL parsers work in case we encounter them in the future.
Notice in our new grammar that we never have a nonterminal symbol such as expression
appear directly before and after the colon :
for each production rule i.e. e: e '+' e
. That's one trick for avoiding left recursion. We have to think differently about constructing grammars for LL parsers compared to LR parsers.
Understanding Epsilon
You might be wondering about the following production rule:
| /* epsilon */
The comment, /* epsilon */
, actually isn't necessary, but we still need an empty production rule. We could have done the following instead:
expression'
: "+" primary expression'
|
;
The Syntax tool uses empty production rules to denote "end of input". The expression'
(read as "expression-prime") symbol can derive into "+" primary expression'
or ε
, the greek letter, "epsilon". As a common convention, parser generators and parsing algorithms use the ε
symbol to mean that the entire input has been scanned. You may recall that in Part 5 of this tutorial series, we used the ε
symbol in one our tokenizer implementations. We used it to represent the same thing: "end of input".
LL(1) Conflicts, Sets, and Tables
LL parsers can have either "FIRST/FIRST" conflicts or "FIRST/FOLLOW" conflicts. The terms, "FIRST" and "FOLLOW", refer to the "first sets" and "follow sets" created internally by the parser generator. In fact, we can even see these sets by using the --sets
or -s
option with the Syntax CLI tool.
syntax-cli -g calc.g -m ll1 -s all
The all
argument will let us see the "first", "follow", and "predict" sets.
First set:
┌─────────────┬───────────┐
│ Symbol │ First set │
├─────────────┼───────────┤
│ expression │ NUMBER │
├─────────────┼───────────┤
│ primary │ NUMBER │
├─────────────┼───────────┤
│ expression' │ "+", ε │
└─────────────┴───────────┘
Follow set:
┌─────────────┬────────────┐
│ Symbol │ Follow set │
├─────────────┼────────────┤
│ expression │ $ │
├─────────────┼────────────┤
│ expression' │ $ │
├─────────────┼────────────┤
│ primary │ "+", $ │
└─────────────┴────────────┘
Predict set:
┌───────────────────────────────────────────┬─────────────┐
│ Production │ Predict set │
├───────────────────────────────────────────┼─────────────┤
│ 1. expression -> primary expression' │ NUMBER │
├───────────────────────────────────────────┼─────────────┤
│ 2. expression' -> "+" primary expression' │ "+" │
├───────────────────────────────────────────┼─────────────┤
│ 4. primary -> NUMBER │ NUMBER │
└───────────────────────────────────────────┴─────────────┘
If this doesn't make sense right now, that's okay. Learning about the LL(1) parsing algorithm is beyond the scope of this tutorial series, but Wikipedia has a lot of useful information about the algorithms used to build LL parsers. Internally, the Syntax tool takes care of all the complex stuff for us.
Just keep in mind that the first, follow, and predict sets help us build a parsing table. Then, the parser can use the parsing table to parse input. The construction of LL parsing tables is different from LR parsing tables. The tables convey different meanings because the tables for LL parsers store different information than LR parsers.
We can use the --table
or -t
option to inspect the parsing table for our grammar using the LL1 parsing mode.
syntax-cli -g calc.g -m ll1 -t
This should result in the following tables:
Parsing mode: LL(1).
Grammar:
1. expression -> primary expression'
2. expression' -> "+" primary expression'
3. | ε
4. primary -> NUMBER
LL(1) parsing table:
┌─────────────┬─────┬────────┬───┐
│ │ "+" │ NUMBER │ $ │
├─────────────┼─────┼────────┼───┤
│ expression │ │ 1 │ │
├─────────────┼─────┼────────┼───┤
│ expression' │ 2 │ │ 3 │
├─────────────┼─────┼────────┼───┤
│ primary │ │ 4 │ │
└─────────────┴─────┴────────┴───┘
In parsing tables, the $
represents the end of input instead of ε
. In the text above the table, the numbers represent the numbered production rules. A good description of how a LL(1) parser uses a parsing table can be found on Wikipedia.
Basically, the LL(1) parser will scan an input such as 1 + 2
and push symbols to a stack using the parsing table. Eventually, we get the following derivation.
expression → primary expression'
→ NUMBER expression'
→ NUMBER (+ primary expression')
→ NUMBER (+ NUMBER expression')
→ NUMBER (+ NUMBER ε)
→ NUMBER + NUMBER ε
→ 1 + 2 ε
→ 1 + 2
Parse accepted.
Handling Multiplication
When we were developing a math expression parser in the previous tutorial, we added multiplication such that our grammar looked like the following.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
e
: e '+' e
| e '*' e
| NUMBER
;
However, this grammar won't work for LL(1) parsers. We need to restructure the grammar to remove left recursion.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
expression
: term expression'
;
expression'
: "+" term expression'
| /* epsilon */
;
term
: primary term'
;
term'
: "*" primary term'
| /* epsilon */
;
primary
: NUMBER
;
Let's test the input, 1 + 2 * 3
using this grammar and LL1 mode to make sure the parse succeeds.
syntax-cli -g calc.g -m ll1 -p '1 + 2 * 3'
Lots of work to make the LL(1) grammar, huh? LR parsers definitely seem superior when it comes to building math expression parsers. If you look closely at our grammar, then you may notice a pattern. As we go down the grammar, we're dealing with operators of higher precedence.
A math expression behaves like a recursive structure in the grammar. The derivation process for the input, 1 + 2 * 3
, might be similar to the following.
expression → term expression'
→ (primary term') expression'
→ (NUMBER term') expression'
→ (NUMBER ε) expression'
→ (NUMBER ε) (+ term expression')
→ (NUMBER ε) (+ (primary term') expression')
→ (NUMBER ε) (+ (NUMBER term') expression')
→ (NUMBER ε) (+ (NUMBER (* primary term')) expression')
→ (NUMBER ε) (+ (NUMBER (* primary ε)) expression')
→ (NUMBER ε) (+ (NUMBER (* NUMBER ε)) expression')
→ (NUMBER ε) (+ (NUMBER (* NUMBER ε)) ε)
→ (NUMBER) (+ (NUMBER (* NUMBER)))
→ NUMBER + NUMBER * NUMBER
→ 1 + 2 * 3
Do you see why the empty production rules are important now? They help detect the end of input for the "mini-expressions" inside the entire expression, 1 + 2 * 3
. As mentioned earlier in this tutorial, blank production rules are treated as ε
(epsilon) by the Syntax tool.
Handling Parentheses
Adding parentheses to our grammar is super simple. We just add a new production rule in the primary
section.
primary
: NUMBER
| "(" expression ")"
;
Notice how the expression
nonterminal symbol is in between parentheses. Since expression
is the start of our grammar, we should recursively parse the expression between parentheses. It's similar to the "(" e ")"
production rule we used in the previous tutorial.
Our completed grammar should look like the following so far.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
expression
: term expression'
;
expression'
: "+" term expression'
| /* epsilon */
;
term
: primary term'
;
term'
: "*" primary term'
| /* epsilon */
;
primary
: NUMBER
| "(" expression ")"
;
Run the following command to test the input, (1 + 2) * 3
, to make sure the parser works correctly.
syntax-cli -g calc.g -m ll1 -p '(1 + 2) * 3'
Handling Subtraction and Division
Adding subtraction to our grammar isn't too bad. Since addition and subtraction have the same precedence, we can add another production rule underneath the line for addition.
expression'
: "+" term expression'
| "-" term expression'
| /* epsilon */
;
Similarly, we can add a line for division underneath the line for multiplication.
term'
: "*" primary term'
| "/" primary term'
| /* epsilon */
;
Now, the grammar should look like the following.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
expression
: term expression'
;
expression'
: "+" term expression'
| "-" term expression'
| /* epsilon */
;
term
: primary term'
;
term'
: "*" primary term'
| "/" primary term'
| /* epsilon */
;
primary
: NUMBER
| "(" expression ")"
;
Run the following command to test the input, (5 - 1) / 2
, to make sure the parser works correctly.
syntax-cli -g calc.g -m ll1 -p '(5 - 1) / 2'
Handling Exponentiation
Continuing the pattern, we can add another set of operators to handle exponentiation.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
/lex
%%
expression
: term expression'
;
expression'
: "+" term expression'
| "-" term expression'
| /* epsilon */
;
term
: factor term'
;
term'
: "*" factor term'
| "/" factor term'
| /* epsilon */
;
factor
: power factor'
;
factor'
: "^" power factor'
| /* epsilon */
;
power
: primary
;
primary
: NUMBER
| "(" expression ")"
;
Since exponentiation has higher precedence than multiplication, we adjust the term
production rule to use factor
instead of primary
. The factor
rule will then resolve to an exponential expression called power
.
Run the following commands to make sure the exponentiation operator works correctly with the other mathematical operators.
syntax-cli -g calc.g -m ll1 -p '2 ^ 2'
syntax-cli -g calc.g -m ll1 -p '2 ^ 2 ^ 3'
syntax-cli -g calc.g -m ll1 -p '(1 + 3) ^ 2'
syntax-cli -g calc.g -m ll1 -p '2 ^ 2 ^ 3 + 1'
Handling Unary Operations and Functions
To handle negative numbers, we can add unary operations underneath the primary
production rule, similar to parentheses.
primary
: NUMBER
| "(" expression ")"
| "-" power
;
The reason why we do "-" power
instead of "-" expression
is to ensure that the exponential operator has right associativity and higher precedence than the unary operator.
-2 ^ 2 + 1
is an example of a math expression where exponentiation should occur first before the unary operation is applied.
For handling functions, we can add another line underneath the primary
rule. This will help the parser understand functions with one parameter such as cos(0)
.
primary
: NUMBER
| "(" expression ")"
| "-" power
| ID "(" NUMBER ")"
;
We also need to add the \w+
regular expression in the lex
section of our grammar, so we can return an ID
token. Finally, our completed grammar looks like the following.
%lex
%%
\s+ /* skip whitespace */
\d+ return 'NUMBER'
\w+ return 'ID'
/lex
%%
expression
: term expression'
;
expression'
: "+" term expression'
| "-" term expression'
| /* epsilon */
;
term
: factor term'
;
term'
: "*" factor term'
| "/" factor term'
| /* epsilon */
;
factor
: power factor'
;
factor'
: "^" power factor'
| /* epsilon */
;
power
: primary
;
primary
: NUMBER
| "-" power
| "(" expression ")"
| ID "(" expression ")"
;
Run the following commands to test to make sure the unary operators and functions work with the other mathematical operators.
syntax-cli -g calc.g -m ll1 -p 'sin(3) - -5'
syntax-cli -g calc.g -m ll1 -p '2 ^ 2 ^ 3 + -1'
syntax-cli -g calc.g -m ll1 -p '2 ^ -3'
syntax-cli -g calc.g -m ll1 -p ' -2 ^ 2 + 1'
Remember, your terminal might complain if you try running a negative number right after the first single quote '
. Therefore, you need to add an extra space before -1
like in the example below.
syntax-cli -g calc.g -m ll1 -p ' -1 + 2'
We now have a math expression parser that behaves similar to the LR parser we made in the previous tutorial! This LL(1) parser is quite dumb and only returns true
if we imported as a JavaScript module, but we learned a lot and expanded our knowledge of parser theory.
Conclusion
In this tutorial, we learned how to build a math expression grammar that works with the LL1 parsing mode in the Syntax parser generator. We learned a great deal about parser theory, including how LL parsers work and how to avoid left recursion.
In the next tutorial, we'll make another math expression parser using a parser generator called Peggy, formerly called PEG.js. This tool can generate PEG-based parsers that accept a new kind of grammar, written in a completely different notation than what we did for the Syntax tool.
I hope you've learned a lot so far! Until next time, happy coding! 😊