How to Build a PEG-based Parser for Math Expressions with Peggy

Published: Monday, March 20, 2023

Greetings, friends! This is my thirteenth post in the math expression parser series. In this tutorial, we'll learn how to use the Peggy parser generator to build a math expression using parsing expression grammars. Let's get started!

What is Peggy?

Peggy is a parser generator that supports parsing expression grammars. It is a modern fork of the PEG.js library. Since PEG.js is no longer being actively maintained, a group of developers decided to make Peggy to support newer JavaScript features.

Peggy has been used by a variety of tools including Kibana, a data visualization dashboard software for Elasticsearch, a powerful search engine tool.

The predecessor to Peggy, PEG.js, has been used in a variety of tools as well including Markdoc, a markdown-based syntax and toolchain for creating custom documentation sites.

For the most part, grammars made in PEG.js should still be backwards compatible with Peggy. Searching for questions related to PEG.js is still helpful for solving issues related to building grammars for Peggy. There might be a few syntax changes we may need to make when going between PEG.js and Peggy.

In this tutorial, we'll develop our own parsing expression grammar so that Peggy can generate a math expression parser for us.

Installing Peggy

Let's install the Peggy parser generator globally on our computer, since this software is meant to be used as a CLI tool.

shell

Copied! ⭐️

npm i -g peggy

Next, try running the following command to make sure the program was installed correctly.

shell

Copied! ⭐️

peggy --help

The --help option will display a list of all the other options we can pass to the Peggy CLI tool. We'll use the -t option throughout this tutorial to test various inputs with a generated parser.

Online Peggy Playground

In this tutorial, we'll be using the Peggy npm package we just installed, but I'd like to mention another tool in case you find it helpful. The online version of the Peggy tool can be used to test out grammars quickly using a user interface (UI). This tool also lets us download a parser, generated from Peggy, using the grammar we provide in the UI.

Context-free Grammars

In the previous two tutorials, we've been creating grammars that are similar to Context-free grammars (CFG). Most parser generators that support CFG-like syntax, including the Syntax tool, provide some kind of mechanism for storing context in the grammar, making them context-sensitive. Technically, we could have created parsers that tokenized values depending on context.

There are multiple programming languages that can't be considered context-free due to how complicated their syntax is. The parser might rely on context as it scans through a file coded in some kind of programming language. The JavaScript language is a good example.

In JavaScript, the + symbol has multiple meanings depending on how many occur in an expression and whether it appears next to an identifier. For instance, the prefix ++id, postfix id++, and addition 1 + 2 operations all use the + symbol. What would happen if we used them all together?

Copied! ⭐️

let num = 0;

console.log(++num+num+++1); // 3
console.log(num); // 2

Looks weird, huh? The parser has to be able to handle tricky situations such as this. The + symbol is ambiguous until we have enough information before and after it. Using context is one solution to handle this "ambiguity" problem. However, clever computer scientists came up with the notion of parsing expression grammars to help remove ambiguities.

Parsing Expression Grammars

When developing production rules in our grammars, we've been using a notation similar to Backus–Naur form. Below is a small grammar we fed into a LR parser in Part 11 using the Syntax tool.

text

Copied! ⭐️

%lex

%%

\s+             /* skip whitespace */
\d+             return 'NUMBER'

/lex

%left '+'
%left '*'

%%

e
  : e '+' e    { $$ = $1 + $3 }
  | e '*' e    { $$ = $1 * $3 }
  | NUMBER     { $$ = Number($1) }
  ;

We saw that if we didn't add special %left modifiers, our parser would complain about a shift-reduce conflict caused by an ambiguity. For the input, 1 + 2 * 3, the parser didn't know whether it should perform addition or resolve an expression as a number.

In parsing expression grammars, most ambiguity is resolved automatically by how we write the grammar. In the Peggy tool, we need to learn a completely new notation for creating grammar files. In parsing expression grammars, we keep trying production rules in order and stop when a rule is matched. This contrasts with the BNF-like notation used in the grammars we wrote in previous tutorials.

Let's create a grammar for both the Syntax tool and Peggy to see how they differ. First, create a small grammar file called pizza.g that can be used with the Syntax tool.

text

Copied! ⭐️

%%

food
  : 'pizza'
  | 'pizza' 'donut'
  ;

The above grammar can accept either pizza or pizzadonut (without whitespaces). We can use the Syntax tool to prove these inputs are valid.

shell

Copied! ⭐️

syntax-cli -g pizza.g -m lalr1 -p 'pizza'
# Parsed value: pizza

syntax-cli -g pizza.g -m lalr1 -p 'pizzadonut'
# Parsed value: pizzadonut

Here's a parsing expression grammar (PEG) that can be accepted by Peggy.

Copied! ⭐️

Food
  = 'pizza'
  / 'pizza' 'donut'

Now, we'll run a few tests in Peggy to see if the same inputs are valid.

shell

Copied! ⭐️

peggy pizza.pegjs -t 'pizza'
# Output: 'pizza'

peggy pizza.pegjs -t 'pizzadonut'
# Error running test
# Error: Expected end of input but "d" found.
#  --> command line:1:6
#   |
# 1 | pizzadonut
#   |      ^

Why did pizzadonut fail? It's because the / symbol in a PEG represents a choice operator. If the first choice succeeds, then the parser stops and never reaches the other choices. In our pizza.pegjs grammar file, pizza is our first choice for Food (isn't it everyone's? 😋) which means the parser stops, so it never sees the pizzadonut option.

Packrat Parsing

Peggy uses a special kind of parsing algorithm known as packrat parsing. Using memoization (caching), packrat parsers can achieve linear time performance, making them really fast. However, this means that packrat parsers consume more memory (due to its caching mechanism).

There's a great series by Guido van Rossum, the creator of Python, where he talks about the process of building a PEG-based parser using packrat parsing and memoization.

The standard implementation of the default Python interpreter, CPython, used to parse Python by a LL(1) parser called pgen. Ever since Python 3.9, the new PEG-based parser called pegen parses Python instead. I must say, "pegen" is a great pun 😂

There's a great discussion on how the new pegen parser works in the Python's Developer Guide. This guide also talks a bit about how PEG-based parsers work in general.

Left Recursion

We can think of PEG-based parsers (packrat parsers) as a sibling of LL(*) parsers where the * refers to an infinite lookahead. In LL(1) parsers, we only used one lookahead token, but a packrat parser can essentially have infinite lookahead. This may seem like the parser would get slower as we use more lookaheads, but memoization helps improve performance.

However, packrat parsers can still suffer from the dreadful left recursion problem we mentioned in the previous tutorial when we discussed LL parsers. There are ways of mitigating this problem in packrat parsers using some tricks, but for Peggy (and PEG.js), left recursion should be avoided. If left recursion appears in our grammar, then we may get the following error when testing a generated parser.

text

Copied! ⭐️

Error parsing grammar
Maximum call stack size exceeded

Yuck! We'd like to avoid that!

The Peggy VS Code Language Extension

When developing a grammar, it'd be nice to get syntax highlighting. We can think of a grammar as its own little language. Sometimes, a grammar file that's used to build a parser for parsing the syntax used in that same grammar is called a "metagrammar".

Some very thoughtful developers have made a VS Code extension for syntax highlighting the Peggy grammar language. It's super useful. It provides not only syntax highlighting but also error reporting, IntelliSense, Symbols, and a language server. Truly amazing!

I highly recommend installing this extension if you're using VS Code. It automatically detects left recursion and helps us find references for when our grammar becomes really long (which definitely happens when people are building a programming language).

Starting our PEG-based Math Expression Parser

Let's start building a grammar for a PEG-based math expression parser. As always, we'll start with addition. Create a new file called calc.pegjs that contains the following contents.

Copied! ⭐️

Start
  = Expression

Expression
  = Primary "+" Expression
  / Primary

Primary
  = [0-9]+

The grammar rule, Primary, refers to a "primary expression." Like the previous tutorial, we can think of it as the final part of our parse tree. A primary expression usually resolves into simple values such as numbers, unary expressions, or parenthesized expressions.

We can run the following commands to test our grammar using Peggy.

shell

Copied! ⭐️

peggy calc.pegjs -t '1'
# Output: [ '1' ]

peggy calc.pegjs -t '1+2'
# Output: [ [ '1' ], '+', [ '2' ] ]

Keep in mind that we haven't implemented whitespaces just yet in our grammar, so the input, 1 + 2 won't work, but 1+2 will work. We also haven't implemented semantic actions, so Peggy will output an array of characters by default.

By convention, we name our grammar files with a .pegjs extension. This convention originated from Peggy's predecessor, PEG.js. You can technically use whatever extension you'd like, but it's helpful to use the .pegjs filename extension to get syntax highlighting quickly in VS Code with the Peggy VS Code extension.

Notice how similar our grammar looks compared to the grammar we built for a LL parser in the previous tutorial.

text

Copied! ⭐️

%lex

%%

\s+             /* skip whitespace */
\d+             return 'NUMBER'

/lex

%%

expression
  : primary expression'
  ;

expression'
  : "+" primary expression'
  | /* epsilon */
  ;

primary
  : NUMBER
  ;

Both the BNF-like grammar used for the Syntax tool and the parsing expression grammar (PEG) used for Peggy are written such that they avoid left recursion. As mentioned earlier in this tutorial, a PEG-based parser reads the PEG file and stops when it matches one of the choices separated by /.

Let's analyze what's happening in our calc.pegjs file.

Copied! ⭐️

Start
  = Expression

Expression
  = Primary "+" Expression
  / Primary

Primary
  = [0-9]+

The / symbol is like saying "or". The Expression symbol can be either Primary "+" Expression OR be Primary. If Primary "+" Expression is matched, then don't proceed to check the Primary choice after the / symbol.

Suppose we were parsing the expression, 1+2 (without whitespaces). The starting condition, Start, is our starting point. The derivation process will be similar to the following:

text

Copied! ⭐️

1. Start → Expression
2.       → Primary "+" Expression / Primary
3.       → Primary "+" Expression
4.       → Primary "+" (Primary "+" Expression / Primary)
5.       → Primary "+" Primary
6.       → 1 + 2

In step 1, Start derives to Expression. Technically, we don't really need the Start grammar rule. We could have simply started with Expression instead.

Copied! ⭐️

Expression
  = Primary "+" Expression
  / Primary

Primary
  = [0-9]+

In step 2, we are trying to derive Expression, but we're left with a choice: Primary "+" Expression or Primary. In Step 3, the parser figures out that Primary "+" Expression must be the choice because there's a + sign coming up in the expression, 1+2. In Step 4, the Expression symbol is resolved to another choice between Primary "+" Expression or Primary. This time, the parser sees a 2 coming up and nothing else after that, so Primary is selected as the choice in Step 5. Finally, the parse completes in Step 6.

Now, the whole process isn't exactly like that when Peggy parses an input, but I hope you get the idea. The / determines choices, which means order is very important in a parsing expression grammar.

Semantic Actions

Most parser generators support the ability to use semantic actions when a parser has completed parsing a portion of an input. Similar to the Syntax tool, we can insert JavaScript code inside curly braces { }. A semantic action is referred to as a "rule action" in the Peggy documentation. Let's create a semantic action to add numbers together. Inside calc.pegs, replace the contents of the file with the following.

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression { return left + right; }
  / Primary

Primary
  = digits:[0-9]+ { return parseInt(digits.join(""), 10); }

Run the following command to make sure the parser works correctly.

shell

Copied! ⭐️

peggy calc.pegjs -t '1'
# Output: 1

peggy calc.pegjs -t '1+2'
# Output: 3

Notice how we added names in our grammar rules such as left:Primary. Why did we do that? In Peggy, we need to assign "labels" to the left of "rule names" in order to use them as variables in a corresponding semantic action. Let's take a closer look at the Expression grammar rule:

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression { return left + right; }
  / Primary

In the grammar rule above, left is a label. Primary and Expression are rule names. In a semantic action (aka rule action), we use the labels to serve as variables in the JavaScript code. We then return left + right as a number, since the semantic action corresponding to digits:[0-9]+ returns a number.

Next, let's take a closer look at what's going on in the semantic action for the grammar rule containing digits:[0-9]+.

Copied! ⭐️

Primary
  = digits:[0-9]+ { return parseInt(digits.join(""), 10); }

In the grammar rule above, we are using a "character match" as seen in the documentation for parsing expression types. Character matches behave similar to the square bracket [ ] notation used in regular expressions. The [0-9]+ notation means that we should match one or more integers where each integer is a digit between 0 (zero) and 9. Therefore, we can match multi-digit numbers such as 7, 10, 123, etc.

In the semantic action for digits:[0-9]+, we are using some JavaScript to return a multi-digit number with the JavaScript type of number. This is done by using the following code:

Copied! ⭐️

return parseInt(digits.join(""), 10);

Why is this necessary though? It's due to how Peggy returns values as it parses an input. By default, Peggy will return an array of matched characters. We can see this happen if we remove the semantic actions from our grammar in calc.pegjs.

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression
  / Primary

Primary
  = digits:[0-9]+

Let's run the following command using the grammar above.

shell

Copied! ⭐️

peggy calc.pegjs -t '12'
# Output: [ '1', '2' ]

As we can see, we get back an array of characters from the input, 12. Let's now change our grammar in calc.pegjs again to only have a semantic action for the Expression rule.

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression { return left + right; }
  / Primary

Primary
  = digits:[0-9]+

Run the following command using the grammar above.

shell

Copied! ⭐️

peggy calc.pegjs -t '12+34'
# Output: '1,23,4'

That's a strange output. What happened? It turns out that this is just regular good ol' fashioned JavaScript quirkiness. During the parse, something similar to the following is happening.

Copied! ⭐️

function parse() {
  const left = '12'.split(''); // ['1', '2']
  const right = '34'.split(''); // ['3', '4']

  // ['1', '2'] + ['3', '4']
  return left + right;
}

const result = parse();
console.log(result); // '1,23,4'

As we can see, adding arrays of strings is perfectly valid in JavaScript. Therefore, we end up with the result, 1,23,4 when we parse 12+34. To fix this, we use the following line:

Copied! ⭐️

return parseInt(digits.join(""), 10);

Then, the parser does something similar to the following:

Copied! ⭐️

function parse() {
  let digits;

  digits = '12'.split('')
  const left = parseInt(digits.join(''), 10); // 12

  digits = '34'.split('')
  const right = parseInt(digits.join(''), 10); // 34

  // 12 + 34
  return left + right;
}

const result = parse();
console.log(result); // 46

We can return our grammar back to normal and confirm that 12+34 equals 46. Update the calc.pegjs grammar to have the following contents.

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression { return left + right; }
  / Primary

Primary
  = digits:[0-9]+ { return parseInt(digits.join(""), 10); }

Then, run the following command, and we should see an output of 46.

shell

Copied! ⭐️

peggy calc.pegjs -t '12+34'
# Output: 46

Action Execution Environment

In Peggy, semantic actions run in an action execution environment where they have access to various variables and functions provided to us by Peggy. One of these functions happens to be the text() function which returns the source text between the start and end of matched text.

We can use the text() function to rewrite our grammar rule for handling digits. This will make the grammar look less verbose.

Copied! ⭐️

Expression
  = left:Primary "+" right:Expression { return left + right; }
  / Primary

Primary
  = [0-9]+ { return parseInt(text(), 10); }

We should get the same result as before.

shell

Copied! ⭐️

peggy calc.pegjs -t '12+34'
# Output: 46

Peggy stores location information about characters from a given input. For example, the string, '12+34', contains five characters. The parser knows that '12' starts at position 0 of the string and ends at position 1. Similarly, the parser knowns that '34' starts at position 3 and ends at position 4.

The text() function will return the source text between the start and end of a token. Therefore, the text() function will automatically return the strings, '12' and '34'. We can then use parseInt to convert these strings into numbers.

Handling Whitespaces

Currently, Peggy throws an error if we try to use whitespaces when running the following command:

shell

Copied! ⭐️

peggy calc.pegjs -t '12 + 34'
# Error running test
# Error: Expected "+", [0-9], or end of input but " " found.
#  --> command line:1:3
#   |
# 1 | 12 + 34
#   |   ^

We need to add whitespaces to our grammar, but it won't be as straightforward as in previous tutorials because we need to specify everywhere a whitespace is allowed. We don't have access to a tokenizer that says, "Oh! A whitespace! I'll skip that for you." Inside the grammar files we pass to Peggy, we need to adjust each rule to include the possibility of whitespaces occurring.

Copied! ⭐️

Expression
  = left:Primary _ "+" _ right:Expression { return left + right; }
  / Primary

Primary
  = _ [0-9]+ _ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

In the grammar above, we recognize that it's possible to have a whitespace appear between numbers and the + operator. It's also possible for whitespaces to occur before or after numbers.

The underscore _ is a common convention you'll see in .pegjs files when people define a rule name for whitespaces. In the grammar, we use _ as a rule name, but we could have called it ws or something else. An underscore is allowed by Peggy because it is considered a valid identifier.

The JavaScript string, "whitespace" to the right of the underscore is known as a human-readable name in the documentation. This name is used in error messages as well as symbols in the Peggy VS Code extension. We can make a human-readable name for any of our grammar rule names (Start, Expression, Primary, etc.).

Let's run the following commands to make sure the parser understands how to deal with whitespaces.

shell

Copied! ⭐️

peggy calc.pegjs -t '12'
# Output: 12

peggy calc.pegjs -t ' 12 '
# Output: 12

peggy calc.pegjs -t '12 + 34'
# Output: 46

peggy calc.pegjs -t ' 12 + 34 '
# Output: 46

Handling Parentheses

Adding parentheses to our grammar rules is super simple. We can add a new expression choice at the beginning of the Primary grammar rule.

Copied! ⭐️

Expression
  = left:Primary _ "+" _ right:Expression { return left + right; }
  / Primary

Primary
  = _ "(" _ expr:Expression _ ")" _ { return expr; }
  / _ [0-9]+ _ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

We can technically make the parentheses grammar rule cleaner by using the @ operator, known as the "pluck" operator in the documentation which lets us pluck a value out of a grammar rule instead of returning the entire rule. Instead of writing the following:

Copied! ⭐️

Primary
  = _ "(" _ expr:Expression _ ")" _ { return expr; }
  / _ [0-9]+ _ { return parseInt(text(), 10); }

We can simplify the grammar rule using the @ pluck operator:

Copied! ⭐️

Primary
  = _ "(" _ @Expression _ ")" _ 
  / _ [0-9]+ _ { return parseInt(text(), 10); }

This will implicitly return the Expression instead of having to write a label and semantic action.

By inserting the parentheses logic inside the Primary rule, we can guarantee higher priority than addition. Remember, we need to add underscores _ to represent every possible occurrence of a whitespace.

For the semantic action, we simply return whatever expression is inside the parentheses and let recursion handle going through all the rules and other semantic actions until we calculate a final result.

Run the following command to confirm that the parentheses work correctly.

shell

Copied! ⭐️

peggy calc.pegjs -t '(1)'
# Output: 1

peggy calc.pegjs -t '( 1 )'
# Output: 1

peggy calc.pegjs -t ' (1) '
# Output: 1

peggy calc.pegjs -t '(1 + 2)'
# Output: 3

peggy calc.pegjs -t ' (1 + 2) '
# Output: 3

peggy calc.pegjs -t '(1 + 2) + 3'
# Output: 6

You may have noticed that we're duplicating some information in our grammar inside the Primary grammar rule.

Copied! ⭐️

Primary
  = _ "(" _ @Expression _ ")" _ 
  / _ [0-9]+ _ { return parseInt(text(), 10); }

In both choices, we have a _ before and after the main expression. We can split up the Primary rule into multiples rules to make our grammar look cleaner. In the updated grammar below, Expression rule now derives to a new rule called Group (as in a "whitespace-padded" group) instead of Primary.

Copied! ⭐️

Expression
  = left:Group _ "+" _ right:Expression { return left + right; }
  / Group

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

All of the commands we ran earlier should still pass. When building parsers, it's always a good idea to run test cases frequently. Parsers are fragile creatures. One small change could cause unexpected parse results.

Handling Multiplication

In the previous tutorial, you may recall that we had to add new grammar rules to our LL parser as we added more mathematical operations. In a similar fashion, we need to add new rules and shift our existing rules as we add new mathematical operators with higher precedence.

Copied! ⭐️

Expression
  = left:Term _ "+" _ right:Expression { return left + right; }
  / Term

Term
  = left:Group _ "*" _ right:Term { return left * right; }
  / Group

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

Have you noticed the pattern? The Term rule was inserted between Expression and Group because multiplication has a higher priority than addition and lower priority than parentheses. The starting condition is generally the lowest priority operation in both LL parsers and PEG-based parsers.

Let's run a few commands to make sure addition, multiplication, parentheses, and whitespaces behave correctly.

shell

Copied! ⭐️

peggy calc.pegjs -t '1 * 2'
# Output: 2

peggy calc.pegjs -t '1 + 2 * 3'
# Output: 7

peggy calc.pegjs -t '2 * 3 + 1'
# Output: 7

peggy calc.pegjs -t '(1 + 2) * 3'
# Output: 9

peggy calc.pegjs -t '3 * (1 + 2)'
# Output: 9

peggy calc.pegjs -t ' (1 + 2) * 3 '
# Output: 9

Everything should pass! Next, let's add subtraction and division.

Issues with Subtraction and Division

Implementing subtraction and division in our grammar is a bit tricky with Peggy. Since subtraction has the same precedence as addition and multiplication has the same precedence as division, you would think it would be as straightforward as the following.

Copied! ⭐️

Expression
  = left:Term _ "+" _ right:Expression { return left + right; }
  / left:Term _ "-" _ right:Expression { return left - right; }
  / Term

Term
  = left:Group _ "*" _ right:Term { return left * right; }
  / left:Group _ "/" _ right:Term { return left / right; }
  / Group

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

However, there are issues with this grammar. Let's see what happens when we run the following commands:

shell

Copied! ⭐️

peggy calc.pegjs -t '5 - 2 - 1'
# Output: 4

peggy calc.pegjs -t '12 / 2 / 3'
# Output: 18

I'm no math genius, but I can tell that 5 - 2 - 1 should equal 2 and 12 / 2 / 3 should equal 2. Due to how our grammar is arranged, the following arithmetic happens instead.

Copied! ⭐️

5 - 2 - 1 → 5 - 2 + 1 = 3 + 1 = 4

12 / 2 / 3 → 12 / 2 * 3 = 6 * 3 = 18

In order to resolve these issues, we need to update our grammar to use parsing expression types.

Parsing Expression Types

In the Peggy documentation, you will find information about various parsing expression types. Peggy supports pattern matching by using special symbols and notations. Some notations may even look similar to regular expressions.

As a side note, there's a grammar notation known as Extended Backus–Naur form (EBNF) that uses "regular expression"-like syntax as well. Some parser generators will allow EBNF syntax because it's less verbose than the BNF-like syntax we used when designing grammars with the Syntax tool. Peggy lets us use a similar syntax, but it's not the same as EBNF notation (though it might have been inspired by it).

An important parsing expression type is the (expression)* syntax to match zero or more occurrences of a grammar rule. Let's look at an example.

Create a new grammar file called exp-types.pegjs that has the following contents.

Copied! ⭐️

Food
  = ( Pizza / Donut )*

Pizza
  = 'pizza'

Donut
  = 'donut'

This grammar will accept zero or more occurrences of either the Pizza grammar rule or Donut grammar rule. Let's run a few commands to see how this will work. Remember, Peggy will return an array of parsed tokens by default when there are no semantic actions.

shell

Copied! ⭐️

peggy exp-types.pegjs -t ''
# Output: []

peggy exp-types.pegjs -t 'pizza'
# Output: [ 'pizza' ]

peggy exp-types.pegjs -t 'donut'
# Output: [ 'donut' ]

peggy exp-types.pegjs -t 'pizzapizza'
# Output: [ 'pizza', 'pizza' ]

peggy exp-types.pegjs -t 'pizzadonut'
# Output: [ 'pizza', 'donut' ]

peggy exp-types.pegjs -t 'donutdonut'
# Output: [ 'donut', 'donut' ]

peggy exp-types.pegjs -t 'pizzadonutpizzadonut'
# Output: [ 'pizza', 'donut', 'pizza', 'donut' ]

We can repeat pizza and donut infinitely because we're using the (Pizza/Donut)* parsing expression type.

This knowledge can help us support mathematical operators that have the same precedence such as addition/subtraction and multiplication/division.

Handling Subtraction and Division

Now that we understand parsing expression types a bit more, let's use them to handle subtraction and division in our calc.pegjs grammar file.

Copied! ⭐️

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(function (result, element) {
        if (element[1] === "+") { return result + element[3]; }
        if (element[1] === "-") { return result - element[3]; }
      }, head);
    }

Term
  = head:Group tail:(_ ("*" / "/") _ Group)* {
      return tail.reduce(function (result, element) {
        if (element[1] === "*") { return result * element[3]; }
        if (element[1] === "/") { return result / element[3]; }
      }, head);
    }

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

This grammar may look overwhelming, so let's analyze what's happening inside the Expression rule.

Copied! ⭐️

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(function (result, element) {
        if (element[1] === "+") { return result + element[3]; }
        if (element[1] === "-") { return result - element[3]; }
      }, head);
    }

We're creating a label called head that corresponds to the Term rule name. Then, we're creating a label called tail that corresponds to an expression that repeats zero or more times. To understand the tail.reduce stuff, it might be helpful to insert a couple console.log statements inside the semantic action.

Copied! ⭐️

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      console.log('head:', head);
      console.log('tail:', tail);
      return tail.reduce(function (result, element) {
        if (element[1] === "+") { return result + element[3]; }
        if (element[1] === "-") { return result - element[3]; }
      }, head);
    }

When we run our calc.pegjs grammar against the Peggy tool, we should see the following arrays for the input, 1 + 2.

shell

Copied! ⭐️

head: 1
tail: [ [ [ ' ' ], '+', [ ' ' ], 2 ] ]

The head is equal to 1, and the tail is equal to the rest of the tokens found in the input. There are arrays that contain ' ' because of the whitespaces. We can use the native reduce method in JavaScript to add or subtract depending on the operator found in the array.

If the grammar rules for Expression still look confusing, we can take away the labels, subexpressions, and semantic actions to obtain the following.

Copied! ⭐️

Expression
  = Term (_ ("+" / "-") _ Term)*

Basically, we're creating a subexpression, ("+" / "-"), within a larger expression (_ ("+" / "-") _ Term)*. The larger expression is repeated zero or more times, similar to what happened in our "pizza-donut" grammar we made earlier. By using a parsing expression type like this in our grammar, we can be sure that addition is only chosen when the next operator is a + symbol and subtraction is only chosen when the next operator is a - symbol.

Using our new grammar, let's rerun the following commands we tried earlier.

shell

Copied! ⭐️

peggy calc.pegjs -t '5 - 2 - 1'
# Output: 2

peggy calc.pegjs -t '12 / 2 / 3'
# Output: 2

Looks like everything is working now!

But wait! There's more! Notice how the reducer function passed into the tail.reduce() method is similar for each math operation. We can create a function called operatorReducer that determines which math operation to apply using simple switch-case statements.

Copied! ⭐️

{{
  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Group tail:(_ ("*" / "/") _ Group)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

Our grammar looks so much cleaner now! 🤩

Peggy lets us inject JavaScript at the top of the grammar file in between a set of double curly braces {{ }}. In the documentation, this is referred to as a "global initializer". The global initializer is executed only once when a generated parser is loaded such as through the command line or with a require or import statement.

Handling Exponentiation

Let's now implement exponentiation in our grammar. This process is similar to multiplication. We create a new rule called Factor and shift the rules for Term such that it points to Factor instead of Group. We also add a new ^ operator inside our operatorReducer function.

Copied! ⭐️

{{
  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Factor
  = head:Group tail:(_ "^" _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

Notice the tail label refers to Factor within the Factor rule. Therefore, the rule for exponentiation becomes:

text

Copied! ⭐️

Factor = Group ("^" Factor)*

This will ensure that the exponentiation operation is right associative which is important in expressions such as 2 ^ 2 ^ 3, which should be equal to 256.

We can test this grammar by running the following commands:

shell

Copied! ⭐️

peggy calc.pegjs -t '2 ^ 3'
# Output: 8

peggy calc.pegjs -t '2 ^ 2 ^ 3'
# Output: 256

peggy calc.pegjs -t '(2 ^ 2) ^ 3'
# Output: 64

peggy calc.pegjs -t '2 ^ 3 * 2'
# Output: 16

Handling Unary Operations

Adding unary operations is super simple in a PEG-based grammar. Similar to the previous tutorial, we just need to add an extra option in our Primary grammar rule.

Copied! ⭐️

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / [0-9]+ { return parseInt(text(), 10); }

For the semantic action, we create a new label called expr and return the negative version of it. Thus, we end up with the following grammar:

Copied! ⭐️

{{
  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Factor
  = head:Group tail:(_ "^" _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

Similar to the previous tutorial, we use "-" Factor instead of "-" Expression to ensure that the exponential operator has higher precedence than the unary operator.

-2 ^ 2 + 1 is an example of a math expression where exponentiation should occur first before the unary operation is applied.

As always, let's test this grammar by running some commands.

shell

Copied! ⭐️

peggy calc.pegjs -t '-2'
# Output: -2

peggy calc.pegjs -t '-(3)'
# Output: -3

peggy calc.pegjs -t '-1 * 2 ^ 3'
# Output: -8

peggy calc.pegjs -t '-2 ^ 2'
# Output: -4

peggy calc.pegjs -t '(-2) ^ 2'
# Output: 4

peggy calc.pegjs -t '2 ^ 2 + 1'
# Output: 5

peggy calc.pegjs -t '-2 ^ 2 + 1'
# Output: -3

Awesome! Feel free to test each of these expressions in WolframAlpha if you want to verify all of the outputs.

Handling Trigonometric Functions

Next, it's time to handle trigonometric functions. Since we don't have access to a tokenizer in PEGs, we have to create a grammar rule for handling "identifiers" or words such as sin, cos, and tan.

If we search the Peggy documentation for [characters], we'll see that Peggy supports character sets, similar to regular expressions. We've actually been using character sets for handling integer values i.e. [0-9]*. Now, we'd like to use character sets for letters.

Copied! ⭐️

ID = [a-z]+ { return text(); }

This grammar rule says an id is equal to one or more letters. The + means that we should match one or more letters. By using the text() function provided to us by Peggy, we can extract out the whole group of matched text.

We'll then use the ID grammar rule inside the Primary grammar rule.

Copied! ⭐️

ID
  = [a-z]+ { return text(); }

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / id:ID "(" _ expr:Expression _ ")" { return trig(id, expr); }
  / [0-9]+ { return parseInt(text(), 10); }

You may have noticed that we're making a function call: trig(id, exp). Let's create our trig function and add it to the top section of the grammar file.

Copied! ⭐️

{{
  function trig(id, value) {
    switch (id) {
      case 'sin':
        return Math.sin(value);
      case 'cos':
        return Math.cos(value);
      case 'tan':
        return Math.tan(value);
      default:
        throw new Error(`Invalid operation: ${id}`);
    }
  }
}}

A keen reader will notice that this trig function is identical to the one we used in Part 11 of this tutorial series. The Syntax tool let us inject code inside our grammar file as well.

Once we add the trig function, the calc.pegjs should look like the following.

Copied! ⭐️

{{
  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  }

  function trig(id, value) {
    switch (id) {
      case 'sin':
        return Math.sin(value);
      case 'cos':
        return Math.cos(value);
      case 'tan':
        return Math.tan(value);
      default:
        throw new Error(`Invalid operation: ${id}`);
    }
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Factor
  = head:Group tail:(_ "^" _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

ID
  = [a-z]+ { return text(); }

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / id:ID "(" _ expr:Expression _ ")" { return trig(id, expr); }
  / [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

Let's make sure our trigonometric functions behave properly and we handle situations where the user enters an invalid function name.

shell

Copied! ⭐️

peggy calc.pegjs -t 'cos(0)'
# Output: 1

peggy calc.pegjs -t 'sin(0)'
# Output: 0

peggy calc.pegjs -t 'tan(0)'
# Output: 0

peggy calc.pegjs -t 'pineapple(0)'
# Output: Invalid operation: pineapple

Handling Decimal Values

Similar to Part 11, we would like our math parser to support decimal numbers. We have to think about this problem differently than when we created an LR parser with the Syntax tool.

Since we don't have access to a tokenizer that reads from regular expressions, we'll need to rely on some special symbols that Peggy allows. We can use the ? symbol to make a specific token or grammar rule optional. Let's see how we can incorporate decimal values using the ? symbol.

Copied! ⭐️

Decimal
  = [0-9]* ("." [0-9]*)?

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / id:ID "(" _ expr:Expression _ ")" { return trig(id, expr); }
  / Decimal { return parseFloat(text(), 10); }

We replaced [0-9]* with a new rule called Decimal. Inside this rule, we are saying a decimal value can be zero or more integers followed by an optional group composed of a decimal point . and more integer values. Therefore, the following values should be valid decimal numbers:

Copied! ⭐️

2
2.
2.5
10.50
.5
.50

Since these are all valid decimal values in JavaScript, our semantic actions should work properly when performing arithmetic. Keep in mind that we also changed parseInt to parseFloat in the Primary rule so that we parse the string we get back from the parser as a JavaScript float.

Copied! ⭐️

{{
  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  }

  function trig(id, value) {
    switch (id) {
      case 'sin':
        return Math.sin(value);
      case 'cos':
        return Math.cos(value);
      case 'tan':
        return Math.tan(value);
      default:
        throw new Error(`Invalid operation: ${id}`);
    }
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Factor
  = head:Group tail:(_ "^" _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

ID
  = [a-z]+ { return text(); }

Decimal
  = [0-9]* ("." [0-9]*)?

Primary
  = "(" _ @Expression _ ")"
  / "-" expr:Factor { return -expr; }
  / id:ID "(" _ expr:Expression _ ")" { return trig(id, expr); }
  / Decimal { return parseFloat(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

We can run the following commands to test decimal values with various syntax. The output result might have a different syntax than the input due to JavaScript evaluating the passed string using parseFloat in the semantic action.

shell

Copied! ⭐️

peggy calc.pegjs -t '2'
# Output: 2

peggy calc.pegjs -t '2.'
# Output: 2

peggy calc.pegjs -t '2.5'
# Output: 2.5

peggy calc.pegjs -t '10.50'
# Output: 10.5

peggy calc.pegjs -t '.5'
# Output: 0.5

peggy calc.pegjs -t '.50'
# Output: 0.5

Testing Various Math Expressions

When building parsers, it's always a good idea to come up with as many math expressions as possible to make sure they behave as expected. Test-driven development (TDD) can be super helpful when building parsers. As you add more features to a grammar, you can keep testing various inputs.

For now, we'll just run a bunch of commands to make sure the expressions return the correct output, but I highly encourage everyone to use a testing library such as Jest or Vitest to test as many math expressions as possible!

Let's make sure our grammar still parses a variety of math expression correctly. Remember to test, test, test!

shell

Copied! ⭐️

peggy calc.pegjs -t '1'
# Output: 1

peggy calc.pegjs -t ' 2 '
# Output: 2

peggy calc.pegjs -t '1 + 2'
# Output: 3

peggy calc.pegjs -t ' 1 + 2 '
# Output: 3

peggy calc.pegjs -t '1 + 2 * 3'
# Output: 7

peggy calc.pegjs -t '(1 + 2) * 3'
# Output: 9

peggy calc.pegjs -t '5 - 2'
# Output: 3

peggy calc.pegjs -t '5 - 2 - 1'
# Output: 2

peggy calc.pegjs -t '12 / 2 / 3'
# Output: 2

peggy calc.pegjs -t '2 ^ 3 + 1'
# Output: 9

peggy calc.pegjs -t '-2 ^ 2'
# Output: -4

peggy calc.pegjs -t '(-2) ^ 2'
# Output: 4

peggy calc.pegjs -t '-2 ^ 2 + 1'
# Output: -3

peggy calc.pegjs -t 'cos(0) + 3 * -4 / -2 ^ 2'
# Output: 4

If you copy and paste this into a terminal, everything should parse correctly! 🎉

Generating Parsers as JavaScript Modules

Peggy provides multiple ways of generating a parser as a JavaScript module or JavaScript function. We can either use the command line or call peggy.generate() using Peggy's JavaScript API. In this tutorial, we'll keep it simple and use the command line to generate a parser.

To generate a parser with Peggy, using our completed grammar, run the following command:

shell

Copied! ⭐️

peggy calc.pegjs -o calc-parser.js

This command will use our calc.pegjs grammar file to output a JavaScript file called calc-parser.js that will contain our generated parser.

Then, we can create a new JavaScript file called example.js to import our parser and use it to parse mathematical expressions.

Copied! ⭐️

const parser = require('./calc-parser.js');

const result = parser.parse('1 + 2');

console.log(result); // 3

We can execute this script using Node.js by running the following command:

shell

Copied! ⭐️

node example

The output should be result of the math expression. Congratulations! You made a math expression parser using parsing expression grammars (PEG) and the Peggy parser generator! 🎉

Building Abstract Syntax Trees

In Part 11 of this tutorial series, we updated the semantic actions near each grammar rule to output an AST node instead of computing a result from math expressions. We can accomplish the same task using Peggy. Create a new grammar file called calc-ast.pegjs with the following contents.

Copied! ⭐️

{{
  const functionList = ['sin', 'cos', 'tan'];

  function FunctionExpression(id, value) {
    if (functionList.includes(id)) {
      return {
        type: 'Function',
        name: id,
        value
      }
    }
  }

  function BinaryExpression(operator, left, right) {
    return {
      type: 'BinaryExpression',
      operator,
      left,
      right
    }
  }

  function UnaryExpression(value) {
    return {
      type: 'UnaryExpression',
      value
    }
  }

  function NumericLiteral(value) {
    return {
      type: 'Number',
      value: Number(value)
    }
  }

  function operatorReducer (result, element) {
    const left = result;
    const right = element[3];
    const op = element[1];

    return BinaryExpression(op, left, right);
  }
}}

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(operatorReducer, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Factor
  = head:Group tail:(_ ("^") _ Factor)* {
      return tail.reduce(operatorReducer, head);
    }

Group
  = _ @Primary _

Primary
  = ParenthesizedExpression
  / UnaryExpression
  / Function
  / Decimal

Decimal
  = _ [0-9]* ('.' [0-9]*)? _ { return NumericLiteral(text()) }

UnaryExpression
  = '-' expr:Factor { return UnaryExpression(expr) }

ParenthesizedExpression
  = '(' _ @Expression _ ')'

ID
  = [a-z]+ { return text(); }

Function
  = id:ID expr:ParenthesizedExpression { return FunctionExpression(id, expr) }

_ 'whitespace'
  = [ \t\n\r]*

Run following command to see the AST for the math expression, 1 + 2 * 3.

shell

Copied! ⭐️

peggy calc-ast.pegjs -t '1 + 2 * 3'

We should see the following AST:

json

Copied! ⭐️

{
  "type": "BinaryExpression",
  "operator": "+",
  "left": { "type": "Number", "value": 1 },
  "right": {
    "type": "BinaryExpression",
    "operator": "*",
    "left": { "type": "Number", "value": 2 },
    "right": { "type": "Number", "value": 3 }
  }
}

Next, we can build a JavaScript module using the following command.

shell

Copied! ⭐️

peggy calc-ast.pegjs -o calc-ast-parser.js

Then, we can create a new JavaScript file called calc-ast-example.js and import our newly created parser.

Copied! ⭐️

const parser = require('./calc-ast-parser');

const result = parser.parse('1 + 2 * 3');

console.dir(result, { depth: null }); // log nested objects

We can execute this script using node calc-ast-example in the command line to see the AST for 1 + 2 * 3.

Copied! ⭐️

{
  type: 'BinaryExpression',
  operator: '+',
  left: { type: 'Number', value: 1 },
  right: {
    type: 'BinaryExpression',
    operator: '*',
    left: { type: 'Number', value: 2 },
    right: { type: 'Number', value: 3 }
  }
}

Evaluating the AST

Since our AST is identical to the one we created in Part 8 of this tutorial series, we can reuse our AST evaluator from that tutorial.

Let's create a new file called calc-ast-evaluator.js and add the following contents.

Copied! ⭐️

const parser = require('./calc-ast-parser');

class NodeVisitor {
  visit(node) {
    switch (node.type) {
      case 'Number':
        return this.visitNumber(node);
      case 'BinaryExpression':
        return this.visitBinaryExpression(node);
      case 'UnaryExpression':
        return this.visitUnaryExpression(node);
      case 'Function':
        return this.visitFunction(node);
    }
  }

  visitNumber(node) {
    return node.value;
  }

  visitBinaryExpression(node) {
    switch (node.operator) {
      case '+':
        return this.visit(node.left) + this.visit(node.right);
      case '-':
        return this.visit(node.left) - this.visit(node.right);
      case '*':
        return this.visit(node.left) * this.visit(node.right);
      case '/':
        return this.visit(node.left) / this.visit(node.right);
      case '^':
        return this.visit(node.left) ** this.visit(node.right);
      default:
        throw new Error(`Invalid operation: ${node.operator}`);
    }
  }

  visitUnaryExpression(node) {
    return -this.visit(node.value);
  }

  visitFunction(node) {
    switch (node.name) {
      case 'sin':
        return Math.sin(this.visit(node.value));
      case 'cos':
        return Math.cos(this.visit(node.value));
      case 'tan':
        return Math.tan(this.visit(node.value));
      default:
        throw new Error(`Invalid function: ${node.name}`);
    }
  }
}

const input = '1 + 2 * 3';
const ast = parser.parse(input);
const visitor = new NodeVisitor();
const result = visitor.visit(ast);
console.log(result); // 7

After running node calc-ast-evaluator, we should get the result of 7. The power of this approach is that it lets the user, who imports the library, decide how they want to use the AST. Inside each "visitor" function (i.e. visitNumber, visitBinaryExpression, etc.), we could have done something else besides evaluating the math expression. It gives more freedom to the user on how they want to use the AST.

Conclusion

In this tutorial, we built a math expression parser from a parsing expression grammar (PEG) using the Peggy tool. We compared the similarities and differences between a BNF-like grammar used by the Syntax tool and a PEG used by Peggy.

PEG-based parsers (packrat parsers) can help solve problems that LL and LR parsers suffer from. As mentioned earlier, programming languages such as Python may find them suitable for parsing certain syntax faster. Even problems related to parsing nested parentheses might be easier with PEG-based parsers.

The parsing expression grammars we made in this tutorial can actually help serve as a model for the way developers write recursive descent parsers. In the next tutorial, we'll learn how to build our own recursive descent parser from scratch. No more cheating by using parser generators. It's time to manually create a parser again using JavaScript! See you there! ⭐