How to Build a Small Programming Language with a Pratt Parser using JavaScript
Greetings, friends! This is my eighteenth post in the math expression parser series. This is a special bonus tutorial! ⭐
In this tutorial, we'll learn how to build a build a small programming language called "Tiny Math Language" or TML for short. This custom language will support the ability to define and use variables in math expressions. We're still building a math expression parser, but it'll be a bit different from all the previous tutorials. As a base, we'll use the Pratt parser we made in Part 16 of this tutorial series, so it shouldn't be too much extra work 😅
Let's get started!
Introducing the TML Programming Language!
Welcome all to the greatest programming language on the planet! Tiny Math Language (TML)!
Before we start building a programming language, let's set a foundation for what the language grammar should like. As the name implies, TML will be super tiny. The main difference between the TML parser and the Pratt parser we built in Part 16 is that the TML parser will support the declaration of variables. We'll also provide a way to print or log the results of multiple math expressions.
Here's a small code snippet of TML:
const x = 1 + 2 * 3
const y = 2 ^ 2 ^ 3
const z = 3
print(x + y + z)
Yep! That's it! Super basic, right? There are three main features we will add to our math expression parser:
- The
var
keyword to declare variables - The
print
keyword to log the results of an expression - The semicolon
;
to indicate the end of a statement
Although our language will be tiny, it'll be a great learning experience! The main purpose of this tutorial is to take a small step into the world of programming language design.
Grammar Blueprint for TML
When designing a parser, it's always a good idea to create a grammar "blueprint" of our language. Let's look at the completed grammar for TML.
Program
= StatementList
StatementList
= Statement+
Statement
= VariableStatement
/ PrintStatement
/ ExpressionStatement
VariableStatement
= "var" IDENTIFIER "=" Expression ";"
PrintStatement
= "print" ParenthesizedExpression ";"
ExpressionStatement
= Expression ";"
Expression
= Prefix (Infix)*
Prefix
= ParenthesizedExpression
/ UnaryExpression
/ VariableOrFunctionExpression
/ NUMBER
Infix
= ("+" / "-" / "*" / "/" / "^") Expression
ParenthesizedExpression
= "(" Expression ")"
UnaryExpression
= "-" Expression
VariableOrFunctionExpression
= Variable
/ FunctionExpression
Variable
= IDENTIFIER
FunctionExpression
= IDENTIFIER ParenthesizedExpression
I hope this grammar doesn't look too scary 😅 If you've been following along my tutorial series, this should hopefully make sense. A good portion of the above grammar is actually similar to the grammar we created for the Pratt parser in Part 16 of this tutorial series.
Compared to our Pratt parser from earlier, the grammar for TML has a few new additions:
Program
= StatementList
StatementList
= Statement+
Statement
= VariableStatement
/ PrintStatement
/ ExpressionStatement
VariableStatement
= "var" IDENTIFIER "=" Expression ";"
PrintStatement
= "print" ParenthesizedExpression ";"
ExpressionStatement
= Expression ";"
VariableOrFunctionExpression
= Variable
/ FunctionExpression
Variable
= IDENTIFIER
The TML parser will start at Program
instead of Expression
. Then, we'll proceed to go to a StatementList
, but wait...what is a statement?
A statement is an entire group of code that performs a specific action or side effect. In TML, it will be a group of code followed by one semicolon ;
.
It might be helpful to look at the difference between a statement and expression. Let's look at an example of a VariableStatement
.
// VariableStatement
const x = 1 + 2
In the code above, var x = 1 + 2;
is a statement. More specifically, it's a "variable statement". 1 + 2
is an expression inside the "variable statement".
Let's look at an example of a PrintStatement
. Note that this type of grammar rule is specific to TML, so you may not see it in actual programming languages 😅
// PrintStatement
print(1 + 2)
In the code above, print(1 + 2);
is a statement and 1 + 2
is an expression inside that statement.
Next, let's look at an example of an ExpressionStatement
. This type of grammar rule is common among actual programming languages. But wait...how can it be both an expression and statement? An "expression statement" is simply a statement that only includes an expression.
// ExpressionStatement
1 + 2
Notice how an "expression statement" looks very similar to an expression. We'll use the ExpressionStatement
grammar rule to serve as a "container" around an expression. All expressions will lie inside a statement. In TML, we'll use a semicolon ;
to mark the end of a statement to make it easier to parse.
As seen in the grammar above, there will be three different kinds of statements in TML.
Statement
= VariableStatement
/ PrintStatement
/ ExpressionStatement
The TML program will start with StatementList
which will just be an array of one or more statements as indicated by Statement+
.
Program
= StatementList
StatementList
= Statement+
Our parser will go through TML code line-by-line, or rather, statement-by-statement. It'll collect a list of statements and store them in array. When we build an abstract syntax tree (AST) in the next tutorial, it'll make more sense why we store the statements in an array. In this tutorial, however, we'll evaluate each statement immediately.
body
AST node which contains a list of statements.Building a Tokenizer for TML
In Part 5 of this tutorial series, we learned how to build a tokenizer for math expression parsers. It was then updated in Part 6. We've been using this updated version when building parsers by hand including the recursive descent parser in Part 14 and the Pratt parser in Part 16.
Since we're adding new keywords, we need to update our tokenizer once again. We will add the following token types:
{
VAR: 'VAR',
PRINT: 'PRINT',
SEMICOLON: ';',
ASSIGNMENT: '=',
}
We'll also add the following token specifications:
[
[/^\bvar\b/, TokenTypes.VAR],
[/^\bprint\b/, TokenTypes.PRINT],
[/^;/, TokenTypes.SEMICOLON],
[/^\=/, TokenTypes.ASSIGNMENT],
]
Keep in mind that the order of where we place the token specifications matters! Our tokenizer will run through each token spec in order. Create a new file called Tokenizer.js
and insert the following contents to complete our tokenizer.
const TokenTypes = {
VAR: 'VAR',
PRINT: 'PRINT',
NUMBER: 'NUMBER',
IDENTIFIER: 'IDENTIFIER',
ADDITION: '+',
SUBTRACTION: '-',
MULTIPLICATION: '*',
DIVISION: '/',
EXPONENTIATION: '^',
PARENTHESIS_LEFT: '(',
PARENTHESIS_RIGHT: ')',
SEMICOLON: ';',
ASSIGNMENT: '='
}
const TokenSpec = [
[/^\s+/, null],
[/^\bvar\b/, TokenTypes.VAR],
[/^\bprint\b/, TokenTypes.PRINT],
[/^(?:\d+(?:\.\d*)?|\.\d+)/, TokenTypes.NUMBER],
[/^[a-z]+/, TokenTypes.IDENTIFIER],
[/^\+/, TokenTypes.ADDITION],
[/^\-/, TokenTypes.SUBTRACTION],
[/^\*/, TokenTypes.MULTIPLICATION],
[/^\//, TokenTypes.DIVISION],
[/^\^/, TokenTypes.EXPONENTIATION],
[/^\(/, TokenTypes.PARENTHESIS_LEFT],
[/^\)/, TokenTypes.PARENTHESIS_RIGHT],
[/^;/, TokenTypes.SEMICOLON],
[/^\=/, TokenTypes.ASSIGNMENT]
]
class Tokenizer {
constructor(input) {
this.input = input
this.cursor = 0
}
hasMoreTokens() {
return this.cursor < this.input.length
}
match(regex, inputSlice) {
const matched = regex.exec(inputSlice)
if (matched === null)
return null
this.cursor += matched[0].length
return matched[0]
}
getNextToken() {
if (!this.hasMoreTokens())
return null
const inputSlice = this.input.slice(this.cursor)
for (const [regex, type] of TokenSpec) {
const tokenValue = this.match(regex, inputSlice)
// No rule was matched!
if (tokenValue === null)
continue
// Skip whitespace!
if (type === null)
return this.getNextToken()
return {
type,
value: tokenValue
}
}
throw new SyntaxError(`Unexpected token: "${inputSlice[0]}"`)
}
printAllTokens() {
let token
while ((token = this.getNextToken()))
console.log(token)
}
}
module.exports = {
TokenTypes,
Tokenizer
}
Starting our TML Parser
We will use the Pratt parser we built in Part 16 as a base for our TML parser. Below is the finished parser from that tutorial. We'll name this file Parser.js
.
const { TokenTypes, Tokenizer } = require('./Tokenizer')
class Parser {
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
'unary': 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
return this.Expression()
}
eat(tokenType) {
const token = this.lookahead
if (token == null)
throw new SyntaxError(`Unexpected end of input, expected "${tokenType}"`)
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.value}", expected "${tokenType}"`
)
}
this.lookahead = this.tokenizer.getNextToken()
return token
}
getPrecedence(token) {
if (token === 'unary')
return this.operators.unary
if (token?.type)
return this.operators[token.value]
return 0
}
trig(id, value) {
switch (id) {
case 'sin':
return Math.sin(value)
case 'cos':
return Math.cos(value)
case 'tan':
return Math.tan(value)
default:
throw new Error(`Invalid operation: ${id}`)
}
}
/**
* Expression
* = Prefix (Infix)*
*/
Expression(prec = 0) {
let left = this.Prefix()
while (prec < this.getPrecedence(this.lookahead))
left = this.Infix(left, this.lookahead?.type)
return left
}
/**
* Prefix
* = ParenthesizedExpression
* / UnaryExpression
* / FunctionExpression
* / NUMBER
*/
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT)
return this.ParenthesizedExpression()
if (this.lookahead.type === TokenTypes.SUBTRACTION)
return this.UnaryExpression()
if (this.lookahead.type === TokenTypes.IDENTIFIER)
return this.FunctionExpression()
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
/**
* Infix
* = ("+" / "-" / "*" / "/" / "^") Expression
*/
Infix(left, operatorType) {
const token = this.eat(operatorType)
const newPrec = this.operators[token.value]
switch (token.type) {
case TokenTypes.ADDITION:
return left + this.Expression(newPrec)
case TokenTypes.SUBTRACTION:
return left - this.Expression(newPrec)
case TokenTypes.MULTIPLICATION:
return left * this.Expression(newPrec)
case TokenTypes.DIVISION:
return left / this.Expression(newPrec)
case TokenTypes.EXPONENTIATION:
return left ** this.Expression(newPrec - 1)
}
}
/**
* ParenthesizedExpression
* = "(" Expression ")"
*/
ParenthesizedExpression() {
this.eat(TokenTypes.PARENTHESIS_LEFT)
const expression = this.Expression()
this.eat(TokenTypes.PARENTHESIS_RIGHT)
return expression
}
/**
* UnaryExpression
* = "-" Expression
*/
UnaryExpression() {
this.eat(TokenTypes.SUBTRACTION)
return -this.Expression(this.getPrecedence('unary'))
}
/**
* FunctionExpression
* = IDENTIFIER ParenthesizedExpression
*/
FunctionExpression() {
const id = this.eat(TokenTypes.IDENTIFIER).value
const expression = this.ParenthesizedExpression()
return this.trig(id, expression)
}
}
module.exports = { Parser }
This parser can only accept expressions, so it's time to add statements!
Handling StatementList
The first thing we need to change in our parser is the starting point. Currently, the parser
method returns this.Expression
. We will change it to this.Program
instead.
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
unary: 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
return this.Program() // Our new starting point!
}
Next, we will add a Program
method to our Parser
class:
/**
* Program
* = StatementList
*/
Program() {
return this.StatementList()
}
The Program
method will then call the StatementList
which will begin the recursive descent. Let's add the StatementList
method to the Parser
class:
/**
* StatementList
* = Statement+
*/
StatementList() {
const statementList = [this.Statement()]
while (this.lookahead !== null) {
statementList.push(this.Statement())
}
return statementList
}
The StatementList
grammar rule derives to a list of one or statements as indicated by the +
sign in Statement+
. Any time we're dealing repetition, we can use a loop such as a while
loop.
We will initialize the statementList
variable to the first Statement
in the program. Then, we'll keep pushing new statements into the statementList
array until we've reached a null
token. That is, we continue until we reach the end of input. The input will be our small TML program.
Handling Statement and ExpressionStatement
Currently, we can't run our code successfully yet. We're missing an implementation of the Statement
method. Let's add the following code inside the Parser
class:
/**
* Statement
* = ExpressionStatement
*/
Statement() {
return this.ExpressionStatement()
}
For now, we'll only support the ExpressionStatement
. This statement is the simplest, so it's a good starting point. We'll add support for the PrintStatement
and VariableStatement
methods soon.
Let's create a method for ExpressionStatement
:
/**
* ExpressionStatement
* = Expression ";"
*/
ExpressionStatement() {
const expression = this.Expression()
this.eat(TokenTypes.SEMICOLON)
return expression
}
This statement is just an Expression
followed by a semicolon ;
. Yes, I know. The infamous semicolon. When testing the parser, don't forget to include a semicolon at the end of an expression!
Our parser should have the following code so far:
const { TokenTypes, Tokenizer } = require('./Tokenizer')
class Parser {
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
'unary': 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
return this.Program()
}
eat(tokenType) {
const token = this.lookahead
if (token == null)
throw new SyntaxError(`Unexpected end of input, expected "${tokenType}"`)
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.value}", expected "${tokenType}"`
)
}
this.lookahead = this.tokenizer.getNextToken()
return token
}
getPrecedence(token) {
if (token === 'unary')
return this.operators.unary
if (token && token.type != TokenTypes.PARENTHESIS_RIGHT)
return this.operators[token.value]
return 0
}
trig(id, value) {
switch (id) {
case 'sin':
return Math.sin(value)
case 'cos':
return Math.cos(value)
case 'tan':
return Math.tan(value)
default:
throw new Error(`Invalid operation: ${id}`)
}
}
/**
* Program
* = StatementList
*/
Program() {
return this.StatementList()
}
/**
* StatementList
* = Statement+
*/
StatementList() {
const statementList = [this.Statement()]
while (this.lookahead !== null)
statementList.push(this.Statement())
return statementList
}
/**
* Statement
* = ExpressionStatement
*/
Statement() {
return this.ExpressionStatement()
}
/**
* ExpressionStatement
* = Expression ";"
*/
ExpressionStatement() {
const expression = this.Expression()
this.eat(TokenTypes.SEMICOLON)
return expression
}
/**
* Expression
* = Prefix (Infix)*
*/
Expression(prec = 0) {
let left = this.Prefix()
while (prec < this.getPrecedence(this.lookahead))
left = this.Infix(left, this.lookahead?.type)
return left
}
/**
* Prefix
* = ParenthesizedExpression
* / UnaryExpression
* / FunctionExpression
* / NUMBER
*/
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT)
return this.ParenthesizedExpression()
if (this.lookahead.type === TokenTypes.SUBTRACTION)
return this.UnaryExpression()
if (this.lookahead.type === TokenTypes.IDENTIFIER)
return this.FunctionExpression()
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
/**
* Infix
* = ("+" / "-" / "*" / "/" / "^") Expression
*/
Infix(left, operatorType) {
const token = this.eat(operatorType)
const newPrec = this.operators[token.value]
switch (token.type) {
case TokenTypes.ADDITION:
return left + this.Expression(newPrec)
case TokenTypes.SUBTRACTION:
return left - this.Expression(newPrec)
case TokenTypes.MULTIPLICATION:
return left * this.Expression(newPrec)
case TokenTypes.DIVISION:
return left / this.Expression(newPrec)
case TokenTypes.EXPONENTIATION:
return left ** this.Expression(newPrec - 1)
}
}
/**
* ParenthesizedExpression
* = "(" Expression ")"
*/
ParenthesizedExpression() {
this.eat(TokenTypes.PARENTHESIS_LEFT)
const expression = this.Expression()
this.eat(TokenTypes.PARENTHESIS_RIGHT)
return expression
}
/**
* UnaryExpression
* = "-" Expression
*/
UnaryExpression() {
this.eat(TokenTypes.SUBTRACTION)
return -this.Expression(this.getPrecedence('unary'))
}
/**
* FunctionExpression
* = IDENTIFIER ParenthesizedExpression
*/
FunctionExpression() {
const id = this.eat(TokenTypes.IDENTIFIER).value
const expression = this.ParenthesizedExpression()
return this.trig(id, expression)
}
}
module.exports = { Parser }
Let's add the following code at the end of our Parser.js
file.
const parser = new Parser()
console.log(parser.parse('1 + 1; 2 * 3;')) // Don't forget the semicolons!
We can execute our code by running the following command:
node Parser
# Output: [ 2, 6 ]
After executing this code, we should get the following array: [2, 6]
. Why do we get an array? As you may recall, we're building an array of statements in our TML parser. We're no longer printing a single result to the console anymore.
The StatementList
method will return an array of statements and each statement will return an evaluated expression. In the next tutorial, we'll build an AST where each statement returns an AST node representing the expression instead of an evaluated value.
Handling PrintStatement
For convenience, we would like to let the user print results declaratively instead of displaying a weird array of numbers. That is where the "print statement" comes in handy. It will act like a console.log
function to display the results of an expression.
Let's adjust our Statement
method to call PrintStatement
when the lookahead token is a print
token.
/**
* Statement
* = PrintStatement
* / ExpressionStatement
*/
Statement() {
if (this.lookahead.type === TokenTypes.PRINT) {
return this.PrintStatement()
}
return this.ExpressionStatement()
}
Then, we'll add the PrintStatement
method to the Parser
class.
/**
* PrintStatement
* = "print" ParenthesizedExpression ";"
*/
PrintStatement() {
this.eat(TokenTypes.PRINT)
const expression = this.ParenthesizedExpression()
this.eat(TokenTypes.SEMICOLON)
console.log(expression) // Print the result of the expression!
}
The console.log
line is the main purpose of the PrintStatement
method. Since our parser is built using JavaScript, it will use JavaScript as its runtime environment language. We've been using Node.js to execute our parser. Node.js is built using the V8 engine which uses a C++ runtime to execute JavaScript. For our TML language, we're using Node.js as our runtime, so it'll take care of logging information to the console for us.
Our parser should now have the following code:
const { TokenTypes, Tokenizer } = require('./Tokenizer')
class Parser {
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
'unary': 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
return this.Program()
}
eat(tokenType) {
const token = this.lookahead
if (token == null)
throw new SyntaxError(`Unexpected end of input, expected "${tokenType}"`)
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.value}", expected "${tokenType}"`
)
}
this.lookahead = this.tokenizer.getNextToken()
return token
}
getPrecedence(token) {
if (token === 'unary')
return this.operators.unary
if (token && token.type != TokenTypes.PARENTHESIS_RIGHT)
return this.operators[token.value]
return 0
}
trig(id, value) {
switch (id) {
case 'sin':
return Math.sin(value)
case 'cos':
return Math.cos(value)
case 'tan':
return Math.tan(value)
default:
throw new Error(`Invalid operation: ${id}`)
}
}
/**
* Program
* = StatementList
*/
Program() {
return this.StatementList()
}
/**
* StatementList
* = Statement+
*/
StatementList() {
const statementList = [this.Statement()]
while (this.lookahead !== null)
statementList.push(this.Statement())
return statementList
}
/**
* Statement
* = PrintStatement
* / ExpressionStatement
*/
Statement() {
if (this.lookahead.type === TokenTypes.PRINT)
return this.PrintStatement()
return this.ExpressionStatement()
}
/**
* PrintStatement
* = "print" ParenthesizedExpression ";"
*/
PrintStatement() {
this.eat(TokenTypes.PRINT)
const expression = this.ParenthesizedExpression()
this.eat(TokenTypes.SEMICOLON)
console.log(expression)
}
/**
* ExpressionStatement
* = Expression ";"
*/
ExpressionStatement() {
const expression = this.Expression()
this.eat(TokenTypes.SEMICOLON)
return expression
}
/**
* Expression
* = Prefix (Infix)*
*/
Expression(prec = 0) {
let left = this.Prefix()
while (prec < this.getPrecedence(this.lookahead))
left = this.Infix(left, this.lookahead?.type)
return left
}
/**
* Prefix
* = ParenthesizedExpression
* / UnaryExpression
* / FunctionExpression
* / NUMBER
*/
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT)
return this.ParenthesizedExpression()
if (this.lookahead.type === TokenTypes.SUBTRACTION)
return this.UnaryExpression()
if (this.lookahead.type === TokenTypes.IDENTIFIER)
return this.FunctionExpression()
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
/**
* Infix
* = ("+" / "-" / "*" / "/" / "^") Expression
*/
Infix(left, operatorType) {
const token = this.eat(operatorType)
const newPrec = this.operators[token.value]
switch (token.type) {
case TokenTypes.ADDITION:
return left + this.Expression(newPrec)
case TokenTypes.SUBTRACTION:
return left - this.Expression(newPrec)
case TokenTypes.MULTIPLICATION:
return left * this.Expression(newPrec)
case TokenTypes.DIVISION:
return left / this.Expression(newPrec)
case TokenTypes.EXPONENTIATION:
return left ** this.Expression(newPrec - 1)
}
}
/**
* ParenthesizedExpression
* = "(" Expression ")"
*/
ParenthesizedExpression() {
this.eat(TokenTypes.PARENTHESIS_LEFT)
const expression = this.Expression()
this.eat(TokenTypes.PARENTHESIS_RIGHT)
return expression
}
/**
* UnaryExpression
* = "-" Expression
*/
UnaryExpression() {
this.eat(TokenTypes.SUBTRACTION)
return -this.Expression(this.getPrecedence('unary'))
}
/**
* FunctionExpression
* = IDENTIFIER ParenthesizedExpression
*/
FunctionExpression() {
const id = this.eat(TokenTypes.IDENTIFIER).value
const expression = this.ParenthesizedExpression()
return this.trig(id, expression)
}
}
module.exports = { Parser }
Let's add the following code at the end of our Parser.js
file.
const parser = new Parser()
parser.parse('print(1 + 2 * 3);')
We can execute our code by running the following command:
node Parser
# Output: 7
After executing this code, we should see only one value printed to the console: 7
. Now our parser handles printing to the console for us whenever we use a "print statement". We can use the print
statement to help debug as well.
Handling VariableStatement
The final statement we need to implement is VariableStatement
. This statement is responsible for handling variable declarations in TML. Whenever a programming language needs to store values, it uses a data structure called a symbol table.
A common data structure used to implement a symbol table is a hash table (aka hash map). In JavaScript, we can use regular objects to store values like a hash table or hash map.
Inside our Parser.js
file, let's create a new class called SymbolTable
and insert the following contents:
class SymbolTable {
constructor() {
this.symbolTable = {}
}
set(name, value) {
this.symbolTable[name] = value
}
get(name) {
if (this.has(name))
return this.symbolTable[name]
}
has(name) {
if (this.symbolTable.hasOwnProperty(name))
return true
else throw new Error(`${name} has not been declared.`)
}
}
Suppose we had the following TML program:
const x = 2
print(x)
When encountering a "variable statement", our parser can use the set
method to store a variable named x
with a value of 2
:
const symbolTable = new SymbolTable()
symbolTable.set(x, 2)
Later, we can retrieve the value stored in variable x
using the get
method:
symbolTable.get(x)
When we run print(x)
in our TML program, the get
method in our SymbolTable
class will retrieve x
and return a value of 2
. If the variable was never defined, then the get
method will throw an error.
Inside the parse
method of our Parser
class, let's instantiate a new symbol table:
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
unary: 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
// Instantiate a new symbol table!
this.symbolTable = new SymbolTable()
return this.Program()
}
Then, we'll update our Statement
method to call the VariableStatement
method when we encounter a var
token:
/**
* Statement
* = VariableStatement
* / PrintStatement
* / ExpressionStatement
*/
Statement() {
if (this.lookahead.type === TokenTypes.VAR) {
return this.VariableStatement()
}
if (this.lookahead.type === TokenTypes.PRINT) {
return this.PrintStatement()
}
return this.ExpressionStatement()
}
Let's add the VariableStatement
method to the Parser
class:
/**
* VariableStatement
* = "var" IDENTIFIER "=" Expression ";"
*/
VariableStatement() {
this.eat(TokenTypes.VAR)
const name = this.eat(TokenTypes.IDENTIFIER).value
this.eat(TokenTypes.ASSIGNMENT)
const value = this.Expression()
this.eat(TokenTypes.SEMICOLON)
this.symbolTable.set(name, value)
}
This method will consume an IDENTIFIER
token such as myVar
, x
, y
, etc. and use it as the name
parameter in the SymbolTable.set
method. The value
will be equal to the result of the expression after the =
token.
Retrieving Variable Values
We can now store variables when they're declared in TML code, but there's still the issue of retrieving them. When we use a variable in an expression such as x + 1
, our parser needs to know how to retrieve the value stored in that variable, so it can evaluate the expression.
In the Pratt parser tutorials, our Prefix
rule looked like the following.
Prefix
= ParenthesizedExpression
/ UnaryExpression
/ FunctionExpression
/ NUMBER
However, we now have a conflict between a variable and function in TML. Our Prefix
method currently looks like the following:
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT) {
return this.ParenthesizedExpression()
}
if (this.lookahead.type === TokenTypes.SUBTRACTION) {
return this.UnaryExpression()
}
if (this.lookahead.type === TokenTypes.IDENTIFIER) {
return this.FunctionExpression()
}
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
Suppose we had the identifier, myVar
. How would our parser know if myVar
is a function name or a variable name? We can determine if the identifier belongs to a function name by checking the next token. If the next token is a left parenthesis (
after the IDENTIFIER
token, then we can say that myVar
belongs to a function name.
In the example below, we have three statements in TML code.
const myVar = 0
print(myVar)
print(cos(myVar))
The first statement is declaring a variable named myVar
and setting it equal to zero. The second statement is printing the value stored in myVar
. Notice how myVar
is not followed by a left parenthesis (
. The third statement is printing the expression, cos(myVar)
. Since cos
is an identifier that precedes a left parenthesis, the parser knows that it is a function and not a variable.
In the grammar blueprint we defined at the beginning of this tutorial, we had to make an adjustment to the Prefix
rule.
Prefix
= ParenthesizedExpression
/ UnaryExpression
/ VariableOrFunctionExpression
/ NUMBER
VariableOrFunctionExpression
= Variable
/ FunctionExpression
Variable
= IDENTIFIER
FunctionExpression
= IDENTIFIER ParenthesizedExpression
A Prefix
can resolve to a VariableOrFunctionExpression
rule which then can resolve to either Variable
or FunctionExpression
. Let's implement all these methods in our parser.
/**
* Prefix
* = ParenthesizedExpression
* / UnaryExpression
* / VariableOrFunctionExpression
* / NUMBER
*/
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT) {
return this.ParenthesizedExpression()
}
if (this.lookahead.type === TokenTypes.SUBTRACTION) {
return this.UnaryExpression()
}
if (this.lookahead.type === TokenTypes.IDENTIFIER) {
return this.VariableOrFunctionExpression()
}
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
/**
* VariableOrFunctionExpression
* = FunctionExpression
* / Variable
*/
VariableOrFunctionExpression() {
const id = this.eat(TokenTypes.IDENTIFIER).value
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT) {
return this.FunctionExpression(id)
}
return this.Variable(id)
}
/**
* Variable
* = IDENTIFIER
*/
Variable(id) {
return this.symbolTable.get(id)
}
/**
* FunctionExpression
* = IDENTIFIER ParenthesizedExpression
*/
FunctionExpression(id) {
const expression = this.ParenthesizedExpression()
return this.trig(id, expression)
}
Inside the Variable
method, we can pass in an identifier, id
, which will be the name of the variable. Then, we can use SymbolTable.get
to retrieve the value stored in that variable.
Finally, our TML parser is fully complete and should look like the following:
const { TokenTypes, Tokenizer } = require('./Tokenizer')
class SymbolTable {
constructor() {
this.symbolTable = {}
}
set(name, value) {
this.symbolTable[name] = value
}
get(name) {
if (this.has(name))
return this.symbolTable[name]
}
has(name) {
if (this.symbolTable.hasOwnProperty(name))
return true
else throw new Error(`${name} has not been declared.`)
}
}
class Parser {
parse(input) {
this.input = input
this.tokenizer = new Tokenizer(input)
this.lookahead = this.tokenizer.getNextToken()
this.operators = {
'^': 5,
'unary': 4,
'*': 3,
'/': 3,
'+': 2,
'-': 2
}
this.symbolTable = new SymbolTable()
return this.Program()
}
eat(tokenType) {
const token = this.lookahead
if (token == null)
throw new SyntaxError(`Unexpected end of input, expected "${tokenType}"`)
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.value}", expected "${tokenType}"`
)
}
this.lookahead = this.tokenizer.getNextToken()
return token
}
getPrecedence(token) {
if (token === 'unary')
return this.operators.unary
if (token && token.type != TokenTypes.PARENTHESIS_RIGHT)
return this.operators[token.value]
return 0
}
trig(id, value) {
switch (id) {
case 'sin':
return Math.sin(value)
case 'cos':
return Math.cos(value)
case 'tan':
return Math.tan(value)
default:
throw new Error(`Invalid operation: ${id}`)
}
}
/**
* Program
* = StatementList
*/
Program() {
return this.StatementList()
}
/**
* StatementList
* = Statement+
*/
StatementList() {
const statementList = [this.Statement()]
while (this.lookahead !== null)
statementList.push(this.Statement())
return statementList
}
/**
* Statement
* = VariableStatement
* / PrintStatement
* / ExpressionStatement
*/
Statement() {
if (this.lookahead.type === TokenTypes.VAR)
return this.VariableStatement()
if (this.lookahead.type === TokenTypes.PRINT)
return this.PrintStatement()
return this.ExpressionStatement()
}
/**
* VariableStatement
* = "var" IDENTIFIER "=" Expression ";"
*/
VariableStatement() {
this.eat(TokenTypes.VAR)
const name = this.eat(TokenTypes.IDENTIFIER).value
this.eat(TokenTypes.ASSIGNMENT)
const value = this.Expression()
this.eat(TokenTypes.SEMICOLON)
this.symbolTable.set(name, value)
}
/**
* PrintStatement
* = "print" ParenthesizedExpression ";"
*/
PrintStatement() {
this.eat(TokenTypes.PRINT)
const expression = this.ParenthesizedExpression()
this.eat(TokenTypes.SEMICOLON)
console.log(expression)
}
/**
* ExpressionStatement
* = Expression ";"
*/
ExpressionStatement() {
const expression = this.Expression()
this.eat(TokenTypes.SEMICOLON)
return expression
}
/**
* Expression
* = Prefix (Infix)*
*/
Expression(prec = 0) {
let left = this.Prefix()
while (prec < this.getPrecedence(this.lookahead))
left = this.Infix(left, this.lookahead?.type)
return left
}
/**
* Prefix
* = ParenthesizedExpression
* / UnaryExpression
* / VariableOrFunctionExpression
* / NUMBER
*/
Prefix() {
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT)
return this.ParenthesizedExpression()
if (this.lookahead.type === TokenTypes.SUBTRACTION)
return this.UnaryExpression()
if (this.lookahead.type === TokenTypes.IDENTIFIER)
return this.VariableOrFunctionExpression()
const token = this.eat(TokenTypes.NUMBER)
return Number(token.value)
}
/**
* Infix
* = ("+" / "-" / "*" / "/" / "^") Expression
*/
Infix(left, operatorType) {
const token = this.eat(operatorType)
const newPrec = this.operators[token.value]
switch (token.type) {
case TokenTypes.ADDITION:
return left + this.Expression(newPrec)
case TokenTypes.SUBTRACTION:
return left - this.Expression(newPrec)
case TokenTypes.MULTIPLICATION:
return left * this.Expression(newPrec)
case TokenTypes.DIVISION:
return left / this.Expression(newPrec)
case TokenTypes.EXPONENTIATION:
return left ** this.Expression(newPrec - 1)
}
}
/**
* ParenthesizedExpression
* = "(" Expression ")"
*/
ParenthesizedExpression() {
this.eat(TokenTypes.PARENTHESIS_LEFT)
const expression = this.Expression()
this.eat(TokenTypes.PARENTHESIS_RIGHT)
return expression
}
/**
* UnaryExpression
* = "-" Expression
*/
UnaryExpression() {
this.eat(TokenTypes.SUBTRACTION)
return -this.Expression(this.getPrecedence('unary'))
}
/**
* VariableOrFunctionExpression
* = FunctionExpression
* / Variable
*/
VariableOrFunctionExpression() {
const id = this.eat(TokenTypes.IDENTIFIER).value
if (this.lookahead.type === TokenTypes.PARENTHESIS_LEFT)
return this.FunctionExpression(id)
return this.Variable(id)
}
/**
* Variable
* = IDENTIFIER
*/
Variable(id) {
return this.symbolTable.get(id)
}
/**
* FunctionExpression
* = IDENTIFIER ParenthesizedExpression
*/
FunctionExpression(id) {
const expression = this.ParenthesizedExpression()
return this.trig(id, expression)
}
}
module.exports = { Parser }
We can test to make sure variables can be retrieved correctly using the following code:
const parser = new Parser()
parser.parse(`
var x = cos(0);
print(x);
print(1 + 2 * x);
`)
/* OUTPUT:
1
3
*/
Testing our Parser
Similar to the previous tutorial, we'll create a new file called tests.js
for running tests. This time, we'll test various statements instead of math expressions. We need to make sure our programming language works correctly with all kinds of combinations of tokens, so it's a good idea to test as many combinations as possible.
Insert the following contents inside the tests.js
file.
const assert = require('node:assert').strict
const { Parser } = require('./Parser')
const mathTests = [
// Expression statements
['1;', [1]],
[' 2 ;', [2]],
['1 + 2;', [3]],
[' 1 + 2 ;', [3]],
['1 + 2 * 3;', [7]],
['(1 + 2) * 3;', [9]],
['5 - 2;', [3]],
['5 - 2 - 1;', [2]],
['12 / 2 / 3;', [2]],
['2 ^ 3 + 1;', [9]],
['-2 ^ 2;', [-4]],
['(-2) ^ 2;', [4]],
['-2 ^ 2 + 1;', [-3]],
['cos(0) + 3 * -4 / -2 ^ 2;', [4]],
// Variable Statements
['var x = 1; x + 1;', [undefined, 2]],
['var x = (1 + 2) * 3; x + 1;', [undefined, 10]],
['var x = 1; var y = 2; x + y;', [undefined, undefined, 3]]
]
const parser = new Parser()
let result
for (const [expression, expected] of mathTests) {
try {
result = parser.parse(expression)
assert.deepEqual(result, expected)
}
catch (err) {
const lines = '-'.repeat(process.stdout.columns)
console.log(lines)
console.log(`Expression failed: "${expression}"`)
console.log(`Expected result: ${expected}`)
console.log(`Actual result: ${result}`)
console.log(lines)
throw err
}
}
console.log('All tests passed! 🎉')
We can execute this file using the following command:
node tests
After running this command, all the tests should pass! Notice how the expected result is now an array of values. We're using assert.deepEqual
to compare arrays instead of the assert.equal
method we used in previous tutorials.
The Program
method is where our parser starts. This method returns a "statement list" which is an array of statements. Since we're evaluating statements on the fly, our arrays should contain either undefined
or a numeric value. Remember, our VariableStatement
method sets the value of a variable and stores it in a symbol table. Then, it returns undefined
, so we should expect nothing to be returned from "variable statements".
Running .tml Files
Time for some fun! Suppose we want to run .tml
files using our TML parser from the command line. The .tml
files will contain only TML code.
Create a new file called example.tml
with the following contents.
const x = 1
const y = 2
const z = 3
const average = (x + y + z) / 3
print(average)
Don't forget the semicolons! The code above will compute the average between three numbers. Let's write a Node.js script called tml-runner
(with no file extension) that will import our TML parser and execute a file passed to it.
#!/usr/bin/env node
const fs = require('node:fs/promises')
const { Parser } = require('./Parser')
const parser = new Parser()
async function run() {
try {
const code = await fs.readFile(process.argv[2], { encoding: 'utf8' })
parser.parse(code)
}
catch (err) {
console.log(err)
}
}
run()
The line at the top is called a shebang and lets us run tml-runner
as an executable.
#!/usr/bin/env node
Next, we need to change permissions of this file so that we're allowed to run it as an executable on our computer. In MacOS and Linux operating systems, we can run the following command to make tml-runner
an executable.
chmod +x tml-runner
Now, we can run tml-runner
as an executable and pass in a .tml
file:
./tml-runner example.tml
After executing the above command, a value of 2
should display in the console. Congrats! You built your very first interpreter! 🎉
We can even "pipe" the output of the example.tml
file into a JavaScript program. Create a new file called display.js
with the following contents:
let data = ''
process.stdin.on('data', (chunk) => {
data += chunk
})
process.stdin.on('end', () => {
console.log(`The average is: ${data} 🎉🎉🎉`)
})
Next, run the following command to pipe the output of example.tml
into the input of display.js
.
./tml-runner example.tml | node display.js
After running the above command, we should see the following in our terminal:
The average is: 2
🎉🎉🎉
The result from example.tml
gets piped into a normal JavaScript file where we can use the result in any way we want. I know this example seems silly, but it showcases the power of using our own language together with another language. In our case, we're able to use TML together with JavaScript!
Conclusion
In this tutorial, we created a small programming language called "Tiny Math Language" or TML for short. We took the math expression Pratt parser from a previous tutorial and restructured it so that it could parse statements. We used statements to add support for storing variables in a math expression parser.
Then, we created a file to test various statements and math expressions. We also learned how to make our very own interpreter for the TML language using a Node.js runtime environment. By creating an interpreter for the TML language that could be run via the command line, we could pipe a result into a regular JavaScript file to allow communication between a .tml
file and .js
file. Pretty cool! 😎
In this tutorial, we were evaluating statements and math expressions on the fly. In the next tutorial, we'll learn how to make a brand new abstract syntax tree (AST) that supports the statements we added to our parser. See you there!