How to Build a Math Expression Tokenizer in JavaScript

Published: Tuesday, February 14, 2023

Updated: Thursday, February 16, 2023

Greetings, friends! This is my fifth post in the math expression parser series. In this tutorial, we'll learn how to build a tokenizer for the math expression evaluator we created in the previous tutorial!

Why do we need Tokenizers?

Tokenizers are very important in parsing because it lets us group together characters into meaningful symbols which we refer to as tokens. When we built a math expression evaluator using the Shunting yard algorithm in the previous tutorial, we were only able to evaluate expressions that contained one-digit numbers such as 1, 2, 3, etc. Now we want to use multi-digit numbers such as 10, 100, 1234, etc.

When building compilers and interpreters for programming languages, it's important to use tokenizers to create tokens for common keywords such as if, else, var, etc. Tokenizers also help us identify what group of characters count as strings or function names. For the math expression evaluator in this tutorial, we'll only focus on creating tokens for multi-digit numbers. Later, we'll create tokens for function names such as sin, cos, and more!

Building a Tokenizer with Loops

Before we begin building our tokenizer, let's think about the thought process on how we might go about creating one.

Suppose we had the following string:

Copied! ⭐️

const input = '10 + 20'

We should immediately be able to identify three tokens: 10, +, and 20. How can we create these tokens using code? We first need to iterate through the original input string character by character.

What's the best way to iterate through a string? Well, if you read my tutorial on iterators, iterables, and generators, then you know that strings in JavaScript are considered iterable. This means we can iterate through a string using a for...of loop.

Copied! ⭐️

const input = '10 + 20'

for (const character of input)
  console.log(character)

/* OUTPUT:
1
0
(whitespace)
+
(whitespace)
2
0
*/

The downside of using a for-of loop is that it's tricky to detect when we're at the end of the input string. Since we're building a math expression tokenizer, we can add a symbol to the end of the string that we don't plan on using anywhere in the math expression. Let's use the epsilon (ε) symbol to denote the end of input has been reached.

Copied! ⭐️

const input = '10 + 20ε'

for (const character of input) {
  console.log(character)

  if (character === 'ε')
    console.log('END OF INPUT!')
}

/* OUTPUT:
1
0
(whitespace)
+
(whitespace)
2
0
ε
END OF INPUT!
*/

We can check if a character is a number by iterating through the input string until we reach a whitespace or an addition (+) symbol. We just keep appending to a string which we'll call buffer that will hold each character of the entire numeric token. Once we reach a whitespace or addition symbol, we'll add whatever is in buffer to an array called tokens. Then, we clear the buffer and keep going.

Below is a function called tokenizer that accepts a string as input and returns an array of tokens.

Copied! ⭐️

function tokenizer(baseInput) {
  const input = `${baseInput}ε`
  const tokens = []
  let buffer = ''

  for (const character of input) {
    // Skip this character if it's a whitespace
    if (character === ' ') {
      // If the buffer contains any characters, push
      // them to the tokens array and clear the buffer
      if (buffer.length > 0) {
        tokens.push(buffer)
        buffer = ''
      }

      continue
    }

    // Check if character is a number
    if (!isNaN(Number.parseFloat(character))) {
      buffer += character
      continue
    }

    // Check if character is an operator
    if (character === '+') {
      if (buffer.length > 0) {
        tokens.push(buffer)
        buffer = ''
      }

      tokens.push(character)
      continue
    }

    // If we reach the end of the string,
    // push whatever is in the buffer
    if (character === 'ε') {
      if (buffer.length > 0) {
        tokens.push(buffer)
        buffer = ''
      }
    }
  }
  return tokens
}

const input = '10 + 20'
const result = tokenizer(input)
console.log(result) // ['10', '+', '20']

You may have noticed some similar logic in regards to the buffer. We can make a small helper function called flushBuffer to make our code a bit cleaner.

Copied! ⭐️

function tokenizer(baseInput) {
  const input = `${baseInput}ε`
  const tokens = []
  let buffer = ''

  const flushBuffer = () => {
    if (buffer.length > 0) {
      tokens.push(buffer)
      buffer = ''
    }
  }

  for (const character of input) {
    if (character === ' ') {
      flushBuffer()
      continue
    }

    if (!isNaN(Number.parseFloat(character))) {
      buffer += character
      continue
    }

    if (character === '+') {
      flushBuffer()
      tokens.push(character)
      continue
    }

    if (character === 'ε')
      flushBuffer()
  }
  return tokens
}

const input = '10 + 20'
const result = tokenizer(input)
console.log(result) // ['10', '+', '20']

If we don't want to append the epsilon (ε) character at the end, or if we want to use regular for loops instead of for...of loops, we can use the following code instead. Notice that we have to check if the end of the string is reached when detecting a numeric character to make sure that we push the buffer to the tokens array when the last character in the input is a number.

Copied! ⭐️

function tokenizer(input) {
  const tokens = []
  let buffer = ''

  const flushBuffer = () => {
    if (buffer.length > 0) {
      tokens.push(buffer)
      buffer = ''
    }
  }

  for (let i = 0; i < input.length; i++) {
    const character = input[i]

    if (character === ' ') {
      flushBuffer()

      continue
    }

    if (!isNaN(Number.parseFloat(character))) {
      buffer += character

      if (i === input.length - 1)
        flushBuffer()

      continue
    }

    if (character === '+') {
      flushBuffer()

      tokens.push(character)
      continue
    }

    if (i === input.length - 1)
      flushBuffer()
  }

  return tokens
}

const input = '10 + 20'
const result = tokenizer(input)
console.log(result) // ['10', '+', '20']

Remember! There's usually more than one way to solve a problem in software development! 🙂

So far, we have coded a very basic tokenizer. Currently, our tokenizer only understands the following tokens: numbers and the addition (+) symbol. We want it to understand all the other mathematical operations we used in our math expression evaluation from the previous tutorial. We also need to understand parentheses, since those are valid tokens.

Therefore, the following tokens should be recognized:

Multi-digit numbers
+ symbol (for addition)
- symbol (for subtraction)
* symbol (for multiplication)
/ symbol (for division)
^ symbol (for exponentiation)
( symbol
) symbol

As we add more operators and more functionality to our tokenizer, it tends to get a bit messy when dealing with for loops. It would be nice if we could create a specification for our tokens and use this specification in our tokenizer so that we can easily add new types of tokens.

Let's now create a tokenizer using regular expressions and build a Tokenizer class to create a more modular solution that makes it simple to add new token types.

Building a Tokenizer with Regular Expressions

Regular expressions are great for matching text inside a long string such as a math expression provided by a user. However, it's quite possible that tokenizer implementations that use regular expressions are slower than using regular for loops. This definitely depends on a case-by-case basis, but it's something to keep in mind if you're trying to build the fastest tokenizer in the world 😅

In this tutorial, we don't plan on building the most performant tokenizer. It's good to understand the fundamentals first and optimize things later. Regular expressions make it super easy to detect whether part of the string is a number, whitespace, letters, and more, which makes them the perfect tools for building a tokenizer.

tip

If you need help testing out regular expressions expression, Regex101 is a great resource!

Alright, let's get started! First, we first declare some constants that we will use inside our tokenizer.

Copied! ⭐️

const TokenTypes = {
  NUMBER: 'NUMBER',
  ADDITION: '+',
  SUBTRACTION: '-',
  MULTIPLICATION: '*',
  DIVISION: '/',
  EXPONENTIATION: '^',
  PARENTHESIS_LEFT: '(',
  PARENTHESIS_RIGHT: ')',
}

const TokenSpec = [
  [/^\s+/, null],
  [/^\d+/, TokenTypes.NUMBER],
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT],
]

We now have a variable called TokenTypes that contains a list of all the possible types a token can be. We use null to correspond to whitespaces, so that it's easy to skip them as we iterate through the input string.

Then, we create an array called TokenSpec that defines our token specification. Each element in this specification is a two-element array. The first element of the two-element array is a regular expression. The second element is the type of token that should be returned when the corresponding regular expression is matched.

By defining a token specification using regular expressions, it's super easy to add new tokens that our tokenizer will understand!

Let's now construct a Tokenizer class that will use our defined token types and regular expressions.

Copied! ⭐️

class Tokenizer {
  constructor(input) {
    this.input = input
    this.cursor = 0
  }
}

First, we define a constructor for the Tokenizer class that will accept our input string and assign it to an instance variable. We then define a cursor property that will keep track of where we are in the math expression.

If we are given an input of 98 + 76, then we can expect to the cursor to start at 0, right before 9. When we discover that 98 is a number token, the cursor should be at position 2, right after the 8.

We'll then add a method called hasMoreTokens for determining when the entire input has been read, so we know that we're done tokenizing. If the cursor position is equal to the length of input or greater, then we know we're done scanning the string.

Copied! ⭐️

class Tokenizer {
  constructor(input) {
    this.input = input
    this.cursor = 0
  }

  hasMoreTokens() {
    return this.cursor < this.input.length
  }
}

So far, so good! Next, we'll add two more methods that work together: match and getNextToken.

Copied! ⭐️

class Tokenizer {
  constructor(input) {
    this.input = input
    this.cursor = 0
  }

  hasMoreTokens() {
    return this.cursor < this.input.length
  }

  match(regex, inputSlice) {
    const matched = regex.exec(inputSlice)
    if (matched === null)
      return null

    this.cursor += matched[0].length
    return matched[0]
  }

  getNextToken() {
    if (!this.hasMoreTokens())
      return null

    const inputSlice = this.input.slice(this.cursor)

    for (const [regex, type] of TokenSpec) {
      const tokenValue = this.match(regex, inputSlice)

      // No rule was matched!
      if (tokenValue === null)
        continue

      // Skip whitespace!
      if (type === null)
        return this.getNextToken()

      return {
        type,
        value: tokenValue,
      }
    }

    throw new SyntaxError(`Unexpected token: "${inputSlice[0]}"`)
  }
}

The getNextToken method is used to retrieve the next token from the input string automatically by using regular expressions specified in our token specification. It will iterate through every two-element array in TokenSpec using a for...of loop and array destructuring.

Notice that the getNextToken returns an object with a type and value field. This is a very common convention used when building a tokenizer for a parser. Tokens are just objects with a type and value!

The match method will test a regular expression against a slice of the original input. If there was a successful match, it will move the cursor to a position after the matched slice and return the match.

Let's take a look at the finished code for our tokenizer!

Copied! ⭐️

const TokenTypes = {
  NUMBER: 'NUMBER',
  ADDITION: '+',
  SUBTRACTION: '-',
  MULTIPLICATION: '*',
  DIVISION: '/',
  EXPONENTIATION: '^',
  PARENTHESIS_LEFT: '(',
  PARENTHESIS_RIGHT: ')'
}

const TokenSpec = [
  [/^\s+/, null],
  [/^\d+/, TokenTypes.NUMBER],
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT]
]

class Tokenizer {
  constructor(input) {
    this.input = input
    this.cursor = 0
  }

  hasMoreTokens() {
    return this.cursor < this.input.length
  }

  match(regex, inputSlice) {
    const matched = regex.exec(inputSlice)
    if (matched === null)
      return null

    this.cursor += matched[0].length
    return matched[0]
  }

  getNextToken() {
    if (!this.hasMoreTokens())
      return null

    const inputSlice = this.input.slice(this.cursor)

    for (const [regex, type] of TokenSpec) {
      const tokenValue = this.match(regex, inputSlice)

      // No rule was matched!
      if (tokenValue === null)
        continue

      // Skip whitespace!
      if (type === null)
        return this.getNextToken()

      return {
        type,
        value: tokenValue,
      }
    }

    throw new SyntaxError(`Unexpected token: "${inputSlice[0]}"`)
  }
}

It's beautiful. Simple yet elegant 🥲

Using the Tokenizer

Now that we have a fancy tokenizer built, let's use it!

Copied! ⭐️

const input = '10 + 20 * 30 - 40'
const tokenizer = new Tokenizer(input)

console.log(tokenizer.getNextToken()) // {type: 'NUMBER', value: '10'}
console.log(tokenizer.getNextToken()) // {type: '+', value: '+'}
console.log(tokenizer.getNextToken()) // {type: 'NUMBER', value: '20'}
console.log(tokenizer.getNextToken()) // {type: '*', value: '*'}
console.log(tokenizer.getNextToken()) // {type: 'NUMBER', value: '30'}
console.log(tokenizer.getNextToken()) // {type: '-', value: '-'}
console.log(tokenizer.getNextToken()) // {type: 'NUMBER', value: '40'}
console.log(tokenizer.getNextToken()) // 'null' because end of input reached!

You might have noticed right away that this looks like a JavaScript generator. Indeed, it does look that way! We essentially created our own version of the iterator pattern.

I like to create a helpful utility method that returns a full list of tokens so that we don't have to keep calling the getNextToken over and over. Keep in mind that once our tokenizer instance has reached the end of the input string and iterated over every token, it returns null. Therefore, it's easy to create a simple while loop that logs a new token until the token is a falsy value.

Copied! ⭐️

function printAllTokens(input) {
  const tokenizer = new Tokenizer(input)

  let token
  while ((token = tokenizer.getNextToken()))
    console.log(token)
}

const input = '10 + 20 * 30 - 40'
printAllTokens(input)

/* OUTPUT:
{type: 'NUMBER', value: '10'}
{type: '+', value: '+'}
{type: 'NUMBER', value: '20'}
{type: '*', value: '*'}
{type: 'NUMBER', value: '30'}
{type: '-', value: '-'}
{type: 'NUMBER', value: '40'}
*/

Conclusion

In this tutorial, we have created a modular and reusable tokenizer that can be used for math expressions. We saw how to create a small tokenizer using multiple techniques. By using regular expressions, we can easily create new token types without having to change much code.

The tokenizer we built in this tutorial can be used in other environments, not just for building a math expression evaluator! We can reuse the same tokenizer for building a small programming language if we wanted to!

In the next tutorial, we'll learn how to add decimal numbers and function keywords to our tokenizer. Then, we'll learn how to use the tokenizer with the math expression evaluator we built using the Shunting yard algorithm. By the end, you'll be able to create a fully functional calculator without having to use the evil eval function ever again! 🙂