Using the Shunting Yard Algorithm with a Tokenizer

Published: Thursday, February 16, 2023

Greetings, friends! This is my sixth post in the math expression parser series. We learned how to build a math expression evaluator using the shunting yard algorithm in Part 4. However, we were limited to only single digit numbers because we didn't have a proper tokenizer. Luckily, in the previous tutorial, we learned how to make one! In this tutorial, we'll learn how to use them together. We'll also learn how to adjust our tokenizer, so we can use decimal numbers and functions in our calculator 😃

Handling Tokens with the Shunting Yard Algorithm

Let's review the finished math expression evaluator from Part 4 that we implemented using the shunting yard algorithm.

js
Copied! ⭐️
const operators = {
  '^': {
    prec: 4,
    assoc: 'right',
  },
  '*': {
    prec: 3,
    assoc: 'left',
  },
  '/': {
    prec: 3,
    assoc: 'left',
  },
  '+': {
    prec: 2,
    assoc: 'left',
  },
  '-': {
    prec: 2,
    assoc: 'left',
  },
};

const assert = (predicate) => {
  if (predicate) return;
  throw new Error('Assertion failed due to invalid token');
};

const evaluate = (input) => {
  const opSymbols = Object.keys(operators);
  const stack = [];
  let output = [];

  const peek = () => {
    return stack.at(-1);
  };

  const addToOutput = (token) => {
    output.push(token);
  };

  const handlePop = () => {
    const op = stack.pop();

    if (op === '(') return;

    const right = parseFloat(output.pop());
    const left = parseFloat(output.pop());

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  };

  const handleToken = (token) => {
    switch (true) {
      case !isNaN(parseFloat(token)):
        addToOutput(token);
        break;
      case opSymbols.includes(token):
        const o1 = token;
        let o2 = peek();

        while (
          o2 !== undefined &&
          o2 !== '(' &&
          (operators[o2].prec > operators[o1].prec ||
            (operators[o2].prec === operators[o1].prec &&
              operators[o1].assoc === 'left'))
        ) {
          addToOutput(handlePop());
          o2 = peek();
        }
        stack.push(o1);
        break;
      case token === '(':
        stack.push(token);
        break;
      case token === ')':
        let topOfStack = peek();
        while (topOfStack !== '(') {
          assert(stack.length !== 0);
          addToOutput(handlePop());
          topOfStack = peek();
        }
        assert(peek() === '(');
        handlePop();
        break;
      default:
        throw new Error(`Invalid token: ${token}`);
    }
  };

  for (let i of input) {
    if (i === ' ') continue;

    handleToken(i);
  }

  while (stack.length !== 0) {
    assert(peek() !== '(');
    addToOutput(handlePop());
  }

  return output[0];
};

As we can see, there's already a handleToken function that handles numbers and operators such as +, -, *, /, and ^. However, we weren't using a proper tokenizer. We were just iterating through each character of the input using a for...of loop.

js
Copied! ⭐️
for (let i of input) {
  if (i === ' ') continue;

  handleToken(i);
}

We need to replace this chunk of code with a loop that will iterate through every token, not just a single character. As you may recall from the end of the previous tutorial, we can use a while loop to iterate through every token.

js
Copied! ⭐️
const tokenizer = new Tokenizer(input);
let token;
while ((token = tokenizer.getNextToken())) {
  handleToken(token.value);
}

We simply call the handleToken function upon every iteration of the loop. Don't forget that our tokens are objects with a type and value property. Therefore, we should pass in token.value into the handleToken function.

We can then place our tokenizer code before the evaluate function. It's up to you if you want to put the tokenizer in a separate module. That would probably be the cleaner approach 😄

Our finished code, including the tokenizer, should look like the following:

js
Copied! ⭐️
const TokenTypes = {
  NUMBER: 'NUMBER',
  ADDITION: '+',
  SUBTRACTION: '-',
  MULTIPLICATION: '*',
  DIVISION: '/',
  EXPONENTIATION: '^',
  PARENTHESIS_LEFT: '(',
  PARENTHESIS_RIGHT: ')'
};

const TokenSpec = [
  [/^\s+/, null],
  [/^\d+/, TokenTypes.NUMBER],
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT]
];

class Tokenizer {
  constructor(input) {
    this.input = input;
    this.cursor = 0;
  }

  hasMoreTokens() {
    return this.cursor < this.input.length;
  }

  match(regex, inputSlice) {
    const matched = regex.exec(inputSlice);
    if (matched === null) {
      return null;
    }

    this.cursor += matched[0].length;
    return matched[0];
  }

  getNextToken() {
    if (!this.hasMoreTokens()) {
      return null;
    }

    const inputSlice = this.input.slice(this.cursor);

    for (let [regex, type] of TokenSpec) {
      const tokenValue = this.match(regex, inputSlice);

      if (tokenValue === null) {
        continue;
      }

      if (type === null) {
        return this.getNextToken();
      }

      return {
        type,
        value: tokenValue,
      };
    }

    throw new SyntaxError(`Unexpected token: "${inputSlice[0]}"`);
  }
}

const operators = {
  '^': {
    prec: 4,
    assoc: 'right',
  },
  '*': {
    prec: 3,
    assoc: 'left',
  },
  '/': {
    prec: 3,
    assoc: 'left',
  },
  '+': {
    prec: 2,
    assoc: 'left',
  },
  '-': {
    prec: 2,
    assoc: 'left',
  },
};

const assert = (predicate) => {
  if (predicate) return;
  throw new Error('Assertion failed due to invalid token');
};

const evaluate = (input) => {
  const opSymbols = Object.keys(operators);
  const stack = [];
  let output = [];

  const peek = () => {
    return stack.at(-1);
  };

  const addToOutput = (token) => {
    output.push(token);
  };

  const handlePop = () => {
    const op = stack.pop();

    if (op === '(') return;

    const right = parseFloat(output.pop());
    const left = parseFloat(output.pop());

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  };

  const handleToken = (token) => {
    switch (true) {
      case !isNaN(parseFloat(token)):
        addToOutput(token);
        break;
      case opSymbols.includes(token):
        const o1 = token;
        let o2 = peek();

        while (
          o2 !== undefined &&
          o2 !== '(' &&
          (operators[o2].prec > operators[o1].prec ||
            (operators[o2].prec === operators[o1].prec &&
              operators[o1].assoc === 'left'))
        ) {
          addToOutput(handlePop());
          o2 = peek();
        }
        stack.push(o1);
        break;
      case token === '(':
        stack.push(token);
        break;
      case token === ')':
        let topOfStack = peek();
        while (topOfStack !== '(') {
          assert(stack.length !== 0);
          addToOutput(handlePop());
          topOfStack = peek();
        }
        assert(peek() === '(');
        handlePop();
        break;
      default:
        throw new Error(`Invalid token: ${token}`);
    }
  };

  const tokenizer = new Tokenizer(input);
  let token;
  while ((token = tokenizer.getNextToken())) {
    handleToken(token.value);
  }

  while (stack.length !== 0) {
    assert(peek() !== '(');
    addToOutput(handlePop());
  }

  return output[0];
};

We can then test to see if it works!

js
Copied! ⭐️
const input = '10 + 20 * 30 - 40';
const result = evaluate(input);
console.log(result); // 570

Pretty simple change! You now have an amazing calculator built entirely in JavaScript that works with multi-digit numbers! But... we can make it even better!

Adding Support for Decimal Values

So far, our calculator only supports integer values. Let's adjust our tokenizer specification to handle decimal values!

As mentioned in the previous tutorial, the main advantage of designing a tokenizer with regular expressions is that we can easily add new tokens or change the regular expression associated with each token.

To support decimal values, we just to need to change the regular expression corresponding to the NUMBER token.

js
Copied! ⭐️
const TokenSpec = [
  [/^\s+/, null],
  [/^(?:\d+(?:\.\d*)?|\.\d+)/, TokenTypes.NUMBER], // Now we support both integers and decimals!
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT]
];

The regular expression for decimal values may look complex, but it lets us consider 1, 1., 0.1, and .1 as valid tokens. These are also considered valid tokens in JavaScript.

Let's try running our math expression evaluator again.

js
Copied! ⭐️
const input = '(9.5 + 10.5) * 30 - 40';
const result = evaluate(input);
console.log(result); // 560

It works! Amazing! 🤩

Adding Support for Functions

Let's now add support for functions. It's common to let people use trigonometric functions in math expressions.

Suppose we wanted to let users type out the following in a calculator:

js
Copied! ⭐️
sin(3.14) * 1000 - 1

We now have to think about a new kind of token for sin. However, we want to support other types of functions like cos and tan. Should we add a new regular expression for each of these functions? We could, but an easier way is to create a new token called an IDENTIFIER.

By convention, it's common to call tokens that have names, such as variables or functions, "identifiers" because we can reuse this token for multiple purposes in programming languages. We'll follow the same convention in this tutorial.

Let's adjust our tokenizer specification to handle a brand new token. We'll then use a regular expression to match any number of letters, since we don't use them anywhere in our math expression evaluator so far.

js
Copied! ⭐️
const TokenSpec = [
  [/^\s+/, null],
  [/^(?:\d+(?:\.\d*)?|\.\d+)/, TokenTypes.NUMBER],
  [/^[a-z]+/, TokenTypes.IDENTIFIER], // Now we can understand letters!
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT]
];

It's important to recognize the order by which the regular expressions are inserted into the token specification, especially if we decide to add more types of tokens later. Inside the tokenizer, we our iterating through each two-element array in order as we scan a math expression, so the order definitely matters.

Make sure the new token type is added to TokenTypes as well.

js
Copied! ⭐️
const TokenTypes = {
  NUMBER: 'NUMBER',
  IDENTIFIER: 'IDENTIFIER', // New token type!
  ADDITION: '+',
  SUBTRACTION: '-',
  MULTIPLICATION: '*',
  DIVISION: '/',
  EXPONENTIATION: '^',
  PARENTHESIS_LEFT: '(',
  PARENTHESIS_RIGHT: ')'
};

Let's also add a new variable called functionList that contain an array of all the possible function names. We can place this anywhere outside the evaluate function.

js
Copied! ⭐️
const functionList = ['sin', 'cos', 'tan'];

We'll create a utility function that checks if the token is a valid function. When we read IDENTIFIER tokens, it's possible that we get something like stuff or sine which are not valid function names. We won't care about the casing.

js
Copied! ⭐️
const isFunction = (token) => {
  return functionList.includes(token.toLowerCase());
};

Next, we need to adjust the handleToken function in our math expression evaluator to support the use of our function names.

When we first created our implementation of the shunting yard algorithm in Part 3, we closely followed pseudocode mentioned on Wikipedia. This algorithm actually included instructions on how to use functions.

js
Copied! ⭐️
while there are tokens to be read:
    read a token
    if the token is:
    - a number:
        put it into the output queue
    - a function:
        push it onto the operator stack

Therefore, we need to add another case statement to our handleToken function to support our list of functions.

js
Copied! ⭐️
const handleToken = (token) => {
  switch (true) {
    case !isNaN(parseFloat(token)):
      addToOutput(token);
      break;
    // Check if the token is a function/identifier!
    case isFunction(token):
      stack.push(token);
      break;
    case opSymbols.includes(token):
      const o1 = token;
      let o2 = peek();

      while (
        o2 !== undefined &&
        o2 !== '(' &&
        (operators[o2].prec > operators[o1].prec ||
          (operators[o2].prec === operators[o1].prec &&
            operators[o1].assoc === 'left'))
      ) {
        addToOutput(handlePop());
        o2 = peek();
      }
      stack.push(o1);
      break;
    case token === '(':
      stack.push(token);
      break;
    case token === ')':
      let topOfStack = peek();
      while (topOfStack !== '(') {
        assert(stack.length !== 0);
        addToOutput(handlePop());
        topOfStack = peek();
      }
      assert(peek() === '(');
      handlePop();
      // Handle the function and compute a value using handlePop
      if (isFunction(peek())) {
        addToOutput(handlePop());
      }
      break;
    default:
      throw new Error(`Invalid token: ${token}`);
  }
};

You may noticed that we also added a small bit of code near the end of the handleToken function inside the token === ')' case statement. Since function calls always use parentheses, we need to handle functions when a right parenthesis is encountered.

Imagine if the input to our math expression evaluator is sin(3.14). We would get the following tokens:

js
Copied! ⭐️
['sin', '(', '3.14', ')']

We would push sin onto the stack. Then, we would push a left parenthesis ( and 3.14 onto the stack. Once we reach the right parenthesis ), we would start popping values off the stack, including the sin keyword. The handlePop function is where we handle mathematical operations. Now, we need to handle function operations.

Our handlePop function should be adjusted to handle all three trigonometric functions.

js
Copied! ⭐️
const handlePop = () => {
  const op = stack.pop();

  if (op === '(') return;

  if (isFunction(op)) {
    const poppedValue = output.pop();
    switch (op) {
      case 'sin':
        return Math.sin(poppedValue);
      case 'cos':
        return Math.cos(poppedValue);
      case 'tan':
        return Math.tan(poppedValue);
    }
  }

  const right = parseFloat(output.pop());
  const left = parseFloat(output.pop());

  switch (op) {
    case '+':
      return left + right;
    case '-':
      return left - right;
    case '*':
      return left * right;
    case '/':
      return left / right;
    case '^':
      return left ** right;
    default:
      throw new Error(`Invalid operation: ${op}`);
  }
};

Notice how the normal mathematical operations require two values to be popped off the stack. That is because they are considered binary operations. The trigonometric functions, however, only need to pop one value off the stack because they only accept one parameter.

The downside with using the shunting yard algorithm is that it's tricky to add functions that have a variable number of parameters. It's also very tricky to deal with negative numbers. However, we can always make utility methods that help solve these problems.

Finished Code

Let's now look at the finished code for our math expression evaluator that can now handle decimal numbers and functions.

js
Copied! ⭐️
const TokenTypes = {
  NUMBER: 'NUMBER',
  IDENTIFIER: 'IDENTIFIER', // New token type!
  ADDITION: '+',
  SUBTRACTION: '-',
  MULTIPLICATION: '*',
  DIVISION: '/',
  EXPONENTIATION: '^',
  PARENTHESIS_LEFT: '(',
  PARENTHESIS_RIGHT: ')'
};

const TokenSpec = [
  [/^\s+/, null],
  [/^(?:\d+(?:\.\d*)?|\.\d+)/, TokenTypes.NUMBER],
  [/^[a-z]+/, TokenTypes.IDENTIFIER], // Now we can understand letters!
  [/^\+/, TokenTypes.ADDITION],
  [/^\-/, TokenTypes.SUBTRACTION],
  [/^\*/, TokenTypes.MULTIPLICATION],
  [/^\//, TokenTypes.DIVISION],
  [/^\^/, TokenTypes.EXPONENTIATION],
  [/^\(/, TokenTypes.PARENTHESIS_LEFT],
  [/^\)/, TokenTypes.PARENTHESIS_RIGHT]
];

class Tokenizer {
  constructor(input) {
    this.input = input;
    this.cursor = 0;
  }

  hasMoreTokens() {
    return this.cursor < this.input.length;
  }

  match(regex, inputSlice) {
    const matched = regex.exec(inputSlice);
    if (matched === null) {
      return null;
    }

    this.cursor += matched[0].length;
    return matched[0];
  }

  getNextToken() {
    if (!this.hasMoreTokens()) {
      return null;
    }

    const inputSlice = this.input.slice(this.cursor);

    for (let [regex, type] of TokenSpec) {
      const tokenValue = this.match(regex, inputSlice);

      if (tokenValue === null) {
        continue;
      }

      if (type === null) {
        return this.getNextToken();
      }

      return {
        type,
        value: tokenValue,
      };
    }

    throw new SyntaxError(`Unexpected token: "${inputSlice[0]}"`);
  }
}

const operators = {
  '^': {
    prec: 4,
    assoc: 'right',
  },
  '*': {
    prec: 3,
    assoc: 'left',
  },
  '/': {
    prec: 3,
    assoc: 'left',
  },
  '+': {
    prec: 2,
    assoc: 'left',
  },
  '-': {
    prec: 2,
    assoc: 'left',
  },
};

const functionList = ['sin', 'cos', 'tan'];

const isFunction = (token) => {
  return functionList.includes(token.toLowerCase());
};

const assert = (predicate) => {
  if (predicate) return;
  throw new Error('Assertion failed due to invalid token');
};

const evaluate = (input) => {
  const opSymbols = Object.keys(operators);
  const stack = [];
  let output = [];

  const peek = () => {
    return stack.at(-1);
  };

  const addToOutput = (token) => {
    output.push(token);
  };

  const handlePop = () => {
    const op = stack.pop();

    if (op === '(') return;

    if (isFunction(op)) {
      const poppedValue = output.pop();
      switch (op) {
        case 'sin':
          return Math.sin(poppedValue);
        case 'cos':
          return Math.cos(poppedValue);
        case 'tan':
          return Math.tan(poppedValue);
      }
    }

    const right = parseFloat(output.pop());
    const left = parseFloat(output.pop());

    switch (op) {
      case '+':
        return left + right;
      case '-':
        return left - right;
      case '*':
        return left * right;
      case '/':
        return left / right;
      case '^':
        return left ** right;
      default:
        throw new Error(`Invalid operation: ${op}`);
    }
  };

  const handleToken = (token) => {
    switch (true) {
      case !isNaN(parseFloat(token)):
        addToOutput(token);
        break;
      case isFunction(token):
        stack.push(token);
        break;
      case opSymbols.includes(token):
        const o1 = token;
        let o2 = peek();

        while (
          o2 !== undefined &&
          o2 !== '(' &&
          (operators[o2].prec > operators[o1].prec ||
            (operators[o2].prec === operators[o1].prec &&
              operators[o1].assoc === 'left'))
        ) {
          addToOutput(handlePop());
          o2 = peek();
        }
        stack.push(o1);
        break;
      case token === '(':
        stack.push(token);
        break;
      case token === ')':
        let topOfStack = peek();
        while (topOfStack !== '(') {
          assert(stack.length !== 0);
          addToOutput(handlePop());
          topOfStack = peek();
        }
        assert(peek() === '(');
        handlePop();
        if (isFunction(peek())) {
          addToOutput(handlePop());
        }
        break;
      default:
        throw new Error(`Invalid token: ${token}`);
    }
  };

  const tokenizer = new Tokenizer(input);
  let token;
  while ((token = tokenizer.getNextToken())) {
    handleToken(token.value);
  }

  while (stack.length !== 0) {
    assert(peek() !== '(');
    addToOutput(handlePop());
  }

  return output[0];
};

const input = 'sin(3.14) * 1000 - 1';
const result = evaluate(input);
console.log(result); // 0.5926529164868282

Conclusion

Congrats! You made it to the end! 🎉

I know...there's quite quite a lot of code that I stuffed into this article 😅. This is only code for building a simple math expression parser and evaluator. Imagine building an entire programming language!

Obviously, there's room for improvement in the code that was presented. The most important thing is that you understand a bit about how parsing works. First, we use a tokenizer to scan an input and create tokens. Then, we perform "semantic actions" that take those tokens, perform various operations, and return a meaningful result.

In our math expression evaluator, the semantic actions would be everything inside the handlePop function. We push tokens to a stack, pop them off to perform mathematical operations, and then push the results back onto the stack. Eventually, we return the only item left on the stack which is the final result.

We now have the knowledge and skills necessary to build a calculator without using the evil eval function, but there's a small problem... We currently can't handle negative numbers using the shunting yard algorithm. In the next tutorial, I'll discuss a small trick for dealing with them. Stretch, take a break, get some water, and I'll see you in the next tutorial 🙂