Nov 15, 2014 Shunting Yard, Javascript

Shunting Yard (Part 3)

Let’s continue adding to our Shunting Yard code base.

In Part 2, we created operator precedence. This allowed us to delay application of operators indefinitely, using a stack. Next, we’ll focus on adding parentheses, so that the end-user can manually override precedence.

Series: Part 1, Part 2, Part 3, …

Recursion

The secret sauce that will allow us to handle parentheses correctly is recursion.

And if you think about it, it makes sense. Imagine we have a function that can correctly parse an expression, to include parentheses, and returns the root node of the AST. Inside that function, if we were to come across a parenthesis, then we would want to parse that inner expression, and use the result (the root node of the inner expression’s AST) as a node in the larger AST.

There are two things we’ll need to keep in mind:

Our function must be able to parse an expression starting at the current location in the token stream. We can’t assume that we’re starting at the first token anymore, because when we call ourselves recursively, we’ll be in the middle of the expression.
Our function must exit when we run out of tokens (old behavior), but also when we see a ). Why? Because the function must exit when it is done reading an inner expression, which ends with ).

New Symbols

First, we would normally modify our lexer to correctly handle parentheses. Since this series is only focused on the parser, we’ll assume the lexer transforms a string like this:

2 * (3 + 4) - 10 / 2

Into the following token stream:

var tokens = [
  { type: 'num', value:  2  },
  { type: 'sym', value: '*' },
  { type: 'sym', value: '(' },
  { type: 'num', value:  3  },
  { type: 'sym', value: '+' },
  { type: 'num', value:  4  },
  { type: 'sym', value: ')' },
  { type: 'sym', value: '-' },
  { type: 'num', value: 10  },
  { type: 'sym', value: '/' },
  { type: 'num', value:  2  }
];

The Code

Here’s the code, which we’ll inspect in further detail below:

function ParenShuntingYard(tokens){
  // helper function to get next token and validate its type
  var tokenIndex = 0;
  function NextToken(){
    var args = [].slice.call(arguments);
    if (tokenIndex >= tokens.length)
      throw 'invalid expression: missing token';
    var tok = tokens[tokenIndex];
    tokenIndex++;
    if (args.indexOf(tok.type) < 0)
      throw 'invalid expression: expecting ' + args.join(' ');
    return tok;
  }

  // helper function to apply binary op based on symbol
  function ApplyBinOp(symbol, left, right){
    var sym_to_type = {
      '+': 'add',
      '-': 'sub',
      '*': 'mul',
      '/': 'div'
    };
    if (!sym_to_type[symbol])
      throw 'invalid expression: unknown operator ' + symbol;
    return {
      type: sym_to_type[symbol],
      parms: [left, right]
    };
  }

  // determines if we apply left operator before right
  function ApplyLeftFirst(leftSymbol, rightSymbol){
    var sym_to_prec = { // map of symbol to precedence
      '+': 0,
      '-': 0,
      '*': 1,
      '/': 1
    };
    return sym_to_prec[leftSymbol] >= sym_to_prec[rightSymbol];
  }

  // parse expression at current location and return root node
  function ParseExpr(){
    var tok, nextValue;
    var opStack = [];
    var valStack = [];

    function ApplyTopOp(){
      var op = opStack.shift();
      var rightValue = valStack.shift();
      var leftValue = valStack.shift();
      valStack.unshift(ApplyBinOp(op, leftValue, rightValue));
    }

    while (1){
      // grab a terminal OR a left paren
      tok = NextToken('num', 'sym');
      if (tok.type == 'num'){
        // create the terminal node
        nextValue = {
            type: 'val',
            value: tok.value
          };
      }
      else{ // tok.type is 'sym'
        if (tok.value == '('){
          // if we have left paren, call ourselves recursively
          nextValue = ParseExpr();
          // ensure we have a right paren
          tok = NextToken('sym');
          if (tok.value != ')')
            throw 'missing right parenthesis';
        }
        else
          throw 'invalid expression';
      }

      // save the value to be operated on
      valStack.unshift(nextValue);

      // if we're out of tokens, then exit
      if (tokenIndex >= tokens.length)
        break;
      // if our next token is a right paren, then exit
      if (tokens[tokenIndex].type == 'sym' &&
        tokens[tokenIndex].value == ')')
        break;
      // otherwise, we must have a binary operator

      // grab an operator
      tok = NextToken('sym');

      // make sure it isn't an open paren
      if (tok.value == '(')
        throw 'invalid expression';

      // check to see if we have previous operations to apply
      while (opStack.length > 0 &&
        ApplyLeftFirst(opStack[0], tok.value))
        ApplyTopOp();

      // save the new operator to be applied later
      opStack.unshift(tok.value);
    }

    // apply any left over operators
    while (opStack.length > 0)
      ApplyTopOp();

    // all done, so return the top of the tree
    return valStack[0];
  }

  // kick everything off
  var ast = ParseExpr();

  // make sure we consumed the entire token stream
  if (tokenIndex < tokens.length)
    throw 'invalid expression';

  // return the parsed expression
  return ast;
}

This seems like a hairy beast at first, but if you look at our previous code, you’ll see there aren’t many changes.

Let’s get the small stuff out of the way:

NextToken

The NextToken function now allows a variable number of arguments to be passed to it. It validates that the token type is one of the arguments.

Don’t be scared. This line looks strange if you’ve never seen it:

var args = [].slice.call(arguments);

This is nothing more than converting arguments into an array (called args). This is necessary for JavaScript, because the arguments variable behaves really goofy according to the specification, which is a design flaw that we’re stuck with. Oh well.

Other than that, we simply validate that tok.type is inside the args array, and return the token.

Kicking off ParseExpr

The next biggest change is that the main loop is now inside a function, ParseExpr. This is the piece that we will call recursively… but before we get to that, let’s jump to the end:

// kick off everything
var ast = ParseExpr();

// make sure we consumed the entire token stream
if (tokenIndex < tokens.length)
  throw 'invalid expression';

// return the parsed expression
return ast;

We parse the expression by simply calling our magic ParseExpr function, and storing the result in ast. Simple.

We also have an extra check to make sure that we’ve consumed the entire token stream. The reason for this is because ParseExpr will exit when there is a right parenthesis – so if a user has extra right parentheses, then ParseExpr will return before consuming everything. Without the check, ParenShuntingYard would succeed for expressions like 2 + 3) 1 2 3 4, which would be bad.

ParseExpr Body

Now for the meat.

Calling Ourselves

An inner expression can exist anywhere a terminal can exist. Thankfully, there is only one location where we read terminals – at the top of the main loop. Therefore, the first major change is to accept numbers or symbols:

var tok = NextToken('num', 'sym');

If tok.type is 'num', then we’ve received a terminal, and our previous code works the same:

if (tok.type == 'num'){
  // create the terminal node
  nextValue = {
    type: 'val',
    value: tok.value
  };
}
...

However, if tok.type is 'sym', and tok.value is '(', then we have an inner expression.

...
else{ // tok.type is 'sym'
  if (tok.value == '('){
    // if we have left paren, call ourselves recursively
    nextValue = ParseExpr();
    // ensure we have a right paren
    tok = NextToken('sym');
    if (tok.value != ')')
      throw 'missing right parenthesis';
  }
  else
    throw 'invalid expression';
}

Notice that this code expects ParseExpr to return before consuming the right parenthesis. That means this piece needs to consume and confirm it exists, and error otherwise.

At the end of this section of code, our node is stored in nextValue, whether from a terminal, or inner expression.

Exiting Correctly

Next, we exit the main loop if we are out of tokens – but now we also have to exit if we find a right parenthesis:

// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
  break;
// if our next token is a right paren, then exit
if (tokens[tokenIndex].type == 'sym' &&
  tokens[tokenIndex].value == ')')
  break;
// otherwise, we must have a binary operator

Remember that we need to exit if the next token is a right parenthesis, but we shouldn’t consume the token with NextToken, because the previous section of code is responsible for that. That simply means that we peek ahead using tokens[tokenIndex], without increasing tokenIndex.

Checking for Errors

Lastly, because both operators and parenthesis share a token type of 'sym', we can no longer assume that NextToken('sym') is guaranteed to read an operator. We add some extra error checking to ensure we really read an operator:

// make sure it isn't an open paren
if (tok.value == '(')
  throw 'invalid expression';

We don’t need to check if it’s a right parenthesis, because we already did that while peeking above.

All Done!

And… that’s it! Not so bad, really.

Does it work?

var tokens = [
  { type: 'num', value:  2  },
  { type: 'sym', value: '*' },
  { type: 'sym', value: '(' },
  { type: 'num', value:  3  },
  { type: 'sym', value: '+' },
  { type: 'num', value:  4  },
  { type: 'sym', value: ')' },
  { type: 'sym', value: '-' },
  { type: 'num', value: 10  },
  { type: 'sym', value: '/' },
  { type: 'num', value:  2  }
];
ParenShuntingYard(tokens);
=> {
  type: 'sub',
  parms: [{
      type: 'mul',
      parms: [{
          type: 'val',
          value: 2
        }, {
          type: 'add',
          parms: [{
              type: 'val',
              value: 3
            }, {
              type: 'val',
              value: 4
            }
          ]
        }
      ]
    }, {
      type: 'div',
      parms: [{
          type: 'val',
          value: 10
        }, {
          type: 'val',
          value: 2
        }
      ]
    }
  ]
}

This is really great!

We are really close to having a fully featured expression parser. We can’t support unary operators yet, and ternary is a little ways off, but it won’t be long now. Stay tuned for the next article in the series!