Shunting Yard, Javascript
Shunting Yard (Part 3)
Let’s continue adding to our Shunting Yard code base.
In Part 2, we created operator precedence. This allowed us to delay application of operators indefinitely, using a stack. Next, we’ll focus on adding parentheses, so that the end-user can manually override precedence.
Series: Part 1, Part 2, Part 3, …
Recursion
The secret sauce that will allow us to handle parentheses correctly is recursion.
And if you think about it, it makes sense. Imagine we have a function that can correctly parse an expression, to include parentheses, and returns the root node of the AST. Inside that function, if we were to come across a parenthesis, then we would want to parse that inner expression, and use the result (the root node of the inner expression’s AST) as a node in the larger AST.
There are two things we’ll need to keep in mind:
- Our function must be able to parse an expression starting at the current location in the token stream. We can’t assume that we’re starting at the first token anymore, because when we call ourselves recursively, we’ll be in the middle of the expression.
- Our function must exit when we run out of tokens (old behavior), but also when we see a
)
. Why? Because the function must exit when it is done reading an inner expression, which ends with)
.
New Symbols
First, we would normally modify our lexer to correctly handle parentheses. Since this series is only focused on the parser, we’ll assume the lexer transforms a string like this:
2 * (3 + 4) - 10 / 2
Into the following token stream:
var tokens = [
{ type: 'num', value: 2 },
{ type: 'sym', value: '*' },
{ type: 'sym', value: '(' },
{ type: 'num', value: 3 },
{ type: 'sym', value: '+' },
{ type: 'num', value: 4 },
{ type: 'sym', value: ')' },
{ type: 'sym', value: '-' },
{ type: 'num', value: 10 },
{ type: 'sym', value: '/' },
{ type: 'num', value: 2 }
];
The Code
Here’s the code, which we’ll inspect in further detail below:
function ParenShuntingYard(tokens){
// helper function to get next token and validate its type
var tokenIndex = 0;
function NextToken(){
var args = [].slice.call(arguments);
if (tokenIndex >= tokens.length)
throw 'invalid expression: missing token';
var tok = tokens[tokenIndex];
tokenIndex++;
if (args.indexOf(tok.type) < 0)
throw 'invalid expression: expecting ' + args.join(' ');
return tok;
}
// helper function to apply binary op based on symbol
function ApplyBinOp(symbol, left, right){
var sym_to_type = {
'+': 'add',
'-': 'sub',
'*': 'mul',
'/': 'div'
};
if (!sym_to_type[symbol])
throw 'invalid expression: unknown operator ' + symbol;
return {
type: sym_to_type[symbol],
parms: [left, right]
};
}
// determines if we apply left operator before right
function ApplyLeftFirst(leftSymbol, rightSymbol){
var sym_to_prec = { // map of symbol to precedence
'+': 0,
'-': 0,
'*': 1,
'/': 1
};
return sym_to_prec[leftSymbol] >= sym_to_prec[rightSymbol];
}
// parse expression at current location and return root node
function ParseExpr(){
var tok, nextValue;
var opStack = [];
var valStack = [];
function ApplyTopOp(){
var op = opStack.shift();
var rightValue = valStack.shift();
var leftValue = valStack.shift();
valStack.unshift(ApplyBinOp(op, leftValue, rightValue));
}
while (1){
// grab a terminal OR a left paren
tok = NextToken('num', 'sym');
if (tok.type == 'num'){
// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};
}
else{ // tok.type is 'sym'
if (tok.value == '('){
// if we have left paren, call ourselves recursively
nextValue = ParseExpr();
// ensure we have a right paren
tok = NextToken('sym');
if (tok.value != ')')
throw 'missing right parenthesis';
}
else
throw 'invalid expression';
}
// save the value to be operated on
valStack.unshift(nextValue);
// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
break;
// if our next token is a right paren, then exit
if (tokens[tokenIndex].type == 'sym' &&
tokens[tokenIndex].value == ')')
break;
// otherwise, we must have a binary operator
// grab an operator
tok = NextToken('sym');
// make sure it isn't an open paren
if (tok.value == '(')
throw 'invalid expression';
// check to see if we have previous operations to apply
while (opStack.length > 0 &&
ApplyLeftFirst(opStack[0], tok.value))
ApplyTopOp();
// save the new operator to be applied later
opStack.unshift(tok.value);
}
// apply any left over operators
while (opStack.length > 0)
ApplyTopOp();
// all done, so return the top of the tree
return valStack[0];
}
// kick everything off
var ast = ParseExpr();
// make sure we consumed the entire token stream
if (tokenIndex < tokens.length)
throw 'invalid expression';
// return the parsed expression
return ast;
}
This seems like a hairy beast at first, but if you look at our previous code, you’ll see there aren’t many changes.
Let’s get the small stuff out of the way:
NextToken
The NextToken
function now allows a variable number of arguments to be passed to it. It validates
that the token type is one of the arguments.
Don’t be scared. This line looks strange if you’ve never seen it:
var args = [].slice.call(arguments);
This is nothing more than converting arguments
into an array (called args
). This is necessary
for JavaScript, because the arguments
variable behaves really goofy according to the
specification, which is a design flaw that we’re stuck with. Oh well.
Other than that, we simply validate that tok.type
is inside the args
array, and return the
token.
Kicking off ParseExpr
The next biggest change is that the main loop is now inside a function, ParseExpr
. This is the
piece that we will call recursively… but before we get to that, let’s jump to the end:
// kick off everything
var ast = ParseExpr();
// make sure we consumed the entire token stream
if (tokenIndex < tokens.length)
throw 'invalid expression';
// return the parsed expression
return ast;
We parse the expression by simply calling our magic ParseExpr
function, and storing the result in
ast
. Simple.
We also have an extra check to make sure that we’ve consumed the entire token stream. The reason
for this is because ParseExpr
will exit when there is a right parenthesis – so if a user has
extra right parentheses, then ParseExpr
will return before consuming everything. Without the
check, ParenShuntingYard
would succeed for expressions like 2 + 3) 1 2 3 4
, which would be bad.
ParseExpr Body
Now for the meat.
Calling Ourselves
An inner expression can exist anywhere a terminal can exist. Thankfully, there is only one location where we read terminals – at the top of the main loop. Therefore, the first major change is to accept numbers or symbols:
var tok = NextToken('num', 'sym');
If tok.type
is 'num'
, then we’ve received a terminal, and our previous code works the same:
if (tok.type == 'num'){
// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};
}
...
However, if tok.type
is 'sym'
, and tok.value
is '('
, then we have an inner expression.
...
else{ // tok.type is 'sym'
if (tok.value == '('){
// if we have left paren, call ourselves recursively
nextValue = ParseExpr();
// ensure we have a right paren
tok = NextToken('sym');
if (tok.value != ')')
throw 'missing right parenthesis';
}
else
throw 'invalid expression';
}
Notice that this code expects ParseExpr
to return before consuming the right parenthesis. That
means this piece needs to consume and confirm it exists, and error otherwise.
At the end of this section of code, our node is stored in nextValue
, whether from a terminal, or
inner expression.
Exiting Correctly
Next, we exit the main loop if we are out of tokens – but now we also have to exit if we find a right parenthesis:
// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
break;
// if our next token is a right paren, then exit
if (tokens[tokenIndex].type == 'sym' &&
tokens[tokenIndex].value == ')')
break;
// otherwise, we must have a binary operator
Remember that we need to exit if the next token is a right parenthesis, but we shouldn’t consume
the token with NextToken
, because the previous section of code is responsible for that. That
simply means that we peek ahead using tokens[tokenIndex]
, without increasing tokenIndex
.
Checking for Errors
Lastly, because both operators and parenthesis share a token type of 'sym'
, we can no longer
assume that NextToken('sym')
is guaranteed to read an operator. We add some extra error checking
to ensure we really read an operator:
// make sure it isn't an open paren
if (tok.value == '(')
throw 'invalid expression';
We don’t need to check if it’s a right parenthesis, because we already did that while peeking above.
All Done!
And… that’s it! Not so bad, really.
Does it work?
var tokens = [
{ type: 'num', value: 2 },
{ type: 'sym', value: '*' },
{ type: 'sym', value: '(' },
{ type: 'num', value: 3 },
{ type: 'sym', value: '+' },
{ type: 'num', value: 4 },
{ type: 'sym', value: ')' },
{ type: 'sym', value: '-' },
{ type: 'num', value: 10 },
{ type: 'sym', value: '/' },
{ type: 'num', value: 2 }
];
ParenShuntingYard(tokens);
=> {
type: 'sub',
parms: [{
type: 'mul',
parms: [{
type: 'val',
value: 2
}, {
type: 'add',
parms: [{
type: 'val',
value: 3
}, {
type: 'val',
value: 4
}
]
}
]
}, {
type: 'div',
parms: [{
type: 'val',
value: 10
}, {
type: 'val',
value: 2
}
]
}
]
}
This is really great!
We are really close to having a fully featured expression parser. We can’t support unary operators yet, and ternary is a little ways off, but it won’t be long now. Stay tuned for the next article in the series!