Erik Eidt

The Double-E Infix Expression Parsing Method

Introduction

This article describes the Double-E Infix Expression Parsing Method.

This is a method for parsing infix expressions such as we find in most languages like C, having many levels of precedence, and a rich set of operators with static operator precedence.  It is a bottoms-up approach to parsing similar to general purpose shift-reduce parsing, though dedicated to and simplified for infix expressions — meaning parse state tables are replaced by two states, each represented by their own code section.  It is as simple as the Shunting Yard algorithm, though is also industrial strength, without the well-known limitations of the former (e.g. operator support and error detection).

The Double-E method uses two states and two stacks.

The states are simple: Unary and Binary.  To oversimplify, the Unary State says we’re looking for a unary operator or an operand, and the Binary State says we’re looking for a binary operator or the end of an expression.

One of the two stacks is for Operators and the other for Operands.  These stacks are intermediate storage used to hold operators and operands before their proper precedence and binding is known.

An element of the Operator Stack is a simple enum describing an operator.  The enum differentiates between unary negation and substraction, for example, so, the operator stack is a stack of true operators in the expression language, rather than input tokens.

An element of the Operand Stack is a tree node, which represents an expression.  Tree nodes can represent constants, identifiers, addition of two operands, etc.  Such a tree node can be called an Abstract Syntax Tree, AST, so the operand stack is a stack of ASTs.

When proper binding of operators to operands is discovered, those operators and their operands are removed from these stacks and replaced.  Each replacement is a single tree representing the operators bound to their operands, and this replacement is pushed onto the operand stack.

When the parse successfully completes, then one AST representing the entire expression is left on the operand stack.

This method is:

I developed this algorithm in 1978.  The closest thing I’ve see to this is the Shunting Yard algorithm, but that is lacking significant capability.  We might describe the Double-E algorithm as gluing two Shunting Yard algorithms together (one for each of the two states, while also capturing more parse state in the stacks).

Notes

When starting the parse of an infix expression, the initial state is the unary state, and the stacks are both empty.

In the unary state, we’re expecting either a unary operator or an operand or grouping parenthesis (or others).  If we see:

In the binary state, we are expecting binary operators, or close parenthesis (or open paren).  If we see:

That’s the gist of it.

Example

Let’s parse: -a + b*c;

 u   u   b   u   b   u   b		States
  \ / \ / \ / \ / \ / \ / \     	\ for current state ready for input, / for transitions to next state
   -   a   +   b   *   c   ;		Sample input
   1   2   3   4   5   6   7		Action (comment #), given current state & input

initial state: unary
operator stack: empty
operand stack: empty

input: minus sign token (-)
1: push unary negation enum onto operator stack; stay in unary state

state: unary
operator stack: unary negation
operand stack: empty

input: identifier a
2: create AST for identifier a, and push that onto operand stack; switch to binary state

state: binary
operator stack: unary negation
operand stack: identifier a

input: plus sign token (+)
3: binary addition operator: reduce until precedence of addition is higher than that on the operator stack
   that means constructing a tree for unary negation of identifier a,
   popping unary negation off the operator stack, popping identifier a off the operand stack
   and pushing the AST for unary negation of identifier a onto the operand stack
   push operator binary addition onto the operator stack
   switch to unary state

state: unary
operator stack: binary addition
operand stack: AST for -a

input: identifier b
4: push operand identifier b onto the operand stack; switch to binary state

state: binary
operator stack: binary addition
operand stack: AST for -a, AST for b

input: asterisk token (*)
5: reduce (nothing to do: binary multiplication has higher precedence than binary addition)
   push binary multiplication onto operator stack; switch to unary state

state: unary
operator stack: binary addition, binary multiplication
operand stack: AST for -a, AST for b

input: identifier c
6: push operand c; switch to binary state

state: binary
operator stack: binary addition, binary multiplication
operand stack: AST for -a, AST for b, AST for c

input: semi colon token (;)
7: end of parse (semi colon unrecognized, so not consumed).
   (it is legal to end parse when in binary state)
   Reduce until single operand. Parse error if cannot reduce to 1, if unmatch paren, etc..
   
   Reducing will first remove the binary multiplication from the stacks, resulting in:

operator stack: binary addition
operand stack: AST for -a, AST for b * c

   Next the addition is replaced by an AST representing the addition.
   
operator stack: empty
operand stack: AST for -a + b * c

    exit the parse, yielding the one operand remaining on the operand stack:

      +
    /   \
   -     *  
   |    / \
   a   b   c