C311 Parsers

Dana Vrajitoru
C311 Programming Languages

Parsers

Parsing

The most important part of any compiler.
It relies on a context-free grammar.
The input consists of code tokens.
If n is the number of tokens in the code, then a typical parser can compile a program in O(n³) time.
Some particular grammars can be parsed in linear time: LL (left to right, left-most derivation) and LR (left to right, right-most derivation).
LL parsers are top-down (predictive), while LR parsers are bottom-up (constructive).

LL and LR Grammars

An LL grammar is a grammar that can be recognized by an LL parser.
Notation: LL(k) is an LL grammar for which the parser needs to scan ahead k terminal symbols in the string to recognize. A similar notation is used for LR(k).
LL grammars are more general than LR. There exist grammars that are not LL(k) for any given k.

Example

----- A: LL(1)--------
P -> ( P )
P -> R
R -> [ R ]
R -> e

----- B: non-LL--------
P -> ( P ) 
P -> P P
P -> e

----- C: LL(2)--------
R -> [ R ]
R -> [ ]
R -> P
P -> ( P )
P -> ( R )
P -> e

Languages Accepted by the Above Grammars

A: (ⁿ [^m ]^m )ⁿ, where n, m >= 0.
In other words, a sequence of parentheses followed by a sequence of brackets, followed by the same number of closed brackets and parantheses in reverse order.
Example: ( ( ( [ [ ] ] ) ) )
B: a₁ a₂ a₃ ... a_n where
a_i is either ( or ),
for any i<n, count{a_k = (, for k <= i} >= count{a_k = ), for k <= i}, (or the count of left parentheses seen so far should be larger than or equal to the count of right parenteses at any intermediate point in the string) and
count{a_k = (, for k <= n} = count{a_k = ), for k <= n} (or the total count of left parentheses in the whole string should be equal to the total count of right parenteses in the string).
In other words, a well formed sequence of parentheses where we never close more parentheses than we have opened and where all opened parentheses are closed by the end of the string.
Example: ( ( ( ) ) ( ) ) ( ( ) )
C: a₁ a₂ a₃ ... a_n b_n b_n-1 ... b₂ b₁ where n>=1 and
a_i is either ( or [ , b_i is either ) or ] and
if a_i = (, then b_i = ) otherwise b_i = ] (in other words, the parentheses and brackets should be closed in reverse order than they were open).
In other words, a sequence of any combination of opened parentheses and brackets followed by the same number of closed parentheses and brackets in reverse order.
Example: [ ( ( [ ( [ ] ) ] ) ) ]

LL Parsing

The parser starts with the root of the tree.
At every step, it will try to expand the left-most non-terminal active symbol (let's say S).
The parser analyzes the next terminal symbol in the string (a) and tries to make a prediction about the rule to be applied.
S =>a b c T
It must match all terminal symbols to the left of the non-terminal in the production rule (the string must contain a b c). The active non-terminal symbol is replaced with the one to the right (T).
The process stops when the parsed string is composed only of terminal symbols.

More Examples LL Grammars

An LL(3) grammar:
```
N => d e f M S
N =>d e g N
...
```
Parsed string: a b c N
Input string: a b c d e f
The choice above is deterministic if we examine 3 terminals.
A second example where the rules to expand N are not enough to decide what the grammar is.
```
N => d e M
N => d e S
M => f N T
S => g S
```
The same parsed and input strings.
The choice is also deterministic if we examine 3 terminals, but we need to look further than the rules expanding N.

A Non-LL Grammar
This grammar is non-LL because to choose between the two rules expanding UN one must find if there is a preiod in the string or not. For that the parser must search forward through all the digits. No matter what constant k we choose, it is possible to have a number having more than k digits to the left of the period, in which case the parser would fail to make a decision. So the grammar is not LL(k) for any constant k, and so it is not an LL grammar.

N => { + | - | e } UN
UN => UI . UI
UN => UI
UI => digit UI
UI => digit

LL - LR Parsing

An example of LL(1) and LR(2) grammars to recognize a list of identifiers in C++:

LL:

id_list => id id_list_tail
id_list_tail => , id id_list_tail
id_list_tail=> ;

LR:

id_list => id_list_prefix ;
id_list_prefix => id_list_prefix , id
id_list_prefix => id

Example

Parsing the following list: n, m, size, result;
Starting with the id_list.
Applying the first rule
id_list =>n id_list_tail
Then we find a "," in the string which means that we apply the second rule:
n id_list_tail =>n, m id_list_tail
And so on until we find the ";" which prompts us to apply the third rule:
n, m, size, result id_list_tail =>n, m, size, result;

Parse Tree

LR Parsing

A LR parser is bottom-up.
It starts by scanning the string and storing partially-completed subtrees in a list.
As soon as it notices that the right hand side of the list corresponds to sequence to the right hand side of a rule, it replaces that sequence with the non-terminal symbol to the left of the rule.
The algorithm continues until we arrive at the starting non-terminal, when the tree becomes complete.

Example Bottom-Up

It starts from the left side and asks what subtreen could be part of. Since it’s an id, it marks it as an IDLP(id(n)).
Then it examines the next terminal, the comma, and asks what rule could follow an IDLP with a comma. So it builds another subtree IDLP(IDLP(id(n)) , id(m)).
It continues this way adding things on top of the partial tree that was constructed this far until the end of the input.
It is called the right-most derivation because the root of the tree starts with the terminals to the right. The processing of the input string is still left to right.

Example LR

Starting with the processing list
( n , m , size , result ; )
We identify n as an id that can be a subtree of IDLP (is_list_prefix):
( (IDLP n) , m , size , result ; )
Then we notice that the combination of IDLP , m is the subtree of another IDLP based on the second rule:
( (IDLP ( IDLP n) , m) , size , result ; )
After repeating these steps the final result is:
(IDL (IDLP (IDLP (IDLP (IDLP n) , m) , size) , result) ; )

Parse Tree

Backus-Naur Notation

BN notation used to describe the language Algol (1960).
The Backus-Naur Form of a grammar is a convenient way to describe it.
It condenses the rules by separating the alternatives with "|". id_list_tail => , id id_list_tail | ;
Kleene star or plus are used to denote repetition:
id_list => id ( , id) * ; UN => (digit)+
* means that the content of the parenthesis is repeated 0 or any number of times.
+ means that the content of the parenthesis is repeated one or more times.

Arithmetic Expression

expr => term term_tail				1
term_tail => add_op term term_tail | e		2|3
term => factor factor_tail				4
factor_tail => mult_op factor factor_tail | e	5|6
factor => ( expr ) | id | number			7|8|9
add_op => + | -					10|11
mult_op => * | /					12|13

Discriminating Terminals

expr : no need to check a terminal
term_tail :
+ or - -> the first rule,
anything else -> the second
term : one rule, no need to check
factor_tail :
* or / -> the first rule
anything else -> the second
factor :
( -> the first rule
id or number -> the two others
anything else -> error
add_op, mult_op: +, -, *, / ok, anything else error
This is LL(1).

Parsing Process

2 + a * ( 5 – b )
E ->T  TT ->F  FT  TT -> (F 2) FT  TT -> (F 2) TT ->
(F 2) AO  T  TT -> (F 2) (AO +) T  TT -> (F 2) (AO +) F  FT  TT ->
(F 2) (AO +) (F a) FT  TT -> (F 2) (AO +) (F a) MO  F  FT  TT ->
(F 2) (AO +) (F a) (MO *)  F  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (E)  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (F FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) TT)  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) AO T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) F FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b) FT TT ) FT TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b) TT) FT TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b)  )

Parse Tree