C311 Regular Grammars, Scanners, DFA

Dana Vrajitoru
C311 Programming Languages

Regular Grammars, Scanners, DFA

DFA versus NFA

DFA - deterministic finite state automaton.
A state transition machine where from each state we can move to one of the others based on the current terminal symbol.
There at most one outgoing branch for each possible symbol.
NFA - non-deterministic finite state automaton.
The difference with a DFA is that there can be more than one outgoing branch with the same symbol.
Regular grammars can be translated as NFAs.

NFA to DFA

To write a scanner more easily, we can transform an NFA into a DFA.
For this we construct a DFA where the states are combinations of states from the NFA that represent alternatives that can be reached from any given state with the same symbol.
For example in the first grammar we would add a state called TV:

Example: NFA (top) and equivalent DFA (bottom)

Regular Expressions

Recognized by regular grammars.
They can be parsed by a linear algorithm that doesn't need to store the global structure of the parsing tree.
Only one non-terminal symbol is active at any moment.
This type of parsing algorithm is called a scanner.
The first compilation step, the lexical analysis, is done by a scanner.

Examples

An identifier in Lisp:

I => letter J
J => letter J
J => digit J
J => _ J
J => - J
J => e

An integer number:

N  => UN
N  => - UN
N  => + UN
UN => digit UN
UN => digit

Finite States Machines

Real Number

N => {-|+|e} UN
UN => UI . UI 
UI => digit UI
UI => digit

The second rule above is not a regular rule. It needs a parser and not a scanner to be processed.

We can transform it the following way:

UN => digit UN
UN => digit
UN => . UF
UF => digit UF
UF => digit

Scanner / Parser

The lexical analysis of a program is done by a scanner.
A scanner is based on a regular grammar where the end of any chain of rules (a non-terminal converted to a single terminal) produces a token as a result.
The parsing process calls the scanner during its execution to retrieve every new token, then processes it before invoking the scanner again.
They are not two separate phases: the scanner is called repeatedly from the parser.

Scanner in Context

A scanner that works in the context of a program in which it recognizes individual units.

It can be written to recognize the termination of a unit by a character that is not part of it.

We can replace:

UN => digit UN
UN => digit

with:

UN => digit UN
UN => e

Table Driven Scanner
If the grammar is equivalent to a DFA, then we can organize the rules in a table:

I => letter J
I => _ J
J => letter J
J => digit J
J => _ J
J => - J
J => e

Table: