chapter 3 chang chi-chung 2007.4.12. the role of the lexical analyzer lexical analyzer parser source...
Post on 30-Dec-2015
245 Views
Preview:
TRANSCRIPT
Chapter 3
Chang Chi-Chung
2007.4.12
The Role of the Lexical Analyzer
LexicalAnalyzer
ParserSource
Program
Token
Symbol Table
getNextToken
error error
The Reason for Using the Lexical Analyzer Simplifies the design of the compiler
A parser that had to deal with comments and white space as syntactic units would be more complex.
If lexical analysis is not separated from parser, then LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match)
Compiler efficiency is improved Systematic techniques to implement lexical analyzers by
hand or automatically from specifications Stream buffering methods to scan input
Compiler portability is enhanced Input-device-specific peculiarities can be restricted to the
lexical analyzer.
Lexical Analyzer Lexical analyzer are divided into a cascade of
two process. Scanning
Consists of the simple processes that do not require tokenization of the input. Deletion of comments. Compaction of consecutive whitespace characters into
one.
Lexical analysis The scanner produces the sequence of tokens as
output.
Tokens, Patterns, and Lexemes Token (符號單元 )
A pair consisting of a token name and optional arrtibute value.
Example: num, id Pattern (樣本 )
A description of the form for the lexemes of a token. Example: “non-empty sequence of digits”, “letter followed by
letters and digits” Lexeme ( 詞 )
A sequence of characters that matches the pattern for a token.
Example: 123, abc
Examples: Tokens, Patterns, and Lexemes
Token Pattern Lexeme
if characters i f if
else characters e l s e else
comparison < or > or <= or >= or == or != <=, !=
id letter followed by letters and digits
pi, score, D2
number any numeric constant 3.14, 0, 6.23
literal anything but “, surrounded by “’s “core dump”
An Example
E = M * C ** 2 A sequence of pairs by lexical analyzer<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Input Buffering
E = M * C * * 2 eof eof
eof
lexemeBegin forward
Sentinels
Lookahead Code with Sentinels switch (*forward++) { case eof: if (forward is at end of first buffer) { reload second buffer; forward = beginning of second buffer; } else if (forward is at end of second buffer) { reload first buffer; forward = beginning of first buffer; } else /* eof within a buffer marks the end of inout */ terminate lexical anaysis; break; cases for the other characters;}
Strings and Languages
Alphabet An alphabet is a finite set of symbols (characters)
String A string is a finite sequence of symbols from
s denotes the length of string s denotes the empty string, thus = 0
Language A language is a countable set of strings over some fixed
alphabet Abstract Language Φ {ε}
String Operations Concatenation (連接 )
The concatenation of two strings x and y is denoted by xy Identity (單位元素 )
The empty string is the identity under concatenation. s = s = s
Exponentiation Define
s0 = si = si-1s for i > 0
By Define
s1 = s
s2 = ss
Language Operations Union
L M = { s s L or s M } Concatenation
L M = { xy x L and y M} Exponentiation
L0 = { }
Li = Li-1L Kleene closure (封閉包 )
L* = ∪i=0,…, Li
Positive closureL+ = ∪i=1,…, Li
Regular Expressions Regular Expressions
A convenient means of specifying certain simple sets of strings.
We use regular expressions to define structures of tokens.
Tokens are built from symbols of a finite vocabulary. Regular Sets
The sets of strings defined by regular expressions.
Regular Expressions Basis symbols:
is a regular expression denoting language L() = {} a is a regular expression denoting L(a) = {a}
If r and s are regular expressions denoting languages L(r) and M(s) respectively, then rs is a regular expression denoting L(r) M(s) rs is a regular expression denoting L(r)M(s) r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r) A language defined by a regular expression is called
a regular set.
Operator Precedence
Operator Precedence Associative
* highest left
concatenation Second left
| lowest left
Algebraic Laws for Regular ExpressionsLaw Description
r | s = s | r | is commutative
r | ( s | t ) = ( r | s ) | t | is associative
r(st) = (rs)t concatenation is associative
r(s|t) = rs | rt
(s|t)r = sr | trconcatenation distributes over |
εr = rε = r ε is the identity for concatenation
r* = ( r |ε)* ε is guaranteed in a closure
r** = r* * is idempotent
Regular Definitions If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form: d1 r1
d2 r2
…
dn rn Each di is a new symbol, not in Σ and not the same as any
other of d’s. Each ri is a regular expression over the alphabet
{d1, d2, …, di-1 }
Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions
Example: Regular Definitions Regular Definitions letter_ A | B | … | Z | a | b | … | z | _ digit 0 | 1 | … | 9 id letter_ ( letter_ | digit )*
Regular definitions are not recursive
digits digit digits digit wrong
Extensions of Regular Definitions One or more instance
r+ = rr* = r*r r* = r+ | ε
Zero or one instance r? = r |ε
Character classes [a-z] = abc…z [A-Za-z] = A|B|…|Z|a|…|z
Example digit [0-9] num digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and GrammarsContext-Free Grammars
stmt if expr then stmt if expr then stmt else stmt
expr term relop term termterm id num
Regular Definitions digit [0-9]letter [A-Za-z] if if then then else elserelop < <= <> > >= = id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+ | -)? digit+ )?
ws ( blank | tab | newline )+
LEXEMES TOKEN NAME ATTRIBUTE VALUE
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition Diagrams
0 21
6
3
4
5
7
8
return(relop, LE)
return(relop, NE)
return(relop, LT)
return(relop, EQ)
return(relop, GE)
return(relop, GT)
start <
=
>
=
>
=
other
other
*
*
relop < <= <> > >= =
Transition Diagrams
9start letter
10 11*other
letter or digit
return (getToken(), installID() )
id letter ( letter | digit )*
An Example: Implement of RELOP TOKEN getRelop(){ TOKEN retToken = new(RELOP); while (1) { case 0: c = nextChar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state= 5; else if (c == ‘>‘) state= 6; else fail(); break; case 1: ... ... case 8: retract(); retToken.attribute = GT; return(retTOKEN); }}
Finite Automata Finite Automata are recognizers.
FA simply say “Yes” or “No” about each possible input string.
A FA can be used to recognize the tokens specified by a regular expression
Use FA to design of a Lexical Analyzer Generator Two kind of the Finite Automata
Nondeterministic finite automata (NFA) Deterministic finite automata (DFA)
Both DFA and NFA are capable of recognizing the same languages.
NFA Definitions NFA = { S, , , s0, F }
A finite set of states S A set of input symbols Σ
input alphabet, ε is not in Σ A transition function
: S S A special start state s0
A set of final states F, F S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
0 1 2 3a b c
c
a
This machine accepts abccabc, but it rejects abcab.
This machine accepts (abc+)+.
Transition Table
0start a 1 32b b
a
b
STATE a b ε
0 {0, 1} {0} -
1 - {2} -
2 - {3} -
3 - - -
The mapping of an NFA can be represented in a transition table
(0, a) = {0,1}(0, b) = {0}(1, b) = {2}(2, b) = {3}
DFA DFA is a special case of an NFA
There are no moves on input ε For each state s and input symbol a, there is
exactly one edge out of s labeled a. Both DFA and NFA are capable of
recognizing the same languages.
Simulating a DFA
Input An input string x terminated by
an end-of-file character eof. A DFA D with start state s0, accepting states F, and transition function move.
Output Answer “yes” if D accepts x;
“no” otherwise.
s = s0
c = nextChar();
while ( c != eof ) {
s = move(s, c);
c = nextChar();
}
if (s is in F )
return “yes”;
else
return “no”;
NFA vs DFA
0start a 1 32b b
a
b
S = {0,1,2,3} = {a, b}s0 = 0F = {3}
0 1 2 3a b b
b
a
a
a
(a | b)*abb
The Regular Language The regular language defined by an NFA is the
set of input strings it accepts. Example: (ab)*abb for the example NFA
An NFA accepts an input string x if and only if there is some path with edges labeled with symbols
from x in sequence from the start state to some accepting state in the transition graph
A state transition from one state to another on the path is called a move.
Theorem The followings are equivalent
Regular Expression NFA DFA Regular Language Regular Grammar
Convert Concept
Regular Expression
Nondeterministic Finite Automata
Deterministic Finite Automata
MinimizationDeterministic
Finite Automata
Construction of an NFA from a Regular Expression
Use Thompson’s Construction
s | t
N(s)
N(t)
s t N(s) N(t)
s* N(s)
a
a
ε
Example ( a | b )* a b b
r11
r8
r10
r7
r9
r6r5
*r4 a
b
b
( r3 )
r2r1
a b
|r3 = r4
0start a1 10
2
b
b
a
b
3
4 5
6 7 8 9
( a | b )* a b b
Example
Conversion of an NFA to a DFA The subset construction algorithm converts an NFA
into a DFA using the following operation.
Operation Description
ε- closure(s)Set of NFA states reachable from NFA state s on ε-transitions alone.
ε- closure(T)
Set of NFA states reachable from some NFA state s in set T on ε-transitions alone.
= ∪s in T ε- closure(s)
move(T, a)Set of NFA states to which there is a transition on input symbol a from some state s in T
Subset Construction(1)
Initially, -closure(s0) is the only state in Dstates and it is unmarked;
while (there is an unmarked state T in Dstates) {mark T;for (each input symbol a ) { U = -closure( move(T, a) ); if (U is not in Dstates) add U as an unmarked state to Dstates Dtran[T, a] = U}
}
Computing ε- closure(T)
Astart
B
C
D E
b
b
b
b
b
aa
a
a
a
0start a1 10
2
b
b
a
b
3
4 5
6 7 8 9
NFA State DFA State a b
{0,1,2,4,7} A B C
{1,2,3,4,6,7,8} B B D
{1,2,4,5,6,7} C B C
{1,2,4,5,6,7,9} D B E
{1,2,3,5,6,7,10} E B C
Example
( a | b )* a b b
Example2a1
6a3 4 5b b
8b7
a b0
start
DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8}
a abb a*b+
0137 247
68
7
8 58
a
b
b
b b
b
a
b
Simulation of an NFA
Input An input string x terminated by an
end-of-file character eof. An NFA N with start state s0, accepting states F, and transition function move.
Output Answer “yes” if N accepts x; “no”
otherwise.
S = ε-closure(s0)c = nextChar();while ( c != eof ) {
S = ε-closure(s0) c = nextChar();}if (S∩F != ψ) return “yes”;else return “no”;
Minimizing the DFA Step 1
Start with an initial partition II with two group: F and S-F (aceepting and nonaccepting)
Step 2 Split Procedure
Step 3 If ( IInew = II )
IIfinal = II and continue step 4 else
II = IInew and go to step 2 Step 4
Construct the minimum-state DFA by IIfinal group. Delete the dead state
Split Procedure
Initially, let IInew = II for ( each group G of II ) { Partition G into subgroup such that two states s and t are in the same subgroup if and only if for all input symbol a, states s and t have
transition on a to states in the same group of II.
/* at worst, a state will be in a subgroup by itself */
replace G in IInew by the set of all subgroup formed}
Example
initially, two sets {1, 2, 3, 5, 6}, {4, 7}. {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. {1, 2, 5} splits {1}, {2, 5} on b.
Minimizing the DFA
Major operation: partition states into equivalent classes according to final / non-final states transition functions
( A B C D E )( A B C D ) ( E )( A B C ) ( D ) ( E )( A C ) ( B ) ( D ) ( E )
a bA B CB B DC B CD B EE B C
a bA C B A CB B DD B EE B A C
Important States of an NFA The “important states” of an NFA are those
without an -transition, that is if move({s}, a) for some a then s is an
important state The subset construction algorithm uses only
the important states when it determines-closure ( move(T, a) )
Augment the regular expression r with a special end symbol # to make accepting states important: the new expression is r#
Converting a RE Directly to a DFA Construct a syntax tree for (r)# Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos Construct DFA D by algorithm 3.62
Function Computed From the Syntax Tree nullable(n)
The subtree at node n generates languages including the empty string
firstpos(n) The set of positions that can match the first symbol of a
string generated by the subtree at node n lastpos(n)
The set of positions that can match the last symbol of a string generated be the subtree at node n
followpos(i) The set of positions that can follow position i in the tree
Rules for Computing the Function
Node n nullable(n) firstpos(n) lastpos(n)
A leaf labeled by
true
A leaf with position i
false {i} {i}
n = c1 | c2
nullable(c1)or
nullable(c2)firstpos(c1) firstpos(c2) lastpos(c1) lastpos(c2)
n = c1 c2
nullable(c1) and
nullable(c2)
if ( nullable(c1) )
firstpos(c1) firstpos(c2)else firstpos(c1)
if ( nullable(c2) ) lastpos(c1) lastpos(c2)
else lastpos(c2)
n = c1* true firstpos(c1) lastpos(c1)
Computing followpos
for (each node n in the tree){ //n is a cat-node with left child c1 and right child c2 if ( n == c1. c2) for (each i in lastpos(c1) )
followpos(i) = followpos(i) firstpos(c2); else if (n is a star-node)
for ( each i in lastpos(n) ) followpos(i) = followpos(i) firstpos(n);
}
Converting a RE Directly to a DFAInitialize Dstates to contain only the unmarked state firstpos(n0), where n0 is the root of syntax tree T for (r)#;
while ( there is an unmarked state S in Dstates ) {
mark S; for ( each input symbol a ) {
let U be the union of followpos(p)
for all p in S that correspond to a;if (U is not in Dstates )
add U as an unmarked state to DstatesDtran[S,a] = U;
}
}
Example
( a | b )* a b b #
nullable(n) = false
firstpos(n) = { 1, 2, 3 }
lastpos(n) = { 3 }
followpos(1) = {1, 2, 3 }
○
b
#
○
○
b○
a*4
5
6
|
a b
3
21
n
n = ( a | b )* a
Example{6}{1, 2, 3}
{5}{1, 2, 3}
{4}{1, 2, 3}
{3}{1, 2, 3}
{1, 2}{1, 2} *
{1, 2}{1, 2} |
{1}{1} a {2}{2} b
{3}{3} a
{4}{4} b
{5}{5} b
{6}{6} #
nullable
firstpos lastpos
1 2
3
4
5
6
( a | b )* a b b #
Example
1,2,3 a 1,2,3,4
1,2,3,61,2,3,5
b b
b b
a
a
a
Node followpos
1 {1, 2, 3}
2 {1, 2, 3}
3 {4}
4 {5}
5 {6}
6 -
1
2
3 4 5 6
( a | b )* a b b #
Time and Space Complexity
AutomatonSpace
(worst case)Time
(worst case)
NFA O(r) O(rx)
DFA O(2|r|) O(x)
top related