lexical analysis dragon book: chapter 3. compiler structure lexical analyzer syntax analyzer...
TRANSCRIPT
Lexical Analysis
Dragon Book: chapter 3
Compiler structure
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate codegenerator
Code optimizer
Code generator
Source program
Target program
Symbol table Error handling
Compiler structure
Lexical analyzer
Syntax analyzer
Source program
Symbol table Error handling
token Get next token
Tokens in programming languages
Token Sample instances Description
if id keyword
rel <, <=, <>, >=, > relation
id count, length, point2
variable
num 3.1415927, 7, 145e-3
Numericalconstant
str “abc”, “some space”“\7\” is a char”
Constant string
Tokens may be difficult to recognize Fortran: DO 5 I=1.25
DO 5 I=1,25(spaces do not count).
PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN;(no reserved keywords).
PL/I: PR1(2, 7, 18, D*3, 175.14)=3(proc. call or array reference).
Strings, languages. A sequence of characters over some
alphabet, e.g., 0100110 over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string: (size 0). Concatenation: putting one string after
another. X=dog, Y=house, XY=doghouse (also X.Y).
Prefix: ban is prefix of banana.Suffix: ana is prefix of banana.
Language: a set of strings The alphabet is a language:
L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa}.
Y.X = {aab, aba}. Union: XY=X+Y=X|Y={ab, ba, a}. Exponentation: X3 = X.X.X Star: X* = zero or more occurrences.
L* = all words with letters from L. L+= all words with one or more letters from
L.
Regular expressions
X|Y = XY= { s | sX or sY }.X.Y = { x.y | xX and yY }.X* = i=0, Xi.
X+ = i=1, Xi.
Examples
a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}. a* = { , a, aa, aaa, … }. (a|b)* = { , a, b, ab, ba, aa, aba,
… }
Defining tokens
digit [0-9] digits digit+ fraction . digits | exponent E ( + | - | ) digits | const digits fraction exponent
Not everything is regular!
All the words of the form w c w, wherew is a word and c a letter.
The syntax of a program, e.g., the recursive definition of if-then-else.stmtif expr then stmt else stmt.
Reading the input
Need sometimes to “lookahead”. For example: identifying the variable done.
May need to “unread” a character.
If a>8 then goto nextloop else begin while z>8 do
Token starts here
Last character read
Returning: token + attributes.
if xyz > 11 then if, keyword id, value=xyz op, value=“>”. const, value=11 then, keyword.
Finite Automata
s1
s4
s2
c
a
a
a
b
b
b
b
s3
s5
c
a
Includes:
States {s1,s2,…,s5}.
Initial states {s1}.
Accepting states {s3,s5}.
Alphabet {a, b, c}.
Transitions:
{(s1,a,s2), (s2, a, s3), …}.
Deterministic?
Automaton. What is the language?
b
s0
a
a bs1
Formally:
An input is a word over the alphabet .
A run over a word is an alternating sequence ofstates and letters, starting from the initial state.
Accepting run: ends with an accepting state.
Example
s0
a
a bs1
Input: aabbb
Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts.
Input: aba
Run: s0 a s0 b s1 a s0. Does not accept.
b
Automaton. What is the language?
s0
a
a
b
bs1
Automaton. What is the language?
s1
a
a
b
bs0
Identifying tokens
IF
T H E N
L SE
E
letterletter|digit
Non deterministic automata
Allows more than a single transition from a state with the same label.
There does not have to be a transition from every state with every label.
Allows multiple initial states.
Allows transitions.
s0 s1 s20,1
1 0,1 0,1s3
Nondeterministic runs
Input: 0100
Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept.Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts.
Accepts when there exists an accepting run.
s0 s1 s20,1
1 0,1 0,1s3
Determinizing Automata
s0 s1 s20,1
1 0,1 0,1s3
Each state of D is a set of the states of N.
S—aT when T={t|sS and s—at}.
The initial state of D includes all the initial states of N.
Accepting states in D include at least one acceptingstate of N.
Determinization
0,1 s0 s1 s21 0,
10,1
s3
s0
s0,s3
s0,s2 s0,s1,s3
s0,s2,s3
s0,s1,s2,s3s0,s1,s2s0,s10
00
0
1
00
0
1 1
1 1
1
1
0
Determinization
000
100
010 101
110
1110110010
00
0
1
00
0
1 1
1 1
1
1
0
Translating regular expressions into automata
L1
L1 L2
L2
L
L1L2L1.L2
L*
Automatic translation
(a|b).(a.b)=(ab)(ab)=(a+b).(a+b)=…
a
b
a
b
a
b
a
b
Determinization with transitions.
s1 s3a
s2 s4b
s0 s5
s7 s9a
s8 s10bs6 s11
Add to each set states reachable using transitions.
s0,s1,s2
s3,s5,s6,s7,s8 s9,s11
s4,s5,s6,s7,s8 s10,s11
a a
abb
b
Minimization
Group all the states together.
Separate states according to available exit transitions.
Separate a set to two if from some of its states one can reach another set and with others one cannot.
Repeat until cannot separate.
p0
p1 p3
p2 p4
a a
abb
b
Minimization
Group all the states together.
{p0, p1, p2, p3, p4}.
p0
p1 p3
p2 p4
a a
abb
b
Minimization
Separate states according to available exit transitions.
p0
p1 p3
p2 p4
a a
abb
b
Minimization
p0
p1 p3
p2 p4
a a
abb
b
Separate a set to two if from some of its states one can reach another set and with others one cannot.
Repeat until cannot separate.
Can minimize now
a
b
a
b
bb
aa
Lex
Declarations%%Translation rules%%Auxiliary procedures
Lex behavior
Lex ProgramLex sourceprogramlex.l
lex.yy.c
CCompiler
a.out
a.outInput
streem
Output
tokens
Lex behavior Translates the definitions into an
automaton. The automaton looks for the longest
matching string. Either return some value to the reading
program (parser), or looks for next token. Lookahead operator: x/y allow the
token x only if y follows it (but y is not part of the token).
Lex Project Project collection date: Feb 11th. Work in pairs (singles). Use lex to take a text and check
whether the number of open parentheses of any kind is equal to the number of closed parentheses.
Exception: Inside quotes. \” is not a closing quote.