edan65: compilers, lecture02 regularexpressions and...
TRANSCRIPT
![Page 1: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/1.jpg)
EDAN65:Compilers,Lecture 02
Regular expressionsandscanning
GörelHedinRevised:2017-08-29
![Page 2: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/2.jpg)
Courseoverview
Semantic analyzer
Intermediatecode generator
Optimizer
Targetcodegenerator
2EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
machine
runtime system
stack
heap
codeanddata
objects
activationrecords
Interpreter
target code
tokens
Attributed AST
intermediate code
sourcecode (text)
AST(Abstractsyntaxtree)
intermediate code
garbagecollection
Virtualmachine
This lecture
![Page 3: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/3.jpg)
Analyzing programtext
EDAN65,Lecture02 3
sum =sum +k
AssignStmt
Exp
Add
Exp Exp
IDEQIDPLUSIDprogramtext
tokens
parse tree
This lecture
![Page 4: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/4.jpg)
Recall:Generatingthecompiler:
Semantic analyzer
EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions Scannergenerator
Context-freegrammar
Parsergenerator
Attributegrammar
Attribute evaluatorgenerator
We will use ascannergeneratorcalled JFlex
4
tokens
text
tree
![Page 5: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/5.jpg)
Some typical tokens
EDAN65,Lecture02 5
Token Example lexemes
IFTHENFOR
ifthenfor
ID B alpha k10
INTFLOATSTRINGCHAR
12309920163.14160.2"Hello""""100%"'A''c' '%'
PLUSINCRNE
+++!=
SEMICOMMALPAREN
;,(
Regular expression"if""then""for"[A-Za-z][A-Za-z0-9]*
[0-9]+[0-9]+ "." [0-9]+\" [^\"]* \"\' [^\'] \'"+""++""!="";"",""("
JFlex syntax
Reserved words(keywords)
Identifiers
Literals
Operators
Separators
![Page 6: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/6.jpg)
Formallanguages• Analphabet,Σ,isasetof symbols(nonempty andfinite).• Astring isasequence of symbols(each stringisfinite)• Aformallanguage,L,isasetof strings(can beinfinite).
• We would liketo have rules oralgorithms fordefining alanguage – deciding if acertain stringoverthealphabetbelongs to thelanguage ornot.
EDAN65,Lecture02 6
![Page 7: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/7.jpg)
Example:Languages overbinary numbers
Suppose we have thealphabet Σ ={0,1}
Example languages:• Thesetof allpossiblecombinationsofzerosandones:
L0 ={0,1,00,01,10,11,000,...}• Allbinarynumberswithoutunnecessaryleadingzeros:
L1 ={0,1,10,11,100,101,110,111,1000,...}• Allbinarynumberswithtwodigits:
L2 ={00,01,10,11}• ...
EDAN65,Lecture02 7
![Page 8: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/8.jpg)
Example:Languages overUNICODE
Here,thealphabet Σ isthesetof UNICODEcharacters
Example languages:• Allpossible Javakeywords:{"class","import","public",...}• Allpossible lexemes corresponding to Javatokens.• Allpossible lexemes corresponding to Javawhitespace.• Allbinary numbers• ...
EDAN65,Lecture02 8
![Page 9: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/9.jpg)
Example:Languages overJavatokens
Here,thealphabet Σ isthesetof Javatokens
Example languages:• Allsyntactically correct Javaprograms• Allthat are syntactically incorrect• Allthat are compile-time correct• Allthat terminate• ...
EDAN65,Lecture02 9
(But this language cannot becomputed:Terminationisundecidable:itisnotpossible to construct analgorithm that decides forany string,ifitisaterminating programornot.)
![Page 10: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/10.jpg)
Defining languages using rulesIncreasingly powerful:• Regular expressions(fortokens)• Context-free grammars(forsyntaxtrees)• Attribute grammars(context-free grammar +extrarules for
further restricting thelanguage)
EDAN65,Lecture02 10
![Page 11: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/11.jpg)
Regular expressions(core notation)RE read iscalled
a a symbol
M |N M orN alternative
MN M followed byN concatenation
∊ theempty string epsilon
M* zero ormoreM repetition(Kleene star)
(M)
EDAN65,Lecture02 11
where a isasymbolinthealphabet (e.g.,{0,1}orUNICODE)andM andN are regular expressions
Each regular expressiondefines alanguage overthealphabet(asetof stringsthat belong to thelangauge).
Priorities:M |N P*means M |(N (P*))
![Page 12: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/12.jpg)
Example
a |b c*
means
{a,b,bc,bcc,bccc,...}
EDAN65,Lecture02 12
![Page 13: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/13.jpg)
Regular expressions(extended notation)Core RE read iscalled
a a symbol
M |N M orN alternative
MN M followed byN concatenation
∊ theempty string epsilon
M* zero ormoreM repetition(Kleene star)
(M)
EDAN65,Lecture02 13
Extended RE read meansM+ at least one ... MM*
M? optional ... ∊ |M[aou][a-zA-Z]
one of ...(a character class) a|o|ua|b| ...|z|A|B|...|Z
[^0-9](Appel notation:~[0-9])
not... one character,but notanyone of those listed
"a+b" thestring... a\+b
![Page 14: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/14.jpg)
ExerciseWrite aregular expressionthat defines thelanguage of alldecimalnumbers,like
3.140.7547110...
But notnumbers lacking aninteger part.Andnotnumbers with adecimalpoint butlacking afractional part.Sonotnumbers like
17..236.
Leadingandtrailing zeros are allowed.Sothefollowing are ok:
007008.000.01.700
a) Use theextended notation.b) Then translatetheexpressionto thecore notationc) Then write anexpressionthat disallows unnecessary leadingzeros
(intheextended notation)
EDAN65,Lecture02 14
![Page 15: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/15.jpg)
Solutiona)[0-9]+ ("."[0-9]+)?
b)(0 |...| 9)(0 |...| 9)* (∊ | ("."((0 |...| 9)(0 |...| 9)*)))
c)(0 | [1-9] [0-9]*) ("."[0-9]+)?
EDAN65,Lecture02 15
![Page 16: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/16.jpg)
Escaped characters
EDAN65,Lecture02 16
Use backslashto escape metacharacters andnon-printing control characters.
Metacharacters
\+
\*
\(
\)
\|
\\
...
Non-printing control characters
\n newline
\r return
\t tab
\f formfeed
...
![Page 17: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/17.jpg)
Some typical tokens
EDAN65,Lecture02 17
Kind Name Example lexemes
Reserved words(keywords)
IFTHENFOR
ifthenfor
Identifiers ID B alpha k10
Literals INT 123099
FLOAT 3.14160.2
CHAR 'A''c'
STRING "Hello""""j"
Operators PLUSINCRNE
+++!=
Separators SEMICOMMALPAREN
;,(
Regular expression"if""then""for"[A-Za-z]([A-Za-z0-9])*
[0-9]+
[0-9]+ "." [0-9]+
\' [^\'] \'
\" [^\"]* \"
"+""++""!="";"",""("
![Page 18: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/18.jpg)
Some typical non-tokens
EDAN65,Lecture02 18
Non-Token Example lexemes
WHITESPACE blank tab newlinereturn
ENDOFLINECOMMENT //comment
Regular expression(jflex)" " | \t | \n | \r
"//" [^\n\r]* ([\n\r])?
Non-tokensare also recognized bythescanner,justliketokens.But they are notsentonto theparser.
JFlex syntax
(Thenewline/return ending anend-of-line comment isoptional inorderto allow afile to endwith anend-of-line comment,without anextranewline/return.)
![Page 19: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/19.jpg)
JFlex:AscannergeneratorGeneratingascannerforalanguage lang
EDAN65,Lecture02 19
Program.lang
LangScanner.java
LangParser.java
characters
tokens
lang.jflex jflex.jar
Scannerspecification withregular exprs
Scannergenerator
![Page 20: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/20.jpg)
AJFlex specification
EDAN65,Lecture02 20
package lang; // the generated scanner will belong to the package langimport lang.Token; // Our own class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...
Rules andlexical actionsEach rule hastheform:
regular-expression {lexical action }Thelexical actionconsists of arbitrary Javacode.Itisrun when aregular expressionismatched.Themethod yytext()returns thelexeme (thetokenvalue).
What rules are used whenscanning"a<b"?
![Page 21: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/21.jpg)
Ambiguities?
EDAN65,Lecture02 21
package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...
Are thetokendefinitionsambiguous?Which rules match"<="?Which rules match"if"?Which rules match"ifff"?Which rules match"xyz"?
![Page 22: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/22.jpg)
Extrarules forresolving ambiguities
Longest matchIfone rule can beused to matchatoken,but there isanother rulethat will matchalonger token,thelatter rule will bechosen.This way,thescannerwill matchthelongest tokenpossible.
Rule priorityIftwo rules can beused to matchthesamesequence of characters,thefirst one takes priority.
EDAN65,Lecture02 22
![Page 23: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/23.jpg)
Implementationof scannersObservation:
Regular expressions are equivalent to finite automata (finite-state machines).(They can recognize thesameclass of formallanguages:theregular languages.)
Overallapproach:• Translateeach tokenregular expressionto afinite automaton.
Label thefinalstate with thetoken.• Merge alltheautomata.• Theresulting automaton will ingeneralbenondeterministic• Translatethenondeterministic automaton to adeterministic automaton.• Implement thedeterministic automaton,
either using switchstatements oratable.
Ascannergeneratorautomates this process.
EDAN65,Lecture02 23
![Page 24: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/24.jpg)
Construct anautomaton foreach tokenregexp
EDAN65,Lecture02 24
a
state
transition
startstate
finalstate
fi IF
0-9 INT
0-9
"" WHITESPACE
\n
\t
WHITESPACE
WHITESPACE
a-zA-Z ID
a-zA-Z
"if"
[0-9]+
""|\n|\t
[a-zA-Z]+
![Page 25: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/25.jpg)
Merge thestartstates of theautomata
EDAN65,Lecture02
f
i
IF
0-9 INT
0-9
""\n\t
WHITESPACE
a-zA-Z
ID
a-zA-Z
Isthenewautomaton deterministic?
25
![Page 26: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/26.jpg)
Deterministic finite automata
EDAN65,Lecture02 26
1
a 2
3a
1ε
2
1
a 2
3b
Inadeterministic finite automaton each transition isuniquely determined bytheinput.
Nondeterministic,since if we readawhen instate 1,we don't know if we should goto state 2or3.
Nondeterministic,since when we are instate 1,we don'tknow if we should stay there,orgoto state 2withoutreading any input.(Epsilondenotes theempty string.)
Deterministic,since fromstate 1,thenext inputdetermines if we goto state 2or3.
![Page 27: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/27.jpg)
DFAversus NFADeterministic Finite Automaton (DFA)Afinite automaton isdeterministic if
– alloutgoing edges fromany givenstate have disjointcharacter sets– there are noepsilonedges
Can beimplemented efficiently
Non-deterministic Finite Automaton (NFA)AnNFAmay have
– two outgoing edges with overlapping character sets– epsilonedges
Every DFAisalso anNFA.Every NFAcan betranslated to anequivalent DFA.
EDAN65,Lecture02 27
![Page 28: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/28.jpg)
Translating anNFAto aDFASimulate theNFA– keep track of aset of current NFA-states– follow ε edges to extend thecurrent set(take theclosure)
Construct thecorresponding DFA– Each such set of NFAstates corresponds to one DFAstate– Ifany of theNFAstates isfinal,theDFAstate isalso final,andismarked with thecorresponding token.
– Ifthere ismore than one tokento choose from,select thetokenthat isdefined first (rule priority).
(Minimize theDFAforefficiency)
EDAN65,Lecture02 28
![Page 29: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/29.jpg)
Example
EDAN65,Lecture02 29
2 3f
iIF
1
4
a-z
ID
a-z
NFA
3,4f
iIF
1
4
a-hj-zID
a-z
DFA
a-za-eg-z
2,4ID
![Page 30: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/30.jpg)
Error handling
EDAN65,Lecture02 30
3f
iIF
1
4a-hj-z
ID
a-za-z
a-eg-z
0
ERROR
^a-z^a-z ^a-z^a-z
• Add a"dead state"(state 0),corresponding to erroneous input.• Add transitions to the"dead state"forallerroneous input.• Generate an"ERRORtoken"when thedead state isreached.
2
ID
![Page 31: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/31.jpg)
ImplementationalternativesforDFAs
Table-driven– Represent theautomaton byatable– Additional tableto keep track of finalstates andtokenkinds– Aglobalvariable keeps track of thecurrent state
Switchstatements– Each state isimplemented asaswitchstatement– Each case implements astate transition asajump (to another switch
statement)– Thecurrent state isrepresented bytheprogramcounter.
EDAN65,Lecture02 31
![Page 32: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/32.jpg)
Table-drivenimplementation
EDAN65,Lecture02 32
... + ... a ... e f g ... h i j ... z ... final kind
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true ERROR
1 0 5 0 4 4 4 4 4 4 4 2 4 4 4 0 false
2 0 0 0 4 4 4 3 4 4 4 4 4 4 4 0 true ID
3 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true IF
4 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true ID
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true PLUS
3f
iIF
1
4a-hj-z
ID
a-za-z
a-eg-z
2ID
5PLUS
+
![Page 33: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/33.jpg)
Scannerimplementation,design
EDAN65,Lecture02 33
ParserScanner
TokennextToken()
File
charnextChar()
Token
int kind()Stringvalue()
call call
![Page 34: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/34.jpg)
Scannerimplementation,sketch
EDAN65,Lecture02 34
Token nextToken() {state = 1; // start statewhile (! isFinal[state]) {
ch = file.readChar();state = edges[state, ch];
}return new Token(kind[state]);
}
Needs to beextended with handlingof:• longest match• endof file• nontokens(likewhitespace)• tokenvalues (liketheidentifier name)
Idea:Scanthenext tokenby• starting inthestartstate• scan characters until we reach afinalstate• return anewtoken
![Page 35: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/35.jpg)
Extend to longest match,design
EDAN65,Lecture02 35
ParserScanner
TokennextToken()
PushbackFile
charreadChar()void pushback(String)
Token
int kind()Stringvalue()
File
charreadChar()
Idea:• When atokenismatched,don't stopscanning.• When theerror state isreached,return thelasttokenmatched.• Pushreadcharacters that are unused backinto thefile,sothey can bescanned again.• Use aPushbackFile to accomplish this.
![Page 36: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/36.jpg)
Extend to handle longest match,sketch
EDAN65,Lecture02 36
Token nextToken() {state = 1;str = "";lastFinalState = 0; lastTokenValue = "";while (state != 0) {
ch = pushbackfile.readChar();str = str + ch; state = edges[state, ch];if (isFinal[state]) {
lastFinalState = state;lastTokenValue = str;
}}pushbackfile.pushback(str.substring(lastTokenValue.length));return new Token(kind[lastFinalState], lastTokenValue);
}
// In Java, StringBuilder would be more efficient
• When atokenismatched (afinalstate reached),don’t stopscanning.• Keep track of thecurrently scanned string,str.• Keep track of thelatest matched token(lastFinalState,lastTokenValue).• Continue scanninguntil we reach theerror state.• Restore theinputstream using PushBackFile.• Return thelatest matched token.• (orreturn theERRORtokenif there was nolatest matched token)
![Page 37: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/37.jpg)
HandlingEnd-of-file (EOF)andnon-tokens
EOF– construct anexplicitEOFtokenwhen theEOFcharacter isread
Non-tokens(Whitespace&Comments)– view astokensof aspecialkind– scan them asnormaltokens,but don’t create tokenobjects forthem– loopinnext()until arealtokenhasbeen found
Errors– construct anexplicitERRORtokento bereturned when novalidtoken
can befound.
EDAN65,Lecture02 37
![Page 38: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/38.jpg)
Specifying EOFandERRORinJFlex
EDAN65,Lecture02 38
package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...<<EOF>> { return new Token("EOF"); }[^] { return new Token("ERROR"); }
<<EOF>>isaspecialregular expressioninJFlex,matching endof file.
[^]means any character.Due to rule priority,this will matchany character notmatched byprevious rules.
![Page 39: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/39.jpg)
Example scannergenerators
EDAN65,Lecture02 39
tool author generates
lex Schmidt, Lesk.1975 C-code
flex ("fast lex") Paxon.1987 C-code
jlex Javacode
jflex Javacode
...
![Page 40: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/40.jpg)
Limitationsof regular expressionsforscanning
EDAN65,Lecture02 40
• Nested comments?• Layout-sensitivesyntax?• Context-sensitivetokendefinitions?
Forexample,multi-language documents.
• Two mechanisms inscannergeneratorsforworkarounds:– Lexical actions:
domore than create atoken,e.g.,count nesting levels of comments.– Lexical states:
switchbetween differentsetsof tokendefinitions.
![Page 41: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/41.jpg)
Lexical states
EDAN65,Lecture02 41
• Some tokensare difficult orimpossible to define with regular expressions.
• Lexical states (setsof tokenrules)give thepossibility to switchtokensets(DFAs)during scanning.
• Useful formulti-line comments,HTML,scanningmulti-languagedocuments,etc.
• Supported bymany scannergenerators(including JFlex)
T1T2T3T4...
LEXSTATE1T5T6T7...
LEXSTATE2
![Page 42: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/42.jpg)
Example:multi-line comments
EDAN65,Lecture02 42
Would liketo scan thecomplete comment asone token:
/*int m() {
return 15 / 3 * 4 * 2;}*/
Can besolved easily with lexical states:
ID"if"
"then""/*"...
"*/"[^]
Defaulttokenset
Tokensetusedinsidecomment
However,some scannergenerators,likeJFlex,hasthespecialoperatorupto (~)thatcan beused instead: "/*" ~"*/" { /* Comment */ }
"/*"((\*+[^/*])|([^*]))*\**"*/"
Writinganordinary regular expressionforthis isdifficult:
![Page 43: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/43.jpg)
Courseoverview
Semantic analyzer
Intermediatecode generator
Optimizer
Targetcodegenerator
43EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
machine
runtime system
stack
heap
codeanddata
objects
activationrecords
Interpreter
target code
tokens
Attributed AST
intermediate code
sourcecode (text)
AST(Abstractsyntaxtree)
intermediate code
garbagecollection
Virtualmachine
This lecture
Next lecture
A1
A1
![Page 44: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +](https://reader033.vdocuments.site/reader033/viewer/2022053006/5f09b6ae7e708231d4282960/html5/thumbnails/44.jpg)
Summaryquestions
44
• What isaformallanguage?• What isaregular expression?• What ismeant byanambiguous lexical definition?• Give some typical examples of ambiguities andhow they may beresolved.• What isalexical action?• Give anexample of how to construct anNFAforagivenlexical definition• Give anexample of how to construct aDFAforagivenNFA• What isthedifference between aDFAandandNFA?• Give anexample of how to implement aDFAinJava.• How isrule priority handledintheimplementation?Longest match?EOF?Whitespace?Errors?• What are lexical states?When are they useful?
EDAN65,Lecture02
You can startonAssignment 1now.But you will have to wait until thenext lectureforthepartsabout parsing.