03-60-214 lexical analysis s a, b b ajlu.myweb.cs.uwindsor.ca/214/214lexer2019.pdf · 7 regular...
TRANSCRIPT
1
03-60-214Lexicalanalysis S
a, b
a B
2
Lexicalanalysisinperspec5ve
• LEXICALANALYZER– ScanInput– RemoveWhiteSpace,NewLine,…– Iden5fyTokens– CreateSymbolTable– InsertTokensintoSymbolTable– GenerateErrors– SendTokenstoParser
lexical analyzer parser
symbol table
source program
token
get next token
• PARSER– PerformSyntaxAnalysis
– Ac5onsDictatedbyTokenOrder
– UpdateSymbolTableEntries
– CreateAbstractRepresenta5onofSource
– GenerateErrors
• LEXICALANALYZER:Transformscharacterstreamtotokenstream– Alsocalledscanner,lexer,linearanalyzer
3
Whereweare
Total=price+tax;
Total = price + tax ;
Lexical analyzer
Parser
id+id
Expr
assignment
=id
Regular expression
4
Basicterminologiesinlexicalanalysis
• Token– Aclassifica5onforacommonsetofstrings– Examples:if,<identifier>, <number> …
• Pa[ern– Theruleswhichcharacterizethesetofstringsforatoken– RecallfileandOSwildcards(*.java)
• Lexeme– Actualsequenceofcharactersthatmatchespa[ernandisclassifiedby
atoken– Iden5fiers:if, price, 10.00,etc…
Regular expression
if (price + gst – rebate <= 10.00) gift := false
5
Examplesoftoken,lexemeandpa[ernif (price + gst – rebate <= 10.00) gift := false
Token lexeme Informaldescrip5onofpa[ern
if if If
Lparen ( (
Iden5fier price Stringconsistsofle[ersandnumbersandstartswithale[er
operator + +
iden5fier gst Stringconsistsofle[ersandnumbersandstartswithale[er
operator - -
iden5fier rebate Stringconsistsofle[ersandnumbersandstartswithale[er
operator <= Lessthanorequalto
number 10.00 Anynumericconstant
rparen ) )
iden5fier gid Stringconsistsofle[ersandnumbersandstartswithale[er
operator := Assignmentsymbol
iden5fier false Stringconsistsofle[ersandnumbersandstartswithale[er
Regular expression
6
Regularexpression• Scannerisbasedonregularexpression.• Rememberlanguageisasetofstrings.• Examplesofregularexpression
– Le[era a– Keywordif if– Allthele[ers a|b|c|...|z|A|B|C...|Z– Allthedigits 0|1|2|3|4|5|6|7|8|9– AlltheIden5fiers le[er(le[er|digit)*
• Basicopera5ons:– Setunion,e.g.,a|b– Concatena5on,e.g,ab– Kleeneclosure,e.g.,a*
Regular expression
7
Regularexpression
• Regularexpression:construc5ngsequencesofsymbols(strings)fromanalphabet.
• LetΣbeanalphabet,raregularexpressionthenL(r)isthelanguagethatischaracterizedbytherulesofr
• Defini5onofregularexpression– εisaregularexpressionthatdenotesthelanguage{ε}
• Notethatitisnot{}– IfaisinΣ,aisaregularexpressionthatdenotes{a}– LetrandsberegularexpressionswithlanguagesL(r)andL(s).Then
• r|sisaregularexpressionàL(r)∪L(s)• rsisaregularexpressionàL(r)L(s)• r*isaregularexpressionà(L(r))*
• Itisaninduc5vedefini5on!• Dis5nc5onbetweenregularlanguageandregularexpression
Regular expression
8
Formallanguageopera5ons
Opera5on Nota5on Defini5on ExampleL={a,b}M={0,1}
unionofLandM L∪M
L∪M={s|sisinLorsisinM}
{a,b,0,1}
concatena)onofLandM
LM
LM={st|sisinLandtisinM}
{a0,a1,b0,b1}
KleeneclosureofL
L*
L*denoteszeroormoreconcatena5onsofL
Allthestringsconsistsof“a”and“b”,plustheemptystring.{ε, a,aa,bb,ab,ba,aaa,…}
posi)veclosure L+
L+denotes“oneormoreconcatena5onsof“L
Allthestringsconsistsof“a”and“b”.
Regular expression
9
Regularexpressionexamplerevisited
• Examplesofregularexpression– le[eràa|b|c|...|z|A|B|C...|Z– digità0|1|2|3|4|5|6|7|8|9– Iden5fieràle[er(le[er|digit)*
• Exercise:whyisitaregularexpression?
Regular expression
10
Precedenceofoperators
• Can the following RE be simplified? (a) | ((b)*(c))
• *isofthehighestprecedence;• ts(Concatena5on)comesnext;• |lowest.
• Example– (a) | ((b)*(c))isequivalenttoa|b*c
Regular expression
11
Proper5esofregularexpressions
Property Descrip5on
r|s=s|r |iscommuta5ve
r|(s|t)=(r|s)|t |isassocia5ve
(rs)t=r(st) Concatena5onisassocia5ve
r(s|t)=rs|rt(s|t)r=sr|tr
Concatena5ondistributesover|
......
What is why we can write
• either a|b or b|a
• a|b|c, or (a|b)|c
• abc, or (ab)c
Regular expression
12
Nota5onalshorthandofregularexpression
• Oneormoreinstance– L+=LL*– L*=L+|ε – Example
• digitsà digit digit* • digitsàdigit+
• Zerooroneinstance– L?=L|ε– Example:
• Op5onal_frac5onà.digits|ε • op5onal_frac5onà(.digits)?
• Characterclasses– [abc]=a|b|c– [a-z]=a|b|c...|z
Regular expression
REexample
• Stringsoflength5• Supposetheonlycharactersareaandb• Solu5on
(a|b)(a|b)(a|b)(a|b)(a|b)
• Simplifica5on(a|b){5}
13
REexample
• Stringscontainingatmostonea b*ab* | b* Or b*(a|ε) b* | b*
• Moreconciserepresenta5onb*a?b*
14
REforemailaddress
• [email protected]@[email protected]…le[er=[a-zA-Z]digit=[0-9]w=le[er|digitw+(.w+)*@w+.w+(.w+)*
15
REforoddnumbers
Isthefollowingcorrect?[13579]+33343443[0-9]*[13579]
16
17
Moreregularexpressionexample• REforrepresen5ngmonths
– Exampleoflegalinputs• Febcanberepresentedas02or2• Novemberisrepresentedas11
• Firsttry:(0|1)?[0-9]– Matchesalllegalinputs?Yes
• 1,2,11,12,01,02,...– Matchesnoillegalinputs?No
• 13,14,..etc• Secondtry:
(0|1)?[0-9]=(ε|(0|1))[0-9]=[0-9]|(0|1)[0-9]=[0-9]|(0[0-9]|1[0-9][0-9]|(0[0-9]|1[0-2]– Matchesalllegalinputs?Yes
• 1,2,11,12,01,02,...– Matchesnoillegalinputs?No
• 0,00
Regular expression
18
Deriveregularexpressions
• Solu5on:[1-9]|(0[1-9])|(1[012])– Either1-9,or0followedby1to9,or1followedby0,1,or2.– Matchesalllegalinputs– Matchesnoillegalinputs
• Moreconcisesolu5on:0?[1-9]|1[012]– Isitequalto[1-9]|(0[1-9])|(1[012])?0?[1-9]|1[012]=(ε|0)[1-9]|1[012](byshorthandnota5on)=ε[1-9]|0[1-9]|1[012](bydistribu5onover|)=[1-9]|0[1-9]|1[012]
Regular expression
19
Regularexpressionexample(realnumber)
• Realnumbersuchas0,1,2,3.14– Digit:[0-9]– Integer:[0-9]+– Firsttry:[0-9]+(.[0-9]+)?
• Wanttoallow“.25”aslegalinput?
– Secondtry:[0-9]+ | ([0-9]*.[0-9]+)
• Op5onalunaryminus:-?([0-9]+|([0-9]*.[0-9]+))-?(\d+|(\d*.\d+))
Regular expression
20
Regularexpressionexercises
• Canthestringbaabecreatedfromtheregularexpressiona*b*a*b*?
• Describethelanguage(inwords)representedby(a*a)b|ab.a+b
• Writetheregularexpressionthatrepresents:– AllstringsoverΣ={a,b}thatendina.[ab]*a
(a|b)*a– AllstringsoverΣ={0,1}ofevenlength. ((0|1)(0|1))*(00|01|10|11)*
Regular expression
21
Regulargrammarandregularexpression• Theyareequivalent
– Everyregularexpressioncanbeexpressedbyregulargrammar– Everyregulargrammarcanbeexpressedbyregularexpression– Differentwaystoexpressthesamething
• Whytwonota5ons– Grammarisamoregeneralconcept– REismoreconcise
• Howtotranslatebetweenthem– Useautomata– Willintroducelater
Regular expression
22
Whatwelearntlastclass
• Defini5onofregularexpression– εisaregularexpressionthatdenotesthelanguage{ε}
• Notethatitisnot{}– IfaisinΣ,aisaregularexpressionthatdenotes{a}– LetrandsberegularexpressionswithlanguagesL(r)andL(s).Then
• (r)|(s)isaregularexpressionàL(r)∪L(s)• (r)(s)isaregularexpressionàL(r)L(s)• (r)*isaregularexpressionà(L(r))*
Regular expression
23
Applica5onsofregularexpression
• InWindows– InwindowsyoucanuseREtosearchforfilesortextsinafile
• Inunix,therearemanyRErelevanttools,suchasGrep– StandsforGlobalRegularExpressionsandPrint(orGlobalRegular
ExpressionandParser…);– UsefulUNIXcommandtofindpa[ernsofcharactersinatextfile;
• XMLDTDcontentmodel– <!ELEMENTstudent(name,(phone|cell)*,address,course+)>
<student><name>Jianguo</name><phone>1234567</phone><phone>2345678</phone><address>401sunsetave</address><course>214</course></student>
• JavaCoreAPIhasregexpackage!• Scannergenera5on
Regular expression
24
• REinXMLSchema<xsd:simpleTypename="TelephoneNumber"><xsd:restric5onbase="xsd:string"><xsd:lengthvalue="8"/><xsd:pa[ernvalue="\d{3}-\d{4}"/></xsd:restric5on></xsd:simpleType>
Regular expression
25
RegularexpressionsusedinScanner,Stringetc
• Asampleproblem– Developaprogramthat,givenasinputthreepointsP1,P2,andP3onthecartesian
coordinateplane,reportswhetherP3liesonthelinecontainingP1andP2.– Inordertoinputthethreepoints,theuserenterssixintegersx1,y1,x2,y2,x3,andy3.
ThethreepointsareP1=(x1,y1),P2=(x2,y2),andP3=(x3,y3).– Theprogramshouldrepeatun5ltheuser'sinputissuchthatP1andP2arethesame
point.• Sampleinput
– Enterx1:0– Entery1:0– Enterx2:2– Entery2:5– Enterx3:1– Entery3:3
• Output– Thepoint(1,3)ISNOTonthelineconstructedfrompoints(2,5)and(0,0).
Regular expression in Java
26
Howtoreadandprocesstheinput
• FirstTryScannersc=newScanner(System.in);intx1=sc.nextInt();java.u5l.InputMismatchExcep5on
• Wewanttocapturethevaluesforxandy,anddiscardeverythingelse
• DescribeeverythingelseasdelimitersScannersc=newScanner(System.in).useDelimiter(“…”);
• Thedelimiterscanbeanyregularexpression
SampleinputEnterx1:2Entery1:8Enterx2:0Entery2:0Enterx3:1Entery3:4
Regular expression in Java
27
useDelimiterinScannerclass
• useDelimiter(“Enterx1:”)– Thiswillthrowaway“Enterx1:”only.Todiscard“Enterx2:”aswell,
youmaywanttoaddthefollowing
• useDelimiter(“Enterx1:|Enterx2:”)– Ver5calbarmeans“OR”—either“Enter x1:”OR“Enter x2:” – ItiscalledRegularexpression;– Nowyouknowhowtoexpandtothecaseforx3--
• useDelimiter(“Enterx1:|Enterx2:|Enterx3”)– Wecansimplifiedtheaboveusingothernota5onsinREGUALR
EXPRESSION.
Regular expression in Java
28
Useregularexpressiontocapturetheinput
• useDelimiter(“Enterx\\d”)– Where\\dmeansanydigit.Nowhowtoreadinthevaluesforyaxis?
• useDelimiter(“Enter\\w\\d”)– \\wmeansanyle[erordigit
• useDelimiter(“Enter\\w{2}”)– Whatifthereareleadingandtrailingspacesaroundthesampleinput?
• useDelimiter(“(|\\t)Enter\\w{2}(|\\t)”)• useDelimiter(“\\sEnter\\w{2}\\s”)
– \\sstandsforallkindsofwhitespace.• useDelimiter(\\s*Enter\\w{2}:\\s*)
– *meansthatzeroormorespacescanoccur.
Regular expression in Java
29
Codefragmenttoreaddata
//readfromfilefortes5ngpurposeScannersc=newScanner(newFile("online.txt")). useDelimiter("\\s*Enter\\w{2}:\\s*");
intx1=sc.nextInt();inty1=sc.nextInt();intx2=sc.nextInt();inty2=sc.nextInt();intx3=sc.nextInt();inty3=sc.nextInt();
Regular expression in Java
30
RegexpackageinJava
• Javahasregularpackagejava.u5l.regex,• Asimpleexample:
– Pickoutthevaliddatesinastring– E.g.inthestring“finalexam2008-04-22,or2008-4-22,butnot
2008-22-04”– Validdates:2008-04-22,2008-4-22
• Firstweneedtowritetheregularexpressions.\d{4}-(0?[1-9]|1[012])-\d{2}
Regular expression in Java
31
Regexpackage
• First,youmustcompilethepa[ernimport java.util.regex.*;Pattern p = Pattern.compile(“\\d{4}-(0?[1-9]|1[012])-\
\d{2}");– Notethatinjavayouneedtowrite\\dinsteadof\d
• Next,youmustcreateamatcherforaspecificpieceoftextbysendingamessagetoyourpa[ern– Matcherm=p.matcher(“…yourtextgoeshere….");
• Pointstono5ce:– Pa[ernandMatcherarebothinjava.u5l.regex– NeitherPa[ernnorMatcherhasapublicconstructor;youcreate
thesebyusingmethodsinthePa[ernclass– Thematchercontainsinforma5onaboutboththepa[erntouseand
thetexttowhichitwillbeapplied
Regular expression in Java
32
Regexinjava
• Nowthatwehaveamatcherm,– m.matches()returnstrueifthepa[ernmatchestheen5retext
string,andfalseotherwise– m.lookingAt()returnstrueifthepa[ernmatchesatthe
beginningofthetextstring,andfalseotherwise– m.find()returnstrueifthepa[ernmatchesanypartofthetext
string,andfalseotherwise• Ifcalledagain,m.find()willstartsearchingfromwherethelastmatchwasfound
• m.find()willreturntrueforasmanymatchesasthereareinthestring;aderthat,itwillreturnfalse
• Whenm.find()returnsfalse,matchermwillberesettothebeginningofthetextstring(andmaybeusedagain)
Regular expression in java
33
Regexexampleimportjava.u5l.regex.*;publicclassRegexTest{publicstaCcvoidmain(Stringargs[]){
Stringpa[ern="\\d{4}-(0?[1-9]|1[012])-\\d{2}";Stringtext="finalexam2008-04-22,or2008-4-22,butnot 2008-22-04";
Pa[ernp=Pa[ern.compile(pa[ern);Matcherm=p.matcher(text);while(m.find()){ System.out.println("validdate:"+text.substring(m.start(),m.end()));}}
}
Printout:– validdate:2008-04-22– validdate:2008-4-22
Regular expression in java
34
Moreshorthandnota5oninspecifictools,likeregexpackageinJava
• Differentsodwaretoolshaveslightlydifferentnota5ons(e.g.regex,grep,JLEX);
• Shorthandnota5onsfromregexpackage– . anyonecharacterexceptalineterminator– \d adigit:[0-9]– \D anon-digit:[^0-9]– \s awhitespacecharacter:[\t\n\r]– \S anon-whitespacecharacter:[^\s]– \w awordcharacter:[a-zA-Z_0-9]– \W anon-wordcharacter:[^\w]
• GetfamiliarwithregularexpressionusingtheregexTesterApplet.
• NotethatStringclasssinceJava1.4providessimilarmethodsforregularexpression
Regular expression in java
35
TryRegexTester
• Runningatcoursewebsiteasanapplet;– h[p://cs.uwindsor.ca/~jlu/214/regex_tester.htm
• Writeregularexpressionsandtrythematch(),find()methods;
• Whatif3q14insteadof3.14isthestringtobematched?Why?• groupsarenumberedbycoun5ngtheiropeningparenthesesfromledtoright.
– ((A)(B(C)))hasfourgroups:– 1((A)(B(C)))– 2(A)– 3(B(C))– 4(C)
Regular expression
36
Prac5ceregularexpressionusinggrep
Usegreptosearchforcertainpa[erninhtmlfiles;– SearchforCanadianpostalcodeinatextfile;– SearchforOntariocarplatenumberinatextfile.
– usetcsh.Type• %tcsh
– Preparetextfile,say“test”,thatconsistsofsamplepostalcodeetc.– Type
• grep‘[a-z][0-9][a-z][0-9][a-z][0-9]’test
• grep–i‘[a-z][0-9][a-z][0-9][a-z][0-9]’test
Regular expression in unix
37
Prac5cethefollowinggrepcommands:grep'cat'grepTest-youwillfindboth"cat"and"vaca5on”grep'\<cat\>'grepTest
--wordboundarygrep-i'\<cat\>'grepTest
– ignorethecasegrep'\<ega\.a[\.com\>'grepTest
-metacharactergrep'"[^"]*"'grepTest
--findquotedstringegrep'[a-z][0-9][a-z]?[0-9][a-z][0-9]'grepTest
– findpostalcode,onlyifitisinlowercaseegrep-i'[a-z][0-9][a-z]?[0-9][a-z][0-9]'grepTest
– ignorethecasegrep'q[^u]'grepTest
--won'tfind"iraq",butwillfind"iraqi"grep'^cat'grepTest
findonlylinesstartwithcat• egrepissimilartogrep,butsomepa[ernsonlyworkinegrep.
Regular expression in unix
megawatt computing ega.att.com Line with uppercase Cat and Dog. 214 cat vacation Cat Dog vacation only capital Cat the cat likes the dog "quoted string " iraq iraqi Qantas ADBB 123 ADBB12 N9B 3P5 N9B3P5 01-16-2004 14-14-2004
38
Unixmachineaccount
• Applyforaunixaccount:– [email protected]
• Accessunixmachinesathome:– YouneedtouseSSH;– Oneplacetodownload:
• www.uwindsor.ca/its-->services/downloads
39
REandFinitestateAutomaton(FA)• Regularexpressionisadeclara5vewaytodescribethetokens
– Itdescribeswhatisatoken,butnothowtorecognizethetoken.• FAisusedtodescribehowthetokenisrecognized
– FAiseasytobesimulatedbycomputerprograms;
• Thereisa1-1correspondencebetweenFAandregularexpression– Scannergenerator(suchasJLex)bridgesthegapbetweenregularexpression
andFA.
Scanner generator
FA RE Java scanner program
String stream
Tokens
40
Insidescannergenerator
• Maincomponentsofscannergenera5on– REtoNFA– NFAtoDFA– Minimiza5on– DFAsimula5on
RE
NFA
DFA
Minimized DFA
Program
Thompson construction
Subset construction
DFA simulation Scanner generator
Minimization
41
Finiteautomata
• FAalsocalledFiniteStateMachine(FSM)– Abstractmodelofacompu5ngen5ty;– Decideswhethertoacceptorrejectastring.
• TwotypesofFA:– Non-determinis5c(NFA):Hasmorethanonealterna5veac5on
forthesameinputsymbol.– Determinis5c(DFA):Hasatmostoneac5onforagiveninput
symbol.• Example:howdowewriteaprogramtorecognizejavaiden5fiers?
Start letter
letter
digit
s0 s1 S0:
if (getChar() is letter) goto S1;
S1: if (getChar() is letter or digit) goto S1;
Autom
ata
S
a, b a
B
Automaton:amechanismthatisrela5velyself-opera5ng--webster
42
Non-determinis5cFiniteAutomata(FA)
• NFA(Non-determinis5cFiniteAutomaton)isa5-tuple(S,Σ, δ, S0, F):– S:asetofstates;– Σ:thesymbolsoftheinputalphabet;– δ:atransi5onfunc5on;
• move(state,symbol)àasetofstates
– S0:s0∈S,thestartstate;– F:F⊆S,asetoffinaloraccep5ngstates.
• Non-determinis5c--astateandsymbolpaircanbemappedtoasetofstates.– Itisdeterminis5ciftheresultoftransi5onconsistsofonlyonestate.
• Finite—thenumberofstatesisfinite.
Autom
ata
Start letter
letter
digit
s0 s1
43
Transi5onDiagram
• FAcanberepresentedusingtransi5ondiagram.• CorrespondingtoFAdefini5on,atransi5ondiagramhas:
– States:Representedbycircles;– Σ: Alphabet,representedbylabelsonedges; – Moves:Representedbylabeleddirectededgesbetweenstates.Thelabel
istheinputsymbol;– StartState:arrowhead;– FinalState(s):representedbydoublecircles.
• Exampletransi5ondiagramtorecognize(a|b)*abb
q0 q3 a
q1 b
a, b
q2 b
Autom
ata
44
SimpleexamplesofFA
• Epsilon
• a
• a*
• a+
• (a|b)*
start
a
0
start
a
1 a
0
start
a
0
b
start
a, b
0
start
1
a 0
start
1
ε 0
Autom
ata
45
ProceduresofdefiningaDFA/NFA
• Defineinputalphabetandini5alstate• Drawthetransi5ondiagram• Check
– allstateshaveout-goingarcslabeledwithalltheinputsymbols(DFA).– Arethereanymissingfinalstates?– Arethereanyduplicatestates?– allstringsinthelanguagecanbeaccepted.– allstringsnotinthelanguagecannotbeaccepted.
• Nameallthestates• Define(S,Σ,δ,q0,F)
Autom
ata
46
Exampleofconstruc5ngaFA
• ConstructaDFAthatacceptsalanguageLoverΣ={0,1}suchthatListhesetofallstringswithanynumberof“0”sfollowedbyanynumberof“1”s.
• Regularexpression:0*1*• Σ={0,1}• Drawini5alstateofthetransi5ondiagram
Start
Autom
ata
47
Exampleofconstruc5ngaFA(cont.)
• Dradthetransi5ondiagram
Start 1
0 1
0
• Is“111”accepted?• Theledmoststatehasmissedanarcwithinput“1”
Start 1
0 1
0
1
Autom
ata
• Is“00”accepted?
48
Exampleofconstruc5ngaFA(cont.)
• Is“00”accepted?• Theledmosttwostatesarealsofinalstates
– Firststatefromtheled:εisalsoaccepted– Secondstatefromtheled:
stringswith“0”sonlyarealsoaccepted
Start 1 0 1
0
1
Autom
ata
49
Exampleofconstruc5ngaFA(cont.)
• Theledmosttwostatesareduplicate– theirarcspointtothesamestateswiththesamesymbols
Start 1
0 1
• Checkthattheyarecorrect– Allstringsinthelanguagecanbeaccepted
• εisaccepted• stringswith“0”s/“1”sonlyareaccepted
– Allstringsnotbelongedtothelanguagecannotbeaccepted• Nameallthestates
Start 1
0 1
q0 q1
Autom
ata
50
HowdoesFAwork
• NFAdefini5onfor(a|b)*abb– S={q0,q1,q2,q3}– Σ={a,b}– Transi5ons:move(q0,a)={q0,q1},move(q0,b)={q0},....– s0=q0– F={q3}
• Transi5ondiagramrepresenta5on– Non-determinism:
• exi5ngfromonestatetherearemul5pleedgeslabeledwithsamesymbol,or• Thereareepsilonedges.
– HowdoesFAwork?Input:ababb
move(q0,a)=q1move(q1,b)=q2move(q2,a)=?(undefined)REJECT!
move(q0, a) = q0 move(q0, b) = q0 move(q0, a) = q1 move(q1, b) = q2 move(q2, b) = q3 ACCEPT !
q0 q3 a
q1 b
a,b
q2 b
Autom
ata
51
FAfor(a|b)*abb
– WhatdoesitmeanthatastringisacceptedbyaFA?• AnFAacceptsaninputstringxiffthereisapathfromthestartstatetoafinalstate,suchthattheedgelabelsalongthispathspelloutx;
– Apathfor“aabb”:q0àaq0àaq1àbq2àbq3– Is“aab”acceptable?
q0àaq0àaq1àbq2q0àaq0àaq0àbq0
• Theanswerisno;• Finalstatemustbereached;• Ingeneral,therecouldbeseveralpaths.
– Is“aabbb”acceptable?q0àaq0àaq1àbq2àbq3
• Theanswerisno.• Labelsonthepathmustspellouttheen5restring.
q0 q3 a
q1 b
a,b
q2 b
Autom
ata
52
ExampleofNFAwithepsilonsymbol
• NFAaccep5ngaa*|bb* – Is“aaab”acceptable?– Is“aaa”acceptable?
0
1
3 b
b
4 ε
ε
a
a
2
Autom
ata
Simula5nganNFA
• Keeptrackofasetofstates,ini5allythestartstateandeverythingreachablebyε-moves.
• Foreachcharacterintheinput:– Maintainasetofnextstates,ini5allyempty.
• Foreachcurrentstate:– Followalltransi5onslabeledwiththecurrentle[er.– Addthesestatestothesetofnewstates.
• Addeverystatereachablebyanε-movetothesetofnextstates.
• Complexity:O(mn2)forstringsoflengthmandautomatawithnstates
53
q0 q3 a
q1 b
a,b
q2 b
54
Transi5ontable
• Itisoneofthewaystoimplementthetransi5onfunc5on– Thereisarowforeachstate;– Thereisacolumnforeachsymbol;– Entryin(states,symbola)isthesetofstatescanbereachedfromstateson
inputa.
• Nondeterminis5c:– Theentriesaresetsinsteadofasinglestate
States Input
a b
>q0 {q0,q1} {q0}
q1 {q2}
q2 {q3}
*q3
Autom
ata
q0 q3 a
q1 b
a,b
q2 b
55
DFA(Determinis5cFiniteAutomaton)• AspecialcaseofNFA
– Thetransi5onfunc5onmapsthepair(state,symbol)toonestate.• Whenrepresentedbytransi5ondiagram,foreachstateSandsymbola,there
isatmostoneedgelabeledaleavingS;• Whenrepresentedtransi5ontable,eachentryinthetableisasinglestate.
– Thereisnoε-transi5on• Example:DFAfor(a|b)*abb
• RecalltheNFA:
States Input
a b
Q0 Q1 q0
Q1 Q1 Q2
Q2 Q1 Q3
Q3 Q1 Q0
q0
q3
a
a
q1 q2
a b b
b
a b
q0 q3 a
q1 b
a,b
q2 b
Autom
ata
56
DFAtoprogram
• NFAismoreconcise,butnoteasytoimplement;
• InDFA,sincetransi5ontablesdon’thaveanyalterna5veop5ons,DFAsareeasilysimulatedviaanalgorithm.
RE
NFA
DFA
Minimized DFA
Program
Thompson construction
Subset construction
DFA simulation Scanner generator
Minimization
Autom
ata simulation
57
SimulateaDFA
• AlgorithmtosimulateDFAInput:Stringx,DFAD.
• Transi5onfunc5onismove(s,c);• StartstateisS0;• FinalstatesareF.
Output:“yes”ifDacceptsx;“no”otherwise;Algorithm:
currentState ← s0 currentChar ← nextchar; while currentChar ≠ eof { currentState ← move(currentState, currentChar); currentChar ← nextchar; } if currentState is in F then return “yes” else return “no”
• RuntheFAsimulator!• Writeasimulator.
Autom
ata simulation
58
NFAtoDFA
• Whereweare:wearegoingtodiscussthetransla5onfromNFAtoDFA.
• Theorem:AlanguageLisacceptedbyanNFAiffitisacceptedbyaDFA
• Subsetconstruc5onisusedtoperformthetransla5onfromNFAtoDFA.
RE
NFA
DFA
Minimized DFA
Program
Thompson construction
Subset construction
DFA simulation Scanner generator
Minimization
NFA to D
FA
59
Summarize
• Wehavecoveredmanyconcepts– RE,Regulargrammar,FA(NFA,DFA),
Transi5onDiagram,Transi5onTable.• Whatistherela5onshipbetweenthem?
– RE,Regulargrammar,NFA,DFA,Transi5onDiagramareallofthesameexpressivepower;
– REisadeclara5vedescrip5on,henceeasierforustowrite;
– DFAisclosertomachine;– Transi5onDiagramisagraphic
representa5onofFA;– Transi5onTableisoneofthemethodsto
implementthetransi5onfunc5onsinFA.• Whataboutregulargrammar?
– Wewillseeitsrelevanceinsyntaxanalysis.• Anotherpath:howtoderiveREfromDFA?
RE
NFA
DFA
Minimized DFA
Program
Thompson construction
Subset construction
DFA simulation
60
Regularexpressionandregulargrammar
• REtoregulargrammar:– DrawanautomatafortheRE– Constructregulargrammarfromautomaton– Example
S0àle[erS1S1àle[erS1S1àdigitS1S1àε
Start letter
letter
digit
s0 s1
61
Conver5ngDFAstoREs
1. Combineseriallinksbyconcatena5on2. Combineparallellinksbyalterna5on3. Removeself-loopsbyKleeneclosure4. Selectanode(otherthanini5alorfinal)forremoval.Replace
itwithasetofequivalentlinkswhosepathexpressionscorrespondtotheinandoutlinks
5. Repeatsteps1-4un5lthegraphconsistsofasinglelinkbetweentheentryandexitnodes.
62
Example
0 1 2
6
4 3 d
a
b c
d
7
5 a
b d
d
b
c
0 1 2
6
4 3 d a|b|c d
7
5 a
b d
d
b|c
0 4 3 d(a|b|c)d
5 a d
b(b|c)d
63
Example(cont.)
0 4 3 d(a|b|c)d
5 a d
b(b|c)da
0 4 3 d(a|b|c)d
5 a (b(b|c)da)*d
0 d(a|b|c)da(b(b|c)da)*d
5
64
Issuesnotcovered
• RegularexpressiontoDFAdirectly;
RE
NFA
DFA
Minimized DFA
Program
Thompson construction
Subset construction
DFA simulation
65
LexicalacceptorsandLexicalanalyzers
• DFA/NFAacceptsorrejectsastring;– Theyarecalledlexicalacceptors;
• Butthepurposeofalexicalanalyzerisnotjusttoacceptorrejectstring.Thereareseveralissues:– Mul)plematches:Oneregularexpressionmaymatchseveral
substrings.• e.g.,ID=le[er+,String=“abc”,IDcanmatchwitha, ab, abc.• Weshouldfindthelongestmatches,i.e.,longestsubstringoftheinputthatmatchestheregularexpression;
– Mul)pleREs:WhatifonestringcanmatchseveralREs?• e.g.,ID=le[er+,INT=“int”,• String“int”canbebothareservedwordINT,andaniden5fier.Howcanwedecideitisareservedwordinsteadanusualiden5fier?
– Ac)ons:Onceatokenisrecognized,wewanttoperformdifferenttasksonthem,insteadofsimplyreturnthestringrecognized.
66
Longestmatch
• WhenseveralsubstringscanmatchthesameRE,weshouldreturnthelongestone.– e.g.,ID=le[er+,String=“abc”,IDcanmatchwitha,ab,abc.
• Problem:whatifalexergoespastafinalstateofashortertoken,butthendoesn’tfindanyothermatchingtokenlater?
• Example:ConsiderR=00|10|0011andinputw=0010.
• WereachstateCwithnotransi5ononinput0.• Solu5on:Keepingtrackofthelongestmatchjustmeansrememberingthelast5metheDFAwasinafinalstate;
S 0 1
0
0
1
1A B
F
C
E
D
67
Longestmatch(cont.)
• Thisisdonebyintroducingthefollowingvariables:– LastFinal:finalstatemostrecentlyencountered;– InpputPosi5onAtLastFinal:mostrecentposi5onintheinputstringinwhich
theexecu5onoftheDFAwasinafinalstate;– Yytext:Textofthetokenbeingmatched,i.e.,substringbetween
ini5alInputPosi5onandinputPosi5onAtLastFinal.• Thiswayalongestmatchisrecognizedwhentheexecu5onoftheDFAreachesadead-end,i.e.,astatewithnotransi5ons.
• Each5meatokenisrecognized,theexecu5onoftheDFAresumesintheini5alstatetorecognizethenexttoken.
• Ingeneral,whenatokenisrecognized,currentInputPosi5onmaybefarbeyondinputPosi5onAtLastFinal.
S 0 1
0
0
1
1A B
F
C
E
D
68
Handlingmul5pleREs
• CombinetheNFAsofalltheREsintoasinglefiniteautomaton.• WhatiftwoREsmatchesthesamestring?
– E.g.,forastring“abb”,bothREs“a*bb”and“ab*”matchesthestring.WhichREisintended?
– Itisimportantbecausedifferentac5onsmaytakedependingontheREbeingmatched;
– Solu5on:OrderREs:theREprecedeswillmatchfirst.
• Howaboutreservedwords?– Forstring“int”,shouldwereturntokenINTortokenID?– Twosolu5ons:
• Constructareservedwordtableandlookupthetableevery5meaniden5fierisencountered;
• Put“int”asanRE,andputthatREbeforetheiden5fierRE.Sowheneverthestring“int”ismet,RE“int”willbematchedfirstandthetokenINTwillbereturned(insteadofthetokenID).
69
Ac5ons
• Ac5onscanbeaddedforfinalstates;• Ac5onscanbedescribedinausualprogramminglanguage.InJLex,ac5onisdescribedinJava.
70
Buildascannerforasimplelanguage
• Thelanguageofassignmentstatements: LHS=RHS intLHS=RHS
…– led-handsideofassignmentisaniden5fier,withop5onaltype
declara5on;– Iden5fierisale[erfollowedbyoneormorele[ersordigits– right-handsideisoneofthefollowing:
• ID+ID• ID*ID• ID==ID
• Examplestatementintx3=x1+x2
71
Step1:Definetokens
Ourlanguagehassixtypeoftokens.– theycanbedefinedbysixregularexpressions:
token Regularexpression
ASSIGN "="
ID le[er(le[er|digit)*
INT “int”
PLUS "+"
TIMES "*"
EQUALS "=="
72
Step2:ConvertREstoNFAs
“=”
letter
“+”
“*”
“=”
ASSIGN:
ID:
PLUS:
TIMES:
EQUALS:
Letter, digit
“=”
Step 3: Combine the NFAs, Convert NFAs to DFAs, minimize the DFAs
INT: “n” “i” “t”
ε
73
Step4:ExtendtheDFA• ModifytheDFAsothatafinalstatecanhave
– anassociatedac5on,suchas:"putbackonecharacter"or"returntokenXXX“.
• Forexample,theDFAthatrecognizesiden5fierscanbemodifiedasfollows– recallthatscanneriscalledbyaparser(onetokenisreturnedpereach
call)– henceac5onreturnputsthescannerintostateS
letter
any char except letter or digit
letter | digit action:
• put back 1 char • return ID
S
74
Step5:CombinedFAforourlanguage• combine the DFAs for all of the tokens in to a single FA.
letter any char except letter or digit
letter | digit put back 1 char; return ID
S “*” ID
return EQUALS “=”
any char except “=”
put back 1 char; return ASSIGN
return TIMES
“+”
return PLUS
“=”
F1
F4
F3
F2
TMP F5
“n” “i”
“t”
SP return INT, put back one char
F6
I1 I2
I3
• It is not a DFA. Just for illustration purpose.
F7 SP
75
Exampletracefor“intx3=x1+x2”
Input Lastfinalstate
Currentstate
InputposiConatlastFinalState
CurrentinputposiCon
IniCalinputposiCon
acCon
intx3=x1+x2 0 S 0 0 0
ntx3=x1+x2 0 I1 1 1 0
tx3=x1+x2 0 I2 2 2 0
[sp]x3=x1+x2 0 I3 3 3 0
x3=x1+x2 0 F6 4 4 0 Ac5on1
putBackOneChar();Yytext=substring(0,4-1);in5alInputPosi5on=3;currentSate=S;
[sp]x3=x1+x2 0 S 3 3 3
x3=x1+x2 0 SP 4 4 3 Ac5on2
Yytext=substring(3,3);ini5alInputPosi5on=4;currentState=S;
x3=x1+x2 0 S 4 4 4
3=x1+x2 0 ID 5 5 4
=x1+x2 0 ID 6 6 4
x1+x2 0 F2 7 7 4 Ac5on3
putBackOneChar;Yytext=substring(4,7-1);ini5alInputPosi5on=6;currentState=S;
=x1+x2 0 S 6 6 6
......
76
Scannergenerator:history
• LEX– Alexicalanalyzergenerator,wri[enbyLeskandSchmidtatBellLabs
in1975fortheUNIXopera5ngsystem;– Itnowexistsformanyopera5ngsystems;– LEXproducesascannerwhichisaCprogram;– LEXacceptsregularexpressionsandallowsac5ons(i.e.,codeto
executed)tobeassociatedwitheachregularexpression.
• JLex– Lexthatgeneratesascannerwri[eninJava;– ItselfisalsoimplementedinJava.
• Therearemanysimilartools,formostprogramminglanguages
JLex
77
Overallpicture
Tokens
Scanner generator
NFA RE Java scanner program
String stream
DFA
Minimize DFA
Simulate DFA
JLex
78
Insidelexicalanalyzergenerator
• Howdoesalexicalanalyzerwork?– Getinputfromuserwhodefines
tokensintheformthatisequivalenttoregulargrammar
– TurntheregulargrammarintoaNFA
– ConverttheNFAintoDFA– Generatethecodethatsimulates
theDFA
Classes in JLex:
CAccept CAcceptAnchor CAlloc CBunch CDfa CDTrans CEmit CError CInput CLexGen CMakeNfa CMinimize CNfa CNfa2Dfa CNfaPair CSet CSimplifyNfa CSpec CUtility Main SparseBitSet ucsb
JLex
79
Howscannergeneratorisused
• Writethescannerspecifica5on;• Generatethescannerprogramusingscannergenerator;• Compilethescannerprogram;• Runthescannerprogramoninputstreams,andproducesequencesoftokens.
Scanner generator (e.g., JLex)
Scanner definition (e.g., JLex spec)
Scanner program, e.g., Scanner.java
Java compiler
Scanner program (e.g., Scanner.java)
Scanner.class
Scanner Scanner.class Input stream Sequence of
tokens
JLex
80
JLexspecifica5on
• JLexspecifica5onconsistsofthreeparts,separatedby“%%”
UserJavacode,tobecopiedverba5mintothescannerprogram,placedbeforethelexerclass;
%%JLexdirec5ves,macrodefini5ons,commonlyusedtospecifyle[ers,digits,whitespace;%%Regularexpressionsandac5ons:
• Specifyhowtodivideinputintotokens;• Regularexpressionsarefollowedbyac5ons;
– Printerrormessages;returntokencodes;
JLex
81
FirstJLexexamplesimple.lex
• Recognizeintandiden5fiers.1. %%2. %{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{3. MyLexeryy=newMyLexer(System.in);4. while(true){5. yy.yylex();6. }7. }8. %}9. %notunix10. %typevoid11. %classMyLexer12. %eofval{return;13. %eofval}
14. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
15. %%
16. "int"{System.out.println("INTrecognized");}17. {IDENTIFIER}{System.out.println("IDis..."+yytext());}18. \r|\n{}19. .{}
JLex
82
Codegeneratedwillbeinsimple.lex.java
classMyLexer{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{
MyLexeryy=newMyLexer(System.in);while(true){yy.yylex();
}}publicvoidyylex(){......case5:{System.out.println("INTrecognized");}case7:{System.out.println("IDis..."+yytext());}......}}
Copied from internal code directive
JLex
83
RunningtheJLexexample• StepstoruntheJLex
D:\214>javaJLex.Mainsimple.lexProcessingfirstsec5on--usercode.Processingsecondsec5on--JLexdeclara5ons.Processingthirdsec5on--lexicalrules.Crea5ngNFAmachinerepresenta5on.NFAcomprisedof22states.Workingoncharacterclasses.::::::::.NFAhas10dis5nctcharacterclasses.Crea5ngDFAtransi5ontable.WorkingonDFAstates...........MinimizingDFAtransi5ontable.9statesaderremovalofredundantstates.Outpu�nglexicalanalyzercode.D:\214>movesimple.lex.javaMyLexer.javaD:\214>javacMyLexer.javaD:\214>javaMyLexer//itiswai5ngforkeyboardinputintmyid0INTrecognizedIDis...myid0
JLex
84
Exercises
• TrytomodifyJLexdirec5vesinthepreviousJLexspec,andobservewhetheritiss5llworking.Ifitisnotworking,trytounderstandthereason.– Remove“%notunix”direc5ve;– Change“return;”to“returnnull;”;– Remove“%typevoid”;– ......
• MovetheIden5fierregularexpressionbeforethe“int”RE.Whatwillhappentotheinput“int”?
• Whatifyouremovethelastline(line19,“.{}”)?
JLex
85
Changesimple.lex:readinputfromfile
1. importjava.io.*;2. %%3. %{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{4. MyLexeryy=newMyLexer(newFileReader(“input”));5. while(yy.yylex()>=0);6. }7. %}8. %integer9. %classMyLexer
10. %%
11. "int"{System.out.println("INTrecognized");}12. [a-zA-Z_][a-zA-Z0-9_]*{System.out.println("IDis..."+yytext());}13. \r|\n|.{}
• %integer:tomakethereturningtypeofyylex()asint.
JLex
86
Extendtheexample:addreturninganduseclasses– Whenatokenisrecognized,inmostofthecasewewanttoreturnatokenobject,sothat
otherprogramscanuseit.classUseLexer{publicsta5cvoidmain(String[]args)throwsjava.io.IOExcep5on{Tokent;MyLexer2lexer=newMyLexer2(System.in);while((t=lexer.yylex())!=null)System.out.println(t.toString());}}classToken{Stringtype;Stringtext;intline;Token(Stringt,Stringtxt,intl){type=t;text=txt;line=l;}publicStringtoString(){returntext+""+type+""+line;}}%%%notunix%line%typeToken%classMyLexer2%eofval{returnnull;%eofval}IDENTIFIER=[a-zA-Z_][a-zA-Z0-9_]*%%"int"{return(newToken("INT",yytext(),yyline));}{IDENTIFIER}{return(newToken("ID",yytext(),yyline));}\r|\n{}.{}
JLex
87
Codegeneratedfrommylexer2.lex
classUseLexer{publicsta5cvoidmain(String[]args)throwsjava.io.IOExcep5on{Tokent;MyLexer2lexer=newMyLexer2(System.in);while((t=lexer.yylex())!=null)System.out.println(t.toString());}}classToken{Stringtype;Stringtext;intline;Token(Stringt,Stringtxt,intl){type=t;text=txt;line=l;}publicStringtoString(){returntext+""+type+""+line;}}ClassMyLexer2{publicTokenyylex(){......case5:{return(newToken("INT",yytext(),yyline));}case7:{return(newToken("ID",yytext(),yyline));}
......}}
JLex
88
Runningtheextendedlexspecifica5onmylexer2.lex
D:\214>javaJLex.Mainmylexer2.lexProcessingfirstsec5on--usercode.Processingsecondsec5on--JLexdeclara5ons.Processingthirdsec5on--lexicalrules.Crea5ngNFAmachinerepresenta5on.NFAcomprisedof22states.Workingoncharacterclasses.::::::::.NFAhas10dis5nctcharacterclasses.Crea5ngDFAtransi5ontable.WorkingonDFAstates...........MinimizingDFAtransi5ontable.9statesaderremovalofredundantstates.Outpu�nglexicalanalyzercode.D:\214>movemylexer2.lex.javaMyLexer2.javaD:\214>javacMyLexer2.javaD:\214>javaUseLexerintintINT0x1x1ID1
JLex
89
Anotherexample
1importjava.io.IOExcep5on;2%%3%public4%classNumbers_15%typevoid6%eofval{return;8%eofval}910%line11%{publicsta5cvoidmain(Stringargs[]){12Numbers_1num=newNumbers_1(System.in);13try{14num.yylex();15}catch(IOExcep5one){System.err.println(e);}16}17%}1819%%20\r\n{System.out.println("---"+(yyline+1));}22.*\r\n{System.out.print("+++"+(yyline+1)+"\t"+yytext());}
JLex
90
Usercode(firstsec5onofJLex)
• Usercodeiscopiedverba5mintothelexicalanalyzersourcefilethatJLexoutputs,atthetopofthefile.– Packagedeclara5ons;– Importsofanexternalclass– Classdefini5ons
• Generatedcodepackage declarations; import packages; Class definitions; class Yylex { ... ... }
• Yylexclassisthedefaultlexerclassname.Itcanbechangedtootherclassnameusing%classdirec5ve.
JLex
91
JLexdirec5ves(Secondsec5on)
• Internalcodetolexicalanalyzerclass• Marcodefini5on• Statedeclara5on• Character/linecoun5ng• Lexicalanalyzercomponent5tle• Specifyingthereturnvalueonend-of-file• Specifyinganinterfacetoimplement
JLex
92
InternalCodetoLexicalAnalyzerClass
• %{ …. %} direc5vepermitsthedeclara5onofvariablesandfunc5onsinternaltothegeneratedlexicalanalyzer
• Generalform:%{<code>%}
• Effect:<code>willbecopiedintotheLexerclass,suchasMyLexer.classMyLexer{…..<code>……}
• Examplepublicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{
MyLexeryy=newMyLexer(System.in);while(true){yy.yylex();}}
• Differencewiththeusercodesec5on– Itiscopiedinsidethelexerclass(e.g.,theMyLexerclass)
JLex
93
MacroDefini5on• Purpose:defineonceandusedseveral5mes;
– Amustwhenwewritelargelexspecifica5on.• Generalformofmacrodefini5on:
– <name>=<defini5on>– shouldbecontainedonasingleline– Macronameshouldbevalididen5fiers– Macrodefini5onshouldbevalidregularexpressions– Macrodefini5oncancontainothermacroexpansions,inthestandard
{<name>}formatformacroswithinregularexpressions.• Example
– Defini5on(inthesecondpartofJLexspec):IDENTIFIER=[a-zA-Z_][a-zA-Z0-9_]*ALPHA=[A-Za-z_]DIGIT=[0-9]ALPHA_NUMERIC={ALPHA}|{DIGIT}
– Use(inthethirdpart):{IDENTIFIER}{returnnewToken(ID,yytext());}
JLex
94
Statedirec5ve• Samestringcouldbematchedbydifferentregularexpressions,accordingtoitssurroundingenvironment.– String“int”insidecommentshouldnotberecognizedasareservedword,not
evenasaniden5fier.
• Par5cularlyusefulwhenyouneedtoanalyzemixedlanguages;• Forexample,inJSP,JavaprogramscanbeimbeddedinsideHTMLblocks.OnceyouareinsideJavablock,youfollowtheJavasyntax.ButwhenyouareoutoftheJavablock,youneedtofollowtheHTMLsyntax.– Injava“int”shouldberecognizedasareservedword;– InHTML“int”shouldberecognizedjustasausualstring.
• StatesinsideJLex<HTMLState>%{{yybegin(JavaState);}<HTMLState>“int”{returnstring;}<JavaState>%}{yybegin(HTMLState);}<JavaState>“int”{returnkeyword;}
JLex
95
StateDirec5ve(cont.)• MechanismtomixFAstatesandREs• Declaringasetof“startstates”(inthesecondpartofJLexspec)
%statestate0[,state1,state2,….]• Howtousethestate(inthethirdpartofJLexspec):
– REcanbeprefixedbythesetofstartstatesinwhichitisvalid;• Wecanmakeatransi5onfromonestatetoanotherwithinputRE
– yybegin(STATE)isthecommandtomaketransi5ontoSTATE;• YYINITIAL:implicitstartstateofyylex();
– Butwecanchangethestartstate;• Example(fromthesampleinJLexspec):
%state COMMENT %% <YYINITIAL>if {return new tok(sym.IF,”IF”);} <YYINITIAL>[a-z]+ {return new tok(sym.ID, yytext());} <YYINITIAL>”/*” {yybegin(COMMENT);} <COMMENT>”*/” {yybegin(YYINITIAL);} <COMMENT>. {}
JLex
96
Characterandlinecoun5ng
• Some5mesitisusefultoknowwhereexactlythetokenisinthetext.Tokenposi5onisimplementedusinglinecoun5ngandcharcoun5ng.
• Charactercoun5ngisturnedoffbydefault,ac5vatedwiththedirec5ve“%char”– Createaninstancevariableyycharinthescanner;– zero-basedcharacterindexofthefirstcharacteronthematchedregionof
text.
• Linecoun5ngisturnedoffbydefault,ac5vatedwiththedirec5ve“%line”– Createaninstancevariableyylineinthescanner;– zero-basedlineindexatthebeginningofthematchedregionoftext.
• Example“int”{return(newYytoken(4,yytext(),yyline,yychar,yychar+3));}
JLex
97
Lexicalanalyzercomponent5tles
• Changethenameofgenerated– lexicalanalyzerclass %class<name>– thetokenizingfunc5on %func5on<name>– thetokenreturntype %type<name>
• DefaultnamesclassYylex{/*lexicalanalyzerclass*/publicYytoken/*thetokenreturntype*/yylex(){…}/*thetokenizingfunc5on*/==>Yylex.yylex() returns Yytoken type
JLex
98
SpecifyinganInterfacetoimplement
• Form:%implements<InterfaceName>• AllowstheusertospecifyaninterfacewhichtheYylexoryourlexerclasswillimplement.
• Thegeneratedparserclassdeclara5onwilllooklike:class MyLexer implementsInterfaceName{…….}
JLex
99
Regularexpressionrules
• Generalform:regularExpression{ac5on}
• Example:{IDENTIFIER}{System.out.println("IDis..."+yytext());}
• Interpreta5on:Pa[entobematchedcodetobeexecutedwhenthepa[ernismatched
• CodegeneratedinMyLexer:“case2: {System.out.println("IDis..."+yytext());}“
JLex
100
RegularExpressionRules• Specifiesrulesforbreakingtheinputstreamintotokens• RegularExpression+Ac5ons(javacode)[<states>]<expression>{<ac5on>}• Whenmatchedwithmorethanonerule,
– choosetherulethatisgivenfirstintheJlexspec.• Referthe“int”andIDENTIFIERexample.
• TherulesgiveninaJLexspecifica5onshouldmatchallpossibleinput.• Anerrorwillberaisedifthegeneratedlexerreceivesinputthatdoesnotmatchanyofitsrules
– E.g.,therulesonlylistedthecaseforIden5fiers,andsaidnothingaboutnumbers,butyourinputhasnumbers.
– Thisisthemostcommonerror(morethan50%)– putthefollowingruleatthebo[omofREspec
.{java.lang.System.out.println(“Error:” + yytext());} dot(.)willmatchanyinputexceptforthenewline.
JLex
101
Availablelexicalvalueswithinac5oncode
• java.lang.Stringyytext()– matchespor5onofthecharacterinputstream;– alwaysac5ve.
• Intyychar– Zero-basedcharacterindexofthefirstcharacterinthematched
por5onoftheinputstream;– ac5vatedby%chardirec5ve.
• Intyyline– Zero-basedlinenumberofthestartofthematchedpor5onofthe
inputstream;– ac5vatedby%linedirec5ve.
JLex
102
RegularexpressioninJLex•Specialcharacters:?+|()ˆ$/;.=<>[]{}”\andblank
– Ader\thespecialcharacterslosetheirspecialmeaning.– Example:\+
• Betweendoublequotes”allspecialcharactersbut\and”losetheirspecialmeaning.– Example:”+”
• Thefollowingescapesequencesarerecognized:\b\n\t\f\r.• With[]wecandescribesetsofcharacters.
– [abc]isthesameas(a|b|c).Notethatitisnotequivalenttoabc– With[ˆ]wecandescribesetsofcharacters.– [ˆ\n\”]meansanythingbutanewlineorquotes– [ˆa–z]meansanythingbutONElower-casele[er
• Wecanuse.asashortcutfor[ˆ\n]• $: denotestheendofaline.If$endsaregularexpression,theexpressionmatchedonlyattheendofaline.
JLex
103
Concludingremarks• FocusedonLexicalAnalysisProcess,Including
– RegularExpressions– FiniteAutomaton– Conversion– Lex
• Regulargrammar=regularexpression• RegularexpressionàNFAàDFAàlexer• Thenextstepinthecompila5onprocessisParsing:
– Contextfreegrammar;– Top-downparsingandbo[omupparsing.
JLex