lexical analysis, i - rice university 412, fall 2017 2 the front end front end opmmizer back end...
TRANSCRIPT
LexicalAnalysis,I
Comp412
COMP412FALL2017
Copyright2017,KeithD.Cooper&LindaTorczon,allrightsreserved.StudentsenrolledinComp412atRiceUniversityhaveexplicitpermissiontomakecopiesofthesematerialsfortheirpersonaluse.FacultyfromothereducaMonalinsMtuMonsmayusethesematerialsfornonprofiteducaMonalpurposes,providedthiscopyrightnoMceispreserved.
FrontEnd OpMmizer BackEnd
IR IRsourcecode
targetcode
…
Chapter2inEaC2e
AdjustedCalendar
COMP412,Fall2017 1
Lab1,AdjustedScheduleCodeCheck1 Monday,September11,2017
CodeCheck2 Monday,September18,2017
DueDateforCode Monday,September25,2017
LastDayforCode Monday,October2,2017
MidtermExam Wednesday,October18@7PM(unchanged)
Lab3,AdjustedSchedule
Lab3Available Friday,October20,2017
DueDateforCode Wednesday,November15,2017
LastDayforCode Wednesday,November22,2017
COMP412,Fall2017 2
TheFrontEnd
FrontEnd OpMmizer BackEnd
IR IRsourcecode
targetcode
Scannerlooksateverycharacter• Convertsstreamofcharstostreamofclassifiedwords:– <category,lexeme>– SomeMmescallthispaira“token”
• Efficiency&scalabilitymaber
Parserlooksateverytoken• Determinesifthestreamoftokensformsasentenceinthesourcelanguage
• FitstokenstosomesyntacMcmodel,orgrammar,forthesourcelanguage
streamofcharacters
FrontEnd
Scanner
Parser
SemanMcElaboraMon
microsyntax
syntax
IRannotaGons
streamoftokens
COMP412,Fall2017 3
TheFrontEnd
FrontEnd OpMmizer BackEnd
IR IRsourcecode
targetcode
Whyseparatescanning&parsing?• PrimaryraMonaleisefficiency• ScanneridenMfies&classifieswordsbytheirspelling– Abstractsspellingintocategory
• ParserconstructsderivaMons• Parsingisharderthanscanning
Modernview(lesswidelyheld)• Scanner-lessparsersaregainingpopularity,becausetheyeliminateonemoresetoftools– Maybewecanaffordtheoverhead– Aliblemoreinvolved(SGLRparsers)
streamofcharacters
FrontEnd
Scanner
Parser
SemanMcElaboraMon
microsyntax
syntax
IRannotaGons
streamoftokens
COMP412,Fall2017 4
ImplementaMonStrategies
FrontEnd OpMmizer BackEnd
IR IRsourcecode
targetcode
HowdoweautomatetheconstrucGonofscanners&parsers?
Scanner• Specifysyntaxwithregularexpressions(REs)
• Constructfinite-automaton&scannerfromtheRE
Parser• Specifysyntaxwithcontext-freegrammars(CFGs)
• Constructpush-downautomaton&parserfromtheCFG
streamofcharacters
FrontEnd
Scanner
Parser
SemanMcElaboraMon
microsyntax
syntax
IRannotaGons
streamoftokens
HowDoesClassRelatetoRegexLibraries?
• Youwilllearnhowto“compile”REstoaDFA&implementaDFA– ExecuMoncostisguaranteedO(1)perinputcharacter,independentoftheexpression
• Youwillhavedeeperunderstandingoftheirpower&theiruse
COMP412,Fall2017 5
Regularexpressions(calledREs,orregexes,orregexpaberns)areessenMallyaMny,highlyspecializedprogramminglanguageembeddedinsidePythonandmadeavailablethroughtheremodule.…
RegularexpressionpabernsarecompiledintoaseriesofbytecodeswhicharethenexecutedbyamatchingenginewribeninC.Foradvanceduse,itmaybenecessarytopaycarefulaRenGontohowtheenginewillexecuteagivenRE,andwritetheREinacertainwayinordertoproducebytecodethatrunsfaster.OpGmizaGonisn’tcoveredinthisdocument,becauseitrequiresthatyouhaveagoodunderstandingofthematchingengine’sinternals.
TheregularexpressionlanguageisrelaMvelysmallandrestricted,sonotallpossiblestringprocessingtaskscanbedoneusingregularexpressions.Therearealsotasksthatcanbedonewithregularexpressions,buttheexpressionsturnouttobeverycomplicated.Inthesecases,youmaybebeberoffwriMngPythoncodetodotheprocessing;whilePythoncodewillbeslowerthananelaborateregularexpression,itwillalsoprobablybemoreunderstandable.
FromPython2.7.10documenta:on,emphasisadded
InLecture2,wesawsomeambiguityindefining“posiGveinteger”• Is001aposiMveinteger?Whatabout00?• TheautomataareprecisespecificaMons,butthewordsarenot
WeneedabebernotaMonforspecifyingmicrosyntaxthanthesetransiMondiagrams.COMP412,Fall2017 6
BigPicture
ERRORse
Anycharacter
TransiMonstoseareimplicitfromeverystate
s0 s2
s3
0
1…9 0…9
TastefulPosiGveInteger(forbids001)
ERRORse
Anycharacter
TransiMonstoseareimplicitfromeverystate
s0 s20…9
0…9
TastelessPosiGveInteger(allows001)
COMP412,Fall2017 7
RegularExpressions
WeneedabeRernotaGonforspecifyingmicrosyntax
RegularExpressionsoveranAlphabetΣ• Ifx∈Σ,thenxisanREdenoMngtheset{x}orthelanguageL={x}• IfxandyareREsthen– xyisanREdenoMngL(x)L(y)={pq|p∈L(x)andq∈L(y)}– x|yisanREdenoMngL(x)∪L(y)– x*isanREdenoMngL(x)*=∪0≤k<∞L(x)k (KleeneClosure)
➝ SetofallstringsthatarezeroormoreconcatenaEonsofx
– x+isanREdenoMngL(x)+=∪1≤k<∞L(x)k (PosiEveClosure)➝ SetofallstringsthatareoneormoreconcatenaEonsofx(orxx*)
• εisanREdenoMngtheemptyset
“beRer”⇒bothformalandconstrucMve
ManyRE-basedsystemssupportaddiMonalnotaMonandoperators.ThoseaddedfeaturesbuildonalternaMon,concatenaMon,andclosure—plus,perhapslogicalcomplementornegaMon.Complementiseasyandefficient,ifwethinkoftheunderlyingDFA.(Wewillrevisitthisissue.)
COMP412,Fall2017 8
RegularExpressions
Howdotheseoperatorshelp?
RegularExpressionsoveranAlphabetΣ• IfxisinΣ,thenxisanREdenoMngtheset{x}orthelanguageL={x}
➝ ThespellingofanyleIerinthealphabetisanRE• IfxandyareREsthen– xyisanREdenoMngL(x)L(y)={pq|p∈L(x)andq∈L(y)}
➝ IfweconcatenateleAers,theresultisanRE,sowecanspellwords– x|yisanREdenoMngL(x)∪L(y)
➝ AnyfinitelistofwordscanbewriAenasanRE,(w0|w1|w2|…|wn)– x*isanREdenoMngL(x)*=∪0≤k<∞L(x)k– x+isanREdenoMngL(x)+=∪1≤k<∞L(x)k
➝ Wecanuseclosuretowritefinitedescrip:onsofinfinite,butcountable,sets
• εisanREdenoMngtheemptyset➝ εissome:mesusefulforwri:ngmoreconciseREs
TheoperatorsareconcatenaEon,alternaEon,andclosure.
COMP412,Fall2017 9
RegularExpressions
LetthenotaMon[a…z]beshorthandfortheRE(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)
ExamplesTastelessposiEveinteger [0…9][0…9]*
or [0…9]+
TastefulposiEveinteger 0|[1…9][0…9]*
IdenEfier(Algol-likelang) ([a…z]|[A…Z])([a…z]|[A…Z]|[0…9])*
Decimalnumber 0|[1…9][0…9]*.[0…9]*
Realnumber ((0|[1…9][0…9]*)|(0|[1…9][0…9]*.[0…9]*)E[0…9][0…9]*
EachoftheseREscorrespondstoaDFA.
COMP412,Fall2017 10
WhatIsThePoint?
Whydowecareaboutregularexpressionsinthecontextofacompiler?• WeuseREstospecifythemappingofwordstopartsofspeech– AnidenMfieris([a...z]|[A…Z])([a...z]|[A…Z]|[0…9])*– Keywordsarespecifiedbytheirspellings,e.g.,if,then,else
• WeusetoolsderivedfromautomatatheorytoconstructscannersdirectlyfromtheREs– AutomaMcconstrucMonreducestheMme&costofscannerconstrucMon– DerivaMonfromaformalnotaMoneliminatesimplementaMonerrors– ResulMngscannersarebothefficient(O(n))andfast(lowconstantoverhead)
• RE-derivedscannersarewidelyused– Compilers,texteditors– Inputcheckinginmanycontexts– So}waretofilterorblockURLs
Wetypicallyaddsomespecialcharacters,e.g.,_#$@
COMP412,Fall2017 11
ADigressiononTime
InCOMP412,wewilltalkaboutalotof“Gmes”• DesignMme,implementaMonMme,compileMme,runMme,…• InpracMce,theissueofwhensomethinghappensisonethatcausesagreatdealofconfusionamongstudentsofcompilerconstrucMon– DesignMmeandbuildMmehappenlongbeforecompilerruns
➝ CostsincurredatdesignorimplementaEonEmedonotincreasecompileEme– CompileMmehappenseveryMmetheuserinvokesthecompiler
➝ Usersare,appropriately,sensiEvetocompileEme➝ CostsincurredatcompileEmedonotincreaserunEme
– Run-MmecostsaffectactualapplicaMonperformance➝ OnecriEcalgoalforcompilaEonistokeeprunEmetoaminimum,whichmeans
reducingtheoverheadintroducedbytranslaEon
AswelookatstrategiesforgeneraEngscanners&parsers,keepinmindthatgeneraMoncostsareincurredatimplementaMonMme
(the“meta”issue)
Small#ofbuilds
Billionsofcompiles
manypercompile
COMP412,Fall2017 12
AutomaMcScannerConstrucMon
Goals• SimplifytheconstrucMonofrobust,efficientscanners• Developtechniquesthathavewidespreadapplicability• Understandtheunderlyingtheory&pracMce
compileMmedesign&buildMmes
Scannersourcecode
streamof<word,category>pairs
ScannerGenerator
specificaGonswriRenasregularexpressions
knowledge
1.WewriteREsatdesignMme
3.Whenthecompilerruns,itusesthegeneratedscannertoconvertsourcecodeintoastreamoftokens.
e.g.,lex,flex
2.ToolsgeneratethescanneratbuildMme
COMP412,Fall2017 13
AutomaMcScannerConstrucMon
ScannerGenerator• Mayencodeitsknowledgeintablesthatdrivea“skeletonscanner”– SkeletonscannerinterpretsthetablestosimulatetheDFA
• Everyscannerusesthesameskeleton• ScannergeneratorbuildstheDFAfromtheRE,&convertsittoatable
sourcecode <word,category>pairs
ScannerGenerator
specificaGons(asREs)
Knowledgeencodedintablestodriveskeleton
SkeletonScanner Tables
See§2.5.1
COMP412,Fall2017 14
AutomaMcScannerConstrucMon
ScannerGenerator• Mayencodeitsknowledgeoftherecognizerdirectlyintocode– TransiMonsarecompiledintocondiMonallogic
• Producesascannerthathasverylowoverheadpercharacter• ScannergeneratorbuildstheDFAfromtheRE,&emitscodeforit
Scannersourcecode <word,category>pairs
ScannerGenerator
specificaGons(asREs)
Knowledgeembeddedingeneratedprogramtext
See§2.5.2
COMP412,Fall2017 15
ExamplefromLecture2
RecognizerforanILOCregistername(allowredundantzeros)
RulesforDFAOperaGon• Startinstates0&maketransiMonsoneachinputcharacter
• DFAacceptsawordxifandonlyifxleavestheDFAinafinalstate• IftheDFAencountersacharacterwithnospecifiedtransiMon,itmovestose&staysinthatstate• r17takesitthroughs0,s1,s2,s2anditaccepts• rtakesitthroughs0,s1anditfails• ratakesitthroughs0,s1,seanditfails
si
s1 s20…9
0…9
ERRORse
Anycharacter
TransiGonstoseareimplicitfromeverystate
s0r
Recognizerforr[0…9][0…9]*
WewillusetheREforaregisternameasaconMnuingexample.
COMP412,Fall2017 16
Example
Tobeuseful,theDFAmustbeexecutable
Foreachcharacter,theskeletonscannerdoesatablelookupandreadsthenextcharacter—bothofwhichshouldbeO(1)operaMons
char⇽nextcharacterstate⇽s0while(char≠EOF){state⇽δ[state,char]char⇽nextcharacter}if(stateisafinalstate)thenreportsuccesselsereportfailure
δ r 0,1,2,3,4,5,6,7,8,9
AnyOther
s0 s1 se se
s1 se s2 se
s2 se s2 se
se se se se
SkeletonScanner TransiGonTable(δ)
SkeletonScanner Tables
Thisskeletonscannerissimplified.SeeFigure2.14in§2.5.1ofEaC2e.
O(1)percharacter
Characterclassifiermapsanycharacterintooneofthe3classes:{r},{0…9},{allothers}
COMP412,Fall2017 17
Example
Tocaptureandclassifythelexeme,weaddaliRleworktoeachstate
char⇽nextcharacterstate⇽s0lexeme⇽nullstringwhile(char≠EOF){lexeme⇽lexeme||charstate⇽δ[state,char]char⇽nextcharacter}If(stateisafinalstate)then{category⇽f(state)return<lexeme,category>}elsereportfailure
SkeletonScanner
SkeletonScanner Tables
δ r 0,1,2,3,4,5,6,7,8,9
AnyOther
s0 s1 se se
s1 se s2 se
s2 se s2 se
se se se se
TransiGonTable(δ)SMllO(1)
COMP412,Fall2017 18
Example
Tocapturetheregisternumber,wewouldneedstate-specificacGons
char⇽nextcharacterstate⇽s0while(char≠EOF){state⇽δ[state,char]char⇽nextcharacterif(state=s1)n⇽0elseif(state=s2)n⇽n*10+char–‘0’}If(stateisafinalstate)then{category⇽f(state)return<lexeme,category>}elsereportfailure
SkeletonScanner Tables
δ r 0,1,2,3,4,5,6,7,8,9
AnyOther
s0 s1 se se
s1 se s2 se
s2 se s2 se
se se se se
TransiGonTable(δ)
s1 s20…9
0…9
s0r
IniGalizen Accumulaten
SMllO(1)
COMP412,Fall2017 19
MoreComplexREs
Whataboutamorecomplexlanguage?• r[0…9][0…9]*allowsarbitraryregisternumbers (e.g.,r000orr999)• Whatifwewanttolimittheregisternametor0throughr31?
WriteaMghterspecificaMonintotheRE• r((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))• r0|r1|r2|r3|…|r31|r00|r01|r02|…|r09
EachoftheseREscanbeconvertedtoaDFA• TheDFAhasthesameO(1)costpertransiMon• TheDFAtakesonetransiMonperinputcharacter• TheDFAusesthesameskeletonscannerTheaddedcomplexityisintheRE,notinthescanner†
Non-standarduseof…butthemeaningisclear
WithascannergeneratedfromanRE,usingamorecomplexREincursnoaddiMonalcompileMme.
†recallthePythondocumentaMon
COMP412,Fall2017 20
MoreComplexREs
TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))
• Acceptsamoreconstrainedsetofregisternames• Samecostperinputcharacter• Morestates⇒morerowsinthetransiMontable⇒morememory
0…9
ERRORse
Anycharacter
TransiMonstoseareimplicitfromeverystate
s13 s5s0
r
s4
s2 s3
s60,1
0,1,2
4…9
COMP412,Fall2017 21
MoreComplexREs
TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))
• Acceptsamoreconstrainedsetofregisternames• Samecostperinputcharacter• Morestates⇒morerowsinthetransiMontable⇒morememory
0…9
ERRORse
Anycharacter
TransiMonstoseareimplicitfromeverystate
s13 s5s0
r
s4
s2 s3
s60,1
0,1,2
4…9
AutomataTheoryMomentEarlier,wesaidwewouldrevisitlogicalcomplementofanREoraDFA.TocomplementaDFA:
• Makenon-finalstatesintofinalstates
• Makefinalstatesintonon-finalstates
DFAthenacceptsanystringthattheoriginaldidnotaccept=>itscomplement
COMP412,Fall2017 22
MoreComplexREs
TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))
Thistablerunswithoutchangeinthesameskeletonscannerasthefirsttable• Tochangethelanguage,justchangethetable• SMllO(1)costpercharacter
δ r 0,1 2 3 4…9 AnyOthers
s0 s1 se se se se se
s1 se s2 s2 s5 s4 se
s2 se s3 s3 s3 s3 se
s3,s4 se se se se se se
s5 se s6 se se se se
s6 se se se se se se
se se se se se se se
NoMcethatthecharacterclassifierhasmanymoredivisionsthatdidtheearlierone.SMll,itshouldbeimplementableasafuncMonwithO(1)cost.(see§2.5)
Compressed2states,aswell