lexical analyzer - csci4160: compiler design and software
TRANSCRIPT
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Lexical AnalyzerCSCI4160: Compiler Design and Software Development
Dr. Zhijiang Dong
Dept. of Computer ScienceMiddle Tennessee State University
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Credits
Some material found in these slides originated from thefollowing textbooks and other authors with modificationsdone by Dr. Dong:
Compilers: Principles, Techniques, and Tools by AlfredV. Aho, etc.
Engineering a Compiler by Keith Cooper and LindaTorczon
Compiler Construction: Principles and Practices byKenneth C. Louden
Dr. Al Cripps
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Outline
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Lexical Analyzer
Lexical Analyzer
As the first phase of a compiler, the main task of the lexicalanalyzer is to read the input characters of the sourceprogram, group them into lexemes, and produce as output asequence of tokens for each lexeme in the source program.
Besides identification of lexemes, lexical analyzer may perform certain other taskslike:
stripping out comments and whitespace (blank, newline, tab)
correlating error messages generated b y the compiler with the sourceprogram by keeping track of the number of newline characters seen.
interact with the symbol table
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Interaction Between Lexical Analyzer andParser
Commonly, the interaction is implemented by having the parser call thelexical analyzer.
The call, suggested by the getNextToken command, causes the lexicalanalyzer to read next token, which it returns to the parser.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Why separating lexical analysis from syntaxanalysis?
Reasons to separated the analysis portion of a compiler intolexical analysis and syntax analysis (parsing) phases:
Simplicity of design is the most important consideration.
Compiler efficiency is improved.Specialized techniques can be served only for lexicalanalyzer
Compiler portability is enhanced.Input-device-specific peculiarities can be restricted tothe lexical analyzer.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Tokens, Patterns, and Lexemes
Token
A token is a pair consisting of a token name and an optionalattribute value. The token name is an abstract symbolrepresenting a kind of lexical unit.
Pattern
A pattern is a description of the form that the lexemes of atoken may take.
Lexeme
A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Example of Tokens, Patterns, and Lexemes
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Token Classes
The following classes cover most or all of the tokens inmany programming languages:
One token for each keyword
Tokens for the operators, either individually or in classessuch as the token comparison
One token representing all identifiers
One or more tokens representing constants, such asnumbers and literal strings.
Tokens for each punctuation symbol, such as left andright parentheses, comma, and semicolon.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Pattern Matching
The major task of the lexical analyzer is to recognize ormatch a token which represents a certain pattern ofcharacters from the beginning of the remaining inputcharacters.
Therefore, methods of pattern specification andrecognition will be applied to the scanning process,especially
regular expressions (RE)finite automata (NFA, DFA)
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Process of Constructing a Lexical Analyzer
Process of Constructing a Lexical Analyzer
Step 1: Write down the RE for the input language
Step 2: Build a big NFA
Step 3: Build the DFA that simulates the NFA
Step 4: Systematically shrink the DFA (Optional)
Step 5: Turn it into code
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression
Regular Expression
Regular expressions represent patterns of strings ofcharacters. A regular expression r is completely defined bythe set of strings that it matches.
The set of strings represented by a regular expression ris called the language generated by r and is written asL(r).
This language depends on the character set that isavailable.
The element of the character set is called symbols.
This set of legal symbols is called the alphabet and isusually written as the Greek symbol Σ.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Definition of Regular Expressions
Describe the meaning of regular expressions by statingwhich languages are generated by each pattern
Basic Regular Expressions
Regular Expression OperationsChoice Among Alternatives
Concatenation
Repetition
Precedence of Operations and Use of Parentheses
Extensions to Regular Expressions
Regular Definitions
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Basic Regular Expressions
Basic Regular Expressions
These are just the single characters from the alphabet,which match themselves. Given any character a from thealphabet Σ, we indicate that the regular expression amatches the character a by writing L(a) = { a }.
The symbol ε is introduced to denote the empty string, and we define themetasymbol ε (boldface ε) by setting L(ε) = { ε }
The set{} contains no string at all, while the set {ε} contains the single stringconsisting of no characters.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression Operations
Three basic operations are defined in regularexpressions
choice among alternatives, indicated by themetacharacter | (vertical bar);
concatenation, indicated by juxtaposition(without ametacharacter); and
repetition or ”closure”, indicated by the metacharacter *.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression Operations: ChoiceAmong Alternatives
Choice Among Alternatives
If r and s are regular expressions, then r |s is a regularexpression which matches any string that is matched eitherby r or by s. In terms of languages, the language of r |s is theunion of the languages of r and s, or L(r |s) = L(r ) ∪ L(s).
L(a | b) = L(a) ∪ L(b) = {a, b}
L(a | ε) = {a, ε}
L(a | b | c | d) = {a, b, c, d}
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression Operations: Concatenation
Concatenation
The concatenation of two regular expressions r and s iswritten as rs, and it matches any string that is theconcatenation of two strings, the first of which matches rand the second of which matches s.
Given two sets of string S1 and S2, the concatenated set of strings S1S2 isthe set of strings of S1 appended by all the strings of S2.
If S1={aa, b} and S2 = {a, bb}, then S1S2 ={aaa, aabb, ba, bbb}
Therefore, the concatenation operation for regular expressions can bedefined as: L(rs) = L(r )L(s)
L(ab) = {ab}
L( (a|b)c) = L(a|b)L(c) = {a, b}{c} = {ac, bc}
Concatenation can also be extended to more than two regular expressions:L(r1r2...rn) = L(r1)L(r2)...L(rn)
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression Operations: Repetition
Repetition
The repetition operation of a regular expression is writtenr∗,where r is a regular expression. The regular expression r∗
matches any finite concatenation of strings, each of whichmatches r.
Given a set S of strings, let S∗ = {ε} ∪ S ∪ SS ∪ SSS ∪ .... Then, repetitionoperation for regular expressions can be defined as: L(r∗) = L(r)∗
L(a∗) = {ε, a, aa, aaa, aaaa, ...}
L((a|bb)∗) = L(a|bb)∗ = {a, bb}∗ = {ε, a, bb, aa, abb, bba, bbbb, aaa, aabb,abba, abbbb, bbaa, ...}
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Precedence of Operations and Use ofParentheses
Precedence of Operations
Among the choice, concatenation, and repetition operations,* is given the highest precedence, concatenation is giventhe next highest, and | is given the lowest.
Thus, a|bc∗ is interpreted as a|(b(c∗)), and ab|c∗d isinterpreted as (ab)|(c∗)d).Parentheses are introduced to indicate a differentprecedence.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Formal Definition of Regular Expression
DefinitionA regular expression over a alphabet Σ is one of the following:
A basic regular expression like one of the following:the metacharacter ε, where L(ε) = {ε}
any single character a from the alphabet Σ, where L(a) = {a}
An expression of the form r|s denoting the language L(r)∪L(s).
An expression of the form rs denoting the language L(r)L(s).
An expression of the form r∗ denoting the language L(r)∗.
An expression of the form (r) denoting the language L(r). Thus, parenthesesdo not change the language. They are used only to adjust the precedenceof the operations.
Notes: r and s are regular expressions.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Expression: Examples
The following examples use the same alphabet: Σ ={a, b, c}
(a|c)∗b(a|c)∗ denotes the set of all strings that containexactly one b.
(a|c)∗|(a|c)∗b(a|c)∗ denotes the set of all strings thatcontain at most one b.
What language is represented by (a|c)∗(b|ε)(a|c)∗?
The same language may be generated by manydifferent regular expressions. but we never attempt toprove that we have found the ”simplest” because:
it rarely comes up in practical situationsthe algorithms of recognizing regular expressions willbe able to simplify the recognition process withoutbothering o simplify the regular expression first.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Extension to Regular Expression
Many extension have been added to regularexpressions to enhance their ability to specify stringpatterns.
The following slides will talk about extensions used byFlex (The Fast Lexical Analyzer)
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
RE Extensions in Flex
r+ matches one or more repetition of regularexpression r.
r∗ = r+ | ε, and r+ = rr∗
(abcd)+ matches 1 or more ”abcd” stringsabcd+ matches abcd, abcdd, abcddd, etc.
r? matches 0 or 1 occurrences of r(abcd)? means 0 or 1 ”abcd” stringabcd? matches abc, or abcd
r{n} matches exactly n occurrences of r, where n is aninteger.
It has two variantsr{n,} matches n or more occurrences of r.r{n,m} matches anywhere from n to m occurrences of r.
a{4} means ”aaaa”a{2,} means ”aa”, ”aaa”, ”aaaa”, etc.a{2,3} means ”aa”, or ”aaa”.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
RE Extensions in Flex(cont)
[a1a2...an] is the same as a1|a2|...|an where a1, ..., anare symbols of the alphabet.
It is called a character class[xyz] matches either an ’x’, ’y’, or ’z’[ab][cde] would match ac, ad, ae, bc, bd, or be(abcd) would only match the string ”abcd”
A dash character ”-” indicates a range of characterswithin a character set
[abj − oZ ] matches an ’a’, a ’b’, any letter from ’j’through ’o’, or a ’Z’[a− f ] is shorthand for [abcdef][a− z][0− 9] matches any two-character stringbeginning with a lowercase letter and ending with asingle digit
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
RE Extensions in Flex(cont)
^ at the beginning of a character set indicates thecomplement of the set
[ˆabc] means any character except a, b, or c[ˆˆ] matches all characters except ”^”[ˆ0−9A−Za−z] matches nonalphanumeric characters
The {-} operator computes the difference of twocharacter classes.
[a− c]{−}[b − z] represents all the characters in theclass [a− c][abc]{−}[b]{−}[c] is the same as [a].
The {+} operator computes the union of two characterclasses.
[a− z]{+}[0− 9] is the same as [a− z0− 9]
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
RE Extensions in Flex(cont)
The metacharacter . (dot) stands for any characterexcept newline (\n).
.∗b.∗ matches all strings that contain at least one b.
\n, \t (newline, tab)
”” (double quotations)Characters inside interpreted literallyCan also use \(a backwards slash)\. & ”.” stand for the character . (period)(A|B)”CDE”? matches A, B, ACDE, or BCDE
^r matches a r, but only at the beginning of a line.
r$ matches a r, but only at the end of a line.
{name} is the expansion of the ’name’ definition (seeregular definition in next slide)
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Definition
Often, it is helpful as a notational simplification to givenames to certain regular expressions, so that names,instead of regular expression themselves will be usedin subsequence expressions.
These names are called regular definitions.
The following uses regular definitions to define Cidentifier consisting of letters, digits and underscores.
letter_ = A|B|...|Z|a|b|...|z|_digit = 1|1|...|9id = letter_(letter_ | digit)∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Regular Definition: Example
Unsigned numbers (integer or floating point) are stringssuch as 5280, 0.01234, 6.336E4, or 1.89E-4. The followingregular definition is a precise specification for this set ofstrings.
digit = 0 | 1 | ... | 9
digits = digit digit ∗
optFraction = . digits | ε
optExponent = ( E ( + | - | ε ) digits ) | ε
number = digits optFraction optExponent
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Algebraic Laws for Regular Expressions
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Class Exercises on Regular Expression
Write regular expression for the following character sets, or give reasons why noregular expression can be written
a All strings of lowercase letters that begin and end in a.
b All strings of lowercase letters that either begin or end in a (or both).
c All strings of digits that contain no leading zeros.
d All strings of digits that represent even numbers.
e All strings of digits such that all the 2’s occur before all the 9’s.
f All strings of a’s and b’s that contain no three consecutive b’s.
g All strings of a’s and b’s that contain an odd number of a’s or an odd number ofb’s (or both).
h All strings of a’s and b’s that contain an even number of a’s and an evennumber of b’s.
i All strings of a’s and b’s that contain exactly as many a’s as b’s.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Class Exercises on Regular Expression (cont)
Describe the languages denoted by the following regularexpressionsa) a(a|b)∗a
b) ((ε|a)b∗)∗
c) (a|b)∗a(a|b)(a|b)
d) a∗ba∗ba∗ba∗
e) (aa|bb)∗((ab|ba)(aa|bb)∗(ab|ba)(aa|bb)∗)∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Implementation of Lexical Analyzer
There are two primary methods for implementing ascanner.
The first is a program that is hard-coded to perform thescanning tasks.The second uses regular expression and finiteautomata theory to model the scanning process.
Why Study automatic scanner generation?To avoid writing scanners by handTo simplify specification & implementation of scannersTo understand the underlying techniques andtechnologies
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Big Picture of Automatic Scanner Generation
Lexical analyzer generators on the marketANTLR - ANTLR generates predicated-LL(k) lexers.
Lex - Classic lexical analyzer generator
JLex - A lexical analyzer generator for Java.
Flex - Alternative variant of the classic "lex" (C/C++).
JFlex - A rewrite of JLex.
Ragel - A state machine and lexical scanner generator with outputsupport for C, C++, C], Objective-C, D, Java, Go and Ruby sourcecode.
JavaCC - JavaCC generates lexical analyzers written in Java.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Process of Automatic Scanner Generation
Step 1: Direct construction of a nondeterministic finite automaton (NFA) torecognize a given RE
Easy to build in an algorithmic wayRequires ε-transitions to combine regular subexpressions
Step 2: Construct a deterministic finite automaton (DFA) to simulate the NFAUse a set-of-states construction
Step 3: Minimize the number of states in the DFA (Optional)
Step 4: Generate the scanner codeAdditional specifications needed for the actions
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Non-Deterministic Finite Automaton (NFA)
Definition
A NFA M consists ofA finite set of states S
A set of input symbols Σ. (Assume ε – empty string is never a member of Σ
A transition function T that gives, for each state, and for each symbol inΣ ∪ {ε} a set of next states.
A state s0 from S as start state (or initial state)
A set of states F , a subset of S, as the accepting states (or final states)
An NFA can be represented by a transition graph, wherestates are represented by circlestransition function is represented by labeled edges from one circle toanother circlestart state s0 is indicated by an unlabeled incoming arc coming fromnowhere.final states are represented by double circles.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA Notes
The same symbol can label edges from one state toseveral different states
The edge labeled by ε represents a ε-transition, whichmay occur without conculting the input string.
Typically, states are numbered.
To Simplify the diagram,Names (i.e. regular definitions) representing a set ofcharacters can be used to label edgesIf a transition T (s, c) is not represented in the diagram,it means at state s, the symbol c cannot be recognized,which represents an error.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Language Defined by NFA
An NFA accepts a string x if and only if there is a pathin the transition diagram from s0 to a final state suchthat the edge labels spell x, ignoring ε’s
The language defined/accepted by an NFA is the set ofstrings labeling some path from the start to anaccepting state.
To ”run” the NFA, start in s0 and guess the righttransition at each step
Always guess correctlyIf some sequence of correct guesses accepts x thenaccept
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
1st NFA Example
The above NFA recognizes the language of regularexpression (a|b)∗abb
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
2nd NFA Example
The above NFA recognizes the language of regularexpression aa∗|bb∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
3rd NFA Example
The above NFA recognizes the language of regularexpression ab+ | ab∗ |b∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Transition Table
NFA can also be represented by transition table, whoserows correspond to states, and whose columnscorrepond to the input symbols and ε
The entry for a given state and input symbol is the valueof the transition function applied to those arguments.
If the transition function has no information about thatstate-input pair, empty symbol is put in the table for thatpair.
Advantage: easy to find the transitions on a given stateand input symbol.
Disadvantage: Takes a lot of space when the inputalphabet is large, yet most states do not have anymoves on most of the input symbols.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Transition Table Example
The diagram of an NFA
and its transition table.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Deterministic Finite Automaton (DFA)
Deterministic Finite Automaton (DFA)
A DFA is a special case of an NFA where:There are no moves on input ε, andFor each state s and an input symbol a, there is at mostone edge out of s labeled a.
While NFA is an abstract representation of an algorithmto recognize the strings of a certain langage, the DFA isa simple, concrete algorithm for recognizing strings.
Every regular expression and every NFA can beconverted to a DFA accepting the same language.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
1st DFA Example
This is an corresponding DFA of the 1st NFA examplerecognizing the language of regular expression(a|b)∗abb
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
2nd DFA Example
In the above diagram, letter and digit are defined asfollows:
letter = [a− zA− Z ]digit = [0− 9]
The above DFA recognizes all string that begins with aletter and continue with any sequence of letters and/ordigits.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
3rd DFA Example
This is an corresponding DFA of the 3rd NFA examplerecognizing the language of regular expression:ab+ | ab∗ |b∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
4th DFA Example
This is an corresponding DFA of C comments (notnested)
The other transition from state 3 to itself stands for [^*]
The other transition from state 4 to state 3 stands for[^*/]
The meaning for each statestate 1: startstate 2: entering commentstate 3: in commentstate 4: exiting commentstate 5: finish
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction
Thompson’s Construction
Thompson’s Construction uses ε-transitions to ”gluetogether” the NFA machines of each piece of a regularexpression to form a NFA machine that corresponds to thewhole expression.
The construction is inductive, and it follows thestructure of the definition of a regular expression.
We exhibit an NFA for each basic regular expressionand then show how each regular expression operationcan be achieved by connecting together the NFAs ofthe subexpression.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction
NFAs of the subexpressions constructed in Thompson’salgorithm has one initial state and one final state.
There may be multiple NFAs that recognize thelanguage of a regular expression, Thompson’sconstruction generate only one of them.
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
a
b
a|b
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
a
b
a|b
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
a
b
a|b
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
a
b
a|b
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
(a|b)∗
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Thompson’s Construction Example
This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a
(a|b)∗a
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA with Subset Construction
Key Functions
Move(si , a) is the set of states reachable from si by a
ε-closure(si ) is the set of states reachable from si by aseries of zero or more ε-transitions.
The algorithm of constructing a DFA from an NFA
Step 1: Start state derived from s0 of the NFAStep 2: Take its ε-closure S0 = ε-closure(s0)Step 3: Take the image of S0, Move(S0, α) for each α ∈ Σ,
and take its ε-closureStep 4: Iterate until no more states are added
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
NFA→ DFA Example
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Roadmap
1 Overview of Lexical Analyzer
2 Token SpecificationRegular ExpressionsExtension to Regular Expression
3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)
4 Construction of an NFA from a Regular Expression
5 NFA to DFA
6 Implementation of DFA
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Implementation of DFA
Since We already have the DFA that recognize alltokens, the next step is to the lexical analyzer from theDFA.
One approach is called table driven.
Table driven approach use the transition table of theDFA to direct the progress of the algorithm.
Advantages:the size of the code is reducedthe same code will work for many different problemsthe code is easy to maintain
Disadvantages:the transition table can become very large, causing asignificant increase in the space used by the program
CSCI 4160
Overview
TokenSpecificationRE
RE Extension
FiniteAutomataNFA
DFA
RE to NFA
NFA to DFA
DFA to Code
Table Driven Approach
Algorithm
state := 1;ch := next input character;while not Accept[state] and not error(state) do
newstate := T[state, ch];if Advance[state, ch] then ch := next input character;state := newstate;
end whileif Accept[state] then accept;