lexical analyzer - csci4160: compiler design and software

84
CSCI 4160 Overview Token Specification RE RE Extension Finite Automata NFA DFA RE to NFA NFA to DFA DFA to Code Lexical Analyzer CSCI4160: Compiler Design and Software Development Dr. Zhijiang Dong Dept. of Computer Science Middle Tennessee State University

Upload: others

Post on 15-Feb-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Lexical AnalyzerCSCI4160: Compiler Design and Software Development

Dr. Zhijiang Dong

Dept. of Computer ScienceMiddle Tennessee State University

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Credits

Some material found in these slides originated from thefollowing textbooks and other authors with modificationsdone by Dr. Dong:

Compilers: Principles, Techniques, and Tools by AlfredV. Aho, etc.

Engineering a Compiler by Keith Cooper and LindaTorczon

Compiler Construction: Principles and Practices byKenneth C. Louden

Dr. Al Cripps

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Outline

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Lexical Analyzer

Lexical Analyzer

As the first phase of a compiler, the main task of the lexicalanalyzer is to read the input characters of the sourceprogram, group them into lexemes, and produce as output asequence of tokens for each lexeme in the source program.

Besides identification of lexemes, lexical analyzer may perform certain other taskslike:

stripping out comments and whitespace (blank, newline, tab)

correlating error messages generated b y the compiler with the sourceprogram by keeping track of the number of newline characters seen.

interact with the symbol table

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Interaction Between Lexical Analyzer andParser

Commonly, the interaction is implemented by having the parser call thelexical analyzer.

The call, suggested by the getNextToken command, causes the lexicalanalyzer to read next token, which it returns to the parser.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Why separating lexical analysis from syntaxanalysis?

Reasons to separated the analysis portion of a compiler intolexical analysis and syntax analysis (parsing) phases:

Simplicity of design is the most important consideration.

Compiler efficiency is improved.Specialized techniques can be served only for lexicalanalyzer

Compiler portability is enhanced.Input-device-specific peculiarities can be restricted tothe lexical analyzer.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Tokens, Patterns, and Lexemes

Token

A token is a pair consisting of a token name and an optionalattribute value. The token name is an abstract symbolrepresenting a kind of lexical unit.

Pattern

A pattern is a description of the form that the lexemes of atoken may take.

Lexeme

A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Example of Tokens, Patterns, and Lexemes

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Token Classes

The following classes cover most or all of the tokens inmany programming languages:

One token for each keyword

Tokens for the operators, either individually or in classessuch as the token comparison

One token representing all identifiers

One or more tokens representing constants, such asnumbers and literal strings.

Tokens for each punctuation symbol, such as left andright parentheses, comma, and semicolon.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Pattern Matching

The major task of the lexical analyzer is to recognize ormatch a token which represents a certain pattern ofcharacters from the beginning of the remaining inputcharacters.

Therefore, methods of pattern specification andrecognition will be applied to the scanning process,especially

regular expressions (RE)finite automata (NFA, DFA)

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Process of Constructing a Lexical Analyzer

Process of Constructing a Lexical Analyzer

Step 1: Write down the RE for the input language

Step 2: Build a big NFA

Step 3: Build the DFA that simulates the NFA

Step 4: Systematically shrink the DFA (Optional)

Step 5: Turn it into code

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression

Regular Expression

Regular expressions represent patterns of strings ofcharacters. A regular expression r is completely defined bythe set of strings that it matches.

The set of strings represented by a regular expression ris called the language generated by r and is written asL(r).

This language depends on the character set that isavailable.

The element of the character set is called symbols.

This set of legal symbols is called the alphabet and isusually written as the Greek symbol Σ.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Definition of Regular Expressions

Describe the meaning of regular expressions by statingwhich languages are generated by each pattern

Basic Regular Expressions

Regular Expression OperationsChoice Among Alternatives

Concatenation

Repetition

Precedence of Operations and Use of Parentheses

Extensions to Regular Expressions

Regular Definitions

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Basic Regular Expressions

Basic Regular Expressions

These are just the single characters from the alphabet,which match themselves. Given any character a from thealphabet Σ, we indicate that the regular expression amatches the character a by writing L(a) = { a }.

The symbol ε is introduced to denote the empty string, and we define themetasymbol ε (boldface ε) by setting L(ε) = { ε }

The set{} contains no string at all, while the set {ε} contains the single stringconsisting of no characters.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression Operations

Three basic operations are defined in regularexpressions

choice among alternatives, indicated by themetacharacter | (vertical bar);

concatenation, indicated by juxtaposition(without ametacharacter); and

repetition or ”closure”, indicated by the metacharacter *.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression Operations: ChoiceAmong Alternatives

Choice Among Alternatives

If r and s are regular expressions, then r |s is a regularexpression which matches any string that is matched eitherby r or by s. In terms of languages, the language of r |s is theunion of the languages of r and s, or L(r |s) = L(r ) ∪ L(s).

L(a | b) = L(a) ∪ L(b) = {a, b}

L(a | ε) = {a, ε}

L(a | b | c | d) = {a, b, c, d}

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression Operations: Concatenation

Concatenation

The concatenation of two regular expressions r and s iswritten as rs, and it matches any string that is theconcatenation of two strings, the first of which matches rand the second of which matches s.

Given two sets of string S1 and S2, the concatenated set of strings S1S2 isthe set of strings of S1 appended by all the strings of S2.

If S1={aa, b} and S2 = {a, bb}, then S1S2 ={aaa, aabb, ba, bbb}

Therefore, the concatenation operation for regular expressions can bedefined as: L(rs) = L(r )L(s)

L(ab) = {ab}

L( (a|b)c) = L(a|b)L(c) = {a, b}{c} = {ac, bc}

Concatenation can also be extended to more than two regular expressions:L(r1r2...rn) = L(r1)L(r2)...L(rn)

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression Operations: Repetition

Repetition

The repetition operation of a regular expression is writtenr∗,where r is a regular expression. The regular expression r∗

matches any finite concatenation of strings, each of whichmatches r.

Given a set S of strings, let S∗ = {ε} ∪ S ∪ SS ∪ SSS ∪ .... Then, repetitionoperation for regular expressions can be defined as: L(r∗) = L(r)∗

L(a∗) = {ε, a, aa, aaa, aaaa, ...}

L((a|bb)∗) = L(a|bb)∗ = {a, bb}∗ = {ε, a, bb, aa, abb, bba, bbbb, aaa, aabb,abba, abbbb, bbaa, ...}

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Precedence of Operations and Use ofParentheses

Precedence of Operations

Among the choice, concatenation, and repetition operations,* is given the highest precedence, concatenation is giventhe next highest, and | is given the lowest.

Thus, a|bc∗ is interpreted as a|(b(c∗)), and ab|c∗d isinterpreted as (ab)|(c∗)d).Parentheses are introduced to indicate a differentprecedence.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Formal Definition of Regular Expression

DefinitionA regular expression over a alphabet Σ is one of the following:

A basic regular expression like one of the following:the metacharacter ε, where L(ε) = {ε}

any single character a from the alphabet Σ, where L(a) = {a}

An expression of the form r|s denoting the language L(r)∪L(s).

An expression of the form rs denoting the language L(r)L(s).

An expression of the form r∗ denoting the language L(r)∗.

An expression of the form (r) denoting the language L(r). Thus, parenthesesdo not change the language. They are used only to adjust the precedenceof the operations.

Notes: r and s are regular expressions.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Expression: Examples

The following examples use the same alphabet: Σ ={a, b, c}

(a|c)∗b(a|c)∗ denotes the set of all strings that containexactly one b.

(a|c)∗|(a|c)∗b(a|c)∗ denotes the set of all strings thatcontain at most one b.

What language is represented by (a|c)∗(b|ε)(a|c)∗?

The same language may be generated by manydifferent regular expressions. but we never attempt toprove that we have found the ”simplest” because:

it rarely comes up in practical situationsthe algorithms of recognizing regular expressions willbe able to simplify the recognition process withoutbothering o simplify the regular expression first.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Extension to Regular Expression

Many extension have been added to regularexpressions to enhance their ability to specify stringpatterns.

The following slides will talk about extensions used byFlex (The Fast Lexical Analyzer)

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

RE Extensions in Flex

r+ matches one or more repetition of regularexpression r.

r∗ = r+ | ε, and r+ = rr∗

(abcd)+ matches 1 or more ”abcd” stringsabcd+ matches abcd, abcdd, abcddd, etc.

r? matches 0 or 1 occurrences of r(abcd)? means 0 or 1 ”abcd” stringabcd? matches abc, or abcd

r{n} matches exactly n occurrences of r, where n is aninteger.

It has two variantsr{n,} matches n or more occurrences of r.r{n,m} matches anywhere from n to m occurrences of r.

a{4} means ”aaaa”a{2,} means ”aa”, ”aaa”, ”aaaa”, etc.a{2,3} means ”aa”, or ”aaa”.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

RE Extensions in Flex(cont)

[a1a2...an] is the same as a1|a2|...|an where a1, ..., anare symbols of the alphabet.

It is called a character class[xyz] matches either an ’x’, ’y’, or ’z’[ab][cde] would match ac, ad, ae, bc, bd, or be(abcd) would only match the string ”abcd”

A dash character ”-” indicates a range of characterswithin a character set

[abj − oZ ] matches an ’a’, a ’b’, any letter from ’j’through ’o’, or a ’Z’[a− f ] is shorthand for [abcdef][a− z][0− 9] matches any two-character stringbeginning with a lowercase letter and ending with asingle digit

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

RE Extensions in Flex(cont)

^ at the beginning of a character set indicates thecomplement of the set

[ˆabc] means any character except a, b, or c[ˆˆ] matches all characters except ”^”[ˆ0−9A−Za−z] matches nonalphanumeric characters

The {-} operator computes the difference of twocharacter classes.

[a− c]{−}[b − z] represents all the characters in theclass [a− c][abc]{−}[b]{−}[c] is the same as [a].

The {+} operator computes the union of two characterclasses.

[a− z]{+}[0− 9] is the same as [a− z0− 9]

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

RE Extensions in Flex(cont)

The metacharacter . (dot) stands for any characterexcept newline (\n).

.∗b.∗ matches all strings that contain at least one b.

\n, \t (newline, tab)

”” (double quotations)Characters inside interpreted literallyCan also use \(a backwards slash)\. & ”.” stand for the character . (period)(A|B)”CDE”? matches A, B, ACDE, or BCDE

^r matches a r, but only at the beginning of a line.

r$ matches a r, but only at the end of a line.

{name} is the expansion of the ’name’ definition (seeregular definition in next slide)

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Definition

Often, it is helpful as a notational simplification to givenames to certain regular expressions, so that names,instead of regular expression themselves will be usedin subsequence expressions.

These names are called regular definitions.

The following uses regular definitions to define Cidentifier consisting of letters, digits and underscores.

letter_ = A|B|...|Z|a|b|...|z|_digit = 1|1|...|9id = letter_(letter_ | digit)∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Regular Definition: Example

Unsigned numbers (integer or floating point) are stringssuch as 5280, 0.01234, 6.336E4, or 1.89E-4. The followingregular definition is a precise specification for this set ofstrings.

digit = 0 | 1 | ... | 9

digits = digit digit ∗

optFraction = . digits | ε

optExponent = ( E ( + | - | ε ) digits ) | ε

number = digits optFraction optExponent

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Algebraic Laws for Regular Expressions

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Class Exercises on Regular Expression

Write regular expression for the following character sets, or give reasons why noregular expression can be written

a All strings of lowercase letters that begin and end in a.

b All strings of lowercase letters that either begin or end in a (or both).

c All strings of digits that contain no leading zeros.

d All strings of digits that represent even numbers.

e All strings of digits such that all the 2’s occur before all the 9’s.

f All strings of a’s and b’s that contain no three consecutive b’s.

g All strings of a’s and b’s that contain an odd number of a’s or an odd number ofb’s (or both).

h All strings of a’s and b’s that contain an even number of a’s and an evennumber of b’s.

i All strings of a’s and b’s that contain exactly as many a’s as b’s.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Class Exercises on Regular Expression (cont)

Describe the languages denoted by the following regularexpressionsa) a(a|b)∗a

b) ((ε|a)b∗)∗

c) (a|b)∗a(a|b)(a|b)

d) a∗ba∗ba∗ba∗

e) (aa|bb)∗((ab|ba)(aa|bb)∗(ab|ba)(aa|bb)∗)∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Implementation of Lexical Analyzer

There are two primary methods for implementing ascanner.

The first is a program that is hard-coded to perform thescanning tasks.The second uses regular expression and finiteautomata theory to model the scanning process.

Why Study automatic scanner generation?To avoid writing scanners by handTo simplify specification & implementation of scannersTo understand the underlying techniques andtechnologies

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Big Picture of Automatic Scanner Generation

Lexical analyzer generators on the marketANTLR - ANTLR generates predicated-LL(k) lexers.

Lex - Classic lexical analyzer generator

JLex - A lexical analyzer generator for Java.

Flex - Alternative variant of the classic "lex" (C/C++).

JFlex - A rewrite of JLex.

Ragel - A state machine and lexical scanner generator with outputsupport for C, C++, C], Objective-C, D, Java, Go and Ruby sourcecode.

JavaCC - JavaCC generates lexical analyzers written in Java.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Process of Automatic Scanner Generation

Step 1: Direct construction of a nondeterministic finite automaton (NFA) torecognize a given RE

Easy to build in an algorithmic wayRequires ε-transitions to combine regular subexpressions

Step 2: Construct a deterministic finite automaton (DFA) to simulate the NFAUse a set-of-states construction

Step 3: Minimize the number of states in the DFA (Optional)

Step 4: Generate the scanner codeAdditional specifications needed for the actions

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Non-Deterministic Finite Automaton (NFA)

Definition

A NFA M consists ofA finite set of states S

A set of input symbols Σ. (Assume ε – empty string is never a member of Σ

A transition function T that gives, for each state, and for each symbol inΣ ∪ {ε} a set of next states.

A state s0 from S as start state (or initial state)

A set of states F , a subset of S, as the accepting states (or final states)

An NFA can be represented by a transition graph, wherestates are represented by circlestransition function is represented by labeled edges from one circle toanother circlestart state s0 is indicated by an unlabeled incoming arc coming fromnowhere.final states are represented by double circles.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA Notes

The same symbol can label edges from one state toseveral different states

The edge labeled by ε represents a ε-transition, whichmay occur without conculting the input string.

Typically, states are numbered.

To Simplify the diagram,Names (i.e. regular definitions) representing a set ofcharacters can be used to label edgesIf a transition T (s, c) is not represented in the diagram,it means at state s, the symbol c cannot be recognized,which represents an error.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Language Defined by NFA

An NFA accepts a string x if and only if there is a pathin the transition diagram from s0 to a final state suchthat the edge labels spell x, ignoring ε’s

The language defined/accepted by an NFA is the set ofstrings labeling some path from the start to anaccepting state.

To ”run” the NFA, start in s0 and guess the righttransition at each step

Always guess correctlyIf some sequence of correct guesses accepts x thenaccept

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

1st NFA Example

The above NFA recognizes the language of regularexpression (a|b)∗abb

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

2nd NFA Example

The above NFA recognizes the language of regularexpression aa∗|bb∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

3rd NFA Example

The above NFA recognizes the language of regularexpression ab+ | ab∗ |b∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Transition Table

NFA can also be represented by transition table, whoserows correspond to states, and whose columnscorrepond to the input symbols and ε

The entry for a given state and input symbol is the valueof the transition function applied to those arguments.

If the transition function has no information about thatstate-input pair, empty symbol is put in the table for thatpair.

Advantage: easy to find the transitions on a given stateand input symbol.

Disadvantage: Takes a lot of space when the inputalphabet is large, yet most states do not have anymoves on most of the input symbols.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Transition Table Example

The diagram of an NFA

and its transition table.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Deterministic Finite Automaton (DFA)

Deterministic Finite Automaton (DFA)

A DFA is a special case of an NFA where:There are no moves on input ε, andFor each state s and an input symbol a, there is at mostone edge out of s labeled a.

While NFA is an abstract representation of an algorithmto recognize the strings of a certain langage, the DFA isa simple, concrete algorithm for recognizing strings.

Every regular expression and every NFA can beconverted to a DFA accepting the same language.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

1st DFA Example

This is an corresponding DFA of the 1st NFA examplerecognizing the language of regular expression(a|b)∗abb

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

2nd DFA Example

In the above diagram, letter and digit are defined asfollows:

letter = [a− zA− Z ]digit = [0− 9]

The above DFA recognizes all string that begins with aletter and continue with any sequence of letters and/ordigits.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

3rd DFA Example

This is an corresponding DFA of the 3rd NFA examplerecognizing the language of regular expression:ab+ | ab∗ |b∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

4th DFA Example

This is an corresponding DFA of C comments (notnested)

The other transition from state 3 to itself stands for [^*]

The other transition from state 4 to state 3 stands for[^*/]

The meaning for each statestate 1: startstate 2: entering commentstate 3: in commentstate 4: exiting commentstate 5: finish

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction

Thompson’s Construction

Thompson’s Construction uses ε-transitions to ”gluetogether” the NFA machines of each piece of a regularexpression to form a NFA machine that corresponds to thewhole expression.

The construction is inductive, and it follows thestructure of the definition of a regular expression.

We exhibit an NFA for each basic regular expressionand then show how each regular expression operationcan be achieved by connecting together the NFAs ofthe subexpression.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction

NFAs of the subexpressions constructed in Thompson’salgorithm has one initial state and one final state.

There may be multiple NFAs that recognize thelanguage of a regular expression, Thompson’sconstruction generate only one of them.

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

a

b

a|b

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

a

b

a|b

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

a

b

a|b

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

a

b

a|b

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

(a|b)∗

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Thompson’s Construction Example

This example demonstrate the steps to generate an NFA forregular expression (a|b)∗a

(a|b)∗a

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA with Subset Construction

Key Functions

Move(si , a) is the set of states reachable from si by a

ε-closure(si ) is the set of states reachable from si by aseries of zero or more ε-transitions.

The algorithm of constructing a DFA from an NFA

Step 1: Start state derived from s0 of the NFAStep 2: Take its ε-closure S0 = ε-closure(s0)Step 3: Take the image of S0, Move(S0, α) for each α ∈ Σ,

and take its ε-closureStep 4: Iterate until no more states are added

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

NFA→ DFA Example

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Roadmap

1 Overview of Lexical Analyzer

2 Token SpecificationRegular ExpressionsExtension to Regular Expression

3 Finite AutomataNon-Deterministic Finite Automaton (NFA)Deterministic Finite Automaton (DFA)

4 Construction of an NFA from a Regular Expression

5 NFA to DFA

6 Implementation of DFA

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Implementation of DFA

Since We already have the DFA that recognize alltokens, the next step is to the lexical analyzer from theDFA.

One approach is called table driven.

Table driven approach use the transition table of theDFA to direct the progress of the algorithm.

Advantages:the size of the code is reducedthe same code will work for many different problemsthe code is easy to maintain

Disadvantages:the transition table can become very large, causing asignificant increase in the space used by the program

CSCI 4160

Overview

TokenSpecificationRE

RE Extension

FiniteAutomataNFA

DFA

RE to NFA

NFA to DFA

DFA to Code

Table Driven Approach

Algorithm

state := 1;ch := next input character;while not Accept[state] and not error(state) do

newstate := T[state, ch];if Advance[state, ch] then ch := next input character;state := newstate;

end whileif Accept[state] then accept;