Computational Language
Finite State Machines and Regular Expressions
Plan Regular expressions
Introduction Operators Disjunction, precedence, substitution
Finite State Machines Link with regular expressions Determinisitic FSA Non-deterministic FSA
Lab session reg ex. implementation in UNIX (egrep)
Regular Expressions Basis of all web-based and word-
processor-based searches Definition 1. An algebraic notation
for describing a string Definition 2. A set of rules that you
can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions regular expression, text corpus regular expression algebra has
variants: Perl, Unix tools Unix tools: egrep, sed, awk
Regular Expressions Find occurrences of /Nokia/ in the
text egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressionsegrep -n ‘Nokia’ nokia_corpus.txt
1:.Nokia shares slide after warning 4:HELSINKI (Reuters) - Nokia has cut its sales growth forecast for 7:markets sharply down.Nokia warned group sales would grow only 13:better than expected first-quarter profits from Nokia, 15:Finland's Nokia and rivals have been hit by debt-laden telecoms 19:Nokia said in a statement. "The speed of this transition has been 20:slower than was anticipated earlier this year." Nokia saw its market 26:"The problem with Nokia is that it looks like its going ex-growth," 29:with a raft of new functions, was hurting. "Nokia had been perceived 36:Nokia cast another shadow over the sector by slashing its forecast for 41:be sold this year. "Nokia now believes that general weakness in all key 43:Nokia said. The market was caught by surprise, especially as Nokia had 46:said Nokia had been "a bit optimistic overall" in its forecasts. "We 49:adjust to weaker demand, Nokia followed the path of rivals in announcing 51:thousands of jobs in the group last year. Despite the bleak outlook, Nokia 57:Nokia also warned second quarter sales would grow only between two and 61:operating efficiencies, strong brand and leading product portfolio," Nokia 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 67:protecting the margins -- but Nokia has to be a top-line growth story as well, 69:analyst Susan Anthony.But Nokia, known for its strength in forecasting the 79:Nokia's own forecast. Nokia's January-March net sales came in worse than the
Regular Expressions Suppress case distinctions
Nokia or nokia
Regular Expressions set operatoregrep -n ‘[Nn]okia’
nokia_corpus.txt
Regular Expressions Suppress other features, for
example singular share or plural shares
Regular Expressions optional operatoregrep -n ‘shares?’
nokia_corpus.txt
Regular Expressions
egrep -n ‘shares?’ nokia_corpus.txt
1:.Nokia shares slide after warning 6:weak demand, sending its shares 12 percent lower and European 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 85:lion share of the company's sales and earnings, saw sales fall seven percent
Regular Expressions Kleene operators:
/string*/ “zero or more occurrences of previous character”
/string+/ “1 or more occurrences of previous character”
Regular Expressions Wildcard operator:
/string./ “any character after the previous character”
Regular Expressions Wildcard operator:
/string./ “any character after the previous character”
Combine wildcard and kleene: /string.*/ “zero or more instances of any
character after the previous character” /string.+/ “one or more instances of any
character after the previous character”
Regular Expressions
egrep –n ‘profit.*’ nokia_corpus.txt
13:better than expected first-quarter profits from Nokia, 52:remains the only profitable handset maker among the "big three" suppliers 60:company's profitability outlook remains strong, driven by increasing 81:Pre-tax profit was 1.31 billion euros.The company's struggling networks unit
Regular Expressions Anchors
Beginning of line operator: ^egrep ‘^said’ nokia_corpus.txt End of line operator: $egrep ‘$said’ nokia_corpus.txt
Regular Expressions Disjunction:
set operator/[Ss]tring/ “a string which begins with either S
or s” Range/[A-Z]tring/ “a string beginning with a capital
letter” pipe |/string1|string2/ “either string 1 or string 2”
Regular Expressions Disjunction
egrep –n ‘weak|warning|drop’ nokia_corpus.txt
egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions
Negation: /[^a-z]tring“ any strings that does not begin
with a small letter”
Regular Expressions Precedence
1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |
(a) /supply | iers/ /supply/ /iers/(b) /suppl(y|iers)/ /supply/ suppliers/
Regular Expressions Substitution
sed ‘s/word1/word2/ corpus.txt
Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’
corpus.txt
Regular Expressions Substitution
sed ‘s/word1/word2/ corpus.txt
Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’
corpus.txt
Eliza: sorry to hear that you are feeling a bit depressed today
Regular Expressions Substitution
sed ‘s/word1/word2/ corpus.txt
Me: I wish I could shake this depressionsed
Eliza: I am sure you could shake this depression
Regular Expressions Substitution
sed ‘s/word1/word2/’ corpus.txt
Me: I wish I could shake this depressionsed ‘s/wish I/am sure you/’ corpus.txt
Eliza: I am sure you could shake this depression
Finite State Transition Networks
Finite State Automata (FSA) Just as a regular expression, used to
recognise a set of stringse.g. egrep –n ‘baa+!’ corpus.txt
Finite State Transition Networks
Finite State Automata (FSA) Just as a regular expression, used to
recognise a set of strings Represented as a directed graph
Finite State Transition Networks
Finite State Automata (FSA) Just as a regular expression, used to
recognise a set of strings Represented as a directed graph Set of nodes representing states
Finite State Transition Networks
Finite State Automata (FSA) Just as a regular expression, used to
recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,
representing transitions between states
Finite State Transition Networks
Finite State Automata (FSA) Just as a regular expression, used to
recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,
representing transitions between states Arcs are labelled
Finite State Automata How does it work?
used to recognise a set of strings
Finite State Automata How does it work?
used to recognise a set of strings Candidate input string represented as
a segmented tape with a symbol for each cell
Finite State Automata How does it work?
used to recognise a set of strings Candidate input string represented as
a segmented tape with a symbol for each cell
String slowly fed into machine
Finite State Automata How does it work?
used to recognise a set of strings Candidate input string represented as a
segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,
then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends
Finite State Automata How does it work?
used to recognise a set of strings Candidate input string represented as a
segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,
then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends
Otherwise: stop and reject string
Finite State Automata State Transition Table
State Input b a ! 0 1 Ø Ø 1 2 3 4:
Finite State Automata State Transition Table
State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 3 4:
Finite State Automata State Transition Table
State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 4:
Finite State Automata State Transition Table
State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4:
Finite State Automata State Transition Table
State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø
Finite State Automata Algorithm for FSA (Jurafsky and Martin, p. 37)
function D-RECOGNIZE(tape, machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table [current-state, tape [index]] is empty then return reject else Current-state <- transition-table [current-state, tape [index]] Index <- index + 1 end
Finite State Automata FSAs and recognition
Finite State Automata FSAs and recognition FSAs and generation
At each transition print out label of arc At final state stop printing
Finite State Automata Deterministic FSAs
An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at
Finite State Automata Deterministic FSAs
An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at
Non-deterministic FSAs An FSA with decision points
Finite State Automata Deterministic FSAs Non-deterministic FSAs
An FSA with decision points Self-loop may be in a particular state Arcs may have ε transitions
Finite State Automata Deterministic FSAs Non-deterministic FSA
Backup: set a marker that can be returned to
Look-ahead: look ahead at input Parallelism: look at alternative paths in
parallel
Finite State Automata Non-deterministic FSA: state transition
table State Input b a ! ε 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2, 3 Ø Ø 3 Ø Ø 4 Ø 4: Ø Ø Ø Ø
Finite State Automata Formal language Set of strings Finite symbol set, alphabet
Finite State Automata Formal language Set of strings Finite symbol set, alphabet
Σ = {a, b, !}
Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…}“formal language characterised by m”
m = model L = formal language
Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…} A formal language models a
fragment of a natural language