4/14/20151 regular expressions and automata lecture 3: regular expressions and automata husni...
Post on 14-Dec-2015
247 Views
Preview:
TRANSCRIPT
04/18/23 1
REGULAR EXPRESSIONS AND AUTOMATA
Lecture 3: REGULAR EXPRESSIONS AND AUTOMATA
Husni Al-Muhtaseb
04/18/23 2
الرحيم الرحمن الله بسمICS 482: Natural Language
ProcessingLecture 3: REGULAR EXPRESSIONS
AND AUTOMATA
Husni Al-Muhtaseb
NLP Credits and Acknowledgment
These slides were adapted from presentations of the Authors of the
bookSPEECH and LANGUAGE PROCESSING:
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
and some modifications from presentations found in the WEB by
several scholars including the following
NLP Credits and AcknowledgmentIf your name is missing please contact me
muhtasebAt
Kfupm.Edu.sa
NLP Credits and AcknowledgmentHusni Al-Muhtaseb
James MartinJim MartinDan JurafskySandiway FongSong young inPaula MatuszekMary-Angela PapalaskariDick Crouch Tracy KinL. Venkata
SubramaniamMartin Volk Bruce R. MaximJan HajičSrinath SrinivasaSimeon NtafosPaolo PirjanianRicardo VilaltaTom Lenaerts
Heshaam Feili Björn GambäckChristian Korthals Thomas G.
DietterichDevika
SubramanianDuminda
Wijesekera Lee McCluskey David J. KriegmanKathleen McKeownMichael J. CiaraldiDavid FinkelMin-Yen KanAndreas Geyer-
Schulz Franz J. KurfessTim FininNadjet BouayadKathy McCoyHans Uszkoreit Azadeh Maghsoodi
Khurshid Ahmad
Staffan Larsson
Robert Wilensky
Feiyu Xu
Jakub Piskorski
Rohini Srihari
Mark Sanderson
Andrew Elks
Marc Davis
Ray Larson
Jimmy Lin
Marti Hearst
Andrew McCallum
Nick KushmerickMark CravenChia-Hui ChangDiana MaynardJames Allan
Martha Palmerjulia hirschbergElaine RichChristof Monz Bonnie J. DorrNizar HabashMassimo PoesioDavid Goss-GrubbsThomas K HarrisJohn HutchinsAlexandros PotamianosMike RosnerLatifa Al-Sulaiti Giorgio Satta Jerry R. HobbsChristopher ManningHinrich SchützeAlexander GelbukhGina-Anne Levow Guitao GaoQing MaZeynep Altan
04/18/23 6
Agenda: REGULAR EXPRESSIONS AND AUTOMATA
• Why to study it?– Talk to ALICE
• Regular expressions
• Finite State Automata
• Assignments
04/18/23 7
NLP Example: Chat with Alice• http://www.pandorabots.com/pandora/talk
?botid=f5d922d97e345aa1&skin=custom_input
• A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) is an award-winning free natural language artificial intelligence chat robot. The software used to create A.L.I.C.E. is available as free ("open source") Alicebot and AIML software.
• http://www.alicebot.org/about.html
04/18/23 8
NLP Representations
• State Machines– FSAs: Finite State Automata– FSTs: Finite State Transducers– HMMs: Hidden Markov Model– ATNs: Augmented Transition Network– RTNs: Recursive Transition Network
04/18/23 9
NLP Representations
• Rule Systems: – CFGs: Context Free Grammar– Unification Grammars– Probabilistic CFGs
• Logic-based Formalisms– 1st Order Predicate Calculus– Temporal and other Higher Order Logics
• Models of Uncertainty– Bayesian Probability Theory
04/18/23 10
NLP Algorithms
• Most are transducers: accept or reject input, and construct new structure from input– State space search
• To manage the problem of making choices during processing when we lack the information needed to make the right choice
– Dynamic programming• To avoid having to redo work during the course
of a state-space search
04/18/23 11
State Space Search
• States represent pairings of partially processed inputs with partially constructed answers
• Goals are exhausted inputs paired with complete answers that satisfy some criteria
• The spaces are normally too large to exhaustively explore
04/18/23 12
Dynamic Programming
• Don’t do the same work over and over
• Avoid this by building and making use of solutions to sub-problems that must be invariant across all parts of the space
04/18/23 13
Regular Expressions and Text Searching
• Regular expression (RE): A formula (in a special language) for specifying a set of strings
• String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation)
04/18/23 14
Regular Expression Patterns
• Regular Expression can be considered as a pattern to specify text search strings to search a corpus of texts
• What is Corpus?
• For text search purpose: use Perl syntax
• Show the exact part of the string in a line that first matches a Regular Expression pattern
04/18/23 15
Regular Expression Patterns
RE String matched
/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Sarah Ali stopped by Mona’s”
/Ali says,/ “My gift please,” Ali says,”
/book/ “all our pretty books”
/!/ “Leave him behind!” said Sami
04/18/23 16
RE Match
/[wW]oodchuck/ Woodchuck or woodchuck
/[abc]/ “a”, “b”, or “c”
/[0123456789]/ Any digit
04/18/23 17
RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s
/a?/ Zero or one a’s
/cat|dog/ ‘cat’ or ‘dog’
/^cat$/ A line containing only ‘cat’
/\bun\B/ Beginnings of longer strings starts by ‘un’
04/18/23 18
Example
• Find all instances of the word “the” in a text.– /the/
•What About ‘The’
– /[tT]he/•What about ‘Theater”, ‘Another’
– /\b[tT]he\b/
04/18/23 19
Sidebar: Errors
• The process we just went through was based on two fixing kinds of errors– Matching strings that we should not have
matched (there, then, other)• False positives
– Not matching things that we should have matched (The)
• False negatives
04/18/23 20
Sidebar: Errors
• Reducing the error rate for an application often involves two efforts – Increasing accuracy (minimizing false
positives)– Increasing coverage (minimizing false
negatives)
04/18/23 21
Regular expressions• Basic regular expression patterns• Perl-based syntax (slightly different from
other notations for regular expressions)• Disjunctions [abc]• Ranges [A-Z]• Negations [^Ss]• Optional characters ? and *• Wild cards .• Anchors ^ and $, also \b and \B• Disjunction, grouping, and precedence |
04/18/23 22
04/18/23 23
04/18/23 24
04/18/23 25
Writing correct expressions
• Exercise: write a Perl regular expression to match the English article “the”:
/the/ missed ‘The’
included ‘the’ in ‘others’/[tT]he/
/\b[tT]he\b/ Missed ‘the25’ ‘the_’
/[^a-zA-Z][tT]he[^a-zA-Z]/Missed ‘The’ at the beginning of a line
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
04/18/23 26
A more complex example
• Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:
04/18/23 27
Example• Price
– /$[0-9]+/ # whole dollars – /$[0-9]+\.[0-9][0-9]/ # dollars and cents – /$[0-9]+(\.[0-9][0-9])?/ #cents optional – /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries
• Specifications for processor speed – /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
• Memory size – /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ – /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
• Vendors – /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
– /\b(Mac|Macintosh|Apple)\b/
04/18/23 28
Advanced Operators
Underscore: Correct figure 2.6
04/18/23 29
04/18/23 30
04/18/23 31
Assignment: Try regular expressions in MS WORD in both
Arabic & English
04/18/23 32
baa! baaa! baaaa! baaaaa! ...
Finite State Automata
• FSAs recognize the regular languages represented by regular expressions– SheepTalk: /baa+!/
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5 transitions
q0 q4q1 q2 q3
b a aa !
04/18/23 33
Formally
• FSA is a 5-tuple consisting of– Q: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!}
– q0: A start state
– F: a set of final states in Q {q4} (q,i): a transition function mapping Q x
to Q
q0 q4q1 q2 q3
b a
a
a !
04/18/23 34
• FSA recognizes (accepts) strings of a regular language– baa!– baaa!– baaaa!– …
• Tape Input: a rejected input
a b a ! b
q0 q4q1 q2 q3b a
aa !
04/18/23 35
State Transition Table for SheepTalk
StateInput
b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4 Ø Ø Ø
baa! baaa! baaaa! baaaaa! ...
q0 q4q1 q2 q3
b aa
a !
04/18/23 36
Non-Deterministic FSAs for SheepTalk
q0 q4q1 q2 q3
b a a a !
q0 q4q1 q2 q3
b a a !
04/18/23 37
• A language is a set of strings
• String: A sequence of letters
– Examples: “cat”, “dog”, “house”, …
– Defined over an alphabet:
Languages
zcba ,,,,
04/18/23 38
Alphabets and Strings
• We will use small alphabets:
• Strings
abbaw
bbbaaav
abu
ba,
baaabbbaaba
baba
abba
ab
a
04/18/23 39
Finite AutomatonInput
String
•
Output
String
FiniteAutomaton
04/18/23 40
Finite Accepter
• Input
“Accept” or“Reject”
String
FiniteAutomaton
Output
04/18/23 41
Transition Graph
•
initialstate
final state“accept”state
transition
abba -Finite Accepter
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
04/18/23 42
Initial Configuration
•
1q 2q 3q 4qa b b a
5q
a a bb
ba,
Input Stringa b b a
ba,
0q
04/18/23 43
Reading the Input
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/18/23 44
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/18/23 45
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/18/23 46
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/18/23 47
0q 1q 2q 3q 4qa b b a
Output: “accept”
5q
a a bb
ba,
a b b a
ba,
04/18/23 48
Rejection
•
1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
0q
04/18/23 49
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/18/23 50
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/18/23 51
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/18/23 52
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
Output:“reject”
a b a
ba,
04/18/23 53
Another Example
a
b ba,
ba,
0q 1q 2q
a ba
04/18/23 54
a
b ba,
ba,
0q 1q 2q
a ba
04/18/23 55
a
b ba,
ba,
0q 1q 2q
a ba
04/18/23 56
a
b ba,
ba,
0q 1q 2q
a ba
04/18/23 57
a
b ba,
ba,
0q 1q 2q
a ba
Output: “accept”
04/18/23 58
Rejection
a
b ba,
ba,
0q 1q 2q
ab b
04/18/23 59
a
b ba,
ba,
0q 1q 2q
ab b
04/18/23 60
a
b ba,
ba,
0q 1q 2q
ab b
04/18/23 61
a
b ba,
ba,
0q 1q 2q
ab b
04/18/23 62
a
b ba,
ba,
0q 1q 2q
ab b
Output: “reject”
04/18/23 63
Formalities
• Deterministic Finite Accepter (DFA) FqQM ,,,, 0
Q
0q
F
: set of states
: input alphabet
: transition function
: initial state
: set of final states
04/18/23 64
About Alphabets
• Alphabets means we need a finite set of symbols in the input.
• These symbols can and will stand for bigger objects that can have internal structure.
04/18/23 65
Input Aplhabet
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
ba,
04/18/23 66
Set of States
Q
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
543210 ,,,,, qqqqqqQ
ba,
04/18/23 67
Initial State
0q
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/18/23 68
Set of Final States
F
0q 1q 2q 3qa b b a
5q
a a bb
ba,
4qF
ba,
4q
04/18/23 69
Transition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
QQ :
ba,
04/18/23 70
10 , qaq
2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q
04/18/23 71
50 , qbq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/18/23 72
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
32 , qbq
04/18/23 73
Transition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b
0q
1q
2q
3q
4q
5q
1q 5q
5q 2q
2q 3q
4q 5q
ba,5q5q5q5q
04/18/23 74
Extended Transition Function(Reads the entire string)
*
QQ *:*
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
04/18/23 75
20 ,* qabq
3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q 2q
04/18/23 76
40 ,* qabbaq
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
04/18/23 77
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/18/23 78
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
Observation: There is a walk from to with label
0q 5qabbbaa
04/18/23 79
Example
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaML M
accept
04/18/23 80
Another Example
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaabML ,, M
acceptacceptaccept
04/18/23 81
More Examples
a
b ba,
ba,
0q 1q 2q
}0:{ nbaML n
accept trap state
04/18/23 82
ML = { all substrings with prefix }ab
a b
ba,
0q 1q 2q
accept
ba,3q
ab
04/18/23 83
ML = { all strings without substring }001
0 00 001
1
0
1
10
0 1,0
04/18/23 84
Regular Languages
• A language is regular if there is
• a DFA such that
• All regular languages form a language family
LM MLL
04/18/23 85
Example• The language
• is regular:
*,: bawawaL
a
b
ba,
a
b
ba
0q 2q 3q
4q
04/18/23 86
Finite State Automata
• Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata.
04/18/23 87
More Formally
• You can specify an FSA by enumerating the following things.– The set of states: Q– A finite alphabet: Σ– A start state– A set of accept/final states– A transition function that maps QxΣ to Q
04/18/23 88
Dollars and Cents
04/18/23 89
Assignment 2 - Part 1
• A windows-based version of Python interpreter is available at the supplementary material section of the course website. Please download the interpreter and practice it. Use the help, tutorials and available documentation to investigate the possibility of using Arabic text. summarize your findings.
04/18/23 90
Assignment 2 - Part 2
• Practice search in Ms Word using regular expressions (Wildcards) for both Arabic and English. Submit at least 5 nontrivial examples.
04/18/23 91
Assignment 2 - Part 3
• You have been asked to participate in writing an exam about chapter 2 of the textbook. Write one question to check student understanding of chapter two material. Include the answer in your submission.
04/18/23 92
Thank you
الله ورحمة عليكم السالم
top related