nlp (fall 2013): nlp areas, chomisky hierarchy, finite state automata, regular languages, regular...
TRANSCRIPT
Natural Language Processing
www.vkedco.blogspot.comwww.vkedco.blogspot.com
NLP Areas, Chomsky Hierarchy, Finite Sate Machines & Regular Expressions, ELIZA: NL Dialogue through Pattern Matching
Vladimir Kulyukin
Outline
● NLP Areas● Chomsky Hierarchy● Finite State Automata, Regular Expressions, & Regular
Languages● ELIZA: Natural Language Dialogue through Pattern
Matching
www.vkedco.blogspot.comwww.vkedco.blogspot.com
NLP Areas● Morphology● Phonology and Text-To-Speech (TTS)● Syntactic Analysis● Semantics● Optical Character Recognition (OCR) – this is somewhat
questionable, because to may it is an area of computer vision
● Speech Recognition● Natural Language Generation (NLG)
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Chomsky Language Hierarchy
Where Should Natural Languages Be Placed?
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Finite State Automata, Regular Expressions, & Regular Languages
www.vkedco.blogspot.comwww.vkedco.blogspot.com
DFA: Deterministic Finite Automata
• A DFA can be informally defined as a directed graph whose nodes are states and whose edges are transitions on specific symbols
• A DFA has a unique start state and a set (possibly empty) of final or accepting states
• A DFA processes the input string one symbol at a time; when the last symbol is read, the DFA reaches a state which is either final or not; if the state is final, the DFA accepts (recognizes) the string; if the state is not final, the DFA rejects the string
www.vkedco.blogspot.comwww.vkedco.blogspot.com
DFA: Formal Definition
( )
states. (final) accepting ofset theis
state;start theis
;: function;n transitioa is
alphabet;an is
states; ofset finite a is
:where
,,,,, i.e. tuple,-5 a is DFA A
0
0
F
Q
FqQMM
∈→Σ×
Σ
Σ=
δδ
δ
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Example DFA
q0 q1
ba
a
b
All strings that end in a
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Nondeterminism
● Given an input, there can be more than one legal sequence of steps to process the input
● There can be various criteria to evaluate why one legal sequence is better than another
● The input is accepted if at least one legal sequence of moves ends up in an accepting state
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Practical Implication of Nondeterminism
● Key computational implication of nondeterminism is the necessity of search
● In a typical scenario, a legal sequence of steps is a subset of some finite set
● Finding subsets brings us to the concept of power set
www.vkedco.blogspot.comwww.vkedco.blogspot.com
NFA: Definition
( )
{ }( ) ( )
states. accepting ofset theis
state;start theis
;:
symbols; ofset finite a i.e. alphabet,an is
states; ofset finite a is
where
,,,,, tuple-5 a is NFA An
0
0
QF
QPQ
Q
FqQMM
⊆∈
→∪Σ×Σ
Σ=
εδ
δ
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Example NFA
q0 q1 q2
a,b
a a
a,b
a b
q0 {q0, q1} {q0}
q1 {q2} { }
q2 {q2} {q2}
www.vkedco.blogspot.comwww.vkedco.blogspot.com
This is the transition tableof the above NFA
NFA vs. DFA
● NFAs are simpler to write, because, in general, have fewer states and allow for spontaneous transitions
● However, they are not more powerful than DFAs, i.e. they accept the same regular languages as DFAs
● For every NFA, one can construct a DFA that accepts the same language
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Equivalence of NFAs and DFAs
● Basic insight: A DFA can keep track of the states that the equivalent NFA may be in after reading each symbol of the input
● Since the NFA may be in more than one state after reading a symbol, each state of the DFA must correspond to a subset of the NFA’s states
● The construction of an equivalent DFA from an NFA is called subset construction
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Languages Accepted By DFAs & NFAs
● An immediate consequence of the subset construction algorithm is that non-determinism does not increase conceptual power (i.e., the same class of languages is recognized by DFAs and NFAs)
● Languages that are recognized by FSAs (DFAs and NFAs) are called regular
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Regular Expressions
● Regular expressions (regexps) are one of the most useful tools in computer science
● NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, informa-tion extraction, & speech recognition
● As a student of NLP, you should learn to recog-nize if a problem at hand can be solved via reg-exps
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Three Language Operations
{ }
{ }
{ }.,1|... is then language, a is If
.,0|... is of closure Kleene thelanguage, a is If
. and | languages, theare and If
121
021*
221212121
LxnxxxLL
LxnxxxLLL
LxLxxxLLLL
nin
nin
∈≥=
∈≥=
∈∈=
≤≤+
≤≤
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Atomic & Compound Regular Expressions● Regular expressions can be divided into atomic
and compound● Atomic regular expressions are basic building
blocks out of which compound regular expres-sions are built
● There are typically three atomic regular expres-sions: unit strings (strings of one symbol), empty strings (strings of no symbol), and the empty set of strings
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Atomic Regular Expressions
( )
( ) { }( ) { }( ) { }=∅
==Σ∈
L
L
aaLa
ΣrL
r
.3
.2
then , If 1.
:sexpressionregular atomic of typesThree
.alphabet someover
language thedenotes that string a is expressionregular A
εε
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Compound Regular Expressions
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( )( ) ( )( ) .Then .expressionregular a is .3
.Then .expressionregular a is .2
.Then .expressionregular a is 1.
s.expressionregular be and Let
***
212121
212121
21
rLrLr
rLrLrrLrr
rLrLrrLrr
rr
=
=∪=++
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Compiling FSAs from RegExps
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Atomic Reg Exps NFAs→
Σ∈aa
This NFA accepts only the string 'a' and nothing else
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Atomic Reg Exps NFAs→
εε
This NFA accepts only the empty string
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Atomic Reg Exps NFAs→
∅
This NFA accepts only the empty set, i.e.,no strings
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Compound Reg Exps NFAs →
( ) sexpressionregular are , where, 2121 rrrr +
ε
r1
r2
Another notation, commonly used in regexp engines, is (r
1 | r
2), in other words, either r
1 or r
2
ε
ε
ε
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Compound Reg Exps NFAs →
ε
r1
r2
This compound NFA accepts if and only if either the NFA for r1
(upper one) accepts or the NFA for r2 (lower one) accepts
ε
ε
ε
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Compound Reg Exps NFAs →
( ) expressionregular is where, 11 rr +
ε
ε
εr1
This regular expressions accepts strings that match r1
at least once
www.vkedco.blogspot.comwww.vkedco.blogspot.com
( )*1r
ε
ε
εr1
ε
Compound Reg Exps NFAs →
This regular expressions accepts strings that match r1
zero or more times
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Defining Regular Expressions
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Just in case you are interested/curious, below are a few links to my presentations on regular expressions in Python and Perl:
● http://vkedco.blogspot.com/2013/02/python-perl-regular-expression-match.html● http://vkedco.blogspot.com/2013/02/python-amp-perl-py-pattern-compilation.html● http://vkedco.blogspot.com/2013/02/python-perl-loose-tight-regular.html
Defining Regexps
/a/ match all occurrences of character 'a'
/ab/ match all occurrences of sequence 'ab'
/Ab cD/ match all occurrences of sequence 'Ab cD'
● Regular expressions are placed inside the pair of matching forward slashes - / /
● Regular expressions are case-sensitive● Examples:
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Disjunction
/[Aa]/ match either 'A' or 'a'
/]abc]/ match all occurrences of sequence 'a' or 'b' or 'c'
/[cC]onference of [bB]irds/ match strings like 'conference of birds' or 'Conference of birds' or 'conference of Birds' or 'Conference of Birds'
● Disjunction (or'ing) of characters inside a regexp is done with the matching square brackets [ ]
● All characters inside [ ] are part of the disjunction● Examples:
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Positive & Negative Ranges
/[A-Z]/ - match an uppercase letter
/]a-z]/ - match a lowercase letter
/[0-9]/ - match a digit
/[^A-Z]/ - match a non-uppercase letter
/[^Ab]/ - match neither 'A' nor 'b'
● [ ] can be used in conjunction with – to specify character ranges
● Ranges can be negated with the special character ^ if it is the first character inside [ ]
● Examples:
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Zero or One Occurrences
/Examples?/ - match 'Example' or 'Examples'
/colou?r/ - match 'colour' or 'color'
● Special character ? is used to specify zero or one occurrences of the preceding character
● Examples:
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Kleene*, Kleene+, Wildcard .
/a*/ match '', 'a', 'aa', 'aaa', 'aaaa', etc
/(ab)*/ match '', 'ab', 'abab', 'ababab', 'ababab', etc
/a+/ match 'a', 'aa', 'aaa', 'aaaa', etc
/(ab)+/ match 'ab', 'abab', 'ababab', 'abababab', etc
/beg.n/ match 'begin', 'began', 'begun'
● Special charater + (aka Kleene +) specifies one or more occurrences of the regular expression that comes right before it
● Special character * (aka Kleene *) specifies zero or more occur-rences of the regular expression that comes right before it
● Special character . (wildcard) specifies any single character
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Anchors
/^Omar Khayyam/ matches 'Omar Khayyam' only at the beginning of the text
/Omar Khayyam$/ matches 'Omar Khayyam' only at the end of the text
/^Omar Khayyam$/ matches only 'Omar Khayyam'
● Anchors are special characters that anchor a regexp to specific position in the text they are matched against
● The anchors are ^ and $ anchor regexps at the beginning and end of the text, respectively
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Anchors
/\bthe\b/ matches ' the ' but not 'weather'
/\Bthe\B/ does not match 'the' but does match 'weather'
● Anchors \b and \B match at word boundaries and non-word boundaries, respectively
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Grouping Regular Expressions
● Regular expressions can be broken into compo-nents
● Such components are called groups● A group match is a specific part of text that
matches a specific regular subexpression in a larger expression
● Groups are specified with a pair of ( )
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Referencing Group Matches
● Group matches are numbered $1, $2, $3, etc (some regexp notations use just numbers: 1, 2, 3, etc)
● These numbers are called backreferences be-cause they refer specific text segments back to specific regular subexpressions
● Backreferences are used in substitutions
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Sample Problem
Design a regular expression that parses email addresses into user name, host name, and host extension
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Possible Solution
'([\w.]+)@([\w.]+)\.(com|net|org)'
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Example 01
$txt_01 = '1+1=2';
## Suppose we match /(\d)\+(\d)\=(\d)/ against $txt_01
## Then the variable alignment is as follows:
## /(\d)\+(\d)\=(\d)/
## $1 $2 $3
## In other words, $1 is bound to '1',
## $2 is bound to '2' and $3 is bound to '3'
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Example 02
$txt_01 = '1+1=2';
## Suppose we match /((\d)\+(\d)\=(\d))/ against $txt_01.
## Then the special variable alignment is
## /((\d)\+(\d)=(\d))/;
## $1$2 $3 $4
## In other words, $1 is bound to '1+1=2',
## $2 is bound to '1' and $3 is bound to '1',
## $4 is bound to '2'.
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Substitutions● Groups (subexpressions of larger regular expres-
sions) are specified with special characters ( )● Corresponding text segments that match subex-
pressions are retrieved with special variables $1, $2, $3, etc (some formalisms use just integers 1, 2, 3, etc)
● These variables are aligned with left parentheses: $1 is aligned with the 1st left parenthesis, $2 is aligned with the 2nd left parenthesis, etc
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Patterns in Java
● Java has the package java.util.regex that allows the programmer to work with patterns, including regular expressions
● java.util.regex package has two major classes: Pattern and Matcher
● Pattern compiles patterns into finite state automata● Matcher uses the compiled pattern to find substrings
that match the pattern
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Java Pattern Example
● This is how you compile a regular expression:– Pattern pat = Pattern.compile(“abc”);
● This is how you create a matcher object, essentially an NFA that can be used to find matches:– Matcher match = pat.matcher(str);
● This is how you can test for a match:– match.matches() is a boolean predicate
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Java Pattern Examplewhile (true) { Pattern pattern = Pattern.compile(console.readLine("%nEnter your regex: ")); Matcher matcher = pattern.matcher(console.readLine("Enter input string to search: ")); boolean found = false; // iterate through groups and print matches and their positions while (matcher.find()) { console.format("I found match \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end()); found = true; } if(!found) { console.format("No match found.%n"); } }
www.vkedco.blogspot.comwww.vkedco.blogspot.com
source code is here
Java Pattern Example
● This is how you can call RegexDemo from the command line:
– java RegexTestHarness ^(0|1(01*0)*1)*$
● Then you can give it strings to match● You can also redirect a file to RegexDemo:
– java RegexTestHarness reg-exp < numbers.txt
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Sample OutputsEnter your regex: a
Enter input string to search: bac
I found the text "a" starting at index 1 and ending at index 2.
Enter your regex: /\d/
Enter input string to search: bc12ab
No match found.
Enter your regex: \d
Enter input string to search: bc12ab
I found the text "1" starting at index 2 and ending at index 3.
I found the text "2" starting at index 3 and ending at index 4.
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Pattern Matching
Real Intelligence or
Illusion of Intelligence?
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Illusion of Intelligence?
● In the 1960's and 70's, AI researchers developed a num-ber of programs that impressed many people with their 'intelligence'
● ELIZA could have a dialog with a person● STUDENT solved high school algebra word problems● MACSYMA solved problems in integral and differential
calculus● SAM understood simple stories
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Pattern Matching● What all these systems had in common was a technique
that in symbolic AI came to be known as pattern match-ing
● The basic idea was to define a set of rules to look for patterns in inputs and produced outputs by substituting what they matched for something else
● Results were impressive to the uninitiated until the re-searchers themselves started explaining how they did what they did
www.vkedco.blogspot.comwww.vkedco.blogspot.com
ELIZA● ELIZA was written by Joseph Weizenbaum in 1966● The system was named after the heroine of the play
Pygmalion whom a good teacher taught how to speak English
● ELIZA was written not to just speak but to emulate a Rogerian psychologist
● The basic principle of Rogerian psychology is to never ask direct questions but let the patients tell their own story
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Sample Dialog with ELIZA
USER> Men are all alike
ELIZA> IN WHAT WAY
USER> They are always bugging us about something or another.
ELIZA> CAN YOU BE MORE SPECIFIC
USER> Well, my boyfriend made me come here.
ELIZA> YOU BOYFRIEND MADE YOU COME HERE
USER> He says I'm depressed much of the time.
ELIZA> I AM SORRY TO HEAR YOU ARE DEPRESSED.
USER> It's true. I am unhappy.
ELIZA> DO YOU THINK COMING HERE WILL HELP YOU NOT BE UNHAPPY
USER> I need some help, that much is certain.
www.vkedco.blogspot.comwww.vkedco.blogspot.com
How ELIZA Works
● The program looks for specific patterns in the input and then prints the response on the basis of what it finds
● For example, when the program finds 'alike' or 'same', it may print 'IN WHAT WAY'
● When the programs matches the pattern 'I need X', it may print 'WHAT WOULD IT MEAN IF YOU GOT X'
● For example, if the user types 'I need some help', ELIZA prints 'WHAT WOULD IT MEAN IF YOU GOT SOME HELP'
● The level of output sophistication depends on how elaborate the patterns are
● Try an online version of ELIZA here
www.vkedco.blogspot.comwww.vkedco.blogspot.com
ELIZA RULES
● A rule is a data structure that consists of a pattern and a set of re-sponses
● Example:
RULE
PATTERN: 'X I want Y'
RESPONSES:
{ 'What would it mean if you got Y',
'Why do you want Y',
'Suppose you got Y soon' }
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Rule Application
● Suppose the user types: Some day I want to read Farid Uddin Attar's Conference of Birds in the original
● The rule's patter matches X to 'Some day' and Y to 'to read Farid Uddin Attar's Conference of Birds in the original'
● In symbolic pattern matching, the result of the match is referred to as list of bindings: X is bound to 'Some day' and Y is bound 'to read Farid Uddin Attar's Conference of Birds in the original'
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Rule Application● After the list of bindings is found, the program can then use it to
produce the following responses:
– What would it mean if you got to read Farid Uddin Attar's Conference of birds in the original
– Why do you want to read Farid Uddin Attar's Conference of Birds in the original
– Suppose you got to read Farrid Uddin Attar's Conference of Birds in the original
● There are still two important problems to think about: What if mul-tiple rules' patterns match and how is a specific response is chosen within a matched rule
www.vkedco.blogspot.comwww.vkedco.blogspot.com
ELIZA Algorithm
while ( True ) {
input = get_input_from_user();
applicable_rules = find_applicable_rules(input, rule_database);
chosen_rule = choose_applicable_rule(applicable_rules);
chosen_response = choose_response(chosen_rule);
chosen_response = substitute_matches(chosen_response);
print_response(chosen_response);
}
www.vkedco.blogspot.comwww.vkedco.blogspot.com
References
● Ch 02, D. Jurafsky & J. Martin. Speech & Language Processing, Prentice Hall, ISBN 0-13-095069-6
● Weizenbaum, J. 1966. "Eliza - A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, 9(1): 36-45. (pdf)
www.vkedco.blogspot.comwww.vkedco.blogspot.com