nlp (fall 2013): nlp areas, chomisky hierarchy, finite state automata, regular languages, regular...

Natural Language Processing

www.vkedco.blogspot.comwww.vkedco.blogspot.com

NLP Areas, Chomsky Hierarchy, Finite Sate Machines & Regular Expressions, ELIZA: NL Dialogue through Pattern Matching

Vladimir Kulyukin

http://www.vkedco.blogspot.com/

http://www.linkedin.com/pub/vladimir-kulyukin/23/2a2/150

Outline

● NLP Areas● Chomsky Hierarchy● Finite State Automata, Regular Expressions, & Regular

Languages● ELIZA: Natural Language Dialogue through Pattern

Matching



NLP Areas● Morphology● Phonology and Text-To-Speech (TTS)● Syntactic Analysis● Semantics● Optical Character Recognition (OCR) – this is somewhat

questionable, because to may it is an area of computer vision

● Speech Recognition● Natural Language Generation (NLG)



Chomsky Language Hierarchy

Where Should Natural Languages Be Placed?



Finite State Automata, Regular Expressions, & Regular Languages



DFA: Deterministic Finite Automata

• A DFA can be informally defined as a directed graph whose nodes are states and whose edges are transitions on specific symbols

• A DFA has a unique start state and a set (possibly empty) of final or accepting states

• A DFA processes the input string one symbol at a time; when the last symbol is read, the DFA reaches a state which is either final or not; if the state is final, the DFA accepts (recognizes) the string; if the state is not final, the DFA rejects the string



DFA: Formal Definition

( )

states. (final) accepting ofset theis

state;start theis

;: function;n transitioa is

alphabet;an is

states; ofset finite a is

:where

,,,,, i.e. tuple,-5 a is DFA A

0

0

F

Qq

QQ

Q

FqQMM

∈→Σ×

Σ

Σ=

δδ

δ



Example DFA

q0 q1

ba

a

b

All strings that end in a



Nondeterminism

● Given an input, there can be more than one legal sequence of steps to process the input

● There can be various criteria to evaluate why one legal sequence is better than another

● The input is accepted if at least one legal sequence of moves ends up in an accepting state



Practical Implication of Nondeterminism

● Key computational implication of nondeterminism is the necessity of search

● In a typical scenario, a legal sequence of steps is a subset of some finite set

● Finding subsets brings us to the concept of power set


http://en.wikipedia.org/wiki/Power_set


NFA: Definition

( )

{ }( ) ( )

states. accepting ofset theis

state;start theis

;:

symbols; ofset finite a i.e. alphabet,an is

states; ofset finite a is

where

,,,,, tuple-5 a is NFA An

0

0

QF

Qq

QPQ

Q

FqQMM

⊆∈

→∪Σ×Σ

Σ=

εδ

δ



Example NFA

q0 q1 q2

a,b

a a

a,b

a b

q0 {q0, q1} {q0}

q1 {q2} { }

q2 {q2} {q2}


This is the transition tableof the above NFA


NFA vs. DFA

● NFAs are simpler to write, because, in general, have fewer states and allow for spontaneous transitions

● However, they are not more powerful than DFAs, i.e. they accept the same regular languages as DFAs

● For every NFA, one can construct a DFA that accepts the same language



Equivalence of NFAs and DFAs

● Basic insight: A DFA can keep track of the states that the equivalent NFA may be in after reading each symbol of the input

● Since the NFA may be in more than one state after reading a symbol, each state of the DFA must correspond to a subset of the NFA’s states

● The construction of an equivalent DFA from an NFA is called subset construction


http://vkedco.blogspot.com/2011/11/theory-of-computation-lecture-32.html


Languages Accepted By DFAs & NFAs

● An immediate consequence of the subset construction algorithm is that non-determinism does not increase conceptual power (i.e., the same class of languages is recognized by DFAs and NFAs)

● Languages that are recognized by FSAs (DFAs and NFAs) are called regular



Regular Expressions



Regular Expressions

● Regular expressions (regexps) are one of the most useful tools in computer science

● NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, informa-tion extraction, & speech recognition

● As a student of NLP, you should learn to recog-nize if a problem at hand can be solved via reg-exps



Three Language Operations

{ }

{ }

{ }.,1|... is then language, a is If

.,0|... is of closure Kleene thelanguage, a is If

. and | languages, theare and If

121

021*

221212121

LxnxxxLL

LxnxxxLLL

LxLxxxLLLL

nin

nin

∈≥=

∈≥=

∈∈=

≤≤+

≤≤



Atomic & Compound Regular Expressions● Regular expressions can be divided into atomic

and compound● Atomic regular expressions are basic building

blocks out of which compound regular expres-sions are built

● There are typically three atomic regular expres-sions: unit strings (strings of one symbol), empty strings (strings of no symbol), and the empty set of strings



Atomic Regular Expressions

( )

( ) { }( ) { }( ) { }=∅

==Σ∈

L

L

aaLa

ΣrL

r

.3

.2

then , If 1.

:sexpressionregular atomic of typesThree

.alphabet someover

language thedenotes that string a is expressionregular A

εε



Compound Regular Expressions

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( )( ) ( )( ) .Then .expressionregular a is .3

.Then .expressionregular a is .2

.Then .expressionregular a is 1.

s.expressionregular be and Let

***

212121

212121

21

rLrLr

rLrLrrLrr

rLrLrrLrr

rr

=

=∪=++



Compiling FSAs from RegExps



Atomic Reg Exps NFAs→

Σ∈aa

This NFA accepts only the string 'a' and nothing else




εε

This NFA accepts only the empty string




∅

This NFA accepts only the empty set, i.e.,no strings



Compound Reg Exps NFAs →

( ) sexpressionregular are , where, 2121 rrrr +

ε

r1

r2

Another notation, commonly used in regexp engines, is (r

1 | r

2), in other words, either r

1 or r

2

ε

ε

ε




ε

r1

r2

This compound NFA accepts if and only if either the NFA for r1

(upper one) accepts or the NFA for r2 (lower one) accepts

ε

ε

ε




( ) expressionregular is where, 11 rr +

ε

ε

εr1

This regular expressions accepts strings that match r1

at least once



( )*1r

ε

ε

εr1

ε


This regular expressions accepts strings that match r1

zero or more times



Defining Regular Expressions


Just in case you are interested/curious, below are a few links to my presentations on regular expressions in Python and Perl:

● http://vkedco.blogspot.com/2013/02/python-perl-regular-expression-match.html● http://vkedco.blogspot.com/2013/02/python-amp-perl-py-pattern-compilation.html● http://vkedco.blogspot.com/2013/02/python-perl-loose-tight-regular.html


http://vkedco.blogspot.com/2013/02/python-perl-regular-expression-match.html

http://vkedco.blogspot.com/2013/02/python-amp-perl-py-pattern-compilation.html

http://vkedco.blogspot.com/2013/02/python-perl-loose-tight-regular.html

Defining Regexps

/a/ match all occurrences of character 'a'

/ab/ match all occurrences of sequence 'ab'

/Ab cD/ match all occurrences of sequence 'Ab cD'

● Regular expressions are placed inside the pair of matching forward slashes - / /

● Regular expressions are case-sensitive● Examples:



Disjunction

/[Aa]/ match either 'A' or 'a'

/]abc]/ match all occurrences of sequence 'a' or 'b' or 'c'

/[cC]onference of [bB]irds/ match strings like 'conference of birds' or 'Conference of birds' or 'conference of Birds' or 'Conference of Birds'

● Disjunction (or'ing) of characters inside a regexp is done with the matching square brackets [ ]

● All characters inside [ ] are part of the disjunction● Examples:



Positive & Negative Ranges

/[A-Z]/ - match an uppercase letter

/]a-z]/ - match a lowercase letter

/[0-9]/ - match a digit

/[^A-Z]/ - match a non-uppercase letter

/[^Ab]/ - match neither 'A' nor 'b'

● [ ] can be used in conjunction with – to specify character ranges

● Ranges can be negated with the special character ^ if it is the first character inside [ ]

● Examples:



Zero or One Occurrences

/Examples?/ - match 'Example' or 'Examples'

/colou?r/ - match 'colour' or 'color'

● Special character ? is used to specify zero or one occurrences of the preceding character

● Examples:



Kleene*, Kleene+, Wildcard .

/a*/ match '', 'a', 'aa', 'aaa', 'aaaa', etc

/(ab)*/ match '', 'ab', 'abab', 'ababab', 'ababab', etc

/a+/ match 'a', 'aa', 'aaa', 'aaaa', etc

/(ab)+/ match 'ab', 'abab', 'ababab', 'abababab', etc

/beg.n/ match 'begin', 'began', 'begun'

● Special charater + (aka Kleene +) specifies one or more occurrences of the regular expression that comes right before it

● Special character * (aka Kleene *) specifies zero or more occur-rences of the regular expression that comes right before it

● Special character . (wildcard) specifies any single character



Anchors

/^Omar Khayyam/ matches 'Omar Khayyam' only at the beginning of the text

/Omar Khayyam$/ matches 'Omar Khayyam' only at the end of the text

/^Omar Khayyam$/ matches only 'Omar Khayyam'

● Anchors are special characters that anchor a regexp to specific position in the text they are matched against

● The anchors are ^ and $ anchor regexps at the beginning and end of the text, respectively



Anchors

/\bthe\b/ matches ' the ' but not 'weather'

/\Bthe\B/ does not match 'the' but does match 'weather'

● Anchors \b and \B match at word boundaries and non-word boundaries, respectively



Regexp Groups



Grouping Regular Expressions

● Regular expressions can be broken into compo-nents

● Such components are called groups● A group match is a specific part of text that

matches a specific regular subexpression in a larger expression

● Groups are specified with a pair of ( )



Referencing Group Matches

● Group matches are numbered $1, $2, $3, etc (some regexp notations use just numbers: 1, 2, 3, etc)

● These numbers are called backreferences be-cause they refer specific text segments back to specific regular subexpressions

● Backreferences are used in substitutions



Sample Problem

Design a regular expression that parses email addresses into user name, host name, and host extension



Possible Solution

'([\w.]+)@([\w.]+)\.(com|net|org)'



Example 01

$txt_01 = '1+1=2';

## Suppose we match /(\d)\+(\d)\=(\d)/ against $txt_01

## Then the variable alignment is as follows:

## /(\d)\+(\d)\=(\d)/

## $1 $2 $3

## In other words, $1 is bound to '1',

## $2 is bound to '2' and $3 is bound to '3'



Example 02

$txt_01 = '1+1=2';

## Suppose we match /((\d)\+(\d)\=(\d))/ against $txt_01.

## Then the special variable alignment is

## /((\d)\+(\d)=(\d))/;

## $1$2 $3 $4

## In other words, $1 is bound to '1+1=2',

## $2 is bound to '1' and $3 is bound to '1',

## $4 is bound to '2'.



Substitutions● Groups (subexpressions of larger regular expres-

sions) are specified with special characters ( )● Corresponding text segments that match subex-

pressions are retrieved with special variables $1, $2, $3, etc (some formalisms use just integers 1, 2, 3, etc)

● These variables are aligned with left parentheses: $1 is aligned with the 1st left parenthesis, $2 is aligned with the 2nd left parenthesis, etc



Patterns in Java



Patterns in Java

● Java has the package java.util.regex that allows the programmer to work with patterns, including regular expressions

● java.util.regex package has two major classes: Pattern and Matcher

● Pattern compiles patterns into finite state automata● Matcher uses the compiled pattern to find substrings

that match the pattern



Java Pattern Example

● This is how you compile a regular expression:– Pattern pat = Pattern.compile(“abc”);

● This is how you create a matcher object, essentially an NFA that can be used to find matches:– Matcher match = pat.matcher(str);

● This is how you can test for a match:– match.matches() is a boolean predicate



Java Pattern Examplewhile (true) { Pattern pattern = Pattern.compile(console.readLine("%nEnter your regex: ")); Matcher matcher = pattern.matcher(console.readLine("Enter input string to search: ")); boolean found = false; // iterate through groups and print matches and their positions while (matcher.find()) { console.format("I found match \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end()); found = true; } if(!found) { console.format("No match found.%n"); } }


source code is here


https://github.com/VKEDCO/java/blob/master/regexp/RegexTestHarness.java

Java Pattern Example

● This is how you can call RegexDemo from the command line:

– java RegexTestHarness ^(0|1(01*0)*1)*$

● Then you can give it strings to match● You can also redirect a file to RegexDemo:

– java RegexTestHarness reg-exp < numbers.txt



Sample OutputsEnter your regex: a

Enter input string to search: bac

I found the text "a" starting at index 1 and ending at index 2.

Enter your regex: /\d/

Enter input string to search: bc12ab

No match found.

Enter your regex: \d

Enter input string to search: bc12ab

I found the text "1" starting at index 2 and ending at index 3.

I found the text "2" starting at index 3 and ending at index 4.



Pattern Matching

Real Intelligence or

Illusion of Intelligence?



Illusion of Intelligence?

● In the 1960's and 70's, AI researchers developed a num-ber of programs that impressed many people with their 'intelligence'

● ELIZA could have a dialog with a person● STUDENT solved high school algebra word problems● MACSYMA solved problems in integral and differential

calculus● SAM understood simple stories



Pattern Matching● What all these systems had in common was a technique

that in symbolic AI came to be known as pattern match-ing

● The basic idea was to define a set of rules to look for patterns in inputs and produced outputs by substituting what they matched for something else

● Results were impressive to the uninitiated until the re-searchers themselves started explaining how they did what they did



ELIZA● ELIZA was written by Joseph Weizenbaum in 1966● The system was named after the heroine of the play

Pygmalion whom a good teacher taught how to speak English

● ELIZA was written not to just speak but to emulate a Rogerian psychologist

● The basic principle of Rogerian psychology is to never ask direct questions but let the patients tell their own story


http://en.wikipedia.org/wiki/Joseph_Weizenbaum


Sample Dialog with ELIZA

USER> Men are all alike

ELIZA> IN WHAT WAY

USER> They are always bugging us about something or another.

ELIZA> CAN YOU BE MORE SPECIFIC

USER> Well, my boyfriend made me come here.

ELIZA> YOU BOYFRIEND MADE YOU COME HERE

USER> He says I'm depressed much of the time.

ELIZA> I AM SORRY TO HEAR YOU ARE DEPRESSED.

USER> It's true. I am unhappy.

ELIZA> DO YOU THINK COMING HERE WILL HELP YOU NOT BE UNHAPPY

USER> I need some help, that much is certain.



How ELIZA Works

● The program looks for specific patterns in the input and then prints the response on the basis of what it finds

● For example, when the program finds 'alike' or 'same', it may print 'IN WHAT WAY'

● When the programs matches the pattern 'I need X', it may print 'WHAT WOULD IT MEAN IF YOU GOT X'

● For example, if the user types 'I need some help', ELIZA prints 'WHAT WOULD IT MEAN IF YOU GOT SOME HELP'

● The level of output sophistication depends on how elaborate the patterns are

● Try an online version of ELIZA here


http://www.manifestation.com/neurotoys/eliza.php3


ELIZA RULES

● A rule is a data structure that consists of a pattern and a set of re-sponses

● Example:

RULE

PATTERN: 'X I want Y'

RESPONSES:

{ 'What would it mean if you got Y',

'Why do you want Y',

'Suppose you got Y soon' }



Rule Application

● Suppose the user types: Some day I want to read Farid Uddin Attar's Conference of Birds in the original

● The rule's patter matches X to 'Some day' and Y to 'to read Farid Uddin Attar's Conference of Birds in the original'

● In symbolic pattern matching, the result of the match is referred to as list of bindings: X is bound to 'Some day' and Y is bound 'to read Farid Uddin Attar's Conference of Birds in the original'



Rule Application● After the list of bindings is found, the program can then use it to

produce the following responses:

– What would it mean if you got to read Farid Uddin Attar's Conference of birds in the original

– Why do you want to read Farid Uddin Attar's Conference of Birds in the original

– Suppose you got to read Farrid Uddin Attar's Conference of Birds in the original

● There are still two important problems to think about: What if mul-tiple rules' patterns match and how is a specific response is chosen within a matched rule



ELIZA Algorithm

while ( True ) {

input = get_input_from_user();

applicable_rules = find_applicable_rules(input, rule_database);

chosen_rule = choose_applicable_rule(applicable_rules);

chosen_response = choose_response(chosen_rule);

chosen_response = substitute_matches(chosen_response);

print_response(chosen_response);

}



References

● Ch 02, D. Jurafsky & J. Martin. Speech & Language Processing, Prentice Hall, ISBN 0-13-095069-6

● Weizenbaum, J. 1966. "Eliza - A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, 9(1): 36-45. (pdf)


http://www.cse.buffalo.edu/~rapaport/572/S02/weizenbaum.eliza.1966.pdf


nlp (fall 2013): nlp areas, chomisky hierarchy, finite state automata, regular languages, regular...

Technology

dfa nfas

regular languages eliza

equivalent dfa

nfas languages

example dfa q0 q1 b

dfas nfas

atomic regular expres

natural languages