nlp (fall 2013): nlp areas, chomisky hierarchy, finite state automata, regular languages, regular...

62
Natural Language Processing www.vkedco.blogspot.com www.vkedco.blogspot.com NLP Areas, Chomsky Hierarchy, Finite Sate Machines & Regular Expressions, ELIZA: NL Dialogue through Pattern Matching Vladimir Kulyukin

Upload: vladimir-kulyukin

Post on 07-May-2015

472 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Natural Language Processing

www.vkedco.blogspot.comwww.vkedco.blogspot.com

NLP Areas, Chomsky Hierarchy, Finite Sate Machines & Regular Expressions, ELIZA: NL Dialogue through Pattern Matching

Vladimir Kulyukin

Page 2: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Outline

● NLP Areas● Chomsky Hierarchy● Finite State Automata, Regular Expressions, & Regular

Languages● ELIZA: Natural Language Dialogue through Pattern

Matching

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 3: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

NLP Areas● Morphology● Phonology and Text-To-Speech (TTS)● Syntactic Analysis● Semantics● Optical Character Recognition (OCR) – this is somewhat

questionable, because to may it is an area of computer vision

● Speech Recognition● Natural Language Generation (NLG)

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 4: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Chomsky Language Hierarchy

Where Should Natural Languages Be Placed?

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 5: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Finite State Automata, Regular Expressions, & Regular Languages

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 6: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

DFA: Deterministic Finite Automata

• A DFA can be informally defined as a directed graph whose nodes are states and whose edges are transitions on specific symbols

• A DFA has a unique start state and a set (possibly empty) of final or accepting states

• A DFA processes the input string one symbol at a time; when the last symbol is read, the DFA reaches a state which is either final or not; if the state is final, the DFA accepts (recognizes) the string; if the state is not final, the DFA rejects the string

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 7: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

DFA: Formal Definition

( )

states. (final) accepting ofset theis

state;start theis

;: function;n transitioa is

alphabet;an is

states; ofset finite a is

:where

,,,,, i.e. tuple,-5 a is DFA A

0

0

F

Qq

QQ

Q

FqQMM

∈→Σ×

Σ

Σ=

δδ

δ

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 8: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Example DFA

q0 q1

ba

a

b

All strings that end in a

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 9: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Nondeterminism

● Given an input, there can be more than one legal sequence of steps to process the input

● There can be various criteria to evaluate why one legal sequence is better than another

● The input is accepted if at least one legal sequence of moves ends up in an accepting state

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 10: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Practical Implication of Nondeterminism

● Key computational implication of nondeterminism is the necessity of search

● In a typical scenario, a legal sequence of steps is a subset of some finite set

● Finding subsets brings us to the concept of power set

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 11: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

NFA: Definition

( )

{ }( ) ( )

states. accepting ofset theis

state;start theis

;:

symbols; ofset finite a i.e. alphabet,an is

states; ofset finite a is

where

,,,,, tuple-5 a is NFA An

0

0

QF

Qq

QPQ

Q

FqQMM

⊆∈

→∪Σ×Σ

Σ=

εδ

δ

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 12: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Example NFA

q0 q1 q2

a,b

a a

a,b

a b

q0 {q0, q1} {q0}

q1 {q2} { }

q2 {q2} {q2}

www.vkedco.blogspot.comwww.vkedco.blogspot.com

This is the transition tableof the above NFA

Page 13: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

NFA vs. DFA

● NFAs are simpler to write, because, in general, have fewer states and allow for spontaneous transitions

● However, they are not more powerful than DFAs, i.e. they accept the same regular languages as DFAs

● For every NFA, one can construct a DFA that accepts the same language

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 14: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Equivalence of NFAs and DFAs

● Basic insight: A DFA can keep track of the states that the equivalent NFA may be in after reading each symbol of the input

● Since the NFA may be in more than one state after reading a symbol, each state of the DFA must correspond to a subset of the NFA’s states

● The construction of an equivalent DFA from an NFA is called subset construction

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 15: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Languages Accepted By DFAs & NFAs

● An immediate consequence of the subset construction algorithm is that non-determinism does not increase conceptual power (i.e., the same class of languages is recognized by DFAs and NFAs)

● Languages that are recognized by FSAs (DFAs and NFAs) are called regular

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 16: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Regular Expressions

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 17: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Regular Expressions

● Regular expressions (regexps) are one of the most useful tools in computer science

● NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, informa-tion extraction, & speech recognition

● As a student of NLP, you should learn to recog-nize if a problem at hand can be solved via reg-exps

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 18: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Three Language Operations

{ }

{ }

{ }.,1|... is then language, a is If

.,0|... is of closure Kleene thelanguage, a is If

. and | languages, theare and If

121

021*

221212121

LxnxxxLL

LxnxxxLLL

LxLxxxLLLL

nin

nin

∈≥=

∈≥=

∈∈=

≤≤+

≤≤

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 19: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Atomic & Compound Regular Expressions● Regular expressions can be divided into atomic

and compound● Atomic regular expressions are basic building

blocks out of which compound regular expres-sions are built

● There are typically three atomic regular expres-sions: unit strings (strings of one symbol), empty strings (strings of no symbol), and the empty set of strings

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 20: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Atomic Regular Expressions

( )

( ) { }( ) { }( ) { }=∅

==Σ∈

L

L

aaLa

ΣrL

r

.3

.2

then , If 1.

:sexpressionregular atomic of typesThree

.alphabet someover

language thedenotes that string a is expressionregular A

εε

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 21: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Compound Regular Expressions

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( )( ) ( )( ) .Then .expressionregular a is .3

.Then .expressionregular a is .2

.Then .expressionregular a is 1.

s.expressionregular be and Let

***

212121

212121

21

rLrLr

rLrLrrLrr

rLrLrrLrr

rr

=

=∪=++

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 22: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Compiling FSAs from RegExps

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 23: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Atomic Reg Exps NFAs→

Σ∈aa

This NFA accepts only the string 'a' and nothing else

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 24: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Atomic Reg Exps NFAs→

εε

This NFA accepts only the empty string

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 25: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Atomic Reg Exps NFAs→

This NFA accepts only the empty set, i.e.,no strings

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 26: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Compound Reg Exps NFAs →

( ) sexpressionregular are , where, 2121 rrrr +

ε

r1

r2

Another notation, commonly used in regexp engines, is (r

1 | r

2), in other words, either r

1 or r

2

ε

ε

ε

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 27: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Compound Reg Exps NFAs →

ε

r1

r2

This compound NFA accepts if and only if either the NFA for r1

(upper one) accepts or the NFA for r2 (lower one) accepts

ε

ε

ε

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 28: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Compound Reg Exps NFAs →

( ) expressionregular is where, 11 rr +

ε

ε

εr1

This regular expressions accepts strings that match r1

at least once

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 29: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

( )*1r

ε

ε

εr1

ε

Compound Reg Exps NFAs →

This regular expressions accepts strings that match r1

zero or more times

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 30: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Defining Regular Expressions

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Just in case you are interested/curious, below are a few links to my presentations on regular expressions in Python and Perl:

● http://vkedco.blogspot.com/2013/02/python-perl-regular-expression-match.html● http://vkedco.blogspot.com/2013/02/python-amp-perl-py-pattern-compilation.html● http://vkedco.blogspot.com/2013/02/python-perl-loose-tight-regular.html

Page 31: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Defining Regexps

/a/  ­  match  all  occurrences  of  character 'a'

/ab/  ­  match  all  occurrences  of  sequence 'ab'

/Ab cD/ ­ match all occurrences of sequence 'Ab cD'

● Regular expressions are placed inside the pair of matching forward slashes - / /

● Regular expressions are case-sensitive● Examples:

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 32: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Disjunction

/[Aa]/ ­ match either 'A' or 'a'

/]abc]/ ­ match all occurrences of sequence 'a' or 'b' or 'c'

/[cC]onference of [bB]irds/ ­ match strings like  'conference  of  birds'  or  'Conference of birds' or 'conference of Birds' or 'Con­ference of Birds'

● Disjunction (or'ing) of characters inside a regexp is done with the matching square brackets [ ]

● All characters inside [ ] are part of the disjunction● Examples:

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 33: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Positive & Negative Ranges

/[A-Z]/ - match an uppercase letter

/]a-z]/ - match a lowercase letter

/[0-9]/ - match a digit

/[^A-Z]/ - match a non-uppercase letter

/[^Ab]/ - match neither 'A' nor 'b'

● [ ] can be used in conjunction with – to specify character ranges

● Ranges can be negated with the special character ^ if it is the first character inside [ ]

● Examples:

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 34: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Zero or One Occurrences

/Examples?/ - match 'Example' or 'Examples'

/colou?r/ - match 'colour' or 'color'

● Special character ? is used to specify zero or one occurrences of the preceding character

● Examples:

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 35: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Kleene*, Kleene+, Wildcard .

/a*/ ­ match '', 'a', 'aa', 'aaa', 'aaaa', etc

/(ab)*/  ­  match  '',  'ab',  'abab',  'ababab', 'ababab', etc

/a+/ ­ match 'a', 'aa', 'aaa', 'aaaa', etc

/(ab)+/  ­  match  'ab',  'abab',  'ababab', 'abababab', etc

/beg.n/ ­ match 'begin', 'began', 'begun'

● Special charater + (aka Kleene +) specifies one or more occurrences of the regular expression that comes right before it

● Special character * (aka Kleene *) specifies zero or more occur-rences of the regular expression that comes right before it

● Special character . (wildcard) specifies any single character

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 36: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Anchors

/^Omar Khayyam/ matches 'Omar Khayyam' only at the beginning of the text

/Omar Khayyam$/ matches 'Omar Khayyam' only at the end of the text

/^Omar  Khayyam$/  matches  only  'Omar Khayyam'

● Anchors are special characters that anchor a regexp to specific position in the text they are matched against

● The anchors are ^ and $ anchor regexps at the beginning and end of the text, respectively

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 37: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Anchors

/\bthe\b/ matches ' the ' but not 'weather'

/\Bthe\B/ does not match 'the' but does match 'weather'

● Anchors \b and \B match at word boundaries and non-word boundaries, respectively

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 38: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Regexp Groups

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 39: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Grouping Regular Expressions

● Regular expressions can be broken into compo-nents

● Such components are called groups● A group match is a specific part of text that

matches a specific regular subexpression in a larger expression

● Groups are specified with a pair of ( )

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 40: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Referencing Group Matches

● Group matches are numbered $1, $2, $3, etc (some regexp notations use just numbers: 1, 2, 3, etc)

● These numbers are called backreferences be-cause they refer specific text segments back to specific regular subexpressions

● Backreferences are used in substitutions

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 41: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Sample Problem

Design a regular expression that parses email addresses into user name, host name, and host extension

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 42: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Possible Solution

'([\w.­]+)@([\w.­]+)\.(com|net|org)'

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 43: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Example 01

$txt_01 = '1+1=2';

## Suppose we match /(\d)\+(\d)\=(\d)/ against $txt_01

## Then the variable alignment is as follows:

## /(\d)\+(\d)\=(\d)/

## $1 $2 $3

## In other words, $1 is bound to '1',

## $2 is bound to '2' and $3 is bound to '3'

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 44: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Example 02

$txt_01 = '1+1=2';

## Suppose we match /((\d)\+(\d)\=(\d))/ against $txt_01.

## Then the special variable alignment is

## /((\d)\+(\d)=(\d))/;

## $1$2 $3 $4

## In other words, $1 is bound to '1+1=2',

## $2 is bound to '1' and $3 is bound to '1',

## $4 is bound to '2'.

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 45: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Substitutions● Groups (subexpressions of larger regular expres-

sions) are specified with special characters ( )● Corresponding text segments that match subex-

pressions are retrieved with special variables $1, $2, $3, etc (some formalisms use just integers 1, 2, 3, etc)

● These variables are aligned with left parentheses: $1 is aligned with the 1st left parenthesis, $2 is aligned with the 2nd left parenthesis, etc

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 46: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Patterns in Java

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 47: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Patterns in Java

● Java has the package java.util.regex that allows the programmer to work with patterns, including regular expressions

● java.util.regex package has two major classes: Pattern and Matcher

● Pattern compiles patterns into finite state automata● Matcher uses the compiled pattern to find substrings

that match the pattern

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 48: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Java Pattern Example

● This is how you compile a regular expression:– Pattern pat = Pattern.compile(“abc”);

● This is how you create a matcher object, essentially an NFA that can be used to find matches:– Matcher match = pat.matcher(str);

● This is how you can test for a match:– match.matches() is a boolean predicate

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 49: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Java Pattern Examplewhile (true) { Pattern pattern = Pattern.compile(console.readLine("%nEnter your regex: ")); Matcher matcher = pattern.matcher(console.readLine("Enter input string to search: ")); boolean found = false; // iterate through groups and print matches and their positions while (matcher.find()) { console.format("I found match \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end()); found = true; } if(!found) { console.format("No match found.%n"); } }

www.vkedco.blogspot.comwww.vkedco.blogspot.com

source code is here

Page 50: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Java Pattern Example

● This is how you can call RegexDemo from the command line:

– java RegexTestHarness ^(0|1(01*0)*1)*$

● Then you can give it strings to match● You can also redirect a file to RegexDemo:

– java RegexTestHarness reg-exp < numbers.txt

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 51: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Sample OutputsEnter your regex: a

Enter input string to search: bac

I found the text "a" starting at index 1 and ending at index 2.

Enter your regex: /\d/

Enter input string to search: bc12ab

No match found.

Enter your regex: \d

Enter input string to search: bc12ab

I found the text "1" starting at index 2 and ending at index 3.

I found the text "2" starting at index 3 and ending at index 4.

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 52: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Pattern Matching

Real Intelligence or

Illusion of Intelligence?

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 53: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Illusion of Intelligence?

● In the 1960's and 70's, AI researchers developed a num-ber of programs that impressed many people with their 'intelligence'

● ELIZA could have a dialog with a person● STUDENT solved high school algebra word problems● MACSYMA solved problems in integral and differential

calculus● SAM understood simple stories

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 54: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Pattern Matching● What all these systems had in common was a technique

that in symbolic AI came to be known as pattern match-ing

● The basic idea was to define a set of rules to look for patterns in inputs and produced outputs by substituting what they matched for something else

● Results were impressive to the uninitiated until the re-searchers themselves started explaining how they did what they did

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 55: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

ELIZA● ELIZA was written by Joseph Weizenbaum in 1966● The system was named after the heroine of the play

Pygmalion whom a good teacher taught how to speak English

● ELIZA was written not to just speak but to emulate a Rogerian psychologist

● The basic principle of Rogerian psychology is to never ask direct questions but let the patients tell their own story

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 56: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Sample Dialog with ELIZA

USER> Men are all alike

ELIZA> IN WHAT WAY

USER> They are always bugging us about something or another.

ELIZA> CAN YOU BE MORE SPECIFIC

USER> Well, my boyfriend made me come here.

ELIZA> YOU BOYFRIEND MADE YOU COME HERE

USER> He says I'm depressed much of the time.

ELIZA> I AM SORRY TO HEAR YOU ARE DEPRESSED.

USER> It's true. I am unhappy.

ELIZA> DO YOU THINK COMING HERE WILL HELP YOU NOT BE UNHAPPY

USER> I need some help, that much is certain.

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 57: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

How ELIZA Works

● The program looks for specific patterns in the input and then prints the response on the basis of what it finds

● For example, when the program finds 'alike' or 'same', it may print 'IN WHAT WAY'

● When the programs matches the pattern 'I need X', it may print 'WHAT WOULD IT MEAN IF YOU GOT X'

● For example, if the user types 'I need some help', ELIZA prints 'WHAT WOULD IT MEAN IF YOU GOT SOME HELP'

● The level of output sophistication depends on how elaborate the patterns are

● Try an online version of ELIZA here

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 58: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

ELIZA RULES

● A rule is a data structure that consists of a pattern and a set of re-sponses

● Example:

RULE

PATTERN: 'X I want Y'

RESPONSES:

{ 'What would it mean if you got Y',

'Why do you want Y',

'Suppose you got Y soon' }

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 59: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Rule Application

● Suppose the user types: Some day I want to read Farid Uddin Attar's Conference of Birds in the original

● The rule's patter matches X to 'Some day' and Y to 'to read Farid Uddin Attar's Conference of Birds in the original'

● In symbolic pattern matching, the result of the match is referred to as list of bindings: X is bound to 'Some day' and Y is bound 'to read Farid Uddin Attar's Conference of Birds in the original'

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 60: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

Rule Application● After the list of bindings is found, the program can then use it to

produce the following responses:

– What would it mean if you got to read Farid Uddin Attar's Conference of birds in the original

– Why do you want to read Farid Uddin Attar's Conference of Birds in the original

– Suppose you got to read Farrid Uddin Attar's Conference of Birds in the original

● There are still two important problems to think about: What if mul-tiple rules' patterns match and how is a specific response is chosen within a matched rule

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 61: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

ELIZA Algorithm

while ( True ) {

input = get_input_from_user();

applicable_rules = find_applicable_rules(input, rule_database);

chosen_rule = choose_applicable_rule(applicable_rules);

chosen_response = choose_response(chosen_rule);

chosen_response = substitute_matches(chosen_response);

print_response(chosen_response);

}

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 62: NLP (Fall 2013): NLP Areas, Chomisky Hierarchy, Finite State Automata, Regular Languages, Regular Expressions, Pattern Matching, ELIZA - NL Dialogue with Computer

References

● Ch 02, D. Jurafsky & J. Martin. Speech & Language Processing, Prentice Hall, ISBN 0-13-095069-6

● Weizenbaum, J. 1966. "Eliza - A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, 9(1): 36-45. (pdf)

www.vkedco.blogspot.comwww.vkedco.blogspot.com