cpsc 503 computational linguistics

46
06/14/22 CPSC503 Spring 2004 1 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini

Upload: jana-landry

Post on 31-Dec-2015

31 views

Category:

Documents


3 download

DESCRIPTION

CPSC 503 Computational Linguistics. Lecture 4 Giuseppe Carenini. Today 1/23. Finite State Transducers (FSTs) and Morphological Parsing Stemming (Porter Stemmer). Computational problems in Morphology. Recognition : recognize whether a string is an English word (FSA) Parsing/Generation :. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 1

CPSC 503Computational Linguistics

Lecture 4Giuseppe Carenini

Page 2: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 2

Today 1/23

• Finite State Transducers (FSTs) and Morphological Parsing

• Stemming (Porter Stemmer)

Page 3: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 3

Computational problems in Morphology

• Recognition: recognize whether a string is an English word (FSA)

• Parsing/Generation: word

stem, class, lexical features

….….

lieslie +N +PL

lie +V +3SG• Stemming:

wordstem

….

e.g.,

Page 4: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 4

Finite State Transducers (FSTs)

• FSA cannot help ……• Need to extend FSA

– Add another tape– Add extra symbols to the

transitions

– On one tape we read “cats”, on the other we write “cat +N +PL” (or vice versa)

Page 5: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 5

FSTs as translators

generationparsing

Page 6: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 6

Example

Transitions (as a translator):• l:l means read a l on one tape and write a l on the

other (or vice versa)• +N:ε means read a +N symbol on one tape and

write nothing on the other (or vice versa)• +PL:s means read +PL and write an s (or vice

versa)• …

l:l i:i e:e +N:ε +PL:s

+V:ε

+3SG:s

q1

q0

q2

q3

q4q5

q6q7

Page 7: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 7

Examples (as a translator)

l i e s

+V+3SGl i elexical

lexical

surface

surface

Page 8: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 8

Examples (as a recognizer and a

generator)

l i e s

+V +3SGl i e

lexical

lexical

surface

surface

Page 9: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 9

FST definition

• Q: a finite set of states• I,O: input and an output alphabets (which

may include ε)• Σ: a finite alphabet of complex symbols i:o,

iI and oO

• Q0: the start state

• F: a set of accept/final states (FQ)• A transition relation δ that maps QxΣ to Q

Page 10: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 10

FST can be used as… • Translators: input one string from I,

output another from O (or vice versa) • Recognizers: input a strings from IxO• Generator: output a string from IxO

Terminology warning!

Page 11: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 11

A step back: FSA can represent morphological

knowledge• Lexicon: list of stem and affixes,

together with basic information about them

• Morphotactics: the rules governing the ordering of morphemes

• Orthographics rules: model changes in morphemes when they combine

Page 12: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 12

FSA for inflectional morphology of plural Some regular-nouns

Some irregular-nounsi

Page 13: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 13

FST for inflectional morphology of plural

Some regular-nouns

Some irregular-nouns o:i

Page 14: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 14

Examples

m i c

+N +PLc a tlexical

lexical

surface

surface e

Page 15: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 15

Problems/Challenges

• Ambiguity: one word can correspond to multiple structures

• Spelling changes: may occur when two morphemes are combined (inflectionally)

e.g. butterfly + -s -> butterflies

Page 16: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 16

Ambiguity• ND recognition: multiple paths through a

machine may lead to an accept state (Didn’t matter which path was actually traversed)

• In ND parsing the path to an accept state does matter: differ paths represent different parses and different outputs will result

l:l i:i e:e +N:ε +PL:s

+V:ε

+PL:s

q1

q0

q2

q3

q4q5

q6q7

Page 17: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 17

Ambiguity: more complex example

• What’s the right parse for Unionizable?– Union-ize-able– Un-ion-ize-able

• Each would represent a valid path through an FST for derivational morphology.

Page 18: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 18

Deal with Morphological Ambiguity

• There are a number of ways to deal with this problem– Simply take the first output found– Find all the possible outputs (all

paths) and return them all (without choosing)

– Bias the search so that only one or a few likely paths are explored

Then Part-of-speech tagging

to choose

Page 19: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 19

Spelling ChangesWhen morphemes are combined

inflectionally the spelling at the boundaries may change Examples

•E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box)

•Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly)

Page 20: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 20

Solution: Multi-Tape Machines

• Add intermediate tape • Use the output of one tape

machine as the input to the next

• Add intermediate symbols– ^ morpheme boundary– # word boundary

Page 21: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 21

Multi-Level Tape Machines

• FST-1 translates between the lexical and the intermediate level

• FTS-2 handles the spelling changes (due to one rule) to the surface tape

FST-1

FST-2

Page 22: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 22

FST-1 for inflectional morphology of pluralSome regular-nouns

Some irregular-nouns

o:i

+PL:^s#

+PL:^ ε:s ε:#

#

#

#

Page 23: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 23

Example

f o x

intemediate

lexical

s em o u

intemediate

lexical

+PL+N

+N +PL

Page 24: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 24

FST-2 for E-insertion(Intermediate to Surface)

E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x

…as in fox^s# <-> foxes

#: ε

Page 25: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 25

Examples

^ sf o xintemediate

surface

#

^ ib o xintemediate

surface

n g #

Page 26: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 26

Where are we?

Page 27: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 27

Final Scheme: Part 1

Page 28: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 28

Final Scheme: Part 2

Page 29: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 29

Intersection (T1,T2)

δ3((xa,ya), i:c) = (xb,yb) iff

– δ1(xa, i:c) = xb AND

– δ2(ya, i:c) = yb

• States of T1 and T2 : Q1 and Q2

• States of intersection: Q1 x Q2

• Transitions of T1 and T2 : δ1, δ2

• Transitions of intersection : δ3

Page 30: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 30

Composition(T1,T2)

δ3((xa,ya), i:o) = (xb,yb) iff

– There exists c such that

– δ1(xa, i:c) = xb AND

– δ2(ya, c:o) = yb

• States of T1 and T2 : Q1 and Q2

• States of composition : Q1 x Q2

• Transitions of T1 and T2 : δ1, δ2

• Transitions of composition : δ3

Page 31: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 31

Other important applications of FTS in NLP

• Segmentation: finding word boundaries in text (?!)

• Shallow syntactic parsing: e.g., find only noun phrases

• Dialogue Act Disambiguation: “right” (IUI-04)

• Phonological Rules….

Page 32: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 32

FSTs in Practice• Install an FST package……

(pointers)• Describe your “formal language”

(e.g, lexicon, morphotactic and rules) in a RegExp like notation (pointer)

• Your specification is compiled in an FST

NOTE: FSTs for the morphology of a natural language may have 105 – 107 states and arcs

Page 33: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 33

Computational problems in Morphology

• Recognition: recognize whether a string is an English word (FSA)

• Parsing/Generation (FST): word

stem, class, lexical features

….….

lieslie +N +PL

lie +V +3SG• Stemming:

wordstem

….

e.g.,

Page 34: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 34

Stemmer• E.g. the Porter algorithm (Appendix B),

which is based on a series of sets of simple cascaded rewrite rules:– ATIONAL ATE (relational relate)– ING if stem contains vowel (motoring motor)

• Cascade of rules applied to: computerization– ization -> -ize computerize– ize -> ε computer

• Errors occur:– organization organ, doing doe

university universe

Page 35: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 35

Stemming mainly used in Information Retrieval

1. Run a stemmer on the documents to be indexed

2. Run a stemmer on users queries3. Compute similarity between

queries and documents (based on stems they contain)

Page 36: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 36

Porter as an FST

• The original exposition of the Porter stemmer did not describe it as a transducer but…– Each stage is a separate

transducer– The stages can be composed to

get one big transducer

Page 37: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 37

State Machines (no prob.)• Finite State Automata

(and Regular Expressions)

• Finite State Transducers

(English)Morpholog

y

Logical formalisms (First-Order Logics)

Rule systems (and prob. version)(e.g., (Prob.) Context-Free Grammars)

Syntax

PragmaticsDiscourse and

Dialogue

Semantics

AI planners

Linguistic Knowledge Formalisms and associated Algorithms

Page 38: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 38

Next Time• Intro to probability and information

theory• On your preferred source read about

– Conditional probability– Bayes’ rule– Independence– Entropy– Conditional Entropy and Mutual

Information

Page 39: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 39

Lexical to Intermediate Level

Page 40: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 40

FST for inflectional morphology of plural

Some regular-nouns

Some irregular-nouns

Page 41: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 41

Foxes

Page 42: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 42

FST Review

• FSTs allow us to take an input and deliver a structure based on it

• Or… take a structure and create a surface form

• Or take a structure and create another structure

Page 43: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 43

State Machines (no prob.)• Finite State Automata

(and Regular Expressions)

• Finite State Transducers

(English)Morpholog

y

Logical formalisms (First-Order Logics)

Rule systems (and prob. version)(e.g., (Prob.) Context-Free Grammars)

Syntax

PragmaticsDiscourse and

Dialogue

Semantics

AI planners

Linguistic Knowledge Formalisms and associated Algorithms

Page 44: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 44

Review

• In many applications its convenient to decompose the problem into a set of cascaded transducers where– The output of one feeds into the input

of the next.

Page 45: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 45

English Spelling Changes

• We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Page 46: CPSC 503 Computational Linguistics

04/19/23 CPSC503 Spring 2004 46

FST can be used as…

• Translators: input one string (a sequence from I), output another one (a sequence from O)……or viceversa

• Recognizers: input both strings (a sequence from IxO)

• Generator: output both strings (a sequence from IxO)