cpsc 503 computational linguistics
DESCRIPTION
CPSC 503 Computational Linguistics. Lecture 4 Giuseppe Carenini. Today 1/23. Finite State Transducers (FSTs) and Morphological Parsing Stemming (Porter Stemmer). Computational problems in Morphology. Recognition : recognize whether a string is an English word (FSA) Parsing/Generation :. - PowerPoint PPT PresentationTRANSCRIPT
04/19/23 CPSC503 Spring 2004 1
CPSC 503Computational Linguistics
Lecture 4Giuseppe Carenini
04/19/23 CPSC503 Spring 2004 2
Today 1/23
• Finite State Transducers (FSTs) and Morphological Parsing
• Stemming (Porter Stemmer)
04/19/23 CPSC503 Spring 2004 3
Computational problems in Morphology
• Recognition: recognize whether a string is an English word (FSA)
• Parsing/Generation: word
stem, class, lexical features
….….
lieslie +N +PL
lie +V +3SG• Stemming:
wordstem
….
e.g.,
04/19/23 CPSC503 Spring 2004 4
Finite State Transducers (FSTs)
• FSA cannot help ……• Need to extend FSA
– Add another tape– Add extra symbols to the
transitions
– On one tape we read “cats”, on the other we write “cat +N +PL” (or vice versa)
04/19/23 CPSC503 Spring 2004 5
FSTs as translators
generationparsing
04/19/23 CPSC503 Spring 2004 6
Example
Transitions (as a translator):• l:l means read a l on one tape and write a l on the
other (or vice versa)• +N:ε means read a +N symbol on one tape and
write nothing on the other (or vice versa)• +PL:s means read +PL and write an s (or vice
versa)• …
l:l i:i e:e +N:ε +PL:s
+V:ε
+3SG:s
q1
q0
q2
q3
q4q5
q6q7
04/19/23 CPSC503 Spring 2004 7
Examples (as a translator)
l i e s
+V+3SGl i elexical
lexical
surface
surface
04/19/23 CPSC503 Spring 2004 8
Examples (as a recognizer and a
generator)
l i e s
+V +3SGl i e
lexical
lexical
surface
surface
04/19/23 CPSC503 Spring 2004 9
FST definition
• Q: a finite set of states• I,O: input and an output alphabets (which
may include ε)• Σ: a finite alphabet of complex symbols i:o,
iI and oO
• Q0: the start state
• F: a set of accept/final states (FQ)• A transition relation δ that maps QxΣ to Q
04/19/23 CPSC503 Spring 2004 10
FST can be used as… • Translators: input one string from I,
output another from O (or vice versa) • Recognizers: input a strings from IxO• Generator: output a string from IxO
Terminology warning!
04/19/23 CPSC503 Spring 2004 11
A step back: FSA can represent morphological
knowledge• Lexicon: list of stem and affixes,
together with basic information about them
• Morphotactics: the rules governing the ordering of morphemes
• Orthographics rules: model changes in morphemes when they combine
04/19/23 CPSC503 Spring 2004 12
FSA for inflectional morphology of plural Some regular-nouns
Some irregular-nounsi
04/19/23 CPSC503 Spring 2004 13
FST for inflectional morphology of plural
Some regular-nouns
Some irregular-nouns o:i
04/19/23 CPSC503 Spring 2004 14
Examples
m i c
+N +PLc a tlexical
lexical
surface
surface e
04/19/23 CPSC503 Spring 2004 15
Problems/Challenges
• Ambiguity: one word can correspond to multiple structures
• Spelling changes: may occur when two morphemes are combined (inflectionally)
e.g. butterfly + -s -> butterflies
04/19/23 CPSC503 Spring 2004 16
Ambiguity• ND recognition: multiple paths through a
machine may lead to an accept state (Didn’t matter which path was actually traversed)
• In ND parsing the path to an accept state does matter: differ paths represent different parses and different outputs will result
l:l i:i e:e +N:ε +PL:s
+V:ε
+PL:s
q1
q0
q2
q3
q4q5
q6q7
04/19/23 CPSC503 Spring 2004 17
Ambiguity: more complex example
• What’s the right parse for Unionizable?– Union-ize-able– Un-ion-ize-able
• Each would represent a valid path through an FST for derivational morphology.
04/19/23 CPSC503 Spring 2004 18
Deal with Morphological Ambiguity
• There are a number of ways to deal with this problem– Simply take the first output found– Find all the possible outputs (all
paths) and return them all (without choosing)
– Bias the search so that only one or a few likely paths are explored
Then Part-of-speech tagging
to choose
04/19/23 CPSC503 Spring 2004 19
Spelling ChangesWhen morphemes are combined
inflectionally the spelling at the boundaries may change Examples
•E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box)
•Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly)
04/19/23 CPSC503 Spring 2004 20
Solution: Multi-Tape Machines
• Add intermediate tape • Use the output of one tape
machine as the input to the next
• Add intermediate symbols– ^ morpheme boundary– # word boundary
04/19/23 CPSC503 Spring 2004 21
Multi-Level Tape Machines
• FST-1 translates between the lexical and the intermediate level
• FTS-2 handles the spelling changes (due to one rule) to the surface tape
FST-1
FST-2
04/19/23 CPSC503 Spring 2004 22
FST-1 for inflectional morphology of pluralSome regular-nouns
Some irregular-nouns
o:i
+PL:^s#
+PL:^ ε:s ε:#
#
#
#
04/19/23 CPSC503 Spring 2004 23
Example
f o x
intemediate
lexical
s em o u
intemediate
lexical
+PL+N
+N +PL
04/19/23 CPSC503 Spring 2004 24
FST-2 for E-insertion(Intermediate to Surface)
E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x
…as in fox^s# <-> foxes
#: ε
04/19/23 CPSC503 Spring 2004 25
Examples
^ sf o xintemediate
surface
#
^ ib o xintemediate
surface
n g #
04/19/23 CPSC503 Spring 2004 26
Where are we?
04/19/23 CPSC503 Spring 2004 27
Final Scheme: Part 1
04/19/23 CPSC503 Spring 2004 28
Final Scheme: Part 2
04/19/23 CPSC503 Spring 2004 29
Intersection (T1,T2)
δ3((xa,ya), i:c) = (xb,yb) iff
– δ1(xa, i:c) = xb AND
– δ2(ya, i:c) = yb
• States of T1 and T2 : Q1 and Q2
• States of intersection: Q1 x Q2
• Transitions of T1 and T2 : δ1, δ2
• Transitions of intersection : δ3
04/19/23 CPSC503 Spring 2004 30
Composition(T1,T2)
δ3((xa,ya), i:o) = (xb,yb) iff
– There exists c such that
– δ1(xa, i:c) = xb AND
– δ2(ya, c:o) = yb
• States of T1 and T2 : Q1 and Q2
• States of composition : Q1 x Q2
• Transitions of T1 and T2 : δ1, δ2
• Transitions of composition : δ3
04/19/23 CPSC503 Spring 2004 31
Other important applications of FTS in NLP
• Segmentation: finding word boundaries in text (?!)
• Shallow syntactic parsing: e.g., find only noun phrases
• Dialogue Act Disambiguation: “right” (IUI-04)
• Phonological Rules….
04/19/23 CPSC503 Spring 2004 32
FSTs in Practice• Install an FST package……
(pointers)• Describe your “formal language”
(e.g, lexicon, morphotactic and rules) in a RegExp like notation (pointer)
• Your specification is compiled in an FST
NOTE: FSTs for the morphology of a natural language may have 105 – 107 states and arcs
04/19/23 CPSC503 Spring 2004 33
Computational problems in Morphology
• Recognition: recognize whether a string is an English word (FSA)
• Parsing/Generation (FST): word
stem, class, lexical features
….….
lieslie +N +PL
lie +V +3SG• Stemming:
wordstem
….
e.g.,
04/19/23 CPSC503 Spring 2004 34
Stemmer• E.g. the Porter algorithm (Appendix B),
which is based on a series of sets of simple cascaded rewrite rules:– ATIONAL ATE (relational relate)– ING if stem contains vowel (motoring motor)
• Cascade of rules applied to: computerization– ization -> -ize computerize– ize -> ε computer
• Errors occur:– organization organ, doing doe
university universe
04/19/23 CPSC503 Spring 2004 35
Stemming mainly used in Information Retrieval
1. Run a stemmer on the documents to be indexed
2. Run a stemmer on users queries3. Compute similarity between
queries and documents (based on stems they contain)
04/19/23 CPSC503 Spring 2004 36
Porter as an FST
• The original exposition of the Porter stemmer did not describe it as a transducer but…– Each stage is a separate
transducer– The stages can be composed to
get one big transducer
04/19/23 CPSC503 Spring 2004 37
State Machines (no prob.)• Finite State Automata
(and Regular Expressions)
• Finite State Transducers
(English)Morpholog
y
Logical formalisms (First-Order Logics)
Rule systems (and prob. version)(e.g., (Prob.) Context-Free Grammars)
Syntax
PragmaticsDiscourse and
Dialogue
Semantics
AI planners
Linguistic Knowledge Formalisms and associated Algorithms
04/19/23 CPSC503 Spring 2004 38
Next Time• Intro to probability and information
theory• On your preferred source read about
– Conditional probability– Bayes’ rule– Independence– Entropy– Conditional Entropy and Mutual
Information
04/19/23 CPSC503 Spring 2004 39
Lexical to Intermediate Level
04/19/23 CPSC503 Spring 2004 40
FST for inflectional morphology of plural
Some regular-nouns
Some irregular-nouns
04/19/23 CPSC503 Spring 2004 41
Foxes
04/19/23 CPSC503 Spring 2004 42
FST Review
• FSTs allow us to take an input and deliver a structure based on it
• Or… take a structure and create a surface form
• Or take a structure and create another structure
04/19/23 CPSC503 Spring 2004 43
State Machines (no prob.)• Finite State Automata
(and Regular Expressions)
• Finite State Transducers
(English)Morpholog
y
Logical formalisms (First-Order Logics)
Rule systems (and prob. version)(e.g., (Prob.) Context-Free Grammars)
Syntax
PragmaticsDiscourse and
Dialogue
Semantics
AI planners
Linguistic Knowledge Formalisms and associated Algorithms
04/19/23 CPSC503 Spring 2004 44
Review
• In many applications its convenient to decompose the problem into a set of cascaded transducers where– The output of one feeds into the input
of the next.
04/19/23 CPSC503 Spring 2004 45
English Spelling Changes
• We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
04/19/23 CPSC503 Spring 2004 46
FST can be used as…
• Translators: input one string (a sequence from I), output another one (a sequence from O)……or viceversa
• Recognizers: input both strings (a sequence from IxO)
• Generator: output both strings (a sequence from IxO)