ling 570 day #3 stemming, probabilistic automata, markov chains/model
TRANSCRIPT
Ling 570
Day #3
Stemming, Probabilistic Automata, Markov Chains/Model
2
MORPHOLOGY AND FSTS
3
FST as Translator
FR: ce bill met de le baume sur une blessure
EN: this bill puts balm on a sore wound
Last Class
4
FST Application Examples
• Case folding:– He said he said
• Tokenization:– “He ran.” “ He ran . “
• POS tagging:– They can fish PRO VERB NOUN
5
FST Application Examples
• Pronunciation:– B AH T EH R B AH DX EH R
• Morphological generation:– Fox s Foxes
• Morphological analysis:– cats cat s
6
Roadmap
• Motivation:– Representing words
• A little (mostly English) Morphology• Stemming
7
The Lexicon
• Goal: Represent all the words in a language• Approach?
8
The Lexicon
• Goal: Represent all the words in a language• Approach?
– Enumerate all words?
9
The Lexicon
• Goal: Represent all the words in a language• Approach?
– Enumerate all words?• Doable for English
– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished
10
The Lexicon
• Goal: Represent all the words in a language• Approach?
– Enumerate all words?• Doable for English
– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished
• Other languages?
11
The Lexicon
• Goal: Represent all the words in a language• Approach?
– Enumerate all words?• Doable for English
– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished
• Other languages?– Wildly impractical
» Turkish: 40,000 forms/verb;
uygarlas¸tıramadıklarımızdanmıs¸sınızcasına
“(behaving) as if you are among those whom we could not civilize”
12
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
13
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.
14
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central
meaning unit in a word– Affix: prefix, suffix, infix, circumfix
15
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central
meaning unit in a word– Affix: prefix, suffix, infix, circumfix
• Prefix: e.g., possible impossible
16
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central
meaning unit in a word– Affix: prefix, suffix, infix, circumfix
• Prefix: e.g., possible impossible• Suffix: e.g., walk walking
17
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning
unit in a word– Affix: prefix, suffix, infix, circumfix
• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)
18
Morphological Parsing
• Goal: Take a surface word form and generate a linguistic structure of component morphemes
• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning
unit in a word– Affix: prefix, suffix, infix, circumfix
• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)• Circumfix: e.g., sagen gesagt (German)
19
Surface Variation & Morphology
• Searching (a la Bing) for documents about:– Televised sports
20
Surface Variation & Morphology
• Searching (a la Bing) for documents about:– Televised sports
• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…
21
Surface Variation & Morphology
• Searching (a la Bing) for documents about:– Televised sports
• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…
• How can we match?
22
Surface Variation & Morphology
• Searching (a la Bing) for documents about:– Televised sports
• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…
• How can we match?– Convert surface forms to common base form
• Stemming or morphological analysis
23
Two Perspectives
• Stemming:– writing
24
Two Perspectives
• Stemming:– writing write (or writ)– Beijing
25
Two Perspectives
• Stemming:– writing write (or writ)– Beijing Beije
• Morphological Analysis:
26
Two Perspectives
• Stemming:– writing write (or writ)– Beijing Beije
• Morphological Analysis:– writing write+V+prog
27
Two Perspectives
• Stemming:– writing write (or writ)– Beijing Beije
• Morphological Analysis:– writing write+V+prog– cats cat + N + pl– writes write+V+3rdpers+Sg
Stemming
• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise
Stemming
• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer
Stemming
• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer
• Task: Given surface form, produce base form– Typically, removes suffixes
Stemming
• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer
• Task: Given surface form, produce base form– Typically, removes suffixes
• Model:– Rule cascade– No lexicon!
32
Stemming
• Used in many NLP/IR applications• For building equivalence classes
ConnectConnectedConnectingConnectionConnections
Porter Stemmer, simple and efficientWebsite: http://www.tartarus.org/~martin/PorterStemmer
On patas: ~/dropbox/12-13/570/porter
Same class;
suffixes irrelevant
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
– Rule partial order:• Step1a: -s• Step1b: -ed, -ing
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup
• Pros:
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup
• Pros: Simple, fast, buildable for a variety of languages• Cons:
Porter Stemmer
• Rule cascade:– Rule form:
• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE
– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup
• Pros: Simple, fast, buildable for a variety of languages• Cons: Overaggressive and underaggressive
41
STEMMING & EVAL
42
Evaluating Performance
• Measures of Stemming Performance rely on similar metrics used in IR:– Precision: measure of the proportion of selected items
the system got right• precision = tp / (tp + fp)• # of correct answers / # of answers given
– Recall: measure of the proportion of the target items the system selected• recall = tp / (tp + fn)• # of correct answers / # of possible correct answers
– Rule of thumb: as precision increases, recall drops, and vice versa
• Metrics widely adopted in Stat NLP
43
Precision and Recall
• Take a given stemming task– Suppose there are 100 words that could be
stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)
Precision = 52 / (52 + 10) = .84
Recall = 52 / (52 + 48) = .52
44
Precision and Recall
• Take a given stemming task– Suppose there are 100 words that could be
stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)
Precision = 52 / (52 + 10) = .84
Recall = 52 / (52 + 48) = .52
Note: easy to get precision of 1.0. Why?
45
Baseline Tokenizer 1 Tokenizer 2 Tokenizer 3 Tokenizer 4After After After After Aftercoming coming coming coming comingclose close close close close Precision Recall F-Measure
to to to to to Tokenizer 1 0.827586 0.888889 0.858237548
a a a a a Tokenizer 2 0.961538 0.925926 0.943732194
partial partial partial partial partial Tokenizer 3 0.928571 0.962963 0.945767196
settlementsettlement settlement settlement settlement Tokenizer 4 1 1 1
a a a a ayear year year year yearago ago ago ago ago, , , , ,shareholdersshareholders shareholders shareholders shareholderswho who who who whofiled filed filed filed filedcivil civil civil civil civilsuits suits suits suits suitsagainst against against against againstIvan Ivan Ivan Ivan IvanF. F F. F. F.Boesky . Boesky Boesky Boesky& Boesky & & &Co. & Co. Co Co.L.P. Co L.P. . L.P.Drexel . Drexel's L.P. Drexel's L.P plaintiffs Drexel 'splaintiffs . ' 's plaintiffs' Drexel plaintiffs '
's 'plaintiffs
WEIGHTED AUTOMATA & MARKOV CHAINS
PFA Definition
• A Probabilistic Finite-State Automaton is a 6-tuple:– A set of states Q– An alphabet Σ– A set of transitions: δsubset Q x Σ x Q– Initial state probabilities: Q R+
– Transition probabilities: δ R+
– Final state probabilities: Q R+
PFA Recap
• Subject to constraints:
• Computing sequence probabilities
PFA Example
• Example– I(q0)=1
– I(q1)=0
– F(q0)=0
– F(q1)=0.2
– P(q0,a,q1)=1; P(q1,b,q1) =0.8
– P(abn) = I(q0)*P(q0,a,q1)*P(q1,b,q1)n*F(q1)
– = 0.8n*0.2
Markov Chain
• A Markov Chain is a special case of a PFA in which the sequence uniquely determines which states the automaton will go through.
• Markov Chains can not represent inherently ambiguous problems– Can assign probability to unambiguous
sequences
Markov Chain for Words
Markov Chain for Pronunciation
• Observations: 0/1
Markov Chain for Walking through Groningen
Markov Chain: “First-order observable Markov Model”
• A set of states – Q = q1, q2…qN; the state at time t is qt
Markov Chain: “First-order observable Markov Model”
• A set of states – Q = q1, q2…qN; the state at time t is qt
• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i to state j
– The set of these is the transition probability matrix A
Markov Chain: “First-order observable Markov Model”
• A set of states – Q = q1, q2…qN; the state at time t is qt
• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i to state j
– The set of these is the transition probability matrix A• Distinguished start and final states
– q0,qF
Markov Chain: “First-order observable Markov Model”
• A set of states – Q = q1, q2…qN; the state at time t is qt
• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i to state j
– The set of these is the transition probability matrix A• Distinguished start and final states
– q0,qF
• Current state only depends on previous state
Markov Models
• The parameters of a MM can be arranged in matrices
• The A-matrix for the set of transition probabilities:
p11 p12 …p1j
A = p21 p22 …p2j
…[ ]
Markov Models
• The parameters of a MM can be arranged in matrices
• The A-matrix for the set of transition probabilities:
p11 p12 …p1j
A = p21 p22 …p2j
…
• What’s missing?
[ ]
Markov Models
• The parameters of a MM can be arranged in matrices
• The A-matrix for the set of transition probabilities:
p11 p12 …p1j
A = p21 p22 …p2j
…
• What’s missing? Starting probabilities.
[ ]
Markov Models
• Exercise– Build the transition probability matrix over
this set of dataThe duck died.
The car killed the duck.
The duck died under her car.
We duck under the car.
We retrieve the poor duck.
– Build the starting probability matrix
Markov Models
• Exercise– Given your model, what’s the probability for
each of the following sentences?The duck died under her car.
We duck under the car.
The duck under the car.
We retrieve killed the duck.
We the poor duck died.
We retrieve the poor duck under the car.
– For a given start state (The, We), what’s the most likely string (of the above)?
HMMs
• Next class