lecture 4: smoothing, part-of-speech taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf ·...
TRANSCRIPT
![Page 1: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/1.jpg)
Lecture 4: Smoothing, Part-of-Speech Tagging
Ivan TitovInstitute for Logic, Language and Computation
Universiteit van Amsterdam
![Page 2: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/2.jpg)
![Page 3: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/3.jpg)
Language Models from Corpora
We want a model of sentence probability P(w1, . . . ,wn) for all wordsequences w1, . . . ,wn over the vocabulary V :
P(w1, . . . ,wn) = P(w1)n∏
i=2
P(wi |w1, . . . ,wi−1)
Tasks to do:I Estimate P(w1)
I Estimate probabilities P(wi |w1, . . . ,wi−1) for all w1, . . . ,wi !
![Page 4: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/4.jpg)
What to do in order to estimate these probabilities?
I Bucketing histories: Markov modelsI Smoothing techniques against sparse-data
![Page 5: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/5.jpg)
Markov Assumption and N-grams
Limited history: There is a fixed finite k such that for all w i+11 :
P(wi+1|w1, . . . ,wi) ≈ P(wi+1|wi−k , . . . ,wi)
For k ≥ 0
P(wi | w1, . . .wi−k . . . ,wi−1) ≈ P(wi | wi−k , . . . ,wi−1)
How to estimate the probabilities?
Estimation: k th-order Markov Model
P(wi |wi−k , . . . ,wi−1) =Count(wi−k , . . . ,wi−1,wi)∑
w∈V Count(wi−k , . . . ,wi−1, w)
![Page 6: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/6.jpg)
Bigram derivations
catcat(.4)
==saw // saw . . .
��
thethe(.6)
;;heard
heard(.5)
##
a(.33){{
STARTa // a
mouse(.5) // mouse
END(.6)
��. . . END
Likelihood:
Pbigram (“ the cat heard a mouse ”) = .6× .4× .5× .33× .5× .6 = 0.12
![Page 7: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/7.jpg)
Zero counts
Model: we train a bigram model on Data: P(wi |wi−1), e.g. “the catsees the mouse”, ..., “Tom sees Jerry”, “the mouse seesJerry”
Problem: P(“Jerry sees the mouse”) = 0, why?
Data: text does not contain bigrams 〈START , Jerry〉 and〈Jerry, sees〉
Zeros: are called “unseen events”,
Question: the above sentence should have a non-zero probability,how do we estimate its probability?
![Page 8: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/8.jpg)
Zipf’s law
Rank
Frequency
I In a given corpus, a few very frequent words, but very manyinfrequent words.
I Having a resonably sized corpus is important, but we always needsmoothing.
I Smoothing technique used has a large effect on the performance ofany NLP system.
![Page 9: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/9.jpg)
Smoothing techniques for Markov Models
General IdeaFind a way to fill the gaps in counts of events in corpus C.
I Take care not to change original distribution too much.I Fill the gaps only as much as needed: as corpus grows larger,
there will be less gaps to fill.
![Page 10: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/10.jpg)
Naive smoothing: Adding λ method (1)
Events: Set E of possible events, e.g. bigrams over V :E = (V × V)
Data: e1, . . . eN (data size is N events)
Counting: Event (bigram) e occurs C(e) times in Data
Relative Frequency estimate: Prf (e) =C(e)
N
Add 0 < λ ≤ 1: for all e ∈ E (bigrams) change C(e) and Prf (e)
C(e) = C(e) + λ
Pλ(e) =C(e) + λ
N + λ|E |
![Page 11: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/11.jpg)
Add λ method (2)
Example: Bigram Model
Pλ(wi |wi−1) =λ+ c(wi−1wi)∑
w(λ+ c(wi−1,w))
Advantages: very simple and easy to apply
Disadvantges: Method performs poorly (see Chen & Goodman):I All unseen events receive same probability! Is
that OK?I All events upgraded by λ! Is that OK?
![Page 12: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/12.jpg)
Good-Turing method
Intuition Use the counts of things you have seen once to estimate thecount of things not seen.i.e. use n-grams with frequency 1 to re-estimate the frequency of zerocounts.
Suppose we have data with total count of events being N:
Standard notation:r = frequency of event enr = number of events e with frequency r
nr = |{e ∈ E |Count(e) = r}|Nr = total frequency of events occuring exactly r times
Nr = r × nr
Observation: N =∑∞
r=1 Nr N0 = 0
What we want: To recalculate the frequency r of an event (r∗)
![Page 13: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/13.jpg)
Good-Turing Estimates
The Good-Turing probability estimate for events with frequency r :
PGT(α) =r + 1
NE[nr+1]
E[nr ]≈
r + 1N
nr+1
nr=
1N× (r + 1) ×
nr+1
nr
We can think of this, as assigning frequency of r∗ to events appearing rtimes:
r∗ = (r + 1) ×nr+1
nr
nr : number of events with freq. rnr+1 : number of events with freq. r + 1
![Page 14: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/14.jpg)
Properties of Good-Turing
Preservation: Total number of counts is preserved:
N =∞∑
r=1
rnr =∞∑
r=0
(r + 1)nr+1 =∞∑
r=0
nr r∗
Discounting: Total freq. for non-zero events is discounted
N0 = n0 × 0∗ = n0 × (1 ×n1
n0) = n1
Zero freq. events
P0 =r∗
N=
0∗N
=n1
N
Zero events: No explicit method for redistributing N0 among zeroevents!
Redistribute the reserved mass (N0) uniformly among zero events?
![Page 15: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/15.jpg)
Estimating nr from Data
The values nr from data could be sparse themselves!
GT Good suggests to estimate E[nr ] ≈ S(nr) for somesmoothing function S over the values nr .
SGT William Gale suggests the Simple Good-Turingwhere function S is linear S(log(nr)) = a + b log(r)and fitted by logistic regression such that b < −1 (toguarantee that r∗ < r).
Available the implementation by Geoffrey Sampson on theweb.
Problem: Good-Turing says nothing about how to redistributeprobability mass over unseen-events!
![Page 16: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/16.jpg)
Why Uniform Redistribution is Not Good
Given some data, we observe that
I Frequency of trigram 〈gave, the, thing〉 is zero
I Frequency of trigram 〈gave, the, think 〉 is also zero
I Uniform distribution over unseen events means:
P(thing|gave, the) = P(think |gave, the)
Does that reflect our knowledge about English use?
P(thing|gave, the) > P(think |gave, the)
How do we achieve this?
I This might be achieved if we look at bigrams!P(thing|the) vs. P(think |the)
![Page 17: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/17.jpg)
Katz’s Backoff
In order to re-distribute reserved mass, use a lower-order distribution
Discount and backoff Suppose C(wi−1wi) = r :
Ckatz(wi−1wi) =
{dr × r if r > 0α(wi−1) × PML (wi) if r = 0
Discounting all non-zero counts are discounted by 1 − dr
Redistribution of reserved mass over zero-events is proportional to theirlower-order distribution, for bigrams this is unigrams, fortrigrams this is bigrams etc.
![Page 18: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/18.jpg)
Computing dr
Efficiency: discount only events occuring 1 ≤ r ≤ k : e.g. k = 5,
Constraint 1: Discounts should be proportional to discounts of GT,i.e. there exists some constant µ such that1 − dr = µ(1 − r∗
r )
Constraint 2: The total mass of zero-events is the same as for GTGT assign to all zero-events: n00∗ = n0(
n1n0) = n1:
k∑r=1
nr × [1 − dr ] × r = n1
Solution: The above system of equations has the solution:
dr =
r∗r −
(k+1)nk+1n1
1 − (k+1)nk+1n1
![Page 19: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/19.jpg)
Solution for α
Solution for α(wi−1) is through enforcing the constraint that totalcounts are not changed:∑
wiCkatz(wi−1wi) =
∑wi
C(wi−1wi)
See the Chen and Goodman tech report for details.
![Page 20: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/20.jpg)
Smoothing of Markov Model
Smoothing an (n + 1) model gives us an n-gram model.
P∗(wn+1|w1, . . . ,wn) =Count∗(w1, . . . ,wn,wn+1)∑w∈V Count∗(w1, . . . ,wn,w)
For many smoothing methods:∑w∈V
Count∗(w1, . . . ,wn,w) , Count∗(w1, . . . ,wn)
![Page 21: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/21.jpg)
Smoothing by linear interpolation: Jelinek-Mercer
(Also called finite mixture models)
Interpolate a bigram with a unigram model using some 0 ≤ λ ≤ 1
Pinterp(wi−1wi) = λPML(wi−1wi) + (1 − λ)PML(wi)
Estimation: data for PML and held-out data for estimating the λ’sWhat would happen if use the training data toestimate λ?
Algorithm: Baum-Welch/Forward-Backward for Hidden MarkovModels searching for a Maximum-Likelihood estimateof the λ.
More on this when we learn HMMs!
![Page 22: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/22.jpg)
Summary of Smoothing Techniques
I n-gram statistics suffer from sparseness in text as n growsI There are smoothing methods against sparsenessI Add λ is too simpleI Good-Turing only for reserving mass for zero-eventsI Katz allows redistribution according to a lower order
distributionI Interpolation combines lower order distributions
![Page 23: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/23.jpg)
Smoothing N-gram Models
1. Add λ method
2. Good-Turing method
3. Discounting: Katz’ backoff
4. Interpolation: Jelinek-Marcer
Extra Reference: Joshua Goodman and Stanly Chen. Anempirical study of smoothing techniques for language modeling.Tech. report TR-10-98, Harvard University, August 1998.http://research.microsoft.com/˜joshuago/
![Page 24: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/24.jpg)
I Born 1932 (Vilem, Czechoslovakia) fromJewish father and Catholic mother;
I Father died in Theresienstadt; mother &children survived in Prague.
I Electrical Engineering City College ofNew York, 1962 PhD MIT (under RobertFano; Shannon’s department)
I 1957 met Milena Tabolova (arrived1961, studied linguistics at MIT),together attended Chomsky’s classes.
I 1962-1972 Cornell, focus InformationTheory
I 1972-1993, IBM Watson ResearchCenter: applied noisy channel model tospeech recognition & machinetranslation
I 1993-2010, Johns Hopkins
I died 2010.
![Page 25: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/25.jpg)
Shannon (1956): “Information theory has, in the last few years,become something of a scientific bandwagon. Starting as atechnical tool for the communication engineer, it has received anextraordinary amount of publicity in the popular as well as thescientific press... Our fellow scientists in many different fields,attracted by the fanfare and by the new avenues opened toscientific analysis, are using these ideas in their own problems.Applications are being made to biology, psychology, linguistics,fundamental physics, economics, the theory of organization, andmany others... Although this wave of popularity is certainlypleasant and exciting for those of us working in the field, it carriesat the same time an element of danger... It will be all too easy forour somewhat artificial prosperity to collapse overnight when it isrealized that the use of a few exciting words like information,entropy, redundancy, do not solve all our problems.”
![Page 26: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/26.jpg)
I Pierce and Caroll, ALPAC report (1956) persuaded the U.S.government to stop funding machine translation (MT)research.
I “Certainly, language is second to no phenomenon inimportance. The new linguistics presents an attractive as wellas an extremely important challenge.”
I Machine translation is field for “mad inventors or untrustworthyengineers”
I Pierce (1959, JASA): “We are safe in asserting that speechrecognition is attractive to money. The attraction is perhapssimilar to the attraction of schemes for turning water intogasoline, extracting gold from the sea, curing cancer, or goingto the moon.
I Chomsky (1957): “Probabilistic models give no insight intobasic problems of syntactic structure”
I Pierce: “Funding artificial intelligence is real stupidity”
![Page 27: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/27.jpg)
In this climate, Jelinek pioneered statistical models for speechrecognition and machine translation, reconciling Shannon’s andChomsky’s contributions.
I In 1976 (Proceedings of the IEEE) ”Continuous SpeechRecognition by Statistical Methods”
I “eventually his view of the problem entirely reshaped thelandscape and set the foundations for virtually all modernspeech recognition systems.”
I “in his 20 years at IBM, he and his team invented nearly all ofthe key components that you need to build a highperformance speech recognition system includingphone-based acoustic models, N-gram language models,decision tree clustering, and many more”
I 2000 (with Chelba) “Structured language modeling for speechrecognition”: PCFGs as language models.
![Page 28: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/28.jpg)
POS tagging: Background
What are POS tags?
I A Part of Speech (POS) tag is the class of a wordExamples: Verbs (VB), Nouns (NN), Adjectives (ADJ), Adverbs (ADV),
Prepositions (PP), Determiners (DT)
I A POS tag provides significant info about a word and itsneighbourhoodExample: A verb occurs in different neighbourhoods than a noun
I Nouns are likely to be preceded by determiners, nouns, adjectivesI Verbs are likely to be preceded by nouns, pronouns, adverbs
I POS tags capture mainly morpho-syntactic behaviourI POS tag sets are only discrete approximations for the
behaviour of words in sentences!
![Page 29: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/29.jpg)
What are POS tags?
I POS tags are also known as word classes, morphologicalclasses, or lexical tags
I Number of tags used by different systems/corpora/languagesare different
I Penn Treebank (Wall Street Journal Newswire): 45I Brown corpus (Mixed genres like fiction, biographies, etc): 87
I POS tags can be of varying granularityI Morphologically complex languages have very different
tag-sets from English or Dutch.I Even for English:
He gave the book to the manVB-N-N
He ate the banana.VB-N
![Page 30: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/30.jpg)
POS tags are useful
Pronunciation: Examples:NN CONtent NN OBject NN DIScountADJ conTENT VB obJECT VB disCOUNT
Spelling correction: knowing the POS tag of a word helps.
The house be the sea is beautiful.
(VB be) is less likely to follow (NN house) than (PP by)
Parsing: POS tags say a lot about syntactic word behavior.
Speech recognition: POS tags are used in a language model component
Information Retrieval: important words are often (e.g.) nouns.
POS-tagging is often the first step in a larger NLP system.
![Page 31: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/31.jpg)
POS tagging and ambiguity
List the list .VB DT NN .
Book that flight .VB DT NN .
He thought that the book was long .PRN VB COMP DT NN VB ADJ.
Ambiguity:
list VB/NNbook VB/NNfollowing NN/VB/ADJthat DT/COMP I thought (COMP that) . . .
POS tag ambiguity resolution demands “some” syntactic analysis!
![Page 32: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/32.jpg)
Why is POS tagging hard
Ambiguity:Most words belong to more than one word-classI List/VB the list/NNI He will race/VB the carI When is the race/NN?
Some ambiguities are genuinely difficult, even for people
I particle vs. prepositionI He talked over the deal.I He talked over the phone.
I past tense vs. past participleI The horse raced past the barn.I The horse raced past the barn was brown.I The old man the boat.
![Page 33: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/33.jpg)
How hard is POS tagging?
Using the Brown Corpus (approx. 1 million words of mixed-genre text), astudy shows that for English:
I Over 40% of all word tokens are ambiguous!
I Many ambiguities are easy to resolve: just pick the most likely tag!
I A good portion is still hard: picking most likely tag results only in90% per word correct POS tag:
Sentence of length n words is fully correct only in: (0.9)n %Length 10 20 30 40 50Precision 34% 12% 4.2% 1.5% 0.5%
Maybe we can get away with using local context?
![Page 34: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/34.jpg)
POS-tagging is among the easiest of NLP problems
I State-of-the-art methods get about 97-98% accuracyI Simple heuristics can take you quite far
I 90% accuracy just by choosing the most frequent tag.
I A good portion is still hard: need to use some local context.
![Page 35: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/35.jpg)
Current POS taggers
“Rule-based”: set of rules constructed manually (e.g. ENGTWOL:CG)
Stochastic: Markov taggers, HMM taggers (and other MLtechniques)
This lecture: Stochastic TaggersI What is a stochastic tagger? What components does it
contain?I How to use Markov models?I Hidden Markov Models for tagging.
![Page 36: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/36.jpg)
A generative stochastic tagger
Given a vocabulary V and POS tag set T
Input: w1, . . . ,wn all wi ∈ V
Output: t1, . . . , tn all ti ∈ T
Tagger: P : (V × T)+ → [0, 1]
Disambiguation: Select tn1 = argmaxtn
1P(tn
1 | wn1 )
wn1t1
nNoisy Channel
message Observation
![Page 37: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/37.jpg)
Partioning: Task Model & Language Model
argmaxtn1P(tn
1 | wn1 ) = argmaxtn
1
P(tn1 ,w
n1 )
P(wn1 )
= argmaxtn1P(tn
1 ,wn1 )
P(tn1 ,w
n1 ) = P(tn
1 )︸︷︷︸ × P(wn1 | t
n1 )︸ ︷︷ ︸
“Language model” “Task model”P(tn
1 ) P(wn1 | t
n1 )
![Page 38: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/38.jpg)
Independence Assumptions
Language ModelApproximate P(tn
1 ) using n-gram model
bigram∏n
i=1 P(ti | ti−1)
trigram∏n
i=1 P(ti | ti−2ti−1)
Task ModelP(wn
1 | tn1 ) : Approximate by assuming that a word appears in a
category independent of its neighbours.
P(wn1 | t
n1 ) =
n∏i=1
P(wi | ti)
We can smooth the language model, using the smoothingtechniques from earlier in this lecture.
![Page 39: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/39.jpg)
Training: estimating the probabilities
Given a training corpus with every word tagged with its correct tag.
Using relative frequencies from training corpus
Task model:
P(wi | ti) =Count(〈wi , ti〉)∑w Count(〈w, ti〉)
=Count(〈wi , ti〉)
Count(〈ti〉)
Language model (trigram model shown here):
P(ti | ti−2, ti−1) =Count(ti−2, ti−1, ti)∑
t∈V Count(ti−2, ti−1, t)
![Page 40: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/40.jpg)
Unknown words in the task model (1)
When a new word arrives wu, we never have seen any tag with this word.
Hence: for all tags t ∈ T : P(wu | t) = 0
Performance of state-of-the-art taggers depends largely on treatingunknown words!
What should we do?
Use knowledge of language:
Morphology: eating, filled, trousers, Noam, naturally
POS tag classes: Closed-class versus open-class words.
Some classes such as Proper Nouns and Verbs are muchmore likely to generate new words than other classes e.g.prepositions, determiners.
![Page 41: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/41.jpg)
Unknown words in the task model (2)
To capture the fact that some POS classes are more likely togenerate unknown words, we can simply use Good-Turingsmoothing on P(w |t).
n0: number of unknown words.n1(t): number of words which appeared once with tag t .N(t): total number of words which appeared with tag t .Good-Turing estimate for unknown word u:
P(u|t) =n1(t)
n0N(t)
![Page 42: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/42.jpg)
Unknown words: Good Turing Example
P(u|t) =n1(t)
n0N(t)
Number of unknown words: 10
t = NN t = VBD t = PREPn1(NN) = 150 n1(VBD) = 100 n1(PREP) = 2
N(NN) = 3000 N(VBD) = 1000 N(PREP) = 5000
P(u|NN) = 1/200 P(u|VBD) = 1/100 P(u | PREP) = 1/25000
![Page 43: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/43.jpg)
Unknown words in the task model (3)
I For each morphological feature: capitalization, hyphen,etc.Separately train estimators : P(feature | tag)
I Assume independence and multiply.
I The model of Weischedel et al. (1993)
P(w, unknown | t) ∝ P(unknown | t) ×
P(capitalized(w) | t) ×
P(suffix(w)/hyphen(w) | t)
This treatment of unknown words might not be suitable for“morphologically-rich” languages such as Finnish, Hebrew or Arabic
![Page 44: Lecture 4: Smoothing, Part-of-Speech Taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf · Estimating n r from Data The values nr from data could be sparse themselves! GTGood suggests](https://reader033.vdocuments.site/reader033/viewer/2022052011/602797ce8977d80d78368fc5/html5/thumbnails/44.jpg)
Next Lecture
I Tagging (dynamic programming)I Recap Part A (exam preparation)I Syntax: outlookI No lecture on February 18