lecture 4: smoothing, part-of-speech taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf ·...

Lecture 4: Smoothing, Part-of-Speech Tagging

Ivan TitovInstitute for Logic, Language and Computation

Universiteit van Amsterdam

Language Models from Corpora

We want a model of sentence probability P(w1, . . . ,wn) for all wordsequences w1, . . . ,wn over the vocabulary V :

P(w1, . . . ,wn) = P(w1)n∏

i=2

P(wi |w1, . . . ,wi−1)

Tasks to do:I Estimate P(w1)

I Estimate probabilities P(wi |w1, . . . ,wi−1) for all w1, . . . ,wi !

What to do in order to estimate these probabilities?

I Bucketing histories: Markov modelsI Smoothing techniques against sparse-data

Markov Assumption and N-grams

Limited history: There is a fixed finite k such that for all w i+11 :

P(wi+1|w1, . . . ,wi) ≈ P(wi+1|wi−k , . . . ,wi)

For k ≥ 0

P(wi | w1, . . .wi−k . . . ,wi−1) ≈ P(wi | wi−k , . . . ,wi−1)

How to estimate the probabilities?

Estimation: k th-order Markov Model

P(wi |wi−k , . . . ,wi−1) =Count(wi−k , . . . ,wi−1,wi)∑

w∈V Count(wi−k , . . . ,wi−1, w)

Bigram derivations

catcat(.4)

==saw // saw . . .

��

thethe(.6)

;;heard

heard(.5)

##

a(.33){{

STARTa // a

mouse(.5) // mouse

END(.6)

��. . . END

Likelihood:

Pbigram (“ the cat heard a mouse ”) = .6× .4× .5× .33× .5× .6 = 0.12

Zero counts

Model: we train a bigram model on Data: P(wi |wi−1), e.g. “the catsees the mouse”, ..., “Tom sees Jerry”, “the mouse seesJerry”

Problem: P(“Jerry sees the mouse”) = 0, why?

Data: text does not contain bigrams 〈START , Jerry〉 and〈Jerry, sees〉

Zeros: are called “unseen events”,

Question: the above sentence should have a non-zero probability,how do we estimate its probability?

Zipf’s law

Rank

Frequency

I In a given corpus, a few very frequent words, but very manyinfrequent words.

I Having a resonably sized corpus is important, but we always needsmoothing.

I Smoothing technique used has a large effect on the performance ofany NLP system.

Smoothing techniques for Markov Models

General IdeaFind a way to fill the gaps in counts of events in corpus C.

I Take care not to change original distribution too much.I Fill the gaps only as much as needed: as corpus grows larger,

there will be less gaps to fill.

Naive smoothing: Adding λ method (1)

Events: Set E of possible events, e.g. bigrams over V :E = (V × V)

Data: e1, . . . eN (data size is N events)

Counting: Event (bigram) e occurs C(e) times in Data

Relative Frequency estimate: Prf (e) =C(e)

N

Add 0 < λ ≤ 1: for all e ∈ E (bigrams) change C(e) and Prf (e)

C(e) = C(e) + λ

Pλ(e) =C(e) + λ

N + λ|E |

Add λ method (2)

Example: Bigram Model

Pλ(wi |wi−1) =λ+ c(wi−1wi)∑

w(λ+ c(wi−1,w))

Advantages: very simple and easy to apply

Disadvantges: Method performs poorly (see Chen & Goodman):I All unseen events receive same probability! Is

that OK?I All events upgraded by λ! Is that OK?

Good-Turing method

Intuition Use the counts of things you have seen once to estimate thecount of things not seen.i.e. use n-grams with frequency 1 to re-estimate the frequency of zerocounts.

Suppose we have data with total count of events being N:

Standard notation:r = frequency of event enr = number of events e with frequency r

nr = |{e ∈ E |Count(e) = r}|Nr = total frequency of events occuring exactly r times

Nr = r × nr

Observation: N =∑∞

r=1 Nr N0 = 0

What we want: To recalculate the frequency r of an event (r∗)

Good-Turing Estimates

The Good-Turing probability estimate for events with frequency r :

PGT(α) =r + 1

NE[nr+1]

E[nr ]≈

r + 1N

nr+1

nr=

1N× (r + 1) ×

nr+1

nr

We can think of this, as assigning frequency of r∗ to events appearing rtimes:

r∗ = (r + 1) ×nr+1

nr

nr : number of events with freq. rnr+1 : number of events with freq. r + 1

Properties of Good-Turing

Preservation: Total number of counts is preserved:

N =∞∑

r=1

rnr =∞∑

r=0

(r + 1)nr+1 =∞∑

r=0

nr r∗

Discounting: Total freq. for non-zero events is discounted

N0 = n0 × 0∗ = n0 × (1 ×n1

n0) = n1

Zero freq. events

P0 =r∗

N=

0∗N

=n1

N

Zero events: No explicit method for redistributing N0 among zeroevents!

Redistribute the reserved mass (N0) uniformly among zero events?

Estimating nr from Data

The values nr from data could be sparse themselves!

GT Good suggests to estimate E[nr ] ≈ S(nr) for somesmoothing function S over the values nr .

SGT William Gale suggests the Simple Good-Turingwhere function S is linear S(log(nr)) = a + b log(r)and fitted by logistic regression such that b < −1 (toguarantee that r∗ < r).

Available the implementation by Geoffrey Sampson on theweb.

Problem: Good-Turing says nothing about how to redistributeprobability mass over unseen-events!

Why Uniform Redistribution is Not Good

Given some data, we observe that

I Frequency of trigram 〈gave, the, thing〉 is zero

I Frequency of trigram 〈gave, the, think 〉 is also zero

I Uniform distribution over unseen events means:

P(thing|gave, the) = P(think |gave, the)

Does that reflect our knowledge about English use?

P(thing|gave, the) > P(think |gave, the)

How do we achieve this?

I This might be achieved if we look at bigrams!P(thing|the) vs. P(think |the)

Katz’s Backoff

In order to re-distribute reserved mass, use a lower-order distribution

Discount and backoff Suppose C(wi−1wi) = r :

Ckatz(wi−1wi) =

{dr × r if r > 0α(wi−1) × PML (wi) if r = 0

Discounting all non-zero counts are discounted by 1 − dr

Redistribution of reserved mass over zero-events is proportional to theirlower-order distribution, for bigrams this is unigrams, fortrigrams this is bigrams etc.

Computing dr

Efficiency: discount only events occuring 1 ≤ r ≤ k : e.g. k = 5,

Constraint 1: Discounts should be proportional to discounts of GT,i.e. there exists some constant µ such that1 − dr = µ(1 − r∗

r )

Constraint 2: The total mass of zero-events is the same as for GTGT assign to all zero-events: n00∗ = n0(

n1n0) = n1:

k∑r=1

nr × [1 − dr ] × r = n1

Solution: The above system of equations has the solution:

dr =

r∗r −

(k+1)nk+1n1

1 − (k+1)nk+1n1

Solution for α

Solution for α(wi−1) is through enforcing the constraint that totalcounts are not changed:∑

wiCkatz(wi−1wi) =

∑wi

C(wi−1wi)

See the Chen and Goodman tech report for details.

Smoothing of Markov Model

Smoothing an (n + 1) model gives us an n-gram model.

P∗(wn+1|w1, . . . ,wn) =Count∗(w1, . . . ,wn,wn+1)∑w∈V Count∗(w1, . . . ,wn,w)

For many smoothing methods:∑w∈V

Count∗(w1, . . . ,wn,w) , Count∗(w1, . . . ,wn)

Smoothing by linear interpolation: Jelinek-Mercer

(Also called finite mixture models)

Interpolate a bigram with a unigram model using some 0 ≤ λ ≤ 1

Pinterp(wi−1wi) = λPML(wi−1wi) + (1 − λ)PML(wi)

Estimation: data for PML and held-out data for estimating the λ’sWhat would happen if use the training data toestimate λ?

Algorithm: Baum-Welch/Forward-Backward for Hidden MarkovModels searching for a Maximum-Likelihood estimateof the λ.

More on this when we learn HMMs!

Summary of Smoothing Techniques

I n-gram statistics suffer from sparseness in text as n growsI There are smoothing methods against sparsenessI Add λ is too simpleI Good-Turing only for reserving mass for zero-eventsI Katz allows redistribution according to a lower order

distributionI Interpolation combines lower order distributions

Smoothing N-gram Models

1. Add λ method

2. Good-Turing method

3. Discounting: Katz’ backoff

4. Interpolation: Jelinek-Marcer

Extra Reference: Joshua Goodman and Stanly Chen. Anempirical study of smoothing techniques for language modeling.Tech. report TR-10-98, Harvard University, August 1998.http://research.microsoft.com/˜joshuago/

http://research.microsoft.com/~joshuago/

I Born 1932 (Vilem, Czechoslovakia) fromJewish father and Catholic mother;

I Father died in Theresienstadt; mother &children survived in Prague.

I Electrical Engineering City College ofNew York, 1962 PhD MIT (under RobertFano; Shannon’s department)

I 1957 met Milena Tabolova (arrived1961, studied linguistics at MIT),together attended Chomsky’s classes.

I 1962-1972 Cornell, focus InformationTheory

I 1972-1993, IBM Watson ResearchCenter: applied noisy channel model tospeech recognition & machinetranslation

I 1993-2010, Johns Hopkins

I died 2010.

Shannon (1956): “Information theory has, in the last few years,become something of a scientific bandwagon. Starting as atechnical tool for the communication engineer, it has received anextraordinary amount of publicity in the popular as well as thescientific press... Our fellow scientists in many different fields,attracted by the fanfare and by the new avenues opened toscientific analysis, are using these ideas in their own problems.Applications are being made to biology, psychology, linguistics,fundamental physics, economics, the theory of organization, andmany others... Although this wave of popularity is certainlypleasant and exciting for those of us working in the field, it carriesat the same time an element of danger... It will be all too easy forour somewhat artificial prosperity to collapse overnight when it isrealized that the use of a few exciting words like information,entropy, redundancy, do not solve all our problems.”

I Pierce and Caroll, ALPAC report (1956) persuaded the U.S.government to stop funding machine translation (MT)research.

I “Certainly, language is second to no phenomenon inimportance. The new linguistics presents an attractive as wellas an extremely important challenge.”

I Machine translation is field for “mad inventors or untrustworthyengineers”

I Pierce (1959, JASA): “We are safe in asserting that speechrecognition is attractive to money. The attraction is perhapssimilar to the attraction of schemes for turning water intogasoline, extracting gold from the sea, curing cancer, or goingto the moon.

I Chomsky (1957): “Probabilistic models give no insight intobasic problems of syntactic structure”

I Pierce: “Funding artificial intelligence is real stupidity”

In this climate, Jelinek pioneered statistical models for speechrecognition and machine translation, reconciling Shannon’s andChomsky’s contributions.

I In 1976 (Proceedings of the IEEE) ”Continuous SpeechRecognition by Statistical Methods”

I “eventually his view of the problem entirely reshaped thelandscape and set the foundations for virtually all modernspeech recognition systems.”

I “in his 20 years at IBM, he and his team invented nearly all ofthe key components that you need to build a highperformance speech recognition system includingphone-based acoustic models, N-gram language models,decision tree clustering, and many more”

I 2000 (with Chelba) “Structured language modeling for speechrecognition”: PCFGs as language models.

POS tagging: Background

What are POS tags?

I A Part of Speech (POS) tag is the class of a wordExamples: Verbs (VB), Nouns (NN), Adjectives (ADJ), Adverbs (ADV),

Prepositions (PP), Determiners (DT)

I A POS tag provides significant info about a word and itsneighbourhoodExample: A verb occurs in different neighbourhoods than a noun

I Nouns are likely to be preceded by determiners, nouns, adjectivesI Verbs are likely to be preceded by nouns, pronouns, adverbs

I POS tags capture mainly morpho-syntactic behaviourI POS tag sets are only discrete approximations for the

behaviour of words in sentences!

What are POS tags?

I POS tags are also known as word classes, morphologicalclasses, or lexical tags

I Number of tags used by different systems/corpora/languagesare different

I Penn Treebank (Wall Street Journal Newswire): 45I Brown corpus (Mixed genres like fiction, biographies, etc): 87

I POS tags can be of varying granularityI Morphologically complex languages have very different

tag-sets from English or Dutch.I Even for English:

He gave the book to the manVB-N-N

He ate the banana.VB-N

POS tags are useful

Pronunciation: Examples:NN CONtent NN OBject NN DIScountADJ conTENT VB obJECT VB disCOUNT

Spelling correction: knowing the POS tag of a word helps.

The house be the sea is beautiful.

(VB be) is less likely to follow (NN house) than (PP by)

Parsing: POS tags say a lot about syntactic word behavior.

Speech recognition: POS tags are used in a language model component

Information Retrieval: important words are often (e.g.) nouns.

POS-tagging is often the first step in a larger NLP system.

POS tagging and ambiguity

List the list .VB DT NN .

Book that flight .VB DT NN .

He thought that the book was long .PRN VB COMP DT NN VB ADJ.

Ambiguity:

list VB/NNbook VB/NNfollowing NN/VB/ADJthat DT/COMP I thought (COMP that) . . .

POS tag ambiguity resolution demands “some” syntactic analysis!

Why is POS tagging hard

Ambiguity:Most words belong to more than one word-classI List/VB the list/NNI He will race/VB the carI When is the race/NN?

Some ambiguities are genuinely difficult, even for people

I particle vs. prepositionI He talked over the deal.I He talked over the phone.

I past tense vs. past participleI The horse raced past the barn.I The horse raced past the barn was brown.I The old man the boat.

How hard is POS tagging?

Using the Brown Corpus (approx. 1 million words of mixed-genre text), astudy shows that for English:

I Over 40% of all word tokens are ambiguous!

I Many ambiguities are easy to resolve: just pick the most likely tag!

I A good portion is still hard: picking most likely tag results only in90% per word correct POS tag:

Sentence of length n words is fully correct only in: (0.9)n %Length 10 20 30 40 50Precision 34% 12% 4.2% 1.5% 0.5%

Maybe we can get away with using local context?

POS-tagging is among the easiest of NLP problems

I State-of-the-art methods get about 97-98% accuracyI Simple heuristics can take you quite far

I 90% accuracy just by choosing the most frequent tag.

I A good portion is still hard: need to use some local context.

Current POS taggers

“Rule-based”: set of rules constructed manually (e.g. ENGTWOL:CG)

Stochastic: Markov taggers, HMM taggers (and other MLtechniques)

This lecture: Stochastic TaggersI What is a stochastic tagger? What components does it

contain?I How to use Markov models?I Hidden Markov Models for tagging.

A generative stochastic tagger

Given a vocabulary V and POS tag set T

Input: w1, . . . ,wn all wi ∈ V

Output: t1, . . . , tn all ti ∈ T

Tagger: P : (V × T)+ → [0, 1]

Disambiguation: Select tn1 = argmaxtn

1P(tn

1 | wn1 )

wn1t1

nNoisy Channel

message Observation

Partioning: Task Model & Language Model

argmaxtn1P(tn

1 | wn1 ) = argmaxtn

1

P(tn1 ,w

n1 )

P(wn1 )

= argmaxtn1P(tn

1 ,wn1 )

P(tn1 ,w

n1 ) = P(tn

1 )︸︷︷︸ × P(wn1 | t

n1 )︸︷︷︸

“Language model” “Task model”P(tn

1 ) P(wn1 | t

n1 )

Independence Assumptions

Language ModelApproximate P(tn

1 ) using n-gram model

bigram∏n

i=1 P(ti | ti−1)

trigram∏n

i=1 P(ti | ti−2ti−1)

Task ModelP(wn

1 | tn1 ) : Approximate by assuming that a word appears in a

category independent of its neighbours.

P(wn1 | t

n1 ) =

n∏i=1

P(wi | ti)

We can smooth the language model, using the smoothingtechniques from earlier in this lecture.

Training: estimating the probabilities

Given a training corpus with every word tagged with its correct tag.

Using relative frequencies from training corpus

Task model:

P(wi | ti) =Count(〈wi , ti〉)∑w Count(〈w, ti〉)

=Count(〈wi , ti〉)

Count(〈ti〉)

Language model (trigram model shown here):

P(ti | ti−2, ti−1) =Count(ti−2, ti−1, ti)∑

t∈V Count(ti−2, ti−1, t)

Unknown words in the task model (1)

When a new word arrives wu, we never have seen any tag with this word.

Hence: for all tags t ∈ T : P(wu | t) = 0

Performance of state-of-the-art taggers depends largely on treatingunknown words!

What should we do?

Use knowledge of language:

Morphology: eating, filled, trousers, Noam, naturally

POS tag classes: Closed-class versus open-class words.

Some classes such as Proper Nouns and Verbs are muchmore likely to generate new words than other classes e.g.prepositions, determiners.


To capture the fact that some POS classes are more likely togenerate unknown words, we can simply use Good-Turingsmoothing on P(w |t).

n0: number of unknown words.n1(t): number of words which appeared once with tag t .N(t): total number of words which appeared with tag t .Good-Turing estimate for unknown word u:

P(u|t) =n1(t)

n0N(t)

Unknown words: Good Turing Example

P(u|t) =n1(t)

n0N(t)

Number of unknown words: 10

t = NN t = VBD t = PREPn1(NN) = 150 n1(VBD) = 100 n1(PREP) = 2

N(NN) = 3000 N(VBD) = 1000 N(PREP) = 5000

P(u|NN) = 1/200 P(u|VBD) = 1/100 P(u | PREP) = 1/25000


I For each morphological feature: capitalization, hyphen,etc.Separately train estimators : P(feature | tag)

I Assume independence and multiply.

I The model of Weischedel et al. (1993)

P(w, unknown | t) ∝ P(unknown | t) ×

P(capitalized(w) | t) ×

P(suffix(w)/hyphen(w) | t)

This treatment of unknown words might not be suitable for“morphologically-rich” languages such as Finnish, Hebrew or Arabic

Next Lecture

I Tagging (dynamic programming)I Recap Part A (exam preparation)I Syntax: outlookI No lecture on February 18

lecture 4: smoothing, part-of-speech taggingivan-titov.org/teaching/nlmi-15/lecture-4.pdf ·...

Documents