8/27/2015cpsc503 winter 20081 cpsc 503 computational linguistics lecture 5 giuseppe carenini
TRANSCRIPT
04/19/23 CPSC503 Winter 2008 1
CPSC 503Computational Linguistics
Lecture 5Giuseppe Carenini
04/19/23 CPSC503 Winter 2008 2
Today 22/9
• Finish spelling• n-grams• Model evaluation
04/19/23 CPSC503 Winter 2008 3
Spelling: the problem(s)
Non-word isolated
Non-word context
Detection
Correction
Vw?
Find the most
likely correct word
funn -> funny, funnel...
…in this context– trust funn – a lot of funn
– I want too go their
Real-word isolated
Real-word context
?!Is it an impossible (or very unlikely) word
in this context?Find the most likely
sub word in this context
04/19/23 CPSC503 Winter 2008 4
Key Transition
• Up to this point we’ve mostly been discussing words in isolation
• Now we’re switching to sequences of words
• And we’re going to worry about assigning probabilities to sequences of words
04/19/23 CPSC503 Winter 2008 5
Knowledge-Formalisms Map(including probabilistic formalisms)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/19/23 CPSC503 Winter 2008 6
Only Spelling?A.Assign a probability to a sentence
• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing
B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the
disabled
AB
?),..,( 1 nwwP Impossible to estimate
04/19/23 CPSC503 Winter 2008 7
Impossible to estimate!Assuming 104 word types and average sentence contains 10 words ->
sample space?
• Google language model Update (22 Sept. 2006): was based on corpusNumber of sentences: 95,119,665,584
?),..,( 1 nwwP
Most sentences will not appear or appear only once
Key point in Stat. NLP: your corpus should be >> than your sample space!
04/19/23 CPSC503 Winter 2008 8
Decompose: apply chain rule
Chain Rule:
)|(),..(1
111
i
jji
n
in AAPAAP
nw1
)|()(
)|()...|()()(),..,(
112
1
1112111
k
kn
k
nn
nn
wwPwP
wwPwwPwPwPwwP
Applied to a word sequence from position 1 to n:
04/19/23 CPSC503 Winter 2008 9
Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) *
P(big|the) * P(red|the big)*
P(dog|the big red)* P(barks|the big red dog)
Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|
<S>)
04/19/23 CPSC503 Winter 2008 10
Not a satisfying solution Even for small n (e.g., 6) we would
need a far too large corpus to estimate:
)|(....... 516 wwP
)|()|( 11
11
nNnn
nn wwPwwP
Markov Assumption: the entire prefix history isn’t necessary.
),|()|(3
)|()|(2
)()|(1
211
1
11
1
11
nnnn
n
nnn
n
nn
n
wwwPwwPN
wwPwwPN
wPwwPN unigram
bigram
trigram
04/19/23 CPSC503 Winter 2008 11
Prob of a sentence: N-Grams
)|()()(),..,( 112
111
kk
n
k
nn wwPwPwPwwP
)()()(),..,(2
111 kn
k
nn wPwPwPwwP
)|()()(),..,( 12
111 kkn
k
nn wwPwPwPwwP
unigram
bigram
trigram)|()()(),..,( 2,12
111 kkkn
k
nn wwwPwPwPwwP
Chain-rule
simplifications
04/19/23 CPSC503 Winter 2008 12
Bigram<s>The big red dog barks
P(The big red dog barks)= P(The|<S>) *
)|()|()(),..,( 12
111 kkn
k
nn wwPSwPwPwwP
Trigram?
04/19/23 CPSC503 Winter 2008 13
Estimates for N-Grams
)(
),()(
),(
)(
)()|(
1
1
1
1
1
,11
n
nn
words
n
pairs
nn
n
nnnn
wC
wwC
NwC
NwwC
wP
wwPwwP
bigram
..in general)(
)()|(
11
111
1
nNn
nnNnn
NnnwC
wwCwwP
04/19/23 CPSC503 Winter 2008 14
Estimates for Bigrams
)(
),()(
),(
)(
),()|(
bigC
redbigC
NbigC
NredbigC
bigP
redbigPbigredP
words
pairs
Silly Corpus :
“<S>The big red dog barks against the big pink dog”
Word types vs. Word tokens
04/19/23 CPSC503 Winter 2008 15
Berkeley ____________Project (1994) Table: Counts
nw
1nw
)(
)()|(
1
11
n
nnnn
wC
wwCwwP
Corpus: ~10,000 sentences, 1616 word typesWhat domain?
Dialog? Reviews?
04/19/23 CPSC503 Winter 2008 16
BERP Table:nw
1nw
)|( 1nn wwP
04/19/23 CPSC503 Winter 2008 17
BERP Table Comparison
)(
)(
1
1
n
nn
wC
wwC
Counts
Prob.
1?
nw
1nw
04/19/23 CPSC503 Winter 2008 18
Some Observations
• What’s being captured here?– P(want|I) = .32– P(to|want) = .65– P(eat|to) = .26– P(food|Chinese) = .56– P(lunch|eat) = .055
nw
1nw
04/19/23 CPSC503 Winter 2008 19
Some Observations
• P(I | I)• P(want | I)• P(I | food)
• I I I want• I want I want to• The food I want is
nw
1nw
Speech-based restaurant consultant!
04/19/23 CPSC503 Winter 2008 20
Generation• Choose N-Grams according to their
probabilities and string them together
nw
1nw
• I want -> want to -> to eat -> eat lunch• I want -> want to -> to eat -> eat Chinese -> Chinese food
04/19/23 CPSC503 Winter 2008 21
Two problems with applying:
)|()()( 12
11 kkn
k
n wwPwPwP
nw
1nw
to
04/19/23 CPSC503 Winter 2008 22
Problem (1)
• We may need to multiply many very small numbers (underflows!)
• Easy Solution:– Convert probabilities to logs and
then do ……– To get the real probability (if you
need it) go back to the antilog.
04/19/23 CPSC503 Winter 2008 23
Problem (2)• The probability matrix for n-grams is
sparse• How can we assign a probability to a
sequence where one of the component n-grams has a value of zero?
•Solutions:– Add-one smoothing– Good-Turing– Back off and Deleted Interpolation
04/19/23 CPSC503 Winter 2008 24
Add-One• Make the zero counts 1.• Rationale: If you had seen these
“rare” events chances are you would only have seen them once.
unigram N
wCwP
)()(
VN
wCwP
1)(
)(*
N
wCwP
)()(
**
VN
NwCwC
)1)(()(*
)(
)(*
wC
wCdc discount
04/19/23 CPSC503 Winter 2008 25
Add-One: Bigram
)(
),()|(
1
11
n
nnnn
wC
wwCwwP
VwC
wwCwwP
n
nnnn
)(
1),()|(
1
11
*
VwC
NwwCwwC
nnnnn
)()1),((),(
111
*……
Counts
nw
1nw
04/19/23 CPSC503 Winter 2008 26
BERP Original vs. Add-one smoothed Countsnw
1nw
6
19225
15
04/19/23 CPSC503 Winter 2008 27
Add-One Smoothed Problems
• An excessive amount of probability mass is moved to the zero counts
• When compared empirically with MLE or other smoothing techniques, it performs poorly
• -> Not used in practice• Detailed discussion [Gale and Church
1994]
04/19/23 CPSC503 Winter 2008 28
Better smoothing technique• Good-Turing Discounting (clear
example on textbook)
More advanced: Backoff and Interpolation
• To estimate an ngram use “lower order” ngrams– Backoff: Rely on the lower order ngrams when
we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams
– Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts
04/19/23 CPSC503 Winter 2008 29
Impossible to estimate!Sample space much bigger than any
realistic corpus
?),..,( 1 nwwP
Chain rule does not help
Markov assumption :Unigram… sample space?Bigram … sample space?Trigram … sample space?
Sparse matrix: Smoothing techniques
N-Grams Summary: final
Look at practical issues: sec. 4.8
04/19/23 CPSC503 Winter 2008 30
You still need a big corpus…• The biggest one is the Web!
),(
),,(),|(
21
321213
wwC
wwwCwwwP
Web
Web
Web
• Impractical to download every page from the Web to count the ngrams =>
• Rely on page counts (which are only approximations)– Page can contain an ngram multiple times– Search Engines round-off their counts• Such “noise” is tolerable in practice
04/19/23 CPSC503 Winter 2008 31
Today 22/9
• Finish spelling• n-grams• Model evaluation
04/19/23 CPSC503 Winter 2008 32
Entropy• Def1. Measure of uncertainty• Def2. Measure of the information
that we need to resolve an uncertain situation
– Let p(x)=P(X=x); where x X.
– H(p)= H(X)= - xX p(x)log2p(x)
– It is normally measured in bits.
04/19/23 CPSC503 Winter 2008 33
Model Evaluation
?),..,( 1 nwwP
?),..,( 1 nwwQ
Actual distribution
Our approximation
How different?
Relative Entropy (KL divergence) ?
D(p||q)= xX p(x)log(p(x)/q(x))
04/19/23 CPSC503 Winter 2008 34
Entropy of
)(log)()()( 111
1
n
Lw
nn wPwPwHPHn
),..,( 1 nwwP
Entropy rate )(1
1nwH
n
)(log)(1
lim)( 11
1
n
Lw
n
nwPwP
nLH
n
Language
EntropyAssumptions:ergodic and stationary
)(log1
lim)( 1n
nwP
nLH
Entropy can be computed by taking the average log probability of a looooong sample
NL?
Shannon-McMillan-Breiman
04/19/23 CPSC503 Winter 2008 35
Cross-EntropyBetween probability distribution P and
another distribution Q (model for P)
)(log)()||()(),( xQxPQPDPHQPHx
)(),( PHQPH
Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower
)(log)(1
lim),( 11
1
n
Lw
n
nwQwP
nQPH
n
)(log1
lim),( 1n
nwQ
nQPH
Applied to Languag
e
04/19/23 CPSC503 Winter 2008 36
Model Evaluation: In practice
Corpus
Training Set Testing set
A:split
B: train model
Model: Q
C:Apply model• counting
• frequencies • smoothing
• Compute cross-perplexity
),(2 QPH
Nw1
)(log1
),( 1nwQ
nQPH
04/19/23 CPSC503 Winter 2008 37
Knowledge-Formalisms Map(including probabilistic formalisms)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/19/23 CPSC503 Winter 2008 38
Next Time
• Hidden Markov-Models (HMM)• Part-of-Speech (POS) tagging