maximum entropy advanced statistical methods in nlp ling 572 january 31, 2012
TRANSCRIPT
![Page 1: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/1.jpg)
Maximum EntropyAdvanced Statistical Methods in NLP
Ling 572January 31, 2012
![Page 2: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/2.jpg)
2
Maximum Entropy
![Page 3: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/3.jpg)
RoadmapMaximum entropy:
OverviewGenerative & Discriminative models:
Naïve Bayes v Maxent
Maximum entropy principleFeature functionsModelingDecoding
Summary
![Page 4: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/4.jpg)
4
Maximum Entropy“MaxEnt”:
Popular machine learning technique for NLP
First uses in NLP circa 1996 – Rosenfeld, Berger
Applied to a wide range of tasks
Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….
![Page 5: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/5.jpg)
5
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial
![Page 6: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/6.jpg)
6
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
![Page 7: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/7.jpg)
7
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
Going forward: Techniques more complex
Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement
![Page 8: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/8.jpg)
8
Notation NoteNot entirely consistent:
We’ll use: input = x; output=y; pair = (x,y)
Consistent with Berger, 1996
Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)
Klein/Manning, ‘03: input = d; output=c; pair = (c,d)
![Page 9: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/9.jpg)
9
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
![Page 10: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/10.jpg)
10
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)
![Page 11: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/11.jpg)
11
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc
![Page 12: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/12.jpg)
12
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequency
![Page 13: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/13.jpg)
13
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency
Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …
![Page 14: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/14.jpg)
14
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequencyConditional (aka discriminative) models estimate P(y|
x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex
![Page 15: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/15.jpg)
Naïve Bayes Model
Naïve Bayes Model assumes features f are independent of each other, given the class C
c
f1 f2 f3 fk
![Page 16: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/16.jpg)
16
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
![Page 17: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/17.jpg)
17
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
![Page 18: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/18.jpg)
18
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
![Page 19: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/19.jpg)
19
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
![Page 20: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/20.jpg)
20
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
Would like a model that doesn’t assume
![Page 21: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/21.jpg)
Model ParametersOur model:
c*= argmaxc P(c)ΠjP(fj|c)
Types of parametersTwo:
P(C): Class priorsP(fj|c): Class conditional feature probabilities
Features in total |C|+|VC|, if features are words in vocabulary V
![Page 22: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/22.jpg)
22
Weights in Naïve Bayes
c1 c2 c3 … ck
f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)
f2 P(f2|c1) P(f2|c2) …
… …
f|V| P(f|V||,c1)
![Page 23: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/23.jpg)
23
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weights
![Page 24: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/24.jpg)
24
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 25: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/25.jpg)
25
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 26: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/26.jpg)
26
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 27: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/27.jpg)
27
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, sign
![Page 28: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/28.jpg)
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =
28
![Page 29: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/29.jpg)
MaxEnt OverviewPrediction:
P(y|x)
29
![Page 30: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/30.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
30
![Page 31: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/31.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
31
![Page 32: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/32.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
Prediction: Compute P(y|x), pick highest y
32
![Page 33: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/33.jpg)
Weights in MaxEnt
c1 c2 c3 … ck
f1 λ1 λ8…
f2 λ2 …
… …
f|V| λ6
33
![Page 34: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/34.jpg)
Maximum Entropy Principle
34
![Page 35: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/35.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
35
![Page 36: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/36.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
36
![Page 37: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/37.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
37
![Page 38: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/38.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish
between the probability of two events, the best strategy is to consider them equally likely
38
![Page 39: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/39.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
39
![Page 40: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/40.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?
40
![Page 41: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/41.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
41
![Page 42: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/42.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?
42
![Page 43: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/43.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?P(X=T)=0.7
43
![Page 44: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/44.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint:
44
![Page 45: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/45.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?
45
![Page 46: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/46.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours
de)=p(pendant)=1/5
46
![Page 47: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/47.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint
47
![Page 48: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/48.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
48
![Page 49: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/49.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=
49
![Page 50: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/50.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=
50
![Page 51: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/51.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
51
![Page 52: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/52.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
Not intuitively obvious…
52
![Page 53: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/53.jpg)
53
Example III: POS (K&M, 2003)
![Page 54: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/54.jpg)
54
Example III: POS (K&M, 2003)
![Page 55: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/55.jpg)
55
Example III: POS (K&M, 2003)
![Page 56: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/56.jpg)
56
Example III: POS (K&M, 2003)
![Page 57: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/57.jpg)
57
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbs
![Page 58: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/58.jpg)
58
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36
Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36
Etc
![Page 59: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/59.jpg)
Maximum Entropy Principle:Summary
Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:
![Page 60: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/60.jpg)
Maximum Entropy Principle:Summary
Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:
Questions:1) How do we model the constraints?2) How can select the distributions?
![Page 61: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/61.jpg)
MaxEnt Modeling
![Page 62: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/62.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), where
![Page 63: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/63.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predict
![Page 64: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/64.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:
X: words in documents Y: text categories: guns, mideast, misc…
Compute P(y|x)Select y with highest P(y|x) as classification
![Page 65: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/65.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:
X
![Page 66: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/66.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:
X: words in documents Y:
![Page 67: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/67.jpg)
Basic ApproachGiven some training data:
Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:
X: words in documents Y: text categories: guns, mideast, misc…
Compute P(y|x)Select y with highest P(y|x) as classification
![Page 68: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/68.jpg)
Maximum Entropy ModelCompute P(y|x)
Select distribution that maximizes entropy subject to constraints in training data:
![Page 69: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/69.jpg)
Modeling ComponentsFeature functions
Computing expectations
Calculating P(y|x) and P(x,y)
![Page 70: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/70.jpg)
Feature Functions
![Page 71: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/71.jpg)
Feature FunctionsA feature function is a binary-valued indicator
function:
![Page 72: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/72.jpg)
Feature FunctionsA feature function is a binary-valued indicator
function:
In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.
![Page 73: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/73.jpg)
Feature FunctionsA feature function is a binary-valued indicator
function:
In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.
fj(x,y) = {1 if y=“guns” and x includes “rifle”
![Page 74: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/74.jpg)
Feature FunctionsA feature function is a binary-valued indicator
function:
In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.
fj(x,y) = {1 if y=“guns” and x includes “rifle”
{0 otherwise
![Page 75: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/75.jpg)
Feature FunctionsFeature functions can be complex:
![Page 76: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/76.jpg)
Feature FunctionsFeature functions can be complex:
For example:Features could be:
class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and
prevwd=“allow”
![Page 77: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/77.jpg)
Feature FunctionsFeature functions can be complex:
For example:Features could be:
class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and
prevwd=“allow”
class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or
currwd=“guns” and so
![Page 78: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/78.jpg)
Feature FunctionsFeature functions can be complex:
For example:Features could be:
class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and
prevwd=“allow”
class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or currwd=“guns”
and so on
Many are simple
![Page 79: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/79.jpg)
Feature FunctionsFeature functions can be complex:
For example:Features could be:
class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow”
class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns”
and so
Many are simple
Feature selection will be an issue
![Page 80: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/80.jpg)
Key PointsFeature functions are:
binary correspond to a (feature,class) pair
![Page 81: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/81.jpg)
Key PointsFeature functions are:
binary correspond to a (feature,class) pair
MaxEnt training learns weights associated with feature functions
![Page 82: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/82.jpg)
Computing Expectations
![Page 83: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/83.jpg)
Computing Expectations1: Consider a coin-flipping experiment
Heads: win 100Tails: lose 50
![Page 84: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/84.jpg)
Computing Expectations1: Consider a coin-flipping experiment
Heads: win 100Tails: lose 50What is the expected yield?
![Page 85: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/85.jpg)
Computing Expectations1: Consider a coin-flipping experiment
Heads: win 100Tails: lose 50What is the expected yield?
P(X=H)*100+P(X=T)*(-50)
2: Consider the more general case, outcomei : yield = vi
What is the expected yield?
![Page 86: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/86.jpg)
Computing Expectations1: Consider a coin-flipping experiment
Heads: win 100Tails: lose 50What is the expected yield?
P(X=H)*100+P(X=T)*(-50)
2: Consider the more general case, outcomei : yield = vi
![Page 87: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/87.jpg)
Calculating ExpectationsLet P(X=i) be a distribution of a random variable
X
Let f(x) be a function of x.
Let Ep(f) be the expectation of f(x) based on P(x)
![Page 88: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/88.jpg)
Empirical ExpectionEmpirical distribution denoted:
Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5
Example due to F. Xia
![Page 89: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/89.jpg)
Empirical ExpectionEmpirical distribution denoted:
Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5
Empirical distribution:
Example due to F. Xia
![Page 90: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/90.jpg)
Empirical ExpectionEmpirical distribution denoted:
Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5
Empirical distribution:
Empirical expectation
Example due to F. Xia
![Page 91: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/91.jpg)
Empirical ExpectionEmpirical distribution denoted:
Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5
Empirical distribution:
Empirical expectation = ¾*100 + ¼*(-50)=62.5
Example due to F. Xia
![Page 92: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/92.jpg)
Model ExpectationModel: p(x)
Assume a fair coin
![Page 93: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/93.jpg)
Model ExpectationModel: p(x)
Assume a fair coinP(X=H)= ½; P(X=T)= ½;
Model expectation:
![Page 94: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/94.jpg)
Model ExpectationModel: p(x)
Assume a fair coinP(X=H)= ½; P(X=T)= ½;
Model expectation:½*100 + ½*(-50)=25
![Page 95: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/95.jpg)
Empirical Expectation
![Page 96: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/96.jpg)
ExampleTraining data:
x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3
Example due F. Xia
t1 t2 t3 t4
c1 1 1 2 1
c2 1 0 0 1
c3 1 0 1 0
Raw counts
![Page 97: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/97.jpg)
ExampleTraining data:
x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3
Example due F. Xia
t1 t2 t3 t4
c1 1/4 1/4 2/4 1/4
c2 1/4 0 0 1/4
c3 1/4 0 1/4 0
Empirical distribution
![Page 98: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012](https://reader036.vdocuments.site/reader036/viewer/2022062515/56649cf05503460f949bfc83/html5/thumbnails/98.jpg)