maximum entropy model

57
Maximum Entropy Maximum Entropy Model Model ELN – Natural Language ELN – Natural Language Processing Processing Slides by:

Upload: grant

Post on 12-Jan-2016

75 views

Category:

Documents


0 download

DESCRIPTION

Maximum Entropy Model. ELN – Natural Language Processing. Slides by: Fei Xia. History. The concept of Maximum Entropy can be traced back along multiple threads to Biblical times Introduced to NLP by Berger et. al . (1996 ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Maximum Entropy Model

Maximum Entropy Maximum Entropy ModelModel

ELN – Natural Language ELN – Natural Language ProcessingProcessing

Slides by: Fei Xia

Page 2: Maximum Entropy Model

HistoryHistory

The concept of Maximum The concept of Maximum Entropy can be traced back Entropy can be traced back along multiple threads to along multiple threads to Biblical timesBiblical times

Introduced to NLP by Berger Introduced to NLP by Berger et. al. (1996)et. al. (1996)

Used in many NLP tasks: MT, Used in many NLP tasks: MT, Tagging, Parsing, PP Tagging, Parsing, PP attachment, LM, …attachment, LM, …

Page 3: Maximum Entropy Model

OutlineOutline

Modeling: Intuition, basic Modeling: Intuition, basic concepts, …concepts, …

Parameter trainingParameter trainingFeature selectionFeature selectionCase studyCase study

Page 4: Maximum Entropy Model

Reference papersReference papers

(Ratnaparkhi, 1997)(Ratnaparkhi, 1997)(Ratnaparkhi, 1996)(Ratnaparkhi, 1996)(Berger et. al., 1996)(Berger et. al., 1996)(Klein and Manning, 2003)(Klein and Manning, 2003)

Different notations.Different notations.

Page 5: Maximum Entropy Model

ModelingModeling

Page 6: Maximum Entropy Model

The basic ideaThe basic idea

Goal: estimate Goal: estimate ppChoose Choose pp with maximum entropy (or with maximum entropy (or

“uncertainty”) subject to the “uncertainty”) subject to the constraints (or “evidence”).constraints (or “evidence”).

BAx

xpxppH )(log)()(

BbAawherebax ),,(

Page 7: Maximum Entropy Model

SettingSetting

From training data, collect (a, b) From training data, collect (a, b) pairs:pairs: a: thing to be predicted (e.g., a class

in a classification problem) b: the context Ex: POS tagging:

• a=NN• b=the words in a window and previous

two tagsLearn the probability of each (a, b): Learn the probability of each (a, b):

pp((aa, , bb))

Page 8: Maximum Entropy Model

Features in POS tagging Features in POS tagging (Ratnaparkhi, 1996)(Ratnaparkhi, 1996)

context (a.k.a. history) allowable classes

Condition Features

wi is not rare wi = X & ti = T

wi is rare X is prefix of wi, |X| ≤ 4 & ti = T

X is suffix of wi, |X| ≤ 4 & ti = T

wi contains number & ti = T

wi contains uppercase character & ti = T

wi contains hyphen & ti = T

wi ti-1 = X & ti = T

ti-2 ti-1 = X Y & ti = T

wi-1 = X & ti = T

wi-2 = X & ti = T

wi+1 = X & ti = T

wi+2 = X & ti = T

Page 9: Maximum Entropy Model

Maximum EntropyMaximum Entropy

Why Why maximummaximum entropy? entropy?Maximize entropy = Minimize Maximize entropy = Minimize

commitmentcommitmentModel all that is known and assume Model all that is known and assume

nothing about what is unknown. nothing about what is unknown. Model all that is known: satisfy a set of

constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy

Page 10: Maximum Entropy Model

(Maximum) Entropy(Maximum) Entropy

Entropy: the uncertainty of a Entropy: the uncertainty of a distributiondistributionQuantifying uncertainty (“surprise”)Quantifying uncertainty (“surprise”) Event x Probability px

“surprise” log(1/px)

Entropy: expected surprise (over Entropy: expected surprise (over pp):):

xxx

xp

pppH

pEpH

log)(

1log)(

p(HEADS)

H

Page 11: Maximum Entropy Model

Ex1: Coin-flip example (Klein Ex1: Coin-flip example (Klein & Manning 2003)& Manning 2003) Toss a coin: p(H)=p1, p(T)=p2Toss a coin: p(H)=p1, p(T)=p2 Constraint: p1 + p2 = 1Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)?Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p)Answer: choose the p that maximizes H(p)

x

xpxppH )(log)()(

p1

H

p1=0.3

Page 12: Maximum Entropy Model

ConvexityConvexity

Constrained Constrained HH((pp)) = – = – Σ Σ x log x log x x is convex:is convex: – x log x is convex – Σ x log x is convex (sum of

convex functions is convex) The feasible region of

constrained H is a linear subspace (which is convex)

The constrained entropy surface is therefore convex

The maximum likelihood The maximum likelihood exponential model (dual) exponential model (dual) formulation is also convexformulation is also convex

Page 13: Maximum Entropy Model

Ex2: An MT example (Berger Ex2: An MT example (Berger et. al., 1996)et. al., 1996)

Possible translation for the word “in” is:

{dans, en, à, au cours de, pendant}

Constraint:

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) = 1/5p(en) = 1/5p(à) = 1/5

p(au cours de) = 1/5p(pendant) = 1/5

Page 14: Maximum Entropy Model

An MT example (cont)An MT example (cont)

Constraint:p(dans) + p(en) = 1/5

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) = 1/10p(en) = 1/10p(à) = 8/30

p(au cours de) = 8/30p(pendant) = 8/30

Page 15: Maximum Entropy Model

An MT example (cont)An MT example (cont)

Constraint:p(dans) + p(en) = 1/5p(dans) + p(à) = 1/2

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) =p(en) =p(à) =

p(au cours de) =p(pendant) =

Page 16: Maximum Entropy Model

Ex3: POS tagging (Klein and Ex3: POS tagging (Klein and Manning, 2003)Manning, 2003) Lets say we have the following event space:Lets say we have the following event space:

… … and the following empirical data:and the following empirical data:

Maximize H:Maximize H:

… … want probabilities: E[want probabilities: E[NN, NNS, NNP, NN, NNS, NNP, NNPS, VBZ, VBDNNPS, VBZ, VBD] = 1] = 1

1/6 1/6 1/6 1/6 1/6 1/6

1/e 1/e 1/e 1/e 1/e 1/e

NN NNS NNP NNPS BVZ VBD

3 5 11 13 3 1

Page 17: Maximum Entropy Model

Ex3 (cont)Ex3 (cont)

Too uniform!Too uniform! N* are more common than V*, so we add the feature N* are more common than V*, so we add the feature

ffNN = {NN, NNS, NNP, NNPS}, with E[ = {NN, NNS, NNP, NNPS}, with E[ffNN] = 32/36] = 32/36

… … and proper nous are more frequent than common and proper nous are more frequent than common nouns, so we add nouns, so we add ffPP = {NNP, NNPS}, with E[ = {NNP, NNPS}, with E[ffPP] = ] =

24/3624/36

… … we could keep refining the models. E.g. by adding we could keep refining the models. E.g. by adding a feature to distinguish singular vs. plural nouns, or a feature to distinguish singular vs. plural nouns, or verb types.verb types.

4/36 4/36 12/36 12/36 2/36 2/36

NN NNS NNP NNPS VBZ VBD

8/36 8/36 8/36 8/36 2/36 2/36

Page 18: Maximum Entropy Model

Ex4: overlapping features (Klein Ex4: overlapping features (Klein and Manning, 2003)and Manning, 2003) Maxent models handle overlapping featuresMaxent models handle overlapping features Unlike a NB model, there is no double counting!Unlike a NB model, there is no double counting! But do not automatically model feature interactions.But do not automatically model feature interactions.

Page 19: Maximum Entropy Model

Modeling the problemModeling the problem

Objective function: Objective function: HH((pp))Goal: Among all the distributions Goal: Among all the distributions

that satisfy the constraints, choose that satisfy the constraints, choose the one, the one, p*p*, that maximizes , that maximizes HH((pp))..

Question: How to represent Question: How to represent constraints?constraints?

)(maxarg* pHpPp

Page 20: Maximum Entropy Model

FeaturesFeatures

Feature (a.k.a. feature function, Indicator function) is a Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: binary-valued function on events:

ffjj : : {0, 1}, {0, 1}, = = AA BB

A: the set of possible classes (e.g., tags in POS tagging)A: the set of possible classes (e.g., tags in POS tagging)

B: space of contexts (e.g., neighboring words/ tags in POS B: space of contexts (e.g., neighboring words/ tags in POS tagging)tagging)

Example:Example:

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

Page 21: Maximum Entropy Model

Some notationsSome notations

S

)(~ xp

)(xp

jf

Finite training sample of events:

Observed probability of x in S:

The model p’s probability of x:

The jth feature:

Observed expectation of fi :(empirical count of fi)

Model expectation of fi

x

jjp xfxpfE )()(

x

jjp xfxpfE )()(~~

Page 22: Maximum Entropy Model

ConstraintsConstraints

Model’s feature expectation = observed Model’s feature expectation = observed feature expectationfeature expectation

How to calculate ?How to calculate ?

jpjp fEfE ~

jp fE ~

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

x

N

ij

jjp N

xfxfxpfE 1

~

)()()(~

Page 23: Maximum Entropy Model

Training data Training data observed observed eventsevents

Page 24: Maximum Entropy Model

Restating the problemRestating the problem

)(maxarg* pHpPp

}},...,1{,|{ ~ kjfEfEpP jpjp

},...,1,{ ~ kjdfEfE jjpjp

x

xp 1)(

The task: find p* s.t.

where

Objective function: -H(p)

Constraints:

Add a feature babaf ,1),(0

10~0 fEfE pp

Page 25: Maximum Entropy Model

QuestionsQuestions

Is Is PP empty? empty?Does Does p*p* exist? exist?Is Is p*p* unique? unique?What is the form of What is the form of p*p*? ? How to find How to find p*p*??

Page 26: Maximum Entropy Model

What is the form of What is the form of p*p*? ? (Ratnaparkhi, 1997)(Ratnaparkhi, 1997)

}},...,1{,|{ ~ kjfEfEpP jpjp

}0,)(|{1

)(

j

k

j

xfj

jxppQ

Theorem: if p* P Q then

Furthermore, p* is unique.

)(maxarg* pHpPp

Page 27: Maximum Entropy Model

Using Lagrangian multipliersUsing Lagrangian multipliers

)()()(0

j

k

jjpj dfEpHpA

0

1

010

1

)(

1)(1)(

0

0

0

)(

)(

1)()(log

0)()(log1

0)(/)))()((()(log)((

0)('

eZwhereZ

exp

eexp

xfxp

xfxp

xpdxfxpxpxp

pA

xf

xfxf

j

k

jj

j

k

jj

jx

jx

k

jj

j

k

jj

j

k

jjj

k

jj

Minimize A(p):

Derivative = 0

Page 28: Maximum Entropy Model

Two equivalent formsTwo equivalent forms

Z

exp

xf j

k

jj )(

1

)(

k

j

xfj

jxp1

)()(

jjZ ln

1

Page 29: Maximum Entropy Model

The log-likelihood of the empirical distribution as predicted by a model q is defined as

Relation to Maximum Relation to Maximum LikelihoodLikelihood

x

xqxpqL )(log)(~)(

p~

QPp * )(maxarg* qLpQq

Theorem: if then Furthermore, p* is unique.

Page 30: Maximum Entropy Model

Goal: find p* in P, which maximizes H(p).

It can be proved that when p* exists it is unique.

The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample

Summary (so far)Summary (so far)

},...,1,|{ ~ kjfEfEpP jpjp

}0,)(|{1

)(

j

k

j

xfj

jxppQ

p~

Page 31: Maximum Entropy Model

Summary (cont)Summary (cont)

Adding constraints (features):Adding constraints (features):

(Klein and Manning, 2003)(Klein and Manning, 2003) Lower maximum entropy Raise maximum likelihood of data Bring the distribution further from

uniform Bring the distribution closer to data

Page 32: Maximum Entropy Model

Parameter estimationParameter estimation

Page 33: Maximum Entropy Model

AlgorithmsAlgorithms

Generalized Iterative Scaling Generalized Iterative Scaling (GIS):(GIS): (Darroch and Ratcliff, 1972)

Improved Iterative Scaling Improved Iterative Scaling (IIS):(IIS): (Della Pietra et al., 1995)

Page 34: Maximum Entropy Model

GIS: setupGIS: setup

Requirements for running GIS:Requirements for running GIS:Obey form of model and constraints:Obey form of model and constraints:

An additional constraint:An additional constraint:

Add a new feature Add a new feature ffkk+1+1::

Z

exp

xf j

k

jj )(

1

)(

jjp dfE

k

jj Cxfx

1

)(

k

jjk xfCxfx

11 )()(

k

jj

xxfC

1

)(max

Page 35: Maximum Entropy Model

GIS algorithmGIS algorithm

Compute Compute ddjj, , jj=1, …, =1, …, kk+1+1

Initialize (any values, e.g., 0) Initialize (any values, e.g., 0) Repeat until convergeRepeat until converge for each j

• compute

where

• update

Z

exp

xf

n

j

k

j

nj )(

)(

1

1

)(

)(

)()()()( xfxpfE j

x

njp n

)(log1

)(

)()1(

jp

inj

nj fE

d

C n

)1(j

Page 36: Maximum Entropy Model

Approximation for calculating Approximation for calculating feature expectationfeature expectation

N

i Aaiji

Bb Aaj

BbAaj

BbAaj

x BbAajjjp

bafbapN

bafbapbp

bafbapbp

bafbapbp

bafbapxfxpfE

1

,

,

,

),()|(1

),()|()(~

),()|()(~

),()|()(

),(),()()(

Page 37: Maximum Entropy Model

Properties of GISProperties of GIS

L(pL(p(n+1)(n+1)) >= L(p) >= L(p(n)(n))) The sequence is guaranteed to converge to The sequence is guaranteed to converge to

p*.p*. The converge can be very slow.The converge can be very slow.

The running time of each iteration is O(NPA):The running time of each iteration is O(NPA): N: the training set size P: the number of classes A: the average number of features that are active

for a given event (a, b).

Page 38: Maximum Entropy Model

Feature selectionFeature selection

Page 39: Maximum Entropy Model

Feature selectionFeature selection

Throw in many features and let the machine Throw in many features and let the machine select the weightsselect the weights Manually specify feature templates

Problem: too many featuresProblem: too many features

An alternative: greedy algorithmAn alternative: greedy algorithm Start with an empty set S Add a feature at each iteration

Page 40: Maximum Entropy Model

NotationNotation

The gain in the log-likelihood of the training data:

After adding a feature:

With the feature set S:

Page 41: Maximum Entropy Model

Feature selection algorithm Feature selection algorithm (Berger et al., 1996)(Berger et al., 1996)

Start with S being empty; thus pStart with S being empty; thus pss is is uniform.uniform.

Repeat until the gain is small enoughRepeat until the gain is small enough For each candidate feature f

• Computer the model using IIS• Calculate the log-likelihood gain

Choose the feature with maximal gain, and add it to S

fSp

Problem: too expensive

Page 42: Maximum Entropy Model

Approximating gains (Berger Approximating gains (Berger et. al., 1996) et. al., 1996)

Instead of recalculating all the weights, Instead of recalculating all the weights, calculate only the weight of the new feature.calculate only the weight of the new feature.

Page 43: Maximum Entropy Model

Training a MaxEnt ModelTraining a MaxEnt Model

Scenario #1:Scenario #1: Define features templatesDefine features templates Create the feature setCreate the feature set Determine the optimum feature weights via GIS or IISDetermine the optimum feature weights via GIS or IIS

Scenario #2:Scenario #2: Define feature templatesDefine feature templates Create candidate feature set SCreate candidate feature set S At every iteration, choose the feature from S (with At every iteration, choose the feature from S (with

max gain) and determine its weight (or choose top-n max gain) and determine its weight (or choose top-n features and their weights).features and their weights).

Page 44: Maximum Entropy Model

Case studyCase study

Page 45: Maximum Entropy Model

POS tagging (Ratnaparkhi, POS tagging (Ratnaparkhi, 1996)1996)

Notation variation: Notation variation: fj(a, b): a: class, b: context

fj(hi, ti): h: history for ith word, t: tag for ith word

History:History:

hhii = { = {wwii,, w wii-1-1, , wwii-2-2, , wwii+1+1, , wwii+2+2, , ttii-2-2, t, tii-1-1}}

Training data:Training data: Treat it as a list of (hi, ti) pairs How many pairs are there?

Page 46: Maximum Entropy Model

Using a MaxEnt ModelUsing a MaxEnt Model

Modeling: Modeling:

Training: Training: Define features templates Create the feature set Determine the optimum feature weights via GIS or

IIS

Decoding: Decoding:

Page 47: Maximum Entropy Model

ModelingModeling

)|(

),|(

),...,|,...,(

1

111

1

11

i

n

ii

inn

ii

nn

htp

twtp

wwttP

Tt

thp

thphtp

'

)',(

),()|(

Page 48: Maximum Entropy Model

Training step 1: define Training step 1: define feature templatesfeature templates

History hi Tag ti

Condition Features

wi is not rare wi = X & ti = T

wi is rare X is prefix of wi, |X| ≤ 4 & ti = T

X is suffix of wi, |X| ≤ 4 & ti = T

wi contains number & ti = T

wi contains uppercase character & ti = T

wi contains hyphen & ti = T

wi ti-1 = X & ti = T

ti-2 ti-1 = X Y & ti = T

wi-1 = X & ti = T

wi-2 = X & ti = T

wi+1 = X & ti = T

wi+2 = X & ti = T

Page 49: Maximum Entropy Model

Step 2: Create feature setStep 2: Create feature set

Collect all the features from the training dataThrow away features that appear less than 10 times

Page 50: Maximum Entropy Model

Step 3: determine the Step 3: determine the feature weightsfeature weightsGISGIS

Training time:Training time: Each iteration: O(NTA):

• N: the training set size• T: the number of allowable tags• A: average number of features that are active for a (h, t).

How many features?How many features?

Page 51: Maximum Entropy Model

Decoding: Beam searchDecoding: Beam search

Generate tags for Generate tags for ww11, find top , find top NN, set , set ss11jj

accordingly, accordingly, jj =1, 2, …, =1, 2, …, NN for for i = 2 i = 2 to to nn ( (nn is the sentence length) is the sentence length)

for j =1 to Ngenerate tags for wi, given s(i-1)j as previous tag context

append each tag to s(i-1)j to make a new sequence

find N highest prob sequences generated above, and set sij accordingly, j = 1, …, N

Return highest prob sequence Return highest prob sequence ssnn11..

Page 52: Maximum Entropy Model

Beam searchBeam search

Beam inference:Beam inference: At each position keep the top At each position keep the top kk complete sequences complete sequences Extend each sequence in each local wayExtend each sequence in each local way The extensions compete for the The extensions compete for the kk slots at the next position slots at the next position

AdvantagesAdvantages Fast: and beam sizes of 3-5 are as good or almost as good Fast: and beam sizes of 3-5 are as good or almost as good

as exact inference in many casesas exact inference in many cases Easy to implement (no dynamic programming required)Easy to implement (no dynamic programming required)

Disadvantage:Disadvantage: Inexact: the global best sequence can fall off the beamInexact: the global best sequence can fall off the beam

Page 53: Maximum Entropy Model

Viterbi searchViterbi search

Viterbi inferenceViterbi inference Dynamic programming or memoizationDynamic programming or memoization Requires small window of state influence (e.g., Requires small window of state influence (e.g.,

past two states are relevant)past two states are relevant) AdvantagesAdvantages

Exact: the global best sequence is returnedExact: the global best sequence is returned DisadtvantagesDisadtvantages

Harder to implement long-distance state-state Harder to implement long-distance state-state interactions (but beam inference tends not to interactions (but beam inference tends not to allow long-distance resurrection of sequences allow long-distance resurrection of sequences anyway)anyway)

Page 54: Maximum Entropy Model

Decoding (cont)Decoding (cont)

Tags for words:Tags for words: Known words: use tag dictionary Unknown words: try all possible tags

Ex: “time flies like an arrow”Ex: “time flies like an arrow”Running time: O(NTAB)Running time: O(NTAB) N: sentence length B: beam size T: tagset size A: average number of features that

are active for a given event

Page 55: Maximum Entropy Model

Experiment resultsExperiment results

Page 56: Maximum Entropy Model

Comparison with other Comparison with other learnerslearnersHMM: MaxEnt uses more contextHMM: MaxEnt uses more context

SDT: MaxEnt does not split dataSDT: MaxEnt does not split data

TBL: MaxEnt is statistical and it provides TBL: MaxEnt is statistical and it provides probability distributions.probability distributions.

Page 57: Maximum Entropy Model

MaxEnt SummaryMaxEnt Summary

Concept: choose the p* that Concept: choose the p* that maximizes entropy while satisfying all maximizes entropy while satisfying all the constraints.the constraints.

Max likelihood: p* is also the model Max likelihood: p* is also the model within a model family that maximizes within a model family that maximizes the log-likelihood of the training data.the log-likelihood of the training data.

Training: GIS or IIS, which can be slow.Training: GIS or IIS, which can be slow.MaxEnt handles overlapping features MaxEnt handles overlapping features

well.well. In general, MaxEnt achieves good In general, MaxEnt achieves good

performances on many NLP tasks.performances on many NLP tasks.