conditional random fields

Conditional Random Fields

William W. Cohen

CALD

Announcements

• Upcoming assignments:– Today: Sha & Pereira, Lafferty et al– Mon 2/23: Klein & Manning, Toutanova et al– Wed 2/25: no writeup due– Mon 3/1: no writeup due– Wed 3/3: project proposal due: personnel + 1-2

page – Spring break week, no class

Review: motivation for CMM’s

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Motivation for CMMs

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

Implications of the model

• Does this do what we want?

• Q: does Y[i-1] depend on X[i+1] ?– “a nodes is conditionally independent of its non-descendents given

its parents”

How important is label bias?

• Could be avoided in this case by changing structure:

• Our models are always wrong – is this “wrongness” a problem?

• See Klein & Manning’s paper for next week….

Another view of label bias [Sha & Pereira]

So what’s the alternative?

Review of maxent

'

)(0

))',(exp(

)),(exp()|Pr(

)),(exp(),Pr(

))(exp()Pr(

y iii

iii

iii

iii

i

xf

yxf

yxfxy

yxfyx

xfx i

Review of maxent/MEMM/CMMs

j j

ijjjii

jjjjnn

iii

y iii

iii

xZ

yyxfxyyxxyy

xZ

yxf

yxf

yxfxy

)(

)),,(exp()|Pr()...|...Pr(

:MEMMfor

)(

)),(exp(

))',(exp(

)),(exp()|Pr(

1

,111

'

Details on CMMs

j j

ijjjii

jjjjnn xZ

yyxfxyyxxyy

)(

)),,(exp()|Pr()...|...Pr(

1

,111

jjjjijjji

jj

ijjjii

jj

ijjjii

j

yyxfyyxFxZ

yyxF

xZ

yyxf

),,(),,( where,)(

)),,(exp(

)(

)),,(exp(

11

1

1

From CMMs to CRFs

jjjjii

jj

iii

jj

ijjjii

j

yyxfyxFxZ

yxF

xZ

yyxf

),,(),( where,)(

)),(exp(

)(

)),,(exp(

1

1

Recall why we’re unhappy: we don’t want local normalization

)(

)),(exp(

xZ

yxFi

ii

New model

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjjii

iii

x1 x2 x3

y1 y2 y3

What’s independent?

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjii

iii

x

y1 y2 y3

What’s independent now??

Hammerley-Clifford

• For positive distributions P(x1,…,xn):– Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))

– Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B

– P can be written as normalized product of “clique potentials”

C

CxZ

x clique

)(1

)Pr(

So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Definition of CRFs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs

HMM MEMM CRF

Lafferty et al notation

1 2 1 2( , , , ; , , , ); andn n k k

x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a

Boolean edge featurek is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v

If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

(y | x) exp ( , y | , x) ( , y | , x)

k k e k k ve E,k v V ,k

p f e g v

Conditional Distribution (cont’d)

• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

Z(x) is a normalization over the data sequence x

(y | x) exp ( , y | , x) ( , y |1

(x), x)

k k e k k ve E,k v V ,k

p f e g vZ

• Learning:– Lafferty et al’s IIS-based method is rather inefficient.

– Gradient-based methods are faster

– Trickiest bit is computing normalization, which is over exponentially many y vectors.

CRF learning – from Sha & Pereira


Something like forward-backward

Idea:

• Define matrix of y,y’ “affinities” at stage i

• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I

• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

x

y1 y2 y3

y1 y2 y3

Forward backward ideas

name

nonName

name

nonName

name

nonName

a

b c

d

e

f g

h

......

bhafbgae

hg

fe

dc

ba

Sha & Pereira results

CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results

in minutes, 375k examples

POS tagging Experiments in Lafferty et al

• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number

or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies

• oov = out-of-vocabulary (not observed in the training set)

POS tagging vs MXPost

conditional random fields

Documents

label value

node x

parentslabel bias problem

new model

op1 r p1

training data

state normalization

generative model