conditional random fields
DESCRIPTION
Conditional Random Fields. William W. Cohen CALD. Announcements. Upcoming assignments: Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page - PowerPoint PPT PresentationTRANSCRIPT
Conditional Random Fields
William W. Cohen
CALD
Announcements
• Upcoming assignments:– Today: Sha & Pereira, Lafferty et al– Mon 2/23: Klein & Manning, Toutanova et al– Wed 2/25: no writeup due– Mon 3/1: no writeup due– Wed 3/3: project proposal due: personnel + 1-2
page – Spring break week, no class
Review: motivation for CMM’s
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Motivation for CMMs
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
...),|Pr( ,1 ttt sxs
Implications of the model
• Does this do what we want?
• Q: does Y[i-1] depend on X[i+1] ?– “a nodes is conditionally independent of its non-descendents given
its parents”
Label Bias Problem
• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
• Consider this MEMM:
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1
Pr(0123|rib)=1
Pr(0453|rob)=1
How important is label bias?
• Could be avoided in this case by changing structure:
• Our models are always wrong – is this “wrongness” a problem?
• See Klein & Manning’s paper for next week….
Another view of label bias [Sha & Pereira]
So what’s the alternative?
Review of maxent
'
)(0
))',(exp(
)),(exp()|Pr(
)),(exp(),Pr(
))(exp()Pr(
y iii
iii
iii
iii
i
xf
yxf
yxfxy
yxfyx
xfx i
Review of maxent/MEMM/CMMs
j j
ijjjii
jjjjnn
iii
y iii
iii
xZ
yyxfxyyxxyy
xZ
yxf
yxf
yxfxy
)(
)),,(exp()|Pr()...|...Pr(
:MEMMfor
)(
)),(exp(
))',(exp(
)),(exp()|Pr(
1
,111
'
Details on CMMs
j j
ijjjii
jjjjnn xZ
yyxfxyyxxyy
)(
)),,(exp()|Pr()...|...Pr(
1
,111
jjjjijjji
jj
ijjjii
jj
ijjjii
j
yyxfyyxFxZ
yyxF
xZ
yyxf
),,(),,( where,)(
)),,(exp(
)(
)),,(exp(
11
1
1
From CMMs to CRFs
jjjjii
jj
iii
jj
ijjjii
j
yyxfyxFxZ
yxF
xZ
yyxf
),,(),( where,)(
)),(exp(
)(
)),,(exp(
1
1
Recall why we’re unhappy: we don’t want local normalization
)(
)),(exp(
xZ
yxFi
ii
New model
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjjii
iii
x1 x2 x3
y1 y2 y3
What’s independent?
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjii
iii
x
y1 y2 y3
What’s independent now??
Hammerley-Clifford
• For positive distributions P(x1,…,xn):– Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))
– Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B
– P can be written as normalized product of “clique potentials”
C
CxZ
x clique
)(1
)Pr(
So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
Example of CRFs
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
Lafferty et al notation
1 2 1 2( , , , ; , , , ); andn n k k
x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge featurek is the number of features
are parameters to be estimated
y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:
(y | x) exp ( , y | , x) ( , y | , x)
k k e k k ve E,k v V ,k
p f e g v
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:
Z(x) is a normalization over the data sequence x
(y | x) exp ( , y | , x) ( , y |1
(x), x)
k k e k k ve E,k v V ,k
p f e g vZ
• Learning:– Lafferty et al’s IIS-based method is rather inefficient.
– Gradient-based methods are faster
– Trickiest bit is computing normalization, which is over exponentially many y vectors.
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
Something like forward-backward
Idea:
• Define matrix of y,y’ “affinities” at stage i
• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I
• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
x
y1 y2 y3
y1 y2 y3
Forward backward ideas
name
nonName
name
nonName
name
nonName
a
b c
d
e
f g
h
......
bhafbgae
hg
fe
dc
ba
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
Sha & Pereira results
CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron
Sha & Pereira results
in minutes, 375k examples
POS tagging Experiments in Lafferty et al
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
POS tagging vs MXPost