Revisiting Output Coding for Sequential Supervised Learning
Guohua Hao & Alan Fern
School of Electrical Engineering and Computer Science
Oregon State University
Corvallis, OR, U.S.A.
Scalability in CRF Training
Linear Chain CRF model
Inference in Training partition function : forward-backward algorithm Maximizing over label sequences: Viterbi algorithm Complexity of both:
Repeated inference in training Computationally demanding Can not scale to large label sets
Z X
1
1| exp , ,t t
t
P Y y yZ X XX
1kO T L
yt-1 yt+1yt
Xt-1 Xt+1Xt
Recent Work of Focus Sequential Error Correcting Output Coding (SECOC)
Error Correcting Output Coding (ECOC)
Class Code Word
b1 b2 … bn
C1 1 0 … 1
C2 0 0 … 0
… … … ... …
Cm 0 1 … 1
classifier h1 h2 hn
1 2, , , nH x h x h x h x
ˆ arg min ,i iy x H x C
Extension to CRF model
xt-1 xt xt+1
yt-1 yt yt+1
yt-1k yt
k yt+1k
ki k iy b y
original label sequence
binary label sequence
binary CRF kh
Decoding
yt-11 yt
1 yt+11
yt-1n yt
n yt+1n
1h
nh 1, , n
t t tH y yx
ˆ arg min ,t i t iy H C x
ˆty1ˆty 1ˆty
kb Y
Representational Capacity of SECOC Intuitively, it feels that training each binary CRF independently
will not be able to capture rich transition structure Counter-example to independent training
Our hypothesis: when the transition structure is critical, independent training will not do as well
1
1
1| 0 0.5
0 | 0 0.5
t t
t t
P y y
P y y
1
3 2
Y = 1 2 3 1 2 3 1
b1(Y) = 1 0 0 1 0 0 1
b1(Y)* = 1 0 1 0 1 0 0
b2(Y) = 0 1 0 0 1 0 0
b3(Y) = 0 0 1 0 0 1 0
y b1 b2 b3
1 1 0 0
2 0 1 0
3 0 0 1
Our Method—Cascaded SECOC Help capture the transition
structure For problems where a
transition model is critical, we hope to see cascade training outperform independent training
For problem where a observation model is more informative but the sliding window is small. Large sliding window will dominate the effect of cascade training
k hty
1k hty
1k hty
1kty
11kty
11kty
1kty 1
kty
kty
1t x 1t
xtx
1t x 1txtx
Pre
vio
us
bin
ary
pre
dic
tion
s
kb Y
Experimental Results
Base CRF training algorithms Gradient Tree Boosting (GTB) Voted Perceptron (VP)
Methods for comparison iid-- Non sequential ECOC i-SECOC--Independent
SECOC c-SECOC (h)--Cascaded
SECOC w/ history length h Beam search
1
|
1/ l t i i
t t ii t i
p y l LP y y l
L y L
Synthetic Data Sets Generation by HMM
“Transition” Data Set
“Both” Data Set
|1/ o t i i
t t ii t i
p o OP y l
O O
xx
x
0.2 8
0.6 2
o i
l i
p O
p L
0.6 2
0.6 2
o i
l i
p O
p L
window size 1 with GTB window size 3 with GTB
window size 1 with VP window size 3 with VP
Nettalk Data Set (134 labels)
Noun Phrase Chunking (NPC) (121 labels)
window size 3 with VP
"transition" data set: window size 1 with GTB
window size 1 with VP
Synthetic Data Sets (40 labels)
"both" data set: window size 3 with GTB
Comparing to Beam Search
window size 1 on nettalk window size 3 on nettalk
window size 1 on NPC window size 3 on NPC
Summary
i-SECOC can perform poorly when explicitly capturing complex transition models is critical
c-SECOC can improve accuracy in such situations by using cascade features
Performance of c-SECOC can depends strongly on the base CRF algorithm; Algorithms capable of capturing complex (non-linear) feature interactions are preferred
When using less powerful base CRF learning algorithms, other approaches (e.g. beam search) can outperform c-SECOC
Future Directions Efficient validation procedure for selecting cascade history
length Incremental generation of code words Wide comparison of methods for dealing with large label ses
Acknowledgements We thank John Langford for discussion of the counter
example to independent SECOC and Thomas Dietterich for his support. This work was supported by NSF grant IIS-0307592.