revisiting output coding for sequential supervised learning

Revisiting Output Coding for Sequential Supervised Learning

Guohua Hao & Alan Fern

School of Electrical Engineering and Computer Science

Oregon State University

Corvallis, OR, U.S.A.

Scalability in CRF Training

Linear Chain CRF model

Inference in Training partition function : forward-backward algorithm Maximizing over label sequences: Viterbi algorithm Complexity of both:

Repeated inference in training Computationally demanding Can not scale to large label sets

Z X

1

1| exp , ,t t

t

P Y y yZ X XX

1kO T L

yt-1 yt+1yt

Xt-1 Xt+1Xt

Recent Work of Focus Sequential Error Correcting Output Coding (SECOC)

Error Correcting Output Coding (ECOC)

Class Code Word

b1 b2 … bn

C1 1 0 … 1

C2 0 0 … 0

… … … ... …

Cm 0 1 … 1

classifier h1 h2 hn

1 2, , , nH x h x h x h x

ˆ arg min ,i iy x H x C

Extension to CRF model

xt-1 xt xt+1

yt-1 yt yt+1

yt-1k yt

k yt+1k

ki k iy b y

original label sequence

binary label sequence

binary CRF kh

Decoding

yt-11 yt

1 yt+11

yt-1n yt

n yt+1n

1h

nh 1, , n

t t tH y yx

ˆ arg min ,t i t iy H C x

ˆty1ˆty 1ˆty

kb Y

Representational Capacity of SECOC Intuitively, it feels that training each binary CRF independently

will not be able to capture rich transition structure Counter-example to independent training

Our hypothesis: when the transition structure is critical, independent training will not do as well

1

1

1| 0 0.5

0 | 0 0.5

t t

t t

P y y

P y y

1

3 2

Y = 1 2 3 1 2 3 1

b1(Y) = 1 0 0 1 0 0 1

b1(Y)* = 1 0 1 0 1 0 0

b2(Y) = 0 1 0 0 1 0 0

b3(Y) = 0 0 1 0 0 1 0

y b1 b2 b3

1 1 0 0

2 0 1 0

3 0 0 1

Our Method—Cascaded SECOC Help capture the transition

structure For problems where a

transition model is critical, we hope to see cascade training outperform independent training

For problem where a observation model is more informative but the sliding window is small. Large sliding window will dominate the effect of cascade training

k hty

1k hty

1k hty

1kty

11kty

11kty

1kty 1

kty

kty

1t x 1t

xtx

1t x 1txtx

Pre

vio

us

bin

ary

pre

dic

tion

s

kb Y

Experimental Results

Base CRF training algorithms Gradient Tree Boosting (GTB) Voted Perceptron (VP)

Methods for comparison iid-- Non sequential ECOC i-SECOC--Independent

SECOC c-SECOC (h)--Cascaded

SECOC w/ history length h Beam search

1

|

1/ l t i i

t t ii t i

p y l LP y y l

L y L

Synthetic Data Sets Generation by HMM

“Transition” Data Set

“Both” Data Set

|1/ o t i i

t t ii t i

p o OP y l

O O

xx

x

0.2 8

0.6 2

o i

l i

p O

p L

0.6 2

0.6 2

o i

l i

p O

p L

window size 1 with GTB window size 3 with GTB

window size 1 with VP window size 3 with VP

Nettalk Data Set (134 labels)

Noun Phrase Chunking (NPC) (121 labels)

window size 3 with VP

"transition" data set: window size 1 with GTB

window size 1 with VP

Synthetic Data Sets (40 labels)

"both" data set: window size 3 with GTB

Comparing to Beam Search

window size 1 on nettalk window size 3 on nettalk

window size 1 on NPC window size 3 on NPC

Summary

i-SECOC can perform poorly when explicitly capturing complex transition models is critical

c-SECOC can improve accuracy in such situations by using cascade features

Performance of c-SECOC can depends strongly on the base CRF algorithm; Algorithms capable of capturing complex (non-linear) feature interactions are preferred

When using less powerful base CRF learning algorithms, other approaches (e.g. beam search) can outperform c-SECOC

Future Directions Efficient validation procedure for selecting cascade history

length Incremental generation of code words Wide comparison of methods for dealing with large label ses

Acknowledgements We thank John Langford for discussion of the counter

example to independent SECOC and Thomas Dietterich for his support. This work was supported by NSF grant IIS-0307592.

revisiting output coding for sequential supervised learning

Documents

independent secoc csecoc

binary crf

crf modelxt

sequential ecocisecoc

complex transition models

labelssynthetic data

revisiting output coding

training partition function