max-margin sequential learning methods

Max-margin sequential learning methods William W. Cohen CALD

Upload: ron

Post on 12-Feb-2016

31 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

DESCRIPTION

Max-margin sequential learning methods. William W. Cohen CALD. Announcements. Upcoming assignments: Wed 3/3: project proposal due: personnel + 1-2 page Spring break next week, no class Will get feedback on project proposals by end of break - PowerPoint PPT Presentation

TRANSCRIPT

Max-margin sequential learning methods

William W. CohenCALD

Announcements

• Upcoming assignments:– Wed 3/3: project proposal due:

• personnel + 1-2 page – Spring break next week, no class– Will get feedback on project proposals by end

of break– No write-ups for “Distance Metrics for Text”

week are due Wed 3/17 • not the Monday after spring break

Page 3: Max-margin sequential learning methods

Collins’ paper

• Notation: – label (y) is a “tag” t– observation (x) is word w– history h is a 4-tuple <ti,ti-1,w[1:n],i>

– phis(h,t) is a feature of h, t

Page 4: Max-margin sequential learning methods

Collins’ papers

• Notation con’t:– Phi is summation of phi for all positions i

– alphas is weight to give phis

Page 5: Max-margin sequential learning methods

Collins’ paper

Page 6: Max-margin sequential learning methods

The theory

Claim 1: the algorithm is an instance of this perceptron variant:

Claim 2: the arguments in the mistake-bounded classification results of F&S99 extend immediately to this ranking task as well.

Page 7: Max-margin sequential learning methods

Page 8: Max-margin sequential learning methods

F&S99 algorithm

Page 9: Max-margin sequential learning methods

F&S99 result

Page 10: Max-margin sequential learning methods

Collins’ result

Page 11: Max-margin sequential learning methods

Results

• Two experiments– POS tagging, using the Adwait’s features– NP chunking (Start,Continue,Outside tags)– NER on special AT&T dataset (another paper)

Page 12: Max-margin sequential learning methods

Features for NP chunking

Page 13: Max-margin sequential learning methods

Results

Page 14: Max-margin sequential learning methods

More ideas

• The dual version of a perceptron:– w is built up by repeatedly adding examples => w is a

weighted sum of the examples x1,...,xn– inner product <w,x> is can be rewritten:

)(

1-or 0,1, is where,

xxxxxw

jjj

Page 15: Max-margin sequential learning methods

Dual version of perceptron ranking

alpha i,j = i,j range over example and correct/incorrect tag sequence

Page 16: Max-margin sequential learning methods

NER features for re-ranking MAXENT tagger output

Page 17: Max-margin sequential learning methods

NER features

Page 18: Max-margin sequential learning methods

NER results

Page 19: Max-margin sequential learning methods

Altun et al paper

• Starting point – dual version of Collins’ perceptron algorithm– final hypothesis is weighted sum of inner

products with a subset of the examples– this a lot like an SVM – except that the

perceptron algorithm is used to set the weights rather than quadratic optimization

Page 20: Max-margin sequential learning methods

SVM optimization

• Notation:– yi is the correct tag for xi

– y is an incorrect tag– F(xi,yi) are features– Optimization problem:

• find weights w on the examples that maximize minimal margin, limiting ||w||=1, or

• minimize ||w||2 such that every margin >= 1

Page 21: Max-margin sequential learning methods

SVMs for ranking

Page 22: Max-margin sequential learning methods

SVMs for ranking

Proposition: (14) and (15) are equivalent:

Let

npnp

21 p

Page 23: Max-margin sequential learning methods

SVMs for ranking

A binary classification problem – with xi yi the positive example and xi y’ negative examples, except that thetai varies for each example. Why? because we’re ranking.

Page 24: Max-margin sequential learning methods

SVMs for ranking

• Altun et al work give the remaining details• Like for perceptron learning, “negative”

data is found by running Viterbi given the learned weights and looking for errors– Each mistake is a possible new support vector– Need to iterate over the data repeatedly– Could be exponential time before convergence

if the support vectors are dense...