max-margin sequential learning methods
DESCRIPTION
Max-margin sequential learning methods. William W. Cohen CALD. Announcements. Upcoming assignments: Wed 3/3: project proposal due: personnel + 1-2 page Spring break next week, no class Will get feedback on project proposals by end of break - PowerPoint PPT PresentationTRANSCRIPT
Max-margin sequential learning methods
William W. CohenCALD
Announcements
• Upcoming assignments:– Wed 3/3: project proposal due:
• personnel + 1-2 page – Spring break next week, no class– Will get feedback on project proposals by end
of break– No write-ups for “Distance Metrics for Text”
week are due Wed 3/17 • not the Monday after spring break
Collins’ paper
• Notation: – label (y) is a “tag” t– observation (x) is word w– history h is a 4-tuple <ti,ti-1,w[1:n],i>
– phis(h,t) is a feature of h, t
Collins’ papers
• Notation con’t:– Phi is summation of phi for all positions i
– alphas is weight to give phis
Collins’ paper
The theory
Claim 1: the algorithm is an instance of this perceptron variant:
Claim 2: the arguments in the mistake-bounded classification results of F&S99 extend immediately to this ranking task as well.
F&S99 algorithm
F&S99 result
Collins’ result
Results
• Two experiments– POS tagging, using the Adwait’s features– NP chunking (Start,Continue,Outside tags)– NER on special AT&T dataset (another paper)
Features for NP chunking
Results
More ideas
• The dual version of a perceptron:– w is built up by repeatedly adding examples => w is a
weighted sum of the examples x1,...,xn– inner product <w,x> is can be rewritten:
)(
so
1-or 0,1, is where,
11
1
xxxxxw
xw
m
jjj
m
jjj
j
m
jjj
Dual version of perceptron ranking
alpha i,j = i,j range over example and correct/incorrect tag sequence
NER features for re-ranking MAXENT tagger output
NER features
NER results
Altun et al paper
• Starting point – dual version of Collins’ perceptron algorithm– final hypothesis is weighted sum of inner
products with a subset of the examples– this a lot like an SVM – except that the
perceptron algorithm is used to set the weights rather than quadratic optimization
SVM optimization
• Notation:– yi is the correct tag for xi
– y is an incorrect tag– F(xi,yi) are features– Optimization problem:
• find weights w on the examples that maximize minimal margin, limiting ||w||=1, or
• minimize ||w||2 such that every margin >= 1
SVMs for ranking
SVMs for ranking
Proposition: (14) and (15) are equivalent:
Let
11
2
1
p
npnp
21 p
2
1
n
n
SVMs for ranking
A binary classification problem – with xi yi the positive example and xi y’ negative examples, except that thetai varies for each example. Why? because we’re ranking.
SVMs for ranking
• Altun et al work give the remaining details• Like for perceptron learning, “negative”
data is found by running Viterbi given the learned weights and looking for errors– Each mistake is a possible new support vector– Need to iterate over the data repeatedly– Could be exponential time before convergence
if the support vectors are dense...
Altun et al results
• NER on 300 sentences from CoNLL2002 shared task – Spanish– Four entity types, nine labels (beginning-T,
intermediate-T, other)• POS tagging on 300 sentences from Penn
TreeBank• 5-CV, window of size 3, simple features
Altun et al results
Altun et al results