margins, support vectors, and linear programming, oh my! reading: bishop, 4.0, 4.1, 7.0, 7.1 burges...

22
vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Margins, support vectors, and linear programming, oh

my!Reading: Bishop, 4.0, 4.1, 7.0, 7.1Burges tutorial (on class resources page)

Page 2: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Administrivia I

•Tomorrow is deadline for CS grad student conference (CSUSC):

• http://www.cs.unm.edu/~csgsa/conference/cfp.html

•Still time to submit!

Page 3: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Administrivia II: Straw poll•Which would you rather do first (probably at

all)?

•Unsupervised learning

•Clustering

•Structure of data

•Scientific discovery (genomics, taxonomy, etc.)

•Reinforcement learning

•Control

•Robot navigation

•Learning behavior

•Let me know (in person, email, etc.)

Page 4: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Group Reading #1•Due: Feb 20

•Dietterich, T. G., “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” Machine Learning, 40(2), 139-157.• DOI: 10.1023/A:1007607513941

• http://www.springerlink.com

•Background material:• Dietterich, T. G. Ensemble Learning. In The Handbook of Brain

Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://www.cs.orst.edu/~tgd/publications/hbtnn-ensemble-learning.ps.gz

• DH&S Ch 9.5.1-2

Page 5: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Time rolls on...•Last time:

•The linear regression problem

•Squared-error loss

•Vector derivatives

•The least-squared hyperplane

•This time:

•What about multiclass, nonlinear, or nonseparable gunk?

• Intro to support vector machines

Page 6: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Exercise•Derive the vector derivative expressions:

•Find an expression for the minimum squared error weight vector, w, in the loss function:

Page 7: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Solution to LSE regression

Page 8: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

The LSE method•The quantity is called a Gram

matrix and is positive semidefinite and symmetric

•The quantity is the pseudoinverse of X

•May not exist if Gram matrix is not invertible

•The complete “learning algorithm” is 2 whole lines of Matlab code

Page 9: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

LSE example

w=[6.72, -0.36]

Page 10: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

LSE examplet=y(x 1

,w)

Page 11: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

LSE example

[0.36, 1]6.72t=

y(x 1

,w)

Page 12: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

The LSE method

•So far, we have a regressor -- estimates a real valued ti for each xi

•Can convert to a classifier by assigning t=+1 or -1 to binary class training data

•Q: How do you handle non-binary data?

Page 13: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Handling non-binary data•DTs and k-NN can handle multi-class data

•Linear discriminants (& many other) learners only work on binary

•3 ways to “hack” binary classifiers to c-ary data:

Page 14: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Handling non-binary data•DTs and k-NN can handle multi-class data

•Linear discriminants (& many other) learners only work on binary

•3 ways to “hack” binary classifiers to c-ary data:

•1 against many:

•Train c classifiers to recognize “class 1 vs anything else”; “class 2 vs everything else” ...

•Cheap, easy

•May drastically unbalance the classes for each classifier

•What if two classifiers make diff predictions?

Page 15: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Multiclass trouble

?

Page 16: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Handling non-binary data•All against all:

•Train O(c2) classifiers, one for each pair of classes

•Run every test point through all classifiers

•Majority vote for final classifier

•More stable than 1 vs many

•Lot more overhead, esp for large c

•Data may be more balanced

•Each classifier trained on very small part of data

Page 17: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Handling non-binary data•Coding theory approach

•Given c classes, choose b≥lg(c)

•Assign each class a b-bit “code word”

•Train one classifier for each bit

•Apply each classifier to a test instance => new code => reconstruct class

x1 x2 x3 y

green 3.2 -9 apple

yellow 1.8 0.7 lemon

yellow 6.9 -3 banana

red 0.8 0.8 grape

green 3.4 0.9 pear

x1 x2 x3 y1 y2 y3

3 .2 -9

y e llo w 1 .8 0 .7 0 0 1

6 .9 -3

0 1 1

1 0 0

Page 18: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Support Vector Machines

Page 19: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Linear separators are nice• ... but what if your data looks like this:

Page 20: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

Page 21: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

•Change the data

•Nonlinear projection of data

•These turn out to be flip sides of each other

•Easier to think about (do math for) 2nd case

Page 22: Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Nonlinear data projection•Suppose you have a

“projection function”:

•Original feature space

•“Projected” space

•Usually

•Do learning w/ linear model in

•Ex: