margins, support vectors, and linear programming, oh my! reading: bishop, 4.0, 4.1, 7.0, 7.1 burges...

Margins, support vectors, and linear programming, oh

my!Reading: Bishop, 4.0, 4.1, 7.0, 7.1Burges tutorial (on class resources page)

Administrivia I

•Tomorrow is deadline for CS grad student conference (CSUSC):

• http://www.cs.unm.edu/~csgsa/conference/cfp.html

•Still time to submit!

Administrivia II: Straw poll•Which would you rather do first (probably at

all)?

•Unsupervised learning

•Clustering

•Structure of data

•Scientific discovery (genomics, taxonomy, etc.)

•Reinforcement learning

•Control

•Robot navigation

•Learning behavior

•Let me know (in person, email, etc.)

Group Reading #1•Due: Feb 20

•Dietterich, T. G., “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” Machine Learning, 40(2), 139-157.• DOI: 10.1023/A:1007607513941

• http://www.springerlink.com

•Background material:• Dietterich, T. G. Ensemble Learning. In The Handbook of Brain

Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://www.cs.orst.edu/~tgd/publications/hbtnn-ensemble-learning.ps.gz

• DH&S Ch 9.5.1-2

Time rolls on...•Last time:

•The linear regression problem

•Squared-error loss

•Vector derivatives

•The least-squared hyperplane

•This time:

•What about multiclass, nonlinear, or nonseparable gunk?

• Intro to support vector machines

Exercise•Derive the vector derivative expressions:

•Find an expression for the minimum squared error weight vector, w, in the loss function:

Solution to LSE regression

The LSE method•The quantity is called a Gram

matrix and is positive semidefinite and symmetric

•The quantity is the pseudoinverse of X

•May not exist if Gram matrix is not invertible

•The complete “learning algorithm” is 2 whole lines of Matlab code

LSE example

w=[6.72, -0.36]

LSE examplet=y(x 1

,w)

LSE example

[0.36, 1]6.72t=

y(x 1

,w)

The LSE method

•So far, we have a regressor -- estimates a real valued ti for each xi

•Can convert to a classifier by assigning t=+1 or -1 to binary class training data

•Q: How do you handle non-binary data?

Handling non-binary data•DTs and k-NN can handle multi-class data

•Linear discriminants (& many other) learners only work on binary

•3 ways to “hack” binary classifiers to c-ary data:

Handling non-binary data•DTs and k-NN can handle multi-class data

•Linear discriminants (& many other) learners only work on binary

•3 ways to “hack” binary classifiers to c-ary data:

•1 against many:

•Train c classifiers to recognize “class 1 vs anything else”; “class 2 vs everything else” ...

•Cheap, easy

•May drastically unbalance the classes for each classifier

•What if two classifiers make diff predictions?

Multiclass trouble

?

Handling non-binary data•All against all:

•Train O(c2) classifiers, one for each pair of classes

•Run every test point through all classifiers

•Majority vote for final classifier

•More stable than 1 vs many

•Lot more overhead, esp for large c

•Data may be more balanced

•Each classifier trained on very small part of data

Handling non-binary data•Coding theory approach

•Given c classes, choose b≥lg(c)

•Assign each class a b-bit “code word”

•Train one classifier for each bit

•Apply each classifier to a test instance => new code => reconstruct class

x1 x2 x3 y

green 3.2 -9 apple

yellow 1.8 0.7 lemon

yellow 6.9 -3 banana

red 0.8 0.8 grape

green 3.4 0.9 pear

x1 x2 x3 y1 y2 y3

3 .2 -9

y e llo w 1 .8 0 .7 0 0 1

6 .9 -3

0 1 1

1 0 0

Support Vector Machines

Linear separators are nice• ... but what if your data looks like this:

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

•Change the data

•Nonlinear projection of data

•These turn out to be flip sides of each other

•Easier to think about (do math for) 2nd case

Nonlinear data projection•Suppose you have a

“projection function”:

•Original feature space

•“Projected” space

•Usually

•Do learning w/ linear model in

•Ex:

margins, support vectors, and linear programming, oh my! reading: bishop, 4.0, 4.1, 7.0, 7.1 burges...

Documents

data slide

w slide

binary classifiers

lse regression slide

handling nonbinary data

c ary data

class resources page

binary class training