support vector machines

Support Vector Machines

Refer to Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials

Support Vector Machines: Slide 2

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Linear Classifiers

denotes +1

denotes -1

Linear Classifiers

denotes +1

denotes -1

Linear Classifiers

denotes +1

denotes -1

Linear Classifiers

denotes +1

denotes -1

Any of these would be fine..

..but which is best?

Classifier Margin

denotes +1

denotes -1

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin

denotes +1

denotes -1

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Maximum Margin

denotes +1

denotes -1

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

Why Maximum Margin?

denotes +1

denotes -1

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.

Estimate the Margin

• What is the distance expression for a point x to a line wx+b= 0?

denotes +1

denotes -1 x wx +b = 0

Estimate the Margin

• What is the expression for margin?

denotes +1

denotes -1 wx +b = 0

Margin

Maximize Margindenotes

denotes -1wx +b =

Margin

denotes -1wx +b =

Margin

• Min-max problem game problem

denotes -1wx +b =

Margin

Strategy:

Maximum Margin Linear Classifier

• How to solve it?

Learning via Quadratic Programming

• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Quadratic Programming

2maxarg

TT Find

nmnmnn

buauaua

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

enmmenenen

nmmnnn

buauaua

And subject to

n additional linear inequality constraints

linear

uality

Quadratic criterion

Subject to

Quadratic Programming

2maxarg

TT Find

Subject to

nmnmnn

buauaua

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

enmmenenen

nmmnnn

buauaua

And subject to

n additional linear inequality constraints

linear

uality

Quadratic criterion

There exist algorithms for

finding such constrained

quadratic optima much

more efficiently and

reliably than gradient

ascent.

(But they are very fiddly…you

probably don’t want to

write one yourself)

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Uh-oh!

denotes +1

denotes -1

What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-defined optimization

Uh-oh!

denotes +1

denotes -1

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter

Uh-oh!

denotes +1

denotes -1

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameterCan’t be expressed as a Quadratic

Programming problem.

Solving it may be too slow.

(Also, doesn’t distinguish between disastrous errors and near

misses)

So… any

ideas?

Uh-oh!

denotes +1

denotes -1

What should we do?

Idea 2.0:

Minimize w.w + C (distance of error points to their correct place)

Support Vector Machine (SVM) for Noisy Data

• Any problem with the above formulism?

denotes +1

denotes -1

Support Vector Machine (SVM) for Noisy Data

• Balance the trade off between margin and classification errors

denotes +1

denotes -1

Support Vector Machine for Noisy Data

How do we determine the appropriate value for c ?

An Equivalent QP: Determine b

A linear programming problem !

Suppose we’re in 1-dimension

What would SVMs do with this data?

Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

Harder 1-dimensional dataset

That’s wiped the smirk off SVM’s face.

What can be done about this?

Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2kkk xxz

Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2kkk xxz

support vector machines

marginsupport vector

mooresupport vector

nonsupportvector datapoints

marginwx b

linear constraints

xwx b

maximum marginf xayestdenotes

x bhow

Documents

kernel methods: support vector machines maximum margin...

using support vector machines, convolutional … support...

support vector machines hyperplane...

support vector machines & kernel machines

using support vector machines, convolutional neural ... ·...

support vector machines - mit opencourseware · •...

support vector machines

support vector machines kernel machines

support vector machines -...