2013-1 machine learning lecture 05 - andrew moore - support vector machines

Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines

Modified from the slides by Dr. Andrew W. Moore

http://www.cs.cmu.edu/~awm/tutorials

http://www.cs.cmu.edu/~awm/tutorials

Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?



a

yest

denotes +1

denotes -1





a

yest

denotes +1

denotes -1


Any of these would be fine..

..but which is best?


Classifier Margin f x

a

yest

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.


Maximum Margin f x

a

yest

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Linear SVM


Maximum Margin f x

a

yest

denotes +1

denotes -1




Support Vectors are those datapoints that the margin pushes up against

Linear SVM


Why Maximum Margin?

denotes +1

denotes -1




Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.


Estimate the Margin

• What is the distance expression for a point x to a line wx+b= 0?

denotes +1

denotes -1 x wx +b = 0

2 2

12

( )d

ii

b bd

w

x w x wx

w


Estimate the Margin

• What is the expression for margin?

denotes +1

denotes -1 wx +b = 0

2

1

margin min ( ) mindD D

ii

bd

w

x x

x wx

Margin


Maximize Margin

denotes +1


,

,

2,1

argmax margin( , , )

= argmax min ( )

argmax min

i

i

b

iDb

i

dDbii

b D

d

b

w

w

xw

xw

w

x

x w

Margin


Maximize Margin

denotes +1


2,1

argmax min

subject to : 0

i

i

dDbii

i i i

b

w

D y b

xw

x w

x x w

Margin

• Min-max problem game problem


Maximize Margin

denotes +1


2,1

argmax min

subject to : 0

i

i

dDbii

i i i

b

w

D y b

xw

x w

x x w

Margin

Strategy:

: 1i iD b x x w

2

1,

argmin

subject to : 1

dii

b

i i i

w

D y b

w

x x w


Maximum Margin Linear Classifier

• How to solve it?

* * 2

1,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

d

kkw b

N N

w b w

y w x b

y w x b

y w x b


Learning via Quadratic Programming • QP is a well-studied class of optimization

algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.


Quadratic Programming arg min

2

TT R

c u

u ud uFind

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dditio

nal lin

ear

equality

co

nstra

ints

Quadratic criterion

Subject to


Quadratic Programming arg min

2

TT R

c u

u ud uFind

Subject to

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dditio

nal lin

ear

equality

co

nstra

ints

Quadratic criterion


Quadratic Programming

* * 2

,{ , }= min

subject to 1 for all training data ( , )

iiw b

i i i i

w b w

y w x b x y

* *

,

1 1

2 2

{ , }= argmax 0 0

1

1 inequality constraints

....

1

T

w b

N N

w b w w w

y w x b

y w x b

y w x b

nI


Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-defined optimization


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter Can’t be expressed as a Quadratic

Programming problem.

Solving it may be too slow.

(Also, doesn’t distinguish between disastrous errors and near misses)


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 2.0:

Minimize w.w + C (distance of error points to their correct place)


Support Vector Machine (SVM) for Noisy Data

• Any problem with the above formulism?

d* * 2

1 1, ,

1 1 1

2 2 2

{ , }= min

1

1

...

1

N

i ji jw b

N N N

w b w c

y w x b

y w x b

y w x b

denotes +1

denotes -1

1

2

3


Support Vector Machine (SVM) for Noisy Data

• Balance the trade off between margin and classification errors

d* * 2

1 1, ,

1 1 1 1

2 2 2 2

{ , }= min

1 , 0

1 , 0

...

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

denotes +1

denotes -1

1

2

3


Support Vector Machine for Noisy Data

* * 2

1, ,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0 inequality constraints

....

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

How do we determine the appropriate value for c ?


The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1where ( )kl k l k lQ y y x x

Subject to these constraints:

kCαk 0

Then define:

R

k

kkk yα1

xwThen classify with:


01

R

k

kk yα


The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2



kCαk 0

Then define:

R

k

kkk yα1

xw

01

R

k

kk yα


An Equivalent QP

Maximize where ).( lklkkl yyQ xx


kCαk 0

Then define:

R

kkkk yα

1

xw

01

R

k

kk yα

Datapoints with ak > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

R

k

R

l

kllk

R

k

k Qααα1 11 2

1


Support Vectors

denotes +1

denotes -1

1w x b

1w x b

w

Support Vectors

Decision boundary is determined only by those support vectors !

R

kkkk yα

1

xw

: 1 0i i i ii y w x ba

ai = 0 for non-support vectors

ai 0 for support vectors


The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2



kCαk 0

Then define:

R

k

kkk yα1



01

R

k

kk yα

How to determine b ?


An Equivalent QP: Determine b

A linear programming problem !

* * 2

1,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0

....

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

1

*

1,

1 1 1 1

2 2 2 2

= argmin

1 , 0

1 , 0

....

1 , 0

N

i i

N

jjb

N N N N

b

y w x b

y w x b

y w x b

Fix w


R

k

R

l

kllk

R

k

k Qααα1 11 2

1

An Equivalent QP



kCαk 0

Then define:

R

k

kkk yα1

xw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:


01

R

k

kk yα

Datapoints with ak > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

Why did I tell you about this equivalent QP?

• It’s a formulation that QP packages can optimize more quickly

• Because of further jaw-dropping developments you’re about to learn.


Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0


Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0


Harder 1-dimensional dataset

That’s wiped the smirk off SVM’s face.

What can be done about this?

x=0


Harder 1-dimensional dataset Remember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2

kkk xxz


Common SVM basis functions

zk = ( polynomial terms of xk of degree 1 to q )

zk = ( radial basis functions of xk )

zk = ( sigmoid functions of xk )

This is sensible.

Is that the end of the story?

No…there’s one more trick!

2

2||exp)(][

jk

kjk φjcx

xz


Quadratic Basis Functions

mm

m

m

m

m

xx

xx

xx

xx

xx

xx

x

x

x

x

x

x

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)(xΦ

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms

Number of terms (assuming m input dimensions) = (m+2)-choose-2

= (m+2)(m+1)/2

= (as near as makes no difference) m2/2

You may be wondering what those

’s are doing.

•You should be happy that they do no harm

•You’ll find out why they’re there soon.

2


QP (old)



kCαk 0

Then define:

R

kkkk yα

1



01

R

k

kk yα

R

k

R

l

kllk

R

k

k Qααα1 11 2

1


QP with basis functions

where ))().(( lklkkl yyQ xΦxΦ


kCαk 0

Then define: Then classify with:

f(x,w,b) = sign(w. f(x) - b)

01

R

k

kk yα

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Most important changes:

X f(x)


QP with basis functions



kCαk 0

Then define:

Then classify with:


01

R

k

kk yα

We must do R2/2 dot products to get this matrix ready.

Each dot product requires m2/2 additions and multiplications

The whole thing costs R2 m2 /4. Yeeks!

…or does it?

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1


Quadra

tic

Dot

Pro

duct

s

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

2

2

2

1

2

1

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( bΦaΦ

1

m

i

iiba1

2

m

i

ii ba1

22

m

i

m

ij

jiji bbaa1 1

2

+

+

+


Quadra

tic

Dot

Pro

duct

s

)()( bΦaΦ

m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

i

ii

m

i

ii baba

1211 1

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa


Quadra

tic

Dot

Pro

duct

s

)()( bΦaΦ

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

i

ii

m

i

ii baba

1211 1

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

They’re the same!

And this is only O(m) to compute!

m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221


QP with Quintic basis functions

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11



kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

Then classify with:


01

R

k

kk yα


QP with Quadratic basis functions

where ),( lklkkl KyyQ xx


kCαk 0

Then define:

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(K(w, x) - b)

01

R

k

kk yα

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Most important change:

),()().( lklk K xxxΦxΦ


Higher Order Polynomials

Poly-nomial

f(x) Cost to build Qkl matrix traditionally

Cost if 100 inputs

f(a).f(b) Cost to build Qkl matrix sneakily

Cost if 100 inputs

Quadratic All m2/2 terms up to degree 2

m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2

Cubic All m3/6 terms up to degree 3

m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2

Quartic All m4/24 terms up to degree 4

m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2


SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function

• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function

• Radial-Basis-style Kernel Function:

• Neural-net-style Kernel Function:

2

2

2

)(exp),(

babaK

).tanh(),( babaK


Kernel Tricks • Replacing dot product with a kernel function

• Not all functions are kernel functions

• Need to be decomposable

• K(a,b) = f(a) f(b)

• Could K(a,b) = (a-b)3 be a kernel function ?

• Could K(a,b) = (a-b)4 – (a+b)2 be a kernel function?


Kernel Tricks • Mercer’s condition

To expand Kernel function K(x,y) into a dot product, i.e.

K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds

• Could be a kernel function?

( ) ( , ) ( ) 0dxdyf x K x y f y

2 ( )f x dx

( , )p

i iiK x y x y


Kernel Tricks • Pro

• Introducing nonlinearity into the model

• Computational cheap

• Con

• Still have potential overfitting problems


Nonlinear Kernel (I)


Nonlinear Kernel (II)


Kernelize Logistic Regression

How can we introduce the nonlinearity into the logistic regression?

2

1 1

1( | )

1 exp( )

1( ) log

1 exp( )

N N

reg ki k

p y xyx w

l c wyx w

a


Kernelize Logistic Regression

1

1

( ), ( )

( , ) ( , )

N

i ii

N

i ii

x x w x

K w x K x x

f a f

a

1

1 , 1

1

1 1( | )

1 exp( ( , )) 1 exp ( , )

1( ) log ( , )

1 exp ( , )

N

i ii

N N

reg i j i ji i jN

i j j ij

p y xyK x w y K x x

l c K x xy K x x

a

a a aa

• Representation Theorem


Overfitting in SVM

Breast Cancer

0

0.02

0.04

0.06

0.08

1 2 3 4 5

PolyDegree

Cla

ssif

icatio

n E

rro

r

Ionosphere

0

0.05

0.1

0.15

0.2

1 2 3 4 5PolyDegree

Cla

ssif

icati

on

Err

or

Training Error

Testing Error


SVM Performance • Anecdotally they work very very well indeed.

• Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark

• Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly.

• There is a lot of excitement and religious fervor about SVMs as of 2001.

• Despite this, some practitioners are a little skeptical.


Diffusion Kernel • Kernel function describes the correlation or similarity

between two data points

• Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non-negative and symmetric function. How can we generate a kernel function based on this similarity function?

• A graph theory approach …


Diffusion Kernel

• Create a graph for the data points

• Each vertex corresponds to a data point

• The weight of each edge is the similarity s(x,y)

• Graph Laplacian

• Properties of Laplacian

• Negative semi-definite

,

,

,

i j

i j

i kk i

s x x i jL

s x x i j


Diffusion Kernel • Consider a simple Laplacian

• Consider

• What do these matrixes represent?

• A diffusion kernel

,

( )

1 and are connected

1k i

i j

i j

x N x

x xL

i j

2 4, ,...L L


Diffusion Kernel: Properties • Positive definite

• Local relationships L induce global relationships

• Works for undirected weighted graphs with

similarities

• How to compute the diffusion kernel

, or L dK e K LK

d

( , ) ( , )i j j is x x s x x

Le


Computing Diffusion Kernel • Singular value decomposition of Laplacian L

• What is L2 ?

1

1 2 1 2

1

,

( , ,..., ) ( , ,..., )

, :

T T

m m

m

m T

i i ii

T

i j i ji j

L = UΣU u u u u u u

u u

u u

2

2

1

,, 1 , 1

2

1

m T

i i ii

m mT T T

i j i i j j i j i j i ji j i j

m T

i i ii

L = u u

u u u u u u

u u


Computing Diffusion Kernel • What about Ln ?

• Compute diffusion kernel

2

1

m n T

i i ii

L u u

Le

1 1 1

1 1 1

! !

!i

n n nmL n T

i i in n i

n nm mT Ti

i i i ii n i

Le

n n

en

u u

u u u u


Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a

categorical output variable with arity 2).

• What can be done?

• Answer: with output arity N, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”

• SVM 2 learns “Output==2” vs “Output != 2”

• :

• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.


Ranking Problem • Consider a problem of ranking essays

• Three ranking categories: good, ok, bad

• Given a input document, predict its ranking category

• How should we formulate this problem?

• A simple multiple class solution

• Each ranking category is a independent class

• But, there is something missing here …

• We miss the ordinal relationship between classes !


Ordinal Regression

• Which choice is better?

• How could we formulate this problem?

‘good’

‘OK’

‘bad’

w

w’


Ordinal Regression • What are the two decision boundaries?

• What is the margin for ordinal regression?

• Maximize margin

1 20 and 0b b w x w x

11 1

2

1

22 2

2

1

1 2 1 1 2 2

margin ( , ) min ( ) min


margin( , , )=min(margin ( , ),margin ( , ))

g o g o

o b o b

dD D D Dii

dD D D Dii

bb d

w

bb d

w

b b b b

x x

x x

x ww x

x ww x

w w w

1 2

* * *

1 2 1 2, ,

{ , , } arg max margin( , , )b b

b b b bw

w w




• Maximize margin


11 1

2

1

22 2

2

1

1 2 1 1 2 2




g o g o

o b o b

dD D D Dii

dD D D Dii

bb d

w

bb d

w

b b b b

x x

x x

x ww x

x ww x

w w w

1 2

* * *

1 2 1 2, ,


b b b bw

w w


Ordinal Regression

• How do we solve this monster ?

1 2

1 2

1 2

* * *

1 2 1 2, ,

1 1 2 2, ,

1 2

2 2, ,1 1

{ , , } arg max margin( , , )

arg max min(margin ( , ),margin ( , ))

arg max min min , ming o o b

b b

b b

d dD D D Db b

i ii i

b b b b

b b

b b

w w

w

w

x xw

w w

w w

x w x w

1

1 2

2

subject to

: 0

: 0, 0

: 0

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w


Ordinal Regression • The same old trick

• To remove the scaling invariance, set

• Now the problem is simplified as:

1

2

: 1

: 1

i g o i

i o b i

D D b

D D b

x x w

x x w

1 2

* * * 2

1 2 1, ,

{ , , } arg min d

iib b

b b w

w

w

1

1 2

2

subject to

: 1

: 1, 1

: 1

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w


Ordinal Regression • Noisy case

• Is this sufficient enough?

1 2

* * * 2

1 2 1, ,

{ , , } arg min i g i o i b

d

i i i i iix D x D x Db b

b b w c c c

w

w

1

1 2

2

subject to

: 1 , 0

: 1 , 1 , 0, 0

: 1 , 0

i g i i i

i o i i i i i i

i b i i i

D b

D b b

D b

x x w

x x w x w

x x w


Ordinal Regression

‘good’

‘OK’

‘bad’

w

1 2

* * * 2

1 2 1, ,

{ , , } arg min i g i o i b

d

i i i i iix D x D x Db b

b b w c c c

w

w

1

1 2

2

1 2

subject to

: 1 , 0

: 1 , 1 , 0, 0

: 1 , 0

i g i i i

i o i i i i i i

i b i i i

D b

D b b

D b

b b

x x w

x x w x w

x x w


References • An excellent tutorial on VC-dimension and Support

Vector Machines:

C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html

• The VC/SRM/SVM Bible: (Not for beginners including myself)

Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998

• Software: SVM-light, http://svmlight.joachims.org/, free download

http://svmlight.joachims.org/

2013-1 machine learning lecture 05 - andrew moore - support vector machines

Education