2013-1 machine learning lecture 05 - andrew moore - support vector machines

79
Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Modified from the slides by Dr. Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials

Upload: dae-ki-kang

Post on 22-May-2015

622 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines

Modified from the slides by Dr. Andrew W. Moore

http://www.cs.cmu.edu/~awm/tutorials

Page 2: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 3: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 3 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 4: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 5: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 6: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Any of these would be fine..

..but which is best?

Page 7: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 8: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Linear SVM

Page 9: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

Page 10: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore

Why Maximum Margin?

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.

Page 11: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore

Estimate the Margin

• What is the distance expression for a point x to a line wx+b= 0?

denotes +1

denotes -1 x wx +b = 0

2 2

12

( )d

ii

b bd

w

x w x wx

w

Page 12: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore

Estimate the Margin

• What is the expression for margin?

denotes +1

denotes -1 wx +b = 0

2

1

margin min ( ) mindD D

ii

bd

w

x x

x wx

Margin

Page 13: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore

Maximize Margin

denotes +1

denotes -1 wx +b = 0

,

,

2,1

argmax margin( , , )

= argmax min ( )

argmax min

i

i

b

iDb

i

dDbii

b D

d

b

w

w

xw

xw

w

x

x w

Margin

Page 14: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore

Maximize Margin

denotes +1

denotes -1 wx +b = 0

2,1

argmax min

subject to : 0

i

i

dDbii

i i i

b

w

D y b

xw

x w

x x w

Margin

• Min-max problem game problem

Page 15: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore

Maximize Margin

denotes +1

denotes -1 wx +b = 0

2,1

argmax min

subject to : 0

i

i

dDbii

i i i

b

w

D y b

xw

x w

x x w

Margin

Strategy:

: 1i iD b x x w

2

1,

argmin

subject to : 1

dii

b

i i i

w

D y b

w

x x w

Page 16: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin Linear Classifier

• How to solve it?

* * 2

1,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

d

kkw b

N N

w b w

y w x b

y w x b

y w x b

Page 17: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore

Learning via Quadratic Programming • QP is a well-studied class of optimization

algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Page 18: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore

Quadratic Programming arg min

2

TT R

c u

u ud uFind

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dditio

nal lin

ear

equality

co

nstra

ints

Quadratic criterion

Subject to

Page 19: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore

Quadratic Programming arg min

2

TT R

c u

u ud uFind

Subject to

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dditio

nal lin

ear

equality

co

nstra

ints

Quadratic criterion

Page 20: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore

Quadratic Programming

* * 2

,{ , }= min

subject to 1 for all training data ( , )

iiw b

i i i i

w b w

y w x b x y

* *

,

1 1

2 2

{ , }= argmax 0 0

1

1 inequality constraints

....

1

T

w b

N N

w b w w w

y w x b

y w x b

y w x b

nI

Page 21: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Page 22: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-defined optimization

Page 23: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter

Page 24: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 24 Copyright © 2001, 2003, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter Can’t be expressed as a Quadratic

Programming problem.

Solving it may be too slow.

(Also, doesn’t distinguish between disastrous errors and near misses)

Page 25: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 2.0:

Minimize w.w + C (distance of error points to their correct place)

Page 26: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machine (SVM) for Noisy Data

• Any problem with the above formulism?

d* * 2

1 1, ,

1 1 1

2 2 2

{ , }= min

1

1

...

1

N

i ji jw b

N N N

w b w c

y w x b

y w x b

y w x b

denotes +1

denotes -1

1

2

3

Page 27: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machine (SVM) for Noisy Data

• Balance the trade off between margin and classification errors

d* * 2

1 1, ,

1 1 1 1

2 2 2 2

{ , }= min

1 , 0

1 , 0

...

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

denotes +1

denotes -1

1

2

3

Page 28: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machine for Noisy Data

* * 2

1, ,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0 inequality constraints

....

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

How do we determine the appropriate value for c ?

Page 29: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore

The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1where ( )kl k l k lQ y y x x

Subject to these constraints:

kCαk 0

Then define:

R

k

kkk yα1

xwThen classify with:

f(x,w,b) = sign(w. x - b)

01

R

k

kk yα

Page 30: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore

The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1where ( )kl k l k lQ y y x x

Subject to these constraints:

kCαk 0

Then define:

R

k

kkk yα1

xw

01

R

k

kk yα

Page 31: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore

An Equivalent QP

Maximize where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0

Then define:

R

kkkk yα

1

xw

01

R

k

kk yα

Datapoints with ak > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Page 32: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore

Support Vectors

denotes +1

denotes -1

1w x b

1w x b

w

Support Vectors

Decision boundary is determined only by those support vectors !

R

kkkk yα

1

xw

: 1 0i i i ii y w x ba

ai = 0 for non-support vectors

ai 0 for support vectors

Page 33: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore

The Dual Form of QP

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1where ( )kl k l k lQ y y x x

Subject to these constraints:

kCαk 0

Then define:

R

k

kkk yα1

xwThen classify with:

f(x,w,b) = sign(w. x - b)

01

R

k

kk yα

How to determine b ?

Page 34: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore

An Equivalent QP: Determine b

A linear programming problem !

* * 2

1,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0

....

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

1

*

1,

1 1 1 1

2 2 2 2

= argmin

1 , 0

1 , 0

....

1 , 0

N

i i

N

jjb

N N N N

b

y w x b

y w x b

y w x b

Fix w

Page 35: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 35 Copyright © 2001, 2003, Andrew W. Moore

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

An Equivalent QP

Maximize where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0

Then define:

R

k

kkk yα1

xw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. x - b)

01

R

k

kk yα

Datapoints with ak > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

Why did I tell you about this equivalent QP?

• It’s a formulation that QP packages can optimize more quickly

• Because of further jaw-dropping developments you’re about to learn.

Page 36: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 36 Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0

Page 37: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0

Page 38: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional dataset

That’s wiped the smirk off SVM’s face.

What can be done about this?

x=0

Page 39: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional dataset Remember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2

kkk xxz

Page 40: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional dataset Remember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2

kkk xxz

Page 41: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 41 Copyright © 2001, 2003, Andrew W. Moore

Common SVM basis functions

zk = ( polynomial terms of xk of degree 1 to q )

zk = ( radial basis functions of xk )

zk = ( sigmoid functions of xk )

This is sensible.

Is that the end of the story?

No…there’s one more trick!

2

2||exp)(][

jk

kjk φjcx

xz

Page 42: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore

Quadratic Basis Functions

mm

m

m

m

m

xx

xx

xx

xx

xx

xx

x

x

x

x

x

x

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)(xΦ

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms

Number of terms (assuming m input dimensions) = (m+2)-choose-2

= (m+2)(m+1)/2

= (as near as makes no difference) m2/2

You may be wondering what those

’s are doing.

•You should be happy that they do no harm

•You’ll find out why they’re there soon.

2

Page 43: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore

QP (old)

Maximize where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0

Then define:

R

kkkk yα

1

xwThen classify with:

f(x,w,b) = sign(w. x - b)

01

R

k

kk yα

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Page 44: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore

QP with basis functions

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define: Then classify with:

f(x,w,b) = sign(w. f(x) - b)

01

R

k

kk yα

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Most important changes:

X f(x)

Page 45: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 45 Copyright © 2001, 2003, Andrew W. Moore

QP with basis functions

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

Then classify with:

f(x,w,b) = sign(w. f(x) - b)

01

R

k

kk yα

We must do R2/2 dot products to get this matrix ready.

Each dot product requires m2/2 additions and multiplications

The whole thing costs R2 m2 /4. Yeeks!

…or does it?

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Page 46: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 46 Copyright © 2001, 2003, Andrew W. Moore

Quadra

tic

Dot

Pro

duct

s

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

2

2

2

1

2

1

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( bΦaΦ

1

m

i

iiba1

2

m

i

ii ba1

22

m

i

m

ij

jiji bbaa1 1

2

+

+

+

Page 47: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 47 Copyright © 2001, 2003, Andrew W. Moore

Quadra

tic

Dot

Pro

duct

s

)()( bΦaΦ

m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

i

ii

m

i

ii baba

1211 1

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

Page 48: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 48 Copyright © 2001, 2003, Andrew W. Moore

Quadra

tic

Dot

Pro

duct

s

)()( bΦaΦ

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

i

ii

m

i

ii baba

1211 1

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

They’re the same!

And this is only O(m) to compute!

m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Page 49: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 49 Copyright © 2001, 2003, Andrew W. Moore

QP with Quintic basis functions

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

Then classify with:

f(x,w,b) = sign(w. f(x) - b)

01

R

k

kk yα

Page 50: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 50 Copyright © 2001, 2003, Andrew W. Moore

QP with Quadratic basis functions

where ),( lklkkl KyyQ xx

Subject to these constraints:

kCαk 0

Then define:

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(K(w, x) - b)

01

R

k

kk yα

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

l

kllk

R

k

k Qααα1 11 2

1

Most important change:

),()().( lklk K xxxΦxΦ

Page 51: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 51 Copyright © 2001, 2003, Andrew W. Moore

Higher Order Polynomials

Poly-nomial

f(x) Cost to build Qkl matrix traditionally

Cost if 100 inputs

f(a).f(b) Cost to build Qkl matrix sneakily

Cost if 100 inputs

Quadratic All m2/2 terms up to degree 2

m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2

Cubic All m3/6 terms up to degree 3

m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2

Quartic All m4/24 terms up to degree 4

m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2

Page 52: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 52 Copyright © 2001, 2003, Andrew W. Moore

SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function

• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function

• Radial-Basis-style Kernel Function:

• Neural-net-style Kernel Function:

2

2

2

)(exp),(

babaK

).tanh(),( babaK

Page 53: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 53 Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks • Replacing dot product with a kernel function

• Not all functions are kernel functions

• Need to be decomposable

• K(a,b) = f(a) f(b)

• Could K(a,b) = (a-b)3 be a kernel function ?

• Could K(a,b) = (a-b)4 – (a+b)2 be a kernel function?

Page 54: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 54 Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks • Mercer’s condition

To expand Kernel function K(x,y) into a dot product, i.e.

K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds

• Could be a kernel function?

( ) ( , ) ( ) 0dxdyf x K x y f y

2 ( )f x dx

( , )p

i iiK x y x y

Page 55: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 55 Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks • Pro

• Introducing nonlinearity into the model

• Computational cheap

• Con

• Still have potential overfitting problems

Page 56: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 56 Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (I)

Page 57: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 57 Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (II)

Page 58: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 58 Copyright © 2001, 2003, Andrew W. Moore

Kernelize Logistic Regression

How can we introduce the nonlinearity into the logistic regression?

2

1 1

1( | )

1 exp( )

1( ) log

1 exp( )

N N

reg ki k

p y xyx w

l c wyx w

a

Page 59: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 59 Copyright © 2001, 2003, Andrew W. Moore

Kernelize Logistic Regression

1

1

( ), ( )

( , ) ( , )

N

i ii

N

i ii

x x w x

K w x K x x

f a f

a

1

1 , 1

1

1 1( | )

1 exp( ( , )) 1 exp ( , )

1( ) log ( , )

1 exp ( , )

N

i ii

N N

reg i j i ji i jN

i j j ij

p y xyK x w y K x x

l c K x xy K x x

a

a a aa

• Representation Theorem

Page 60: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 60 Copyright © 2001, 2003, Andrew W. Moore

Overfitting in SVM

Breast Cancer

0

0.02

0.04

0.06

0.08

1 2 3 4 5

PolyDegree

Cla

ssif

icatio

n E

rro

r

Ionosphere

0

0.05

0.1

0.15

0.2

1 2 3 4 5PolyDegree

Cla

ssif

icati

on

Err

or

Training Error

Testing Error

Page 61: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 61 Copyright © 2001, 2003, Andrew W. Moore

SVM Performance • Anecdotally they work very very well indeed.

• Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark

• Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly.

• There is a lot of excitement and religious fervor about SVMs as of 2001.

• Despite this, some practitioners are a little skeptical.

Page 62: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 62 Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel • Kernel function describes the correlation or similarity

between two data points

• Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non-negative and symmetric function. How can we generate a kernel function based on this similarity function?

• A graph theory approach …

Page 63: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 63 Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel

• Create a graph for the data points

• Each vertex corresponds to a data point

• The weight of each edge is the similarity s(x,y)

• Graph Laplacian

• Properties of Laplacian

• Negative semi-definite

,

,

,

i j

i j

i kk i

s x x i jL

s x x i j

Page 64: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 64 Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel • Consider a simple Laplacian

• Consider

• What do these matrixes represent?

• A diffusion kernel

,

( )

1 and are connected

1k i

i j

i j

x N x

x xL

i j

2 4, ,...L L

Page 65: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 65 Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel • Consider a simple Laplacian

• Consider

• What do these matrixes represent?

• A diffusion kernel

,

( )

1 and are connected

1k i

i j

i j

x N x

x xL

i j

2 4, ,...L L

Page 66: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 66 Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel: Properties • Positive definite

• Local relationships L induce global relationships

• Works for undirected weighted graphs with

similarities

• How to compute the diffusion kernel

, or L dK e K LK

d

( , ) ( , )i j j is x x s x x

Le

Page 67: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 67 Copyright © 2001, 2003, Andrew W. Moore

Computing Diffusion Kernel • Singular value decomposition of Laplacian L

• What is L2 ?

1

1 2 1 2

1

,

( , ,..., ) ( , ,..., )

, :

T T

m m

m

m T

i i ii

T

i j i ji j

L = UΣU u u u u u u

u u

u u

2

2

1

,, 1 , 1

2

1

m T

i i ii

m mT T T

i j i i j j i j i j i ji j i j

m T

i i ii

L = u u

u u u u u u

u u

Page 68: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 68 Copyright © 2001, 2003, Andrew W. Moore

Computing Diffusion Kernel • What about Ln ?

• Compute diffusion kernel

2

1

m n T

i i ii

L u u

Le

1 1 1

1 1 1

! !

!i

n n nmL n T

i i in n i

n nm mT Ti

i i i ii n i

Le

n n

en

u u

u u u u

Page 69: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 69 Copyright © 2001, 2003, Andrew W. Moore

Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a

categorical output variable with arity 2).

• What can be done?

• Answer: with output arity N, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”

• SVM 2 learns “Output==2” vs “Output != 2”

• :

• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Page 70: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 70 Copyright © 2001, 2003, Andrew W. Moore

Ranking Problem • Consider a problem of ranking essays

• Three ranking categories: good, ok, bad

• Given a input document, predict its ranking category

• How should we formulate this problem?

• A simple multiple class solution

• Each ranking category is a independent class

• But, there is something missing here …

• We miss the ordinal relationship between classes !

Page 71: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 71 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression

• Which choice is better?

• How could we formulate this problem?

‘good’

‘OK’

‘bad’

w

w’

Page 72: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 72 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression • What are the two decision boundaries?

• What is the margin for ordinal regression?

• Maximize margin

1 20 and 0b b w x w x

11 1

2

1

22 2

2

1

1 2 1 1 2 2

margin ( , ) min ( ) min

margin ( , ) min ( ) min

margin( , , )=min(margin ( , ),margin ( , ))

g o g o

o b o b

dD D D Dii

dD D D Dii

bb d

w

bb d

w

b b b b

x x

x x

x ww x

x ww x

w w w

1 2

* * *

1 2 1 2, ,

{ , , } arg max margin( , , )b b

b b b bw

w w

Page 73: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 73 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression • What are the two decision boundaries?

• What is the margin for ordinal regression?

• Maximize margin

1 20 and 0b b w x w x

11 1

2

1

22 2

2

1

1 2 1 1 2 2

margin ( , ) min ( ) min

margin ( , ) min ( ) min

margin( , , )=min(margin ( , ),margin ( , ))

g o g o

o b o b

dD D D Dii

dD D D Dii

bb d

w

bb d

w

b b b b

x x

x x

x ww x

x ww x

w w w

1 2

* * *

1 2 1 2, ,

{ , , } arg max margin( , , )b b

b b b bw

w w

Page 74: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 74 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression • What are the two decision boundaries?

• What is the margin for ordinal regression?

• Maximize margin

1 20 and 0b b w x w x

11 1

2

1

22 2

2

1

1 2 1 1 2 2

margin ( , ) min ( ) min

margin ( , ) min ( ) min

margin( , , )=min(margin ( , ),margin ( , ))

g o g o

o b o b

dD D D Dii

dD D D Dii

bb d

w

bb d

w

b b b b

x x

x x

x ww x

x ww x

w w w

1 2

* * *

1 2 1 2, ,

{ , , } arg max margin( , , )b b

b b b bw

w w

Page 75: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 75 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression

• How do we solve this monster ?

1 2

1 2

1 2

* * *

1 2 1 2, ,

1 1 2 2, ,

1 2

2 2, ,1 1

{ , , } arg max margin( , , )

arg max min(margin ( , ),margin ( , ))

arg max min min , ming o o b

b b

b b

d dD D D Db b

i ii i

b b b b

b b

b b

w w

w

w

x xw

w w

w w

x w x w

1

1 2

2

subject to

: 0

: 0, 0

: 0

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

Page 76: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 76 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression • The same old trick

• To remove the scaling invariance, set

• Now the problem is simplified as:

1

2

: 1

: 1

i g o i

i o b i

D D b

D D b

x x w

x x w

1 2

* * * 2

1 2 1, ,

{ , , } arg min d

iib b

b b w

w

w

1

1 2

2

subject to

: 1

: 1, 1

: 1

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

Page 77: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 77 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression • Noisy case

• Is this sufficient enough?

1 2

* * * 2

1 2 1, ,

{ , , } arg min i g i o i b

d

i i i i iix D x D x Db b

b b w c c c

w

w

1

1 2

2

subject to

: 1 , 0

: 1 , 1 , 0, 0

: 1 , 0

i g i i i

i o i i i i i i

i b i i i

D b

D b b

D b

x x w

x x w x w

x x w

Page 78: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 78 Copyright © 2001, 2003, Andrew W. Moore

Ordinal Regression

‘good’

‘OK’

‘bad’

w

1 2

* * * 2

1 2 1, ,

{ , , } arg min i g i o i b

d

i i i i iix D x D x Db b

b b w c c c

w

w

1

1 2

2

1 2

subject to

: 1 , 0

: 1 , 1 , 0, 0

: 1 , 0

i g i i i

i o i i i i i i

i b i i i

D b

D b b

D b

b b

x x w

x x w x w

x x w

Page 79: 2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Support Vector Machines: Slide 79 Copyright © 2001, 2003, Andrew W. Moore

References • An excellent tutorial on VC-dimension and Support

Vector Machines:

C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html

• The VC/SRM/SVM Bible: (Not for beginners including myself)

Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998

• Software: SVM-light, http://svmlight.joachims.org/, free download