andrei alexandrescu - university of washingtonssli.ee.washington.edu/ws07/notes/svm.pdfandrei...

31
1 / 31 Support Vector Machines Andrei Alexandrescu June 19, 2007

Upload: others

Post on 23-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

1 / 31

Support Vector Machines

Andrei Alexandrescu

June 19, 2007

Page 2: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Introduction

Introduction

Metaphor #1

Metaphor #2

Metaphor #3

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

2 / 31

Page 3: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Metaphor #1

Introduction

Metaphor #1

Metaphor #2

Metaphor #3

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

3 / 31

■ Imagine a single perceptron■ Linear separation■ Train it to draw the separation

hyperplane:

◆ Minimize d+ + d−, where◆ Distance to closest positive point d+

◆ Distance to closest negative point d−

You’ve got an SVM.

Page 4: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Metaphor #2

Introduction

Metaphor #1

Metaphor #2

Metaphor #3

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

4 / 31

■ Imagine creating a highway:■ Straight■ Trees on the left■ Rocks on the right■ Farthest from closest tree■ Farthest from closest rock

You need an SVM.

Page 5: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Metaphor #3

Introduction

Metaphor #1

Metaphor #2

Metaphor #3

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

5 / 31

■ Imagine you want to support a metalsheet with coil springs

■ Coil springs are attached to fixed points■ They push from two different sides■ For the metal sheet to be supported, the

momentum from all springs must cancelout

The sheet will be, well, supported along anSVM plane.

Page 6: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Background: StructuralRisk Minimization

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

6 / 31

Page 7: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Capacity and Generalization

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

7 / 31

■ Generalization: Figure out similaritiesbetween already-seen data and new data

◆ Too much: “Square piece of paper?That’s a $100 bill”

■ Capacity: Ability to allocate newcategories for data

◆ Too much: “#L26118670? It’s afake; all $100 bills I’ve seen had otherserial numbers”

■ They are competitive with one another■ How to strike the right balance?

Page 8: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Empirical Risk

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

8 / 31

■ We are given l observations 〈xi, yi〉

◆ xi ∈ Rn

◆ yi ∈ {−1, 1}

■ Learn y = f(x, α) by tuning α

■ Expected test error (risk) and empiricalrisk:

R(α) =1

2

|y − f(x, α)|dP (x, y) (1)

Remp(α) =1

2l

|yi − f(xi, α)| (2)

Page 9: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Risk Bound

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

9 / 31

■ For 0/1 loss and 0 < η < 1:

R(α) ≤ Remp(α)+

h(1 + log 2lh) − log η

4

l(3)

where h ∈ N is the Vapnik-Chervonenkis(VC) dimension

■ Second term: “VC confidence”

Page 10: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Importance of Risk Bound

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

10 / 31

1. Not dependent on P (x, y)2. lhs not computable3. rhs computable if we know h

■ For a given task, choose the machinethat minimizes the risk bound!

■ Even when bound not tight, we cancontrast “tightness” of various families ofmachines

Page 11: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

The VC dimension

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

11 / 31

■ For a family of functions f(α):

◆ Choose a set of l points◆ Label them in any way◆ ∃α s.t. f(α) can recognize

(“shatter”) them

■ Then f(α) has VC at least l

Page 12: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Example: Hyperplanes in Rn

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

12 / 31

■ Choosing 4 planar points:

◆ they can’t be separated by one linefor all of their possible labelings (onelabeling will be inseparable)

■ Similarly, n + 1 points in Rn can’t be

separated for all labelings■ So the VC dimension of hyperplanes in

Rn is n + 1

Page 13: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Not a Hard and Fast Formula

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

13 / 31

■ Consider a nearest-neighbor (NN)classifier

■ VC dimension is infinite■ Remp = 0■ The bound is irrelevant, yet NN classifier

can perform well in many situations

Page 14: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Corollary

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

14 / 31

We’d like to find a machine able to zero theempirical risk (sufficient capacity) and

minimize the VC dimension (capacity notwastefully large)

Page 15: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

The SVM Connection

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

15 / 31

Page 16: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Linear SVMs

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

16 / 31

■ Training data {xi, yi} i = 1, . . . , l,xi ∈ R

n, and yi ∈ {−1, 1}■ On a separating hyperplane: xw + b = 0,

where

◆ w normal to the hyperplane

◆|b|

‖w‖is the distance to origin

◆ ‖w‖ Euclidean norm of w

Page 17: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Linear SVMs (cont’d)

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

17 / 31

■ d+, d− shortest distances from labeledpoints to hyperplane

■ Define margin m = d+ + d−■ Task: find the separating hyperplane that

maximizes m

Key point: Maximizing the margin minimizesthe VC dimension

Page 18: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Computation

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

18 / 31

■ For the separating plane:

xiw + b ≥ +1, yi = +1 (4)

xiw + b ≤ −1, yi = −1 (5)

≡ (6)

yi(xiw + b) − 1 ≥ 0, ∀i (7)

■ For the closest points the equalities aresatisfied, so:

d+ + d− =|1 − b|

‖w‖+

| − 1 − b|

‖w‖=

2

‖w‖(8)

Page 19: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Switching to Lagrangian

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

19 / 31

■ One coefficient per train sample■ The constraints easier to handle■ Training data appears only in dot

products■ Great for applying the kernel trick later

on

Page 20: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Lagrangian Form

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

20 / 31

■ Minimize

LP =‖w‖2

2−

l∑

i=1

αiyi(xiw + b) +l

i=1

αi

(9)

■ Convex quadratic programming problemwith the dual: maximize

LD =l

i=1

αi −1

2

l∑

i,j=1

αiαjyiyj(xixj)

(10)

Page 21: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

The Support Vectors

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

21 / 31

■ The points with αi > 0 are the supportvectors

■ Solution only depends on them■ All others have αi = 0 and can be moved

arbitrarily far from the decisionhyperplane, or removed

Page 22: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Testing

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

22 / 31

■ Once the hyperplane is found:

y = sgn(wx + b) (11)

Page 23: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Unseparable Data

Introduction

Background:Structural RiskMinimization

The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs

Loose Ends.Conclusions

23 / 31

■ Add slack variables ξi

xiw + b ≥ +1 − ξi, yi = +1 (12)

xiw + b ≤ −1 + ξi, yi = −1 (13)

■ The Lagrangian formulation is onlyinfluenced by an upper bound C on αi

Page 24: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Nonlinear SVMs

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMsSpaceTransformation(RKHS)

Kernel Example

Using Kernels

Loose Ends.Conclusions

24 / 31

Page 25: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Space Transformation (RKHS)

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMsSpaceTransformation(RKHS)

Kernel Example

Using Kernels

Loose Ends.Conclusions

25 / 31

■ Take points from Rd to some space H:

Φ : Rd → H (14)

■ Choose kernel function K such that

K(xi, xj) = Φ(xi)Φ(xj) (15)

■ Since in the Lagrangian formulation weonly have xi in dot products(remember?), we don’t even need toknow Φ!

Page 26: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Kernel Example

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMsSpaceTransformation(RKHS)

Kernel Example

Using Kernels

Loose Ends.Conclusions

26 / 31

■ Gaussian Kernel

K(xi,xj) = e−‖xi−xj‖

2

2σ2 (16)

■ Polynomial

K(xi,xj) = (xixj)p (17)

■ Kinda sorta neural net!

K(xi,xj) = tanh(κxixj − δ)p (18)

Page 27: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Using Kernels

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMsSpaceTransformation(RKHS)

Kernel Example

Using Kernels

Loose Ends.Conclusions

27 / 31

■ Just replace xixj with K(xi,xj)everywhere and the magic is complete

■ Training is identical and takes similartime

■ Separation is still linear, but in a differentspace (infinite-dimensional!)

■ Same sleight of hand for testing:

sgn

Ns∑

i=1

αiyiK(si,x) + b (19)

Page 28: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Loose Ends. Conclusions

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.ConclusionsSVM For MultipleClasses

Soft Outputs

Conclusions

28 / 31

Page 29: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

SVM For Multiple Classes

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.ConclusionsSVM For MultipleClasses

Soft Outputs

Conclusions

29 / 31

■ Build n “one-versus-all” classifiers■ Essentially costs n times the complexity

of one classifier

◆ During testing choose the mostconfident answer

■ Build n(n−1)2 “one-versus-one” classifiers

◆ Decide by voting where the databelongs

◆ Many classifiers, but little data foreach

■ DAGSVM (Platt): Same training time, n

times testing time

Page 30: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Soft Outputs

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.ConclusionsSVM For MultipleClasses

Soft Outputs

Conclusions

30 / 31

■ sgn is discontinuous; want to restoreconfidence

■ SVM outputs can be mapped to posteriorprobabilities (Platt)

Page 31: Andrei Alexandrescu - University of Washingtonssli.ee.washington.edu/WS07/notes/svm.pdfAndrei Alexandrescu June 19, 2007 Introduction Introduction Metaphor #1 Metaphor #2 Metaphor

Conclusions

Introduction

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.ConclusionsSVM For MultipleClasses

Soft Outputs

Conclusions

31 / 31

■ Powerful theoretical grounds■ Global, unique solution■ Performance depends on choice of kernel

and parametersStill a research topic

■ Training is memory-intensive; chunkingmust be used

■ Complexity dependent on the # ofsupport vectors