cs 2750: machine learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfsupport vector...

39
CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017

Upload: others

Post on 09-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

CS 2750: Machine Learning

Support Vector Machines

Prof. Adriana KovashkaUniversity of Pittsburgh

February 16, 2017

Page 2: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Plan for today

• Linear Support Vector Machines

• Non-linear SVMs and the “kernel trick”

• Extensions (briefly) – Soft-margin SVMs

– Multi-class SVMs

– Hinge loss

– SVM vs logistic regression

– SVMs with latent variables

• How to solve the SVM problem (next class)

Page 3: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Lines in R2

0 bcyax

c

aw

y

xxLet

Kristen Grauman

Page 4: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Lines in R2

0 bxw

c

aw

y

xx

0 bcyax

Let

w

Kristen Grauman

Page 5: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Lines in R2

0 bxw

c

aw

y

xx

0 bcyax

Let

w

00, yx

Kristen Grauman

Page 6: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Lines in R2

0 bxw

c

aw

y

xx

0 bcyax

Let

w

00, yx

D

w

xw b

ca

bcyaxD

22

00 distance from

point to line

Kristen Grauman

Page 7: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Lines in R2

0 bxw

c

aw

y

xx

0 bcyax

Let

w

00, yx

D

w

xw ||

22

00 b

ca

bcyaxD

distance from

point to line

Kristen Grauman

Page 8: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Linear classifiers

0:negative

0:positive

b

b

ii

ii

wxx

wxx

• Find linear function to separate positive and

negative examples

Which line

is best?

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 9: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Support vector machines

• Discriminative

classifier based on

optimal separating

line (for 2d case)

• Maximize the

margin between the

positive and

negative training

examples

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 10: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Support vector machines

• Want line that maximizes the margin.

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

MarginSupport vectors

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

For support, vectors, 1 bi wx

Page 11: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Distance between point

and line:

Support vector machines

• Want line that maximizes the margin.

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

Support vectors

For support, vectors, 1 bi wx

||||

||

w

wx bi

www

211

M

ww

xw 1

For support vectors:

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Margin

Page 12: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Support vector machines

• Want line that maximizes the margin.

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

MarginSupport vectors

For support, vectors, 1 bi wx

Distance between point

and line: ||||

||

w

wx bi

Therefore, the margin is 2 / ||w||

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 13: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Finding the maximum margin line

1. Maximize margin 2/||w||

2. Correctly classify all training data points:

Quadratic optimization problem:

Minimize

Subject to yi(w·xi+b) ≥ 1

wwT

2

1

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

One constraint for each

training point.

Note sign trick.

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 14: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Finding the maximum margin line

• Solution: i iii y xw

Support

vector

Learned

weight

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 15: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

by

xf

ii

xx

xw

i isign

b)(sign )(

Finding the maximum margin line

• Solution:

b = yi – w·xi (for any support vector)

• Classification function:

• Notice that it relies on an inner product between the test

point x and the support vectors xi

• (Solving the optimization problem also involves

computing the inner products xi · xj between all pairs of

training points)

i iii y xw

If f(x) < 0, classify as negative, otherwise classify as positive.

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

MORE DETAILS NEXT TIME

Page 16: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Inner product

Adapted from Milos Hauskrecht

by

xf

ii

xx

xw

i isign

b)(sign )(

Page 17: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Plan for today

• Linear Support Vector Machines

• Non-linear SVMs and the “kernel trick”

• Extensions (briefly) – Soft-margin SVMs

– Multi-class SVMs

– Hinge loss

– SVM vs logistic regression

– SVMs with latent variables

• How to solve the SVM problem (next class)

Page 18: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

Andrew Moore

Nonlinear SVMs

Page 19: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Φ: x→ φ(x)

• General idea: the original input space can

always be mapped to some higher-dimensional

feature space where the training set is

separable:

Andrew Moore

Nonlinear SVMs

Page 20: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Nonlinear kernel: Example

• Consider the mapping ),()( 2xxx

22

2222

),(

),(),()()(

yxxyyxK

yxxyyyxxyx

x2

Svetlana Lazebnik

Page 21: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

• The linear classifier relies on dot product

between vectors K(xi,xj) = xi · xj

• If every data point is mapped into high-

dimensional space via some transformation

Φ: xi → φ(xi ), the dot product becomes: K(xi,xj) =

φ(xi ) · φ(xj)

• A kernel function is similarity function that

corresponds to an inner product in some

expanded feature space

• The kernel trick: instead of explicitly computing

the lifting transformation φ(x), define a kernel

function K such that: K(xi,xj) = φ(xi ) · φ(xj)Andrew Moore

The “kernel trick”

Page 22: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Examples of kernel functions

Linear:

Polynomials of degree up to d:

Gaussian RBF:

Histogram intersection:

)2

exp()(2

2

ji

ji

xx,xxK

k

jiji kxkxxxK ))(),(min(),(

j

T

iji xxxxK ),(

Andrew Moore / Carlos Guestrin

𝐾(𝑥𝑖, 𝑥𝑗) = (𝑥𝑖𝑇𝑥𝑗 + 1)𝑑

Page 23: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

The benefit of the “kernel trick”

• Example: Polynomial kernel for 2-dim features

• … lives in 6 dimensions

Page 24: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Is this function a kernel?

Blaschko / Lampert

Page 25: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Constructing kernels

Blaschko / Lampert

Page 26: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

1. Define your representation for each example.

2. Select a kernel function.

3. Compute pairwise kernel values between labeled

examples.

4. Use this “kernel matrix” to solve for SVM support vectors

& alpha weights.

5. To classify a new example: compute kernel values

between new input and support vectors, apply alpha

weights, check sign of output.

Adapted from Kristen Grauman

Using SVMs

Page 27: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002

Moghaddam and Yang, Face & Gesture 2000

Kristen Grauman

Example: Learning gender w/ SVMs

Page 28: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Kristen Grauman

Support faces

Example: Learning gender w/ SVMs

Page 29: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

• SVMs performed better than humans, at either resolution

Kristen Grauman

Example: Learning gender w/ SVMs

Page 30: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Plan for today

• Linear Support Vector Machines

• Non-linear SVMs and the “kernel trick”

• Extensions (briefly) – Soft-margin SVMs

– Multi-class SVMs

– Hinge loss

– SVM vs logistic regression

– SVMs with latent variables

• How to solve the SVM problem (next class)

Page 31: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Hard-margin SVMs

Maximize margin

The w that minimizes…

Page 32: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Soft-margin SVMs (allowing misclassifications)

Maximize margin Minimize misclassification

Slack variable

The w that minimizes…

Misclassification cost

# data samples

BOARD

Page 33: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Multi-class SVMs• In practice, we obtain a multi-class SVM by combining two-class SVMs

• One vs. others

– Training: learn an SVM for each class vs. the others

– Testing: apply each SVM to the test example, and assign it to the class of the SVM that returns the highest decision value

• One vs. one

– Training: learn an SVM for each pair of classes

– Testing: each learned SVM “votes” for a class to assign to the test example

• There are also “natively multi-class” formulations

– Crammer and Singer, JMLR 2001

Svetlana Lazebnik / Carlos Guestrin

Page 34: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Hinge loss

• Let

• We have the objective to minimize where:

• Then we can define a loss:

• and unconstrained SVM objective:

Page 35: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Multi-class hinge loss

• We want:

• Minimize:

• Hinge loss:

Page 36: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

σ

σ

SVMs vs logistic regression

Adapted from Tommi Jaakola

Page 37: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

Adapted from Tommi Jaakola

SVMs vs logistic regression

Page 38: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

SVMs with latent variables

Adapted from S. Nowozin and C. Lampert

Page 39: CS 2750: Machine Learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfSupport Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 16, 2017. Plan for today

SVMs: Pros and cons• Pros

– Kernel-based framework is very powerful, flexible

– Often a sparse set of support vectors – compact at test time

– Work very well in practice, even with very small training sample sizes

– Solution can be formulated as a quadratic program (next time)

– Many publicly available SVM packages: e.g. LIBSVM, LIBLINEAR,

SVMLight

• Cons

– Can be tricky to select best kernel function for a problem

– Computation, memory

• At training time, must compute kernel values for all example pairs

• Learning can take a very long time for large-scale problems

Adapted from Lana Lazebnik