cs 2750: machine learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfsupport vector...

CS 2750: Machine Learning

Support Vector Machines

Prof. Adriana KovashkaUniversity of Pittsburgh

February 16, 2017

Plan for today

• Linear Support Vector Machines

• Non-linear SVMs and the “kernel trick”

• Extensions (briefly) – Soft-margin SVMs

– Multi-class SVMs

– Hinge loss

– SVM vs logistic regression

– SVMs with latent variables

• How to solve the SVM problem (next class)

Lines in R2

0 bcyax

Kristen Grauman

Lines in R2

0 bcyax

Kristen Grauman

Lines in R2

0 bcyax

00, yx

Kristen Grauman

Lines in R2

0 bcyax

00, yx

bcyaxD

00 distance from

point to line

Kristen Grauman

Lines in R2

0 bcyax

00, yx

bcyaxD

distance from

point to line

Kristen Grauman

Linear classifiers

0:negative

0:positive

• Find linear function to separate positive and

negative examples

Which line

is best?

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Support vector machines

• Discriminative

classifier based on

optimal separating

line (for 2d case)

• Maximize the

margin between the

positive and

negative training

examples

• Want line that maximizes the margin.

1:1)(negative

1:1)( positive

MarginSupport vectors

For support, vectors, 1 bi wx

Distance between point

and line:

1:1)(negative

1:1)( positive

Support vectors

For support vectors:

Margin

1:1)(negative

1:1)( positive

MarginSupport vectors

Distance between point

and line: ||||

Therefore, the margin is 2 / ||w||

Finding the maximum margin line

1. Maximize margin 2/||w||

2. Correctly classify all training data points:

Quadratic optimization problem:

Minimize

Subject to yi(w·xi+b) ≥ 1

1:1)(negative

1:1)( positive

One constraint for each

training point.

Note sign trick.

• Solution: i iii y xw

Support

vector

Learned

weight

i isign

b)(sign )(

• Solution:

b = yi – w·xi (for any support vector)

• Classification function:

• Notice that it relies on an inner product between the test

point x and the support vectors xi

• (Solving the optimization problem also involves

computing the inner products xi · xj between all pairs of

training points)

i iii y xw

If f(x) < 0, classify as negative, otherwise classify as positive.

MORE DETAILS NEXT TIME

Inner product

Adapted from Milos Hauskrecht

i isign

b)(sign )(

Plan for today

– Hinge loss

• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

Andrew Moore

Nonlinear SVMs

Φ: x→ φ(x)

• General idea: the original input space can

always be mapped to some higher-dimensional

feature space where the training set is

separable:

Andrew Moore

Nonlinear SVMs

Nonlinear kernel: Example

• Consider the mapping ),()( 2xxx

),(),()()(

yxxyyxK

yxxyyyxxyx

Svetlana Lazebnik

• The linear classifier relies on dot product

between vectors K(xi,xj) = xi · xj

• If every data point is mapped into high-

dimensional space via some transformation

Φ: xi → φ(xi ), the dot product becomes: K(xi,xj) =

φ(xi ) · φ(xj)

• A kernel function is similarity function that

corresponds to an inner product in some

expanded feature space

• The kernel trick: instead of explicitly computing

the lifting transformation φ(x), define a kernel

function K such that: K(xi,xj) = φ(xi ) · φ(xj)Andrew Moore

The “kernel trick”

Examples of kernel functions

Linear:

Polynomials of degree up to d:

Gaussian RBF:

Histogram intersection:

exp()(2

xx,xxK

jiji kxkxxxK ))(),(min(),(

iji xxxxK ),(

Andrew Moore / Carlos Guestrin

𝐾(𝑥𝑖, 𝑥𝑗) = (𝑥𝑖𝑇𝑥𝑗 + 1)𝑑

The benefit of the “kernel trick”

• Example: Polynomial kernel for 2-dim features

• … lives in 6 dimensions

Is this function a kernel?

Blaschko / Lampert

Constructing kernels

Blaschko / Lampert

1. Define your representation for each example.

2. Select a kernel function.

3. Compute pairwise kernel values between labeled

examples.

4. Use this “kernel matrix” to solve for SVM support vectors

& alpha weights.

5. To classify a new example: compute kernel values

between new input and support vectors, apply alpha

weights, check sign of output.

Adapted from Kristen Grauman

Using SVMs

Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002

Moghaddam and Yang, Face & Gesture 2000

Kristen Grauman

Example: Learning gender w/ SVMs

Kristen Grauman

Support faces

• SVMs performed better than humans, at either resolution

Kristen Grauman

Plan for today

– Hinge loss

Hard-margin SVMs

Maximize margin

The w that minimizes…

Soft-margin SVMs (allowing misclassifications)

Maximize margin Minimize misclassification

Slack variable

The w that minimizes…

Misclassification cost

# data samples

Multi-class SVMs• In practice, we obtain a multi-class SVM by combining two-class SVMs

• One vs. others

– Training: learn an SVM for each class vs. the others

– Testing: apply each SVM to the test example, and assign it to the class of the SVM that returns the highest decision value

• One vs. one

– Training: learn an SVM for each pair of classes

– Testing: each learned SVM “votes” for a class to assign to the test example

• There are also “natively multi-class” formulations

– Crammer and Singer, JMLR 2001

Svetlana Lazebnik / Carlos Guestrin

Hinge loss

• Let

• We have the objective to minimize where:

• Then we can define a loss:

• and unconstrained SVM objective:

Multi-class hinge loss

• We want:

• Minimize:

• Hinge loss:

SVMs vs logistic regression

Adapted from Tommi Jaakola

SVMs vs logistic regression

SVMs with latent variables

Adapted from S. Nowozin and C. Lampert

SVMs: Pros and cons• Pros

– Kernel-based framework is very powerful, flexible

– Often a sparse set of support vectors – compact at test time

– Work very well in practice, even with very small training sample sizes

– Solution can be formulated as a quadratic program (next time)

– Many publicly available SVM packages: e.g. LIBSVM, LIBLINEAR,

SVMLight

• Cons

– Can be tricky to select best kernel function for a problem

– Computation, memory

• At training time, must compute kernel values for all example pairs

• Learning can take a very long time for large-scale problems

Adapted from Lana Lazebnik

cs 2750: machine learningpeople.cs.pitt.edu/~kovashka/cs2750_sp17/ml_09_svm.pdfsupport vector...

Documents

cs 2750: machine...

attributes - university of...

advise: symbolism and external knowledge for decoding...

cs 2750: machine learning machine learning basics + matlab...

detecting activities of daily living in first-person...

cs 1675: intro to machine...

big data - university of...

gaussian processes in machine...

cs 1674: intro to computer...

cs 1699: intro to computer vision edges and binary images...

adriana kovashka, sudheendra vijayanarasimhan, and kristen...

cs 2770: computer...

attribute adaptation for personalized image...

cs 1674: intro to computer...

cs 2750: machine learning introduction prof. adriana...

cs 2770: computer vision neural...

cs 2750: machine learningkovashka/cs2750_sp17/ml_02... ·...

cs 1674: intro to computer...

cs 1699: intro to computer vision...

object recognition from local scale-invariant...