sta141c: big data & high performance statistical computing -...

Post on 17-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

STA141C: Big Data & High Performance StatisticalComputing

Lecture 9: Classification

Cho-Jui HsiehUC Davis

May 18, 2017

Linear SVM

Support Vector Machines

SVM is a widely used classifier.Given:

Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.

Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1; if yi = −1, wTxi ≤ −1.

Support Vector Machines (hard constraints)

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (with hard constraints):

minw ,ξ

1

2wTw

s.t. yi (wTxi ) ≥ 1, i = 1, . . . , n,

What if there are outliers?

Support Vector Machines

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTxi ) ≥ 1− ξi , i = 1, . . . , n,

ξi ≥ 0

Support Vector Machines

SVM primal problem can be written as

minw

1

2‖w‖2︸ ︷︷ ︸

L2 regularization

+n∑

i=1

max(0, 1− yiwTxi )︸ ︷︷ ︸hinge loss

Non-differentiable when yiwTxi = 1 for some i

General Empirical Risk Minimization

Regularized ERM:

minw∈Rd

P(w) :=n∑

i=1

`i (wTxi ) + R(w)

`i (·): loss functionR(w): regularization

Dual problem may have a different form (?)

Examples

Loss functions:

Regression: `i (xi ) = (xi − yi )2

SVM (hinge loss): `i (xi ) = max(1− yiwTxi , 0)Square hinge loss: `i (xi ) = max(1− yiwTxi , 0)2

Logistic regression: `i (xi ) = log(1 + e−yiwT xi )

Examples

Regularizations:

L2-regularization: ‖w‖22: small but dense solutionL1-regularization: ‖w‖1: sparse solutionNuclear norm: ‖W ‖∗: low-rank solution

LIBLINEAR

Implemented in LIBLINEAR:

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

Other functionalities:

Logistic regression (L1 or L2 regularization)Multi-class SVMSupport vector regressionCross-validation

LIBLINEAR

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs primal objective function value

LIBLINEAR

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs prediction accuracy

Kernel SVM

Non-linearly separable problems

What if the data is not linearly separable?

Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.

Support Vector Machines (SVM)

SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

At optimum: w =∑n

i=1 αiyiϕ(xi ),

Various types of kernels

Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;

Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .

Hard to solve: need to solve n-by-n quadratic minimization problem,≥ O(n2) time.

LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

For linear SVM, use LIBLINEAR instead of LIBSVM.

Scikit-learn

Linear SVM: sklearn.svm.LinearSVC

Logistic Regression: sklearn.linear model.LogisticRegression

Kernel SVM: sklearn.svm.SVC

· · ·

Practice in homework.

Scikit-learn

Linear SVM: sklearn.svm.LinearSVC

Logistic Regression: sklearn.linear model.LogisticRegression

Kernel SVM: sklearn.svm.SVC

· · ·Practice in homework.

Coming up

Classification

Questions?

top related