sta141c: big data & high performance statistical computing -...
Post on 17-Jul-2020
2 Views
Preview:
TRANSCRIPT
STA141C: Big Data & High Performance StatisticalComputing
Lecture 9: Classification
Cho-Jui HsiehUC Davis
May 18, 2017
Linear SVM
Support Vector Machines
SVM is a widely used classifier.Given:
Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.
Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1; if yi = −1, wTxi ≤ −1.
Support Vector Machines (hard constraints)
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (with hard constraints):
minw ,ξ
1
2wTw
s.t. yi (wTxi ) ≥ 1, i = 1, . . . , n,
What if there are outliers?
Support Vector Machines
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTxi ) ≥ 1− ξi , i = 1, . . . , n,
ξi ≥ 0
Support Vector Machines
SVM primal problem can be written as
minw
1
2‖w‖2︸ ︷︷ ︸
L2 regularization
+n∑
i=1
max(0, 1− yiwTxi )︸ ︷︷ ︸hinge loss
Non-differentiable when yiwTxi = 1 for some i
General Empirical Risk Minimization
Regularized ERM:
minw∈Rd
P(w) :=n∑
i=1
`i (wTxi ) + R(w)
`i (·): loss functionR(w): regularization
Dual problem may have a different form (?)
Examples
Loss functions:
Regression: `i (xi ) = (xi − yi )2
SVM (hinge loss): `i (xi ) = max(1− yiwTxi , 0)Square hinge loss: `i (xi ) = max(1− yiwTxi , 0)2
Logistic regression: `i (xi ) = log(1 + e−yiwT xi )
Examples
Regularizations:
L2-regularization: ‖w‖22: small but dense solutionL1-regularization: ‖w‖1: sparse solutionNuclear norm: ‖W ‖∗: low-rank solution
LIBLINEAR
Implemented in LIBLINEAR:
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
Other functionalities:
Logistic regression (L1 or L2 regularization)Multi-class SVMSupport vector regressionCross-validation
LIBLINEAR
RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.
Time vs primal objective function value
LIBLINEAR
RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.
Time vs prediction accuracy
Kernel SVM
Non-linearly separable problems
What if the data is not linearly separable?
Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.
Support Vector Machines (SVM)
SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
The dual problem for SVM:
minα
1
2αTQα−eTα,
s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,
where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .
Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).
At optimum: w =∑n
i=1 αiyiϕ(xi ),
Various types of kernels
Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;
Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .
Hard to solve: need to solve n-by-n quadratic minimization problem,≥ O(n2) time.
LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/
For linear SVM, use LIBLINEAR instead of LIBSVM.
Scikit-learn
Linear SVM: sklearn.svm.LinearSVC
Logistic Regression: sklearn.linear model.LogisticRegression
Kernel SVM: sklearn.svm.SVC
· · ·
Practice in homework.
Scikit-learn
Linear SVM: sklearn.svm.LinearSVC
Logistic Regression: sklearn.linear model.LogisticRegression
Kernel SVM: sklearn.svm.SVC
· · ·Practice in homework.
Coming up
Classification
Questions?
top related