supervised learning: the setup support vector machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf ·...
TRANSCRIPT
![Page 1: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/1.jpg)
1
Support Vector Machines: Training with Stochastic Gradient Descent
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2018
The slides are mainly from Vivek Srikumar
![Page 2: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/2.jpg)
Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
2
![Page 3: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/3.jpg)
SVM objective function
3
Regularization term:
• Maximize the margin
• Imposes a preference over the
hypothesis space and pushes for
better generalization
• Can be replaced with other
regularization terms which impose
other preferences
Empirical Loss:
• Hinge loss
• Penalizes weight vectors that make
mistakes
• Can be replaced with other loss
functions which impose other
preferences
A hyper-parameter that
controls the tradeoff
between a large margin and
a small hinge-loss
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 4: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/4.jpg)
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
4
![Page 5: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/5.jpg)
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
5
![Page 6: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/6.jpg)
Solving the SVM optimization problem
This function is convex in w, bFor convenience, use simplified notation:
w0 ← ww ← [w0,b]xi ← [xi,1]
6
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 7: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/7.jpg)
A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
7
u v
f(v)
f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
![Page 8: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/8.jpg)
A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
8
u v
f(v)
f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 9: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/9.jpg)
Convex functions
9
Linear functions max is convex
Some ways to show that a function is convex:
1. Using the definition of convexity
2. Showing that the second derivative is nonnegative (for one dimensional functions)
3. Showing that the second derivative is positive semi-definite (for vector functions)
![Page 10: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/10.jpg)
Not all functions are convex
10
These are concaveThese are neither
! "# + 1 − " ' ≥ "! # + 1 − " !(')
![Page 11: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/11.jpg)
Convex functions are convenient
A function ! is convex if for every ", $ in the domain, and for every % ∈[0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0
For convex functions, this is both necessary and sufficient
11
u v
f(v)
f(u)
![Page 12: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/12.jpg)
This function is convex in w
• This is a quadratic optimization problem because the objective is quadratic
• Older methods: Used techniques from Quadratic Programming– Very slow
• No constraints, can use gradient descent– Still very slow!
Solving the SVM optimization problem
12
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 13: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/13.jpg)
Gradient descent
General strategy for minimizing
a function J(w)
• Start with an initial guess for
w, say w0
• Iterate till convergence:
– Compute the gradient of J at wt
– Update wt to get wt+1 by taking
a step in the opposite direction
of the gradient
13
J(w)
w
w0
Intuition: The gradient is the direction
of steepest increase in the function. To
get to the minimum, go in the opposite
direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 14: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/14.jpg)
Gradient descent
General strategy for minimizing
a function J(w)
• Start with an initial guess for
w, say w0
• Iterate till convergence:
– Compute the gradient of J at wt
– Update wt to get wt+1 by taking
a step in the opposite direction
of the gradient
14
J(w)
w
w0w1
Intuition: The gradient is the direction
of steepest increase in the function. To
get to the minimum, go in the opposite
direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 15: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/15.jpg)
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
15
J(w)
ww0w1w2
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 16: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/16.jpg)
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
16
J(w)
ww0w1w2w3
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 17: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/17.jpg)
Gradient descent for SVM
1. Initialize w0
2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)
2. Update w as follows:
17
r: Called the learning rate .
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 18: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/18.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
18
![Page 19: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/19.jpg)
Gradient descent for SVM
1. Initialize w0
2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)
2. Update w as follows:
19
r: Called the learning rate
Gradient of the SVM objective requires summing over the entire training set
Slow, does not really scale
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 20: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/20.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
20
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 21: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/21.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
21
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 22: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/22.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
22
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 23: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/23.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt ←wt-1 – %t ∇Jt (wt-1)
3. Return final w
23
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 24: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/24.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
24
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Number of training examples
![Page 25: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/25.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt ←wt-1 – %t ∇Jt (wt-1)
3. Return final w
25
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 26: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/26.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt ←wt-1 – %t ∇Jt (wt-1)
3. Return final w
What is the gradient of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)
26
This algorithm is guaranteed to converge to the minimum of J if %t is small enough.
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 27: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/27.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
27
![Page 28: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/28.jpg)
Gradient Descent vs SGD
28
Gradient descent
![Page 29: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/29.jpg)
Gradient Descent vs SGD
29
Stochastic Gradient descent
![Page 30: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/30.jpg)
Gradient Descent vs SGD
30
Stochastic Gradient descent
![Page 31: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/31.jpg)
Gradient Descent vs SGD
31
Stochastic Gradient descent
![Page 32: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/32.jpg)
Gradient Descent vs SGD
32
Stochastic Gradient descent
![Page 33: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/33.jpg)
Gradient Descent vs SGD
33
Stochastic Gradient descent
![Page 34: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/34.jpg)
Gradient Descent vs SGD
34
Stochastic Gradient descent
![Page 35: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/35.jpg)
Gradient Descent vs SGD
35
Stochastic Gradient descent
![Page 36: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/36.jpg)
Gradient Descent vs SGD
36
Stochastic Gradient descent
![Page 37: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/37.jpg)
Gradient Descent vs SGD
37
Stochastic Gradient descent
![Page 38: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/38.jpg)
Gradient Descent vs SGD
38
Stochastic Gradient descent
![Page 39: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/39.jpg)
Gradient Descent vs SGD
39
Stochastic Gradient descent
![Page 40: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/40.jpg)
Gradient Descent vs SGD
40
Stochastic Gradient descent
![Page 41: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/41.jpg)
Gradient Descent vs SGD
41
Stochastic Gradient descent
![Page 42: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/42.jpg)
Gradient Descent vs SGD
42
Stochastic Gradient descent
![Page 43: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/43.jpg)
Gradient Descent vs SGD
43
Stochastic Gradient descent
![Page 44: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/44.jpg)
Gradient Descent vs SGD
44
Stochastic Gradient descent
![Page 45: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/45.jpg)
Gradient Descent vs SGD
45
Stochastic Gradient descent
![Page 46: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/46.jpg)
Gradient Descent vs SGD
46
Stochastic Gradient descent
Many more updates than gradient descent, but each individual update is less computationally expensive
![Page 47: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/47.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
47
![Page 48: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/48.jpg)
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)
3. Update: wt ←wt-1 – %t ∇Jt (wt-1)
3. Return final w
What is the derivative of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)
48
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
![Page 49: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/49.jpg)
Hinge loss is not differentiable!
What is the derivative of the hinge loss with respect to w?
49
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 50: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/50.jpg)
Detour: Sub-gradients
Generalization of gradients to non-differentiable functions(Remember that every tangent lies below the function for convex functions)
Informally, a sub-tangent at a point is any line lies below the function at the point.A sub-gradient is the slope of that line
50
![Page 51: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/51.jpg)
Sub-gradients
51[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
![Page 52: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/52.jpg)
Sub-gradients
52[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
![Page 53: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/53.jpg)
Sub-gradients
53[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
![Page 54: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/54.jpg)
Sub-gradient of the SVM objective
54
General strategy: First solve the max and compute the gradient for each case
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 55: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/55.jpg)
Sub-gradient of the SVM objective
55
General strategy: First solve the max and compute the gradient for each case
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 56: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/56.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
56
![Page 57: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/57.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
57
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 58: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/58.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
58
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 59: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/59.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
59
Update w ← w – $t ∇Jt
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 60: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/60.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi
else w0 ← (1- %t) w0
3. Return w
60
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
![Page 61: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/61.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi
else w0 ← (1- %t) w0
3. Return w
61
%t: learning rate, many tweaks possible
Important to shuffle examples at the start of each epoch
![Page 62: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/62.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi
else w0 ← (1- %t) w0
3. Return w
62
%t: learning rate, many tweaks possible
![Page 63: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/63.jpg)
Convergence and learning rates
With enough iterations, it will converge in expectation
Provided the step sizes are “square summable, but not summable”
• Step sizes !t are positive• Sum of squares of step sizes over t = 1 to ∞ is not infinite• Sum of step sizes over t = 1 to ∞ is infinity
• Some examples: !# = %&'()&*+
or !# = %&'(#
63
![Page 64: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/64.jpg)
Convergence and learning rates
• Number of iterations to get to accuracy within !
• For strongly convex functions, N examples, d dimensional:– Gradient descent: O(Nd ln(1/!))– Stochastic gradient descent: O(d/!)
• More subtleties involved, but SGD is generally preferable when the data size is huge
64
![Page 65: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/65.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge lossü Stochastic sub-gradient descent for SVM6. Comparison to perceptron
65
![Page 66: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/66.jpg)
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi
else w0 ← (1- %t) w0
3. Return w
66
Compare with the Perceptron update:If yiwTxi ≤ 0, update w ← w + r yi xi
![Page 67: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/67.jpg)
Perceptron vs. SVM
• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though
• SVM optimizes the hinge loss– With regularization
67
![Page 68: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient](https://reader034.vdocuments.site/reader034/viewer/2022050107/5f45bda2c722433a390941ba/html5/thumbnails/68.jpg)
SVM summary from optimization perspective
• Minimize regularized hinge loss
• Solve using stochastic gradient descent– Very fast, run time does not depend on number of examples
– Compare with Perceptron algorithm: similar framework withdifferent objectives!
– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin
• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear
68Questions?