stochastic optimization for big data analytics: algorithms and … · 2018-05-25 · stochastic...
TRANSCRIPT
![Page 1: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/1.jpg)
Stochastic Optimization for Big Data Analytics:Algorithms and Libraries
Tianbao Yang‡
SDM 2014, Philadelphia, Pennsylvania
collaborators:Rong Jin†, Shenghuo Zhu‡
‡NEC Laboratories America, †Michigan State University
February 9, 2014
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 1 / 77
![Page 2: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/2.jpg)
The updates are available here
http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf
Thanks.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 2 / 77
![Page 3: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/3.jpg)
STochastic OPtimization (STOP) and Machine Learning
Outline
1 STochastic OPtimization (STOP) and Machine Learning
2 STOP Algorithms for Big Data Classification and Regression
3 General Strategies for Stochastic Optimization
4 Implementations and A Library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 3 / 77
![Page 4: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/4.jpg)
STochastic OPtimization (STOP) and Machine Learning
Introduction
Machine Learning problems and Stochastic Optimization
Coverage:
Classification and Regression in different forms
Motivation to employ STochastic OPtimization (STOP)
Basic Convex Optimization Knowledge
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 4 / 77
![Page 5: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/5.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1
n
n∑i=1
(yi −w>xi )2 +
λ
2‖w‖2
2
xi ∈ Rd : d-dimensional feature vector
yi ∈ R: response variable
w ∈ Rd : unknown parameters
n: number of data points
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 5 / 77
![Page 6: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/6.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1
n
n∑i=1
(yi −w>xi )2
︸ ︷︷ ︸Empirical Loss
+λ
2‖w‖2
2
xi ∈ Rd : d-dimensional feature vector
yi ∈ R: response variable
w ∈ Rd : unknown parameters
n: number of data points
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 6 / 77
![Page 7: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/7.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1
n
n∑i=1
(yi −w>xi )2 +
λ
2‖w‖2
2︸ ︷︷ ︸Regularization
xi ∈ Rd : d-dimensional feature vector
yi ∈ R: response variable
w ∈ Rd : unknown parameters
n: number of data points
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 7 / 77
![Page 8: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/8.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Classification Problems:
minw∈Rd
1
n
n∑i=1
`(yiw>xi ) +
λ
2‖w‖2
2
yi ∈ +1,−1: label
Loss function `(z)
1. SVMs: hinge loss `(z) = max(0, 1− z)p, where p = 1, 2
2. Logistic Regression: `(z) = log(1 + exp(−z))
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 8 / 77
![Page 9: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/9.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Feature Selection:
minw∈Rd
1
n
n∑i=1
`(yiw>xi ) + λ‖w‖1
`1 regularization ‖w‖1 =∑d
i=1 |wi |λ controls sparsity level
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 9 / 77
![Page 10: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/10.jpg)
STochastic OPtimization (STOP) and Machine Learning
Learning as Optimization
Feature Selection using Elastic Net:
minw∈Rd
1
n
n∑i=1
`(yiw>xi )+λ
(‖w‖1 + γ‖w‖2
2
)
Elastic net regularizer, more robust than `1 regularizer
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 10 / 77
![Page 11: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/11.jpg)
STochastic OPtimization (STOP) and Machine Learning
Big Data Challenge
Huge amount of data generated on theinternet every day
Facebook users upload 3 million photos
Goolge receives 3000 million queries
GE 100 Million Hours of operationaldata on the World’s Largest GasTurbine Fleet
The global internet population is 2.1billion people
http://www.visualnews.com/2012/06/19/
how-much-data-created-every-minute/
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 11 / 77
![Page 12: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/12.jpg)
STochastic OPtimization (STOP) and Machine Learning
Why Learning from Big Data is Hard?
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
Too many data points
Issue: can’t afford go through data set many times
Solution: Stochastic Optimization
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 12 / 77
![Page 13: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/13.jpg)
STochastic OPtimization (STOP) and Machine Learning
Why Learning from Big Data is Hard?
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
High Dimensional Data
Issue: can’t afford second order optimization (e.g. Newton’s method)
Solution: first order method (i.e, gradient based method)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 13 / 77
![Page 14: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/14.jpg)
STochastic OPtimization (STOP) and Machine Learning
Why Learning from Big Data is Hard?
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
Data are distributed over many machines
Issue: expensive (if not impossible) to move data
Solution: Distributed Optimization
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 14 / 77
![Page 15: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/15.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Vector, Norm, Inner product, Dual Norm
bold letters x ∈ Rd , w ∈ Rd : d-dimensional vectors
norm ‖x‖: Rd → R+. e.g. `2 norm ‖x‖2 =√∑d
i=1 x2i , `1 norm
‖x‖1 =∑d
i=1 |xi |, `∞ norm ‖x‖∞ = maxi |xi |
inner product < x, y >= x>y =∑d
i=1 xiyi
dual norm ‖x‖∗ = max‖y‖≤1 x>y. ‖x‖2 ⇐⇒ ‖x‖2, ‖x‖1 ⇐⇒ ‖x‖∞
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 15 / 77
![Page 16: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/16.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Convex Optimization
minx∈X
f (x)
f (x) is a convex function
X is a convex domain
f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y), ∀x, y ∈ X , α ∈ [0, 1]
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 16 / 77
![Page 17: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/17.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Convergence Measure
Most optimization algorithms areiterative
wt+1 = wt + ∆wt
Iteration Complexity: the number ofiterations T (ε) needed to have
f (xT )−minx∈X
f (x) ≤ ε (ε 1)
Convergence rate: after T iterations,how good is the solution
f (xT )−minx∈X
f (x) ≤ ε(T )
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
obje
ctive
ε(T)
Optimal Value
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 17 / 77
![Page 18: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/18.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)Why are we interested in Bounds?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 18 / 77
![Page 19: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/19.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)Why are we interested in Bounds?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 18 / 77
![Page 20: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/20.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dis
tan
ce
to
op
tim
al o
bje
ctive
0.5T
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 19 / 77
![Page 21: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/21.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dis
tan
ce
to
op
tim
al o
bje
ctive
0.5T
1/T2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 20 / 77
![Page 22: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/22.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dis
tan
ce
to
op
tim
al o
bje
ctive
0.5T
1/T2
1/T
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 21 / 77
![Page 23: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/23.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dis
tan
ce
to
op
tim
al o
bje
ctive
0.5T
1/T2
1/T
1/T0.5
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 22 / 77
![Page 24: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/24.jpg)
STochastic OPtimization (STOP) and Machine Learning
More on Convergence Measure
Convergence Rate Iteration Complexity
linear µT (µ < 1) O(log( 1
ε ))
sub-linear 1Tα α > 0 O
(1
ε1/α
)
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dis
tan
ce
to
op
tim
al o
bje
ctive
0.5T
1/T2
1/T
1/T0.5
µT <1
T 2<
1
T<
1√T
log
(1
ε
)<
1√ε<
1
ε<
1
ε2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 23 / 77
![Page 25: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/25.jpg)
STochastic OPtimization (STOP) and Machine Learning
Factors that affect Iteration Complexity
Domain X : size and geometry
Size of problem: dimension and number of data points
Property of function: smoothness of function
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 24 / 77
![Page 26: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/26.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Non-smooth function
Lipschitz continuous: e.g. absolute loss f (z) = |z |
|f (x)− f (y)| ≤ L‖x− y‖
f (x) ≥ f (y) + ∂f (y)>(x− y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Iteration complexity O
(1
ε2
)or convergence rate O
(1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 25 / 77
![Page 27: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/27.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Smooth Convex function
smooth: e.g. logistic loss f (z) = log(1 + exp(−z))
‖∇f (x)−∇f (y)‖∗ ≤ β‖x− y‖where β > 0
Second Order Derivative is upperbounded −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
Full gradient descent O
(1√ε
)or O
(1
T 2
)Stochastic gradient descent O
(1
ε2
)or O
(1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 26 / 77
![Page 28: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/28.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Strongly Convex function
strongly convex: e.g. `22 norm f (x) = 1
2‖x‖22
‖∇f (x)−∇f (y)‖∗ ≥ λ‖x− y‖, λ > 0
Second Order Derivative is lower bounded
Iteration complexity O(1/ε) or convergence rate O(1/T )
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 27 / 77
![Page 29: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/29.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Smooth and Strongly Convex function
smooth and strongly convex:
λ‖x− y‖ ≤ ‖∇f (x)−∇f (y)‖∗ ≤ β‖x− y‖, β ≥ λ > 0
e.g. square loss: f (z) = 12 (1− z)2
Iteration complexity O
(log
(1
ε
))or convergence rate O(µT )
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 28 / 77
![Page 30: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/30.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Stochastic Gradient
Loss: L(w) =1
n
n∑i=1
`(w>xi , yi )
stochastic gradient: ∇`(w>xit , yit )
where it ∈ 1, . . . , n randomly sampled
key equation: Eit [∇`(w>xit , yit )] = ∇L(w)
computation is very cheap
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 29 / 77
![Page 31: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/31.jpg)
STochastic OPtimization (STOP) and Machine Learning
Warm-up: Dual
Convex Conjugate: `∗(α)⇐⇒ `(z)
`(z) = maxα∈Ω
αz − `∗(α), e.g .max(0, 1− z) = maxα≤0
αz − α
Dual Objective (classification):
minw
1
n
n∑i=1
`(yiw>xi︸ ︷︷ ︸z
) +λ
2‖w‖2
2
= minw
1
n
n∑i=1
[maxαi∈Ω
αi (yiw>xi )− `∗(αi )
]+λ
2‖w‖2
2
= maxα∈Ω
1
n
n∑i=1
−`∗(αi )−λ
2
∥∥∥∥∥ 1
λn
∑i=1
αiyixi
∥∥∥∥∥2
2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 30 / 77
![Page 32: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/32.jpg)
STOP Algorithms for Big Data Classification and Regression
Outline
1 STochastic OPtimization (STOP) and Machine Learning
2 STOP Algorithms for Big Data Classification and Regression
3 General Strategies for Stochastic Optimization
4 Implementations and A Library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 31 / 77
![Page 33: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/33.jpg)
STOP Algorithms for Big Data Classification and Regression
STOP Algorithms for Big Data Classification andRegression
In this section, we present several well-known STOP algorithms for solvingclassification and regression problems
Coverage:
Stochastic Gradient Descent (Pegasos) for L1-SVM (primal)
Stochastic Dual Coordinate Ascent (SDCA) for L2-SVM (dual)
Stochastic Average Gradient (SAG) for Logistic Regression/Regression(primal)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 32 / 77
![Page 34: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/34.jpg)
STOP Algorithms for Big Data Classification and Regression
STOP Algorithms for Big Data Classification
Support Vector Machines:
minw∈Rd
f (w) =1
n
n∑i=1
`(w>xi , yi )︸ ︷︷ ︸loss
+λ
2‖w‖2
2
−1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
ywTx
loss
L1−SVML2−SVM
non−smooth function
smooth function
L1 SVM: `(w>x, y) = max(0, 1− yiw>xi )
L2 SVM: `(w>x, y) = max(0, 1− yiw>xi )
2
Equivalent to C-SVM formulations C = nλ
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 33 / 77
![Page 35: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/35.jpg)
STOP Algorithms for Big Data Classification and Regression
STOP Algorithms for Big Data Classification
Support Vector Machines:
minw∈Rd
f (w) =1
n
n∑i=1
`(w>xi , yi )︸ ︷︷ ︸loss
+λ
2‖w‖2
2
−1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
ywTx
loss
L1−SVML2−SVM
non−smooth function
smooth function
L1 SVM: `(w>x, y) = max(0, 1− yiw>xi )
L2 SVM: `(w>x, y) = max(0, 1− yiw>xi )
2
Equivalent to C-SVM formulations C = nλ
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 33 / 77
![Page 36: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/36.jpg)
STOP Algorithms for Big Data Classification and Regression
STOP Algorithms for Big Data Classification
Support Vector Machines:
minw∈Rd
f (w) =1
n
n∑i=1
`(w>xi , yi )︸ ︷︷ ︸loss
+λ
2‖w‖2
2
−1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
ywTx
loss
L1−SVML2−SVM
non−smooth function
smooth function
L1 SVM: `(w>x, y) = max(0, 1− yiw>xi )
L2 SVM: `(w>x, y) = max(0, 1− yiw>xi )
2
Equivalent to C-SVM formulations C = nλ
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 33 / 77
![Page 37: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/37.jpg)
STOP Algorithms for Big Data Classification and Regression
Pegasos (Shai et al. ICML ’07, primal): STOP for L1 SVM
minw∈Rd
f (w) =1
n
n∑i=1
max(0, 1− yiw>xi ) +
λ
2‖w‖2
2︸ ︷︷ ︸strongly convex
Pegasos (SGD for strongly convex) :
stochastic gradient: ∂`(w>xi , yi ) =
−yixi , 1− yiw
>xi < 0
0, otherwise
∂f (wt ; it) = ∂`(w>t xit , yit ) + λwt
Stochastic Gradient Descent Updates
wt+1 = wt −1
λt∂f (wt ; it)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 34 / 77
![Page 38: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/38.jpg)
STOP Algorithms for Big Data Classification and Regression
SDCA (C.J. Hsieh, ICML ’08, Shai&Tong JMLR’12, dual): STOP forL2 SVM
convex conjugate: max(0, 1− yw>x)2 = maxα≥0
(α− 1
4α2
)− αyw>x
Dual SVM: maxα≥0
D(α) =1
n
n∑i=1
(αi −
α2i
4
)︸ ︷︷ ︸
φ∗(αi )
−λ2
∥∥∥∥∥ 1
λn
n∑i=1
αiyixi
∥∥∥∥∥2
2
primal solution: wt =1
λn
n∑i=1
αti yixi
SDCA: stochastic dual coordinate ascent
Dual Coordinate Updates
∆αi = maxαti +∆αi≥0
φ∗(αti + ∆αi )−
λ
2
∥∥∥∥wt +1
λn∆αiyixi
∥∥∥∥2
2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 35 / 77
![Page 39: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/39.jpg)
STOP Algorithms for Big Data Classification and Regression
SAG (Schmidt et al. ’11, primal): Stopt for Logit Reg./ Reg.
smooth and strongly convex objective
minw∈Rd
f (w) =1
n
n∑i=1
`(w>xi , yi ) +λ
2‖w‖2
2︸ ︷︷ ︸smooth and strongly convex
e.g. logistic regression, least square regression
SAG (stochastic average gradient):
wt+1 = (1− ηλ)wt −η
n
n∑i=1
g ti , where
g ti =
∂`(w>t xi , yi ), if i is selected
g t−1i , otherwise
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 36 / 77
![Page 40: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/40.jpg)
STOP Algorithms for Big Data Classification and Regression
OK, tell me which one is the best?
Let us compare Theoretically first
alg. Pegasos SAG SDCA
f (w) sc sm & sc sm loss & sc reg.
e.g. SVMs, LR, LSR L2-SVM, LR, LSR SVMs, LR, LSR
com. cost O(d) O(d) O(d)
mem. cost O(d) O(n + d) O(n + d)
Iteration O(
1λε
)O((n + 1
λ) log(
1ε
))O((n + 1
λ) log(
1ε
))Table : sc: strongly convex; sm: smooth; LR: logistic regression; LSR: leastsquare regression
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 37 / 77
![Page 41: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/41.jpg)
STOP Algorithms for Big Data Classification and Regression
What about `1 regularization?
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + σ
K∑g=1
‖wg‖1︸ ︷︷ ︸Lasso or Group Lasso
Issue: Not Strongly Convex Solution: Add `22 regularization
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + σ
K∑g=1
‖wg‖1 +λ
2‖w‖2
2
setting λ = Θ(1/ε)Algorithm: Pegasos, SDCA
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 38 / 77
![Page 42: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/42.jpg)
General Strategies for Stochastic Optimization
Outline
1 STochastic OPtimization (STOP) and Machine Learning
2 STOP Algorithms for Big Data Classification and Regression
3 General Strategies for Stochastic Optimization
4 Implementations and A Library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 39 / 77
![Page 43: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/43.jpg)
General Strategies for Stochastic Optimization
General Strategies for Stochastic Optimization
General strategies for Stochastic Gradient Descent
Coverage:
SGD for various objectives
Accelerated SGD for Stochastic Composite Optimization
Variance Reduced SGD
Parallel and Distributed Optimization
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 40 / 77
![Page 44: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/44.jpg)
General Strategies for Stochastic Optimization
Stochastic Gradient Descent
minx∈X
f (x)
stochastic gradient ∇f (x;ω): ω is a random variable
SGD updates:
xt ← ΠX [xt−1 − γt∂f (xt−1;ωt)]
ΠX [x] = minx∈X ‖x− x‖22
Issues: How to determine learning rate γt? key: decreasing
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 41 / 77
![Page 45: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/45.jpg)
General Strategies for Stochastic Optimization
Stochastic Gradient Descent
minx∈X
f (x)
stochastic gradient ∇f (x;ω): ω is a random variable
SGD updates:
xt ← ΠX [xt−1 − γt∂f (xt−1;ωt)]
ΠX [x] = minx∈X ‖x− x‖22
Issues: How to determine learning rate γt? key: decreasing
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 41 / 77
![Page 46: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/46.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Why decreasing step size? GD vs SGD
xt = xt−1 − η∇f (xt−1)
∇f (xt−1)→ 0
xt = xt−1 − ηt∇f (xt−1;ωt)
∇f (xt−1;ωt) 6→ 0, ηt → 0
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 42 / 77
![Page 47: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/47.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 43 / 77
![Page 48: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/48.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 43 / 77
![Page 49: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/49.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 43 / 77
![Page 50: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/50.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 43 / 77
![Page 51: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/51.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 43 / 77
![Page 52: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/52.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 44 / 77
![Page 53: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/53.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 45 / 77
![Page 54: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/54.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 46 / 77
![Page 55: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/55.jpg)
General Strategies for Stochastic Optimization
Why Stochastic Gradient Descent Works?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 47 / 77
![Page 56: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/56.jpg)
General Strategies for Stochastic Optimization
How to decrease the Step size?
Step size Depends on the property of the function.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 48 / 77
![Page 57: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/57.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Function: Convergence Rate
‖x− y‖ ≤ D and ‖∂f (x ;ω)‖∗ ≤ G , ∀x, y ∈ X .
Step size: γt = c√t, c usually needed to be tuned
Convergence rate of final solution xT : O((
D2
c + cG 2)
log TT
)Are we good enough? Close but not yet
minimax optimal rate : O(
1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 49 / 77
![Page 58: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/58.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Function: Convergence Rate
‖x− y‖ ≤ D and ‖∂f (x ;ω)‖∗ ≤ G , ∀x, y ∈ X .
Step size: γt = c√t, c usually needed to be tuned
Convergence rate of final solution xT : O((
D2
c + cG 2)
log TT
)Are we good enough? Close but not yet
minimax optimal rate : O(
1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 49 / 77
![Page 59: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/59.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Function: Convergence Rate
‖x− y‖ ≤ D and ‖∂f (x ;ω)‖∗ ≤ G , ∀x, y ∈ X .
Step size: γt = c√t, c usually needed to be tuned
Convergence rate of final solution xT : O((
D2
c + cG 2)
log TT
)Are we good enough? Close but not yet
minimax optimal rate : O(
1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 49 / 77
![Page 60: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/60.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Function: Convergence Rate
‖x− y‖ ≤ D and ‖∂f (x ;ω)‖∗ ≤ G , ∀x, y ∈ X .
Step size: γt = c√t, c usually needed to be tuned
Convergence rate of final solution xT : O((
D2
c + cG 2)
log TT
)Are we good enough? Close but not yet
minimax optimal rate : O(
1√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 49 / 77
![Page 61: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/61.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Function: Convergence Rate
minimax optimal rate : O(
1√T
)Averaging:
xt =
(1− 1 + η
t + η
)xt−1 +
1 + η
t + ηxt , η ≥ 0
η = 0 standard average
xT =x1 + · · ·+ xT
T
Convergence Rate of xT : O(η(D2
c + cG 2)
1T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 50 / 77
![Page 62: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/62.jpg)
General Strategies for Stochastic Optimization
SGD for Non-smooth Strongly Convex Function:Convergence Rate
λ-Strongly Convex minimax optimal rate : O( 1T )
Step size: γt = 1λt
Convergence Rate of xT : O(G2 log TλT
)Averaging:
xt =
(1− 1 + η
t + η
)xt−1 +
1 + η
t + ηxt , η ≥ 0
Convergence Rate of xT : O(ηG2
λT
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 51 / 77
![Page 63: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/63.jpg)
General Strategies for Stochastic Optimization
SGD for Smooth function
Generally No Further improvement upon O(
1√T
)Special structure may yield better convergence
Stochastic Composite Optimization: Smooth + Non-Smooth
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 52 / 77
![Page 64: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/64.jpg)
General Strategies for Stochastic Optimization
General Strategies for Stochastic Optimization
General strategies for Stochastic Gradient Descent
Coverage:
SGD for various objectives
Accelerated SGD for Stochastic Composite Optimization
Variance Reduced SGD
Parallel and Distributed Optimization
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 53 / 77
![Page 65: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/65.jpg)
General Strategies for Stochastic Optimization
Stochastic Composite Optimization
minx∈X
f (x)︸︷︷︸smooth
+ g(x)︸︷︷︸nonsmooth
Example 1
minw
1
n
n∑i=1
log(1 + exp(−yiw>xi )) + λ‖w‖1
Example 2
minw
1
n
n∑i=1
1
2(yi −w>xi )
2 + λ‖w‖1
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 54 / 77
![Page 66: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/66.jpg)
General Strategies for Stochastic Optimization
Stochastic Composite Optimization
Accelerated Stochastic Gradient
auxiliary sequence of search solutionsmaintain three solutions: xt (action solution), xt (averaged solution),xmt (auxiliary solution)Similar to Nesterov’s deterministic methodseveral variants
Stochastic Dual Coordinate Ascent
minw∈Rd
1
n
n∑i=1
`(w>xi , yi ) + σ
K∑g=1
‖wg‖1 +λ
2‖w‖2
2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 55 / 77
![Page 67: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/67.jpg)
General Strategies for Stochastic Optimization
G. Lan’s Accelerated Stochastic Approximation
Iterate t = 1, . . . ,T1. compute auxiliary solution
xmt = β−1t xt + (1− β−1
t )xt
2. update solution
xt+1 = Pxt (γtG (xmt ;ωt)) (think about projection)
3. update averaged solution
xt+1 = (1− β−1t )xt + β−1
t xt+1 (Similar to SGD)
γt , βt , Convergence Rate?
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 56 / 77
![Page 68: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/68.jpg)
General Strategies for Stochastic Optimization
Accelerated Stochastic Approximation
1. β−1t = 2
t+1 :
xt+1 =
(1− 2
t + 1
)xt +
2
t + 1xt+1
2. γt = t+12 γ: Increasing, Never like before
3. Convergence rate of xT :
O
(L
T 2+
M + σ2
√T
)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 57 / 77
![Page 69: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/69.jpg)
General Strategies for Stochastic Optimization
Variance Reduction: Mini-batch
Batch gradient: gt = 1b
∑bi=1 ∂f (xt−1, ωti )
Unbiased sampling: Egt = E∂f (xt−1, ω).
Variance reduced: Vgt = 1bV∂f (xt−1, ω).
Tradeoff the batch computation for variance decrease.
Batch size1 Constant batch size;2 Increasing batch size as proportion to t.
Convergence Rate: e.g. O( 1bλT ) of SGD for strongly convex
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 58 / 77
![Page 70: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/70.jpg)
General Strategies for Stochastic Optimization
Variance Reduction: Mixed-Grad (Mahdavi et. al, Johnson et al. ’13)
Iterate s = 1, . . . ,
Iterate t = 1, . . . ,m
xst = xst−1 − η (∇f (xst−1;ωt)−∇f (xs−1;ωt) +∇f (xs−1))︸ ︷︷ ︸StoGrad−StoGrad+Grad
update xs
how variance is reduced?if xs−1 → x∗, ∇f (xt−1)→ 0, ∇f (xt−1;ωt)−∇f (x∗;ωt)→ 0
xs = xsm or xs =∑m
t=1 xst/m
Constant learning rate
Better Iteration Complexity.: O((n + 1λ) log
(1ε
)) for sm&sc, O(1/ε)
for smooth
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 59 / 77
![Page 71: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/71.jpg)
General Strategies for Stochastic Optimization
Distributed Optimization: Why?
data distributed over multiple machines
moving to single machine suffers
low network bandwidthlimited disk or memory
benefits from parallel computation
cluster of machinesGPUs
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 60 / 77
![Page 72: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/72.jpg)
General Strategies for Stochastic Optimization
A Naive Solution
Data
w1 w2 w3 w4 w5 w6
w =1
k
k∑i=1
wi
Issue: Not the Optimal
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 61 / 77
![Page 73: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/73.jpg)
General Strategies for Stochastic Optimization
Distribute SGD is simple
Mini-batch
synchronization
Mini-batch SGD
Good: reduced variance, faster conv.
Bad: synchronization is expensive
Solutions:asynchronized update
convergence: difficult to prove
lesser synchronizations
guaranteed convergence
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 62 / 77
![Page 74: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/74.jpg)
General Strategies for Stochastic Optimization
Distribute SDCA is not trivial
issue: data are correlated
∆αi = arg max−φ∗i (−αti −∆αi )−
λn
2
∥∥∥∥wt +1
λn∆αixi
∥∥∥∥2
2
Not Working
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 63 / 77
![Page 75: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/75.jpg)
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Basic Variant
mR-SDCA running on machine k
Iterate: for t = 1, . . . ,T
Iterate: for j = 1, . . . ,m
Randomly pick i ∈ 1, · · · , nk and let ij = iFind ∆αk,i by IncDual(w = wt−1, scl = mK )Set αt
k,i = αt−1k,i + ∆αk,i
Reduce: wt ← wt−1 + 1λn
∑mj=1 ∆αk,ij xk,ij
∆αi = arg max−φ∗i (−αti −∆αi )−
λn
2mK
∥∥∥∥wt +mK
λn∆αixi
∥∥∥∥2
2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 64 / 77
![Page 76: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/76.jpg)
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Practical Variant
Initialize: u0t = wt−1
Iterate: for j = 1, . . . ,m
Randomly pick i ∈ 1, · · · , nk and let ij = i
Find ∆αk,ij by IncDual(w = uj−1t , scl = k)
Update αtk,ij
= αt−1k,ij
+ ∆αk,ij
Update ujt = uj−1t + 1
λnk∆αk,ijxk,ij
∆αij = arg max−φ∗ij (−αtij−∆αij )−
λn
2K
∥∥∥∥uj−1t +
K
λn∆αijxij
∥∥∥∥2
2
Using Updated Informationand Large Step Size
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 65 / 77
![Page 77: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/77.jpg)
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Practical Variant is much faster than the Basic Variant
0 20 40 60 80 100−9
−8
−7
−6
−5
−4
−3
−2
−1
0
iterations
log
(P(w
t ) −
D(α
t ))
duality gap
DisDCA−pDisDCA−n
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 66 / 77
![Page 78: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/78.jpg)
Implementations and A Library
Outline
1 STochastic OPtimization (STOP) and Machine Learning
2 STOP Algorithms for Big Data Classification and Regression
3 General Strategies for Stochastic Optimization
4 Implementations and A Library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 67 / 77
![Page 79: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/79.jpg)
Implementations and A Library
Implementations and A Library
In this section, we present some techniques for efficient implementationsand a practical library
Coverage:
efficient averaging
Gradient sparsification
pre-optimization: screening
pre-optimization: data preconditioning
distributed (parallel) optimization library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 68 / 77
![Page 80: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/80.jpg)
Implementations and A Library
Efficient Averaging
Update rule:
xt = (1− γtλ)xt−1 + γtgt
xt = (1− αt)xt−1 + αtxt
Efficient update when gt has many 0, or gt is sparse,
St =
(1− λγt 0
αt(1− λγt) 1− αt
)St−1, S1 = I
yt = yt−1 − [S−1t ]11γtgt
yt = yt−1 − ([S−1t ]21 + [S−1
t ]22αt)γtgt
xT = [ST ]11yT
xT = [ST ]21yT + [ST ]22yT
When Gradient is SparseYang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 69 / 77
![Page 81: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/81.jpg)
Implementations and A Library
Gradient sparsification
Sparsification by importance sampling
Rti = unif(0, 1)
gti = gti [|gti | ≥ gi ] + gsign(gti )[giRti ≤ |gti | < gi ]
Unbiased sample: Egt = gt .
Tradeoff variance increase for the efficient computation.
Especially useful for Logistic Regression
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 70 / 77
![Page 82: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/82.jpg)
Implementations and A Library
Pre-optimization: Screening
Screening for Lasso (Wang et al., 2012)
Screening for SVM (Ogawa et al., 2013)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 71 / 77
![Page 83: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/83.jpg)
Implementations and A Library
Distributed Optimization Library: Birds
The birds library implements distributed stochastic dual coordinateascent (DisDCA) for classification and regression with a broadsupport.
For technical details see:
”Trading Computation for Communication: Distributed StochasticDual Coordinate Ascent.” Tianbao Yang. NIPS 2013.”On the Theoretical Analysis of Distributed Stochastic DualCoordinate Ascent” Tianbao Yang, etc. Tech Report 2013.
The code is distributed under GNU General Publich License (seelicense.txt for details).
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 72 / 77
![Page 84: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/84.jpg)
Implementations and A Library
Distributed Optimization Library: Birds
What problems does it Solve?
Classification and Regression
Loss1 Hinge loss (SVM): L-1 hinge loss, L-2 hinge loss2 Logistic loss (Logistic Regression)3 Least Square Regression (Ridge Regression)
Regularizer1 `2 norm: SVM, Logistic Regression, Ridge Regression2 `1 norm: Lasso, SVM, LR with `1 norm
Multi-class : one-vs-all
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 73 / 77
![Page 85: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/85.jpg)
Implementations and A Library
Distributed Optimization Library: Birds
What data does it Support?
dense, sparse
txt, binary
What environment does it Support?
Prerequisites: Boost Library
Tested on A cluster of Linux Servers (up to hundreds of machines)
Tested on Cygwin in Windows with multi-core
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 74 / 77
![Page 86: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/86.jpg)
Implementations and A Library
How about kernel methods?
Stochastic/Online Optimization Approaches: (Jin et al., 2010;Orabona et al., 2012),
Linearization + STOP for linear methods
the Nystrom method (Drineas & Mahoney, 2005)Random Fourier Features (Rahimi & Recht, 2007)Comparison of two (Yang et al., 2012)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 75 / 77
![Page 87: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/87.jpg)
Implementations and A Library
References I
Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning. J.Mach. Learn. Res., 6:2153–2175, 2005.
Jin, Rong, Hoi, Steven C. H., and Yang, Tianbao. Online multiple kernellearning: Algorithms and mistake bounds. In Proceedings of the 21stInternational Conference on Algorithmic Learning Theory, pp. 390–404,2010.
Ogawa, Kohei, Suzuki, Yoshiki, and Takeuchi, Ichiro. Safe screening ofnon-support vectors in pathwise svm computation. In ICML (3), pp.1382–1390, 2013.
Orabona, Francesco, Luo, Jie, and Caputo, Barbara. Multi kernel learningwith online-batch optimization. Journal of Machine Learning Research,13:227–253, 2012.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 76 / 77
![Page 88: Stochastic Optimization for Big Data Analytics: Algorithms and … · 2018-05-25 · Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yangz SDM 2014,](https://reader033.vdocuments.site/reader033/viewer/2022050107/5f459eb6e7e14a28fd360790/html5/thumbnails/88.jpg)
Implementations and A Library
References II
Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In NIPS, 2007.
Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping.Lasso screening rules via dual polytope projection. CoRR,abs/1211.3966, 2012.
Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. ”nystrom method vs random fourier features: A theoreticaland empirical comparison”. In NIPS, pp. 485–493, 2012.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 77 / 77