stochastic optimization for big data analyticstyng/sdm14-tutorial.pdfbig data challenge huge amount...

151
Stochastic Optimization for Big Data Analytics Tianbao Yang , Rong Jin , Shenghuo Zhu Tutorial@SDM 2014 Philadelphia, Pennsylvania NEC Laboratories America, Michigan State University April 26, 2014 Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 1 / 99

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Stochastic Optimization for Big Data Analytics

Tianbao Yang‡, Rong Jin†, Shenghuo Zhu‡

Tutorial@SDM 2014Philadelphia, Pennsylvania

‡NEC Laboratories America, †Michigan State University

April 26, 2014

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 1 / 99

Page 2: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

The updates are available here

http://www.cse.msu.edu/˜yangtia1/sdm14-tutorial.pdf

Thanks.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 2 / 99

Page 3: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Some Claims

NoThis tutorial is not an exhaustive literature surveyThe algorithms are not necessary the best for small dataThe theories may not carry over to non-convex optimization

Yesstart-of-the-art Stochastic Optimization for SVM, Logistic Regression,Least Square Regression, LASSOA Generic Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 3 / 99

Page 4: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 4 / 99

Page 5: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP)

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 5 / 99

Page 6: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Introduction

Machine Learning problems and Stochastic OptimizationClassification and Regression in different forms

Motivation to employ STochastic OPtimization (STOP)

Basic Convex Optimization Knowledge

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 6 / 99

Page 7: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Three Steps for Machine Learning and Pattern Recognition

Model Optimization

20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

iterations

dist

ance

to o

ptim

al o

bjec

tive

0.5T

1/T2

1/T

Data

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 7 / 99

Page 8: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 8 / 99

Page 9: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2

︸ ︷︷ ︸Empirical Loss

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 9 / 99

Page 10: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22︸ ︷︷ ︸

Regularization

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 10 / 99

Page 11: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Classification Problems:

minw∈Rd

1n

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ {+1,−1}: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 11 / 99

Page 12: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Feature Selection:

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + λ‖w‖1

`1 regularization ‖w‖1 =∑d

i=1 |wi |λ controls sparsity level

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 12 / 99

Page 13: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Feature Selection using Elastic Net:

minw∈Rd

1n

n∑i=1

`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2

2

)

Elastic net regularizer, more robust than `1 regularizer

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 13 / 99

Page 14: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Multi-class/Multi-task Learning:

minW

1n

n∑i=1

`(Wxi , yi ) + λR(W)

W ∈ RK×d

R(W) = ‖W‖2F =

∑Kk=1

∑dj=1 W 2

kj : Frobenius NormR(W) = ‖W‖∗ =

∑i σi : Nuclear Norm (sum of singular values)

R(W) = ‖W‖1,∞ =∑d

j=1 ‖W:j‖∞: `1,∞mixed normExtensions to Matrix Cases are possible

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 14 / 99

Page 15: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Big Data Challenge

Huge amount of data generated every dayFacebook users upload 3 million photosGoolge receives 3 billion queriesYoutube users upload over 1,700 hours videoGlobal internet population is 2.1 billion people247 billion emails sent

Data Analyticshttp://www.visualnews.com/2012/06/19/how-much-data-created-every-minute/

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 15 / 99

Page 16: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

Too many data pointsIssue: can’t afford go through data set many timesSolution: Stochastic Optimization

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 16 / 99

Page 17: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

High dimensional dataIssue: can’t afford second order optimization (Newton’s method)Solution: first order method (i.e, gradient based method)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 17 / 99

Page 18: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

Data are distributed over many machinesIssue: expensive (if not impossible) to move dataSolution: Distributed Optimization

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 18 / 99

Page 19: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Optimization

Stochastic Optimization:

minw∈W

F (w) = Eξ[f (w; ξ)]

f (w; ξ) is convex, F (w) is convexξ random variable

Methods:1 Sample Average Approximation, ξ1, . . . , ξn

minw∈W

1n

n∑i=1

f (w; ξi )

2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99

Page 20: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Optimization

Stochastic Optimization:

minw∈W

F (w) = Eξ[f (w; ξ)]

f (w; ξ) is convex, F (w) is convexξ random variable

Methods:1 Sample Average Approximation, ξ1, . . . , ξn

minw∈W

1n

n∑i=1

f (w; ξi )

2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99

Page 21: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Page 22: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Page 23: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Page 24: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Page 25: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Page 26: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Page 27: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Gradient in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

let it ∈ {1, . . . , n} uniformly randomly sampled

key equation: Eit [∇`(w>xit , yit ) + λw] = ∇F (w)

computation is very cheap O(d) compared with full gradient O(nd)

wt = (1− γtλ)wt−1 − γt∇`(w>t−1xit , yit )

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 22 / 99

Page 28: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Vector, Norm, Inner product, Dual Norm

bold letters x ∈ Rd (data vector), w ∈ Rd (model parameter) :d-dimensional vectors, yi denotes response variable of ith datax , y ∈ X finite dimensional variable, X a normed space

norm ‖w‖: Rd → R+. e.g.1 `2 norm ‖w‖2 =

√∑di=1 w2

i2 `1 norm ‖w‖1 =

∑di=1 |wi |

3 `∞ norm ‖w‖∞ = maxi |wi |

inner product 〈x,w〉 = x>w =∑d

i=1 xiwi

dual norm ‖w‖∗ = max‖x‖≤1 x>w.1 ‖x‖2 ⇐⇒ ‖w‖22 ‖x‖1 ⇐⇒ ‖w‖∞

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 23 / 99

Page 29: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convex Optimization

minx∈X f (x)

X is a convex domain

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

f (x) is a convex function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 24 / 99

Page 30: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convex Function

Characterization of Convex Function

x y

f(x)

f(y)

↵x + (1 � ↵)y

f(↵x + (1 � ↵)y)

↵f(x) + (1 � ↵)f(y) f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

y

f(x)

f(y) + rf(y)>(y � x)f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 25 / 99

Page 31: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Page 32: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Page 33: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Page 34: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration ComplexityYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Page 35: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99

Page 36: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99

Page 37: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

seconds

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 28 / 99

Page 38: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

secondsminutes

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 29 / 99

Page 39: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 30 / 99

Page 40: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Theoretically, we consider

O(µT ) ≺ O( 1

T 2

)≺ O

( 1T

)≺ O

( 1√T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 31 / 99

Page 41: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Page 42: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Page 43: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Page 44: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x) = |x |

|f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99

Page 45: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x) = |x |

|f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99

Page 46: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2

where β > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99

Page 47: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2

where β > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99

Page 48: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99

Page 49: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99

Page 50: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth and Strongly Convex function

smooth and strongly convex: e.g. quadratic function:f (z) = 1

2 (z − 1)2

λ‖x − y‖2 ≤ ‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2, β ≥ λ > 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 36 / 99

Page 51: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 37 / 99

Page 52: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression

STOP Algorithms for Big Data Classification andRegression

Stochastic Gradient Descent (Pegasos) for SVM

Stochastic Average Gradient (SAG) for Logistic Regression andRegression

Stochastic Dual Coordinate Ascent (SDCA)

Stochastic Optimization for Lasso

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 38 / 99

Page 53: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 54: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 55: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 56: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 57: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 58: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 59: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

Page 60: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Timeline of Stochastic Optimization in Machine Learning

minw∈Rd F (w) = 1n∑n

i=1 `(w>xi , yi ) + λ2‖w‖

22

•  [Zinkevich03].•  [Kivinen04].

General.Convex.

•  [Hazan07].•  [Shalev<Shwartz07].

Strongly.Convex.

•  [Roux12].•  [Shalev<Shwartz13].

•  [Zhang13].

Smooth.&.

Strongly.

O

✓1pT

2007.

O

✓1

T

◆O�µT

2012.90’s. 2003.

Basic.SGD. Pegasos. SAG,......SDCA.

wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit )

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 40 / 99

Page 61: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Basic SGD

Leveraging only convexity

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

c√t

output solution: wT =1T

T∑t=1

wt ⇒ O( 1√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99

Page 62: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Basic SGD

Leveraging only convexity

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

c√t

output solution: wT =1T

T∑t=1

wt ⇒ O( 1√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99

Page 63: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

Leveraging strongly convex regularizer

minw∈Rd

F (w) =1n

n∑i=1

max(0, 1− yiw>xi ) +λ

2 ‖w‖22︸ ︷︷ ︸

strongly convex

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

1λt

output solution: wT =1T

T∑t=1

wt ⇒ O( 1λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99

Page 64: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

Leveraging strongly convex regularizer

minw∈Rd

F (w) =1n

n∑i=1

max(0, 1− yiw>xi ) +λ

2 ‖w‖22︸ ︷︷ ︸

strongly convex

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

1λt

output solution: wT =1T

T∑t=1

wt ⇒ O( 1λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99

Page 65: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

e.g. hinge loss (SVM), absolute loss (Least Absolute Deviation)

stochastic gradient: ∂`(w>xit , yit ) =

−yit xit , 1− yit w>xit > 0

0, otherwise

computation cost per-iteration: O(d)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 43 / 99

Page 66: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG (Roux et al. (2012))

Leveraging smoothness of loss

minw∈Rd

f (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22︸ ︷︷ ︸

smooth and strongly convexEstimated Average Gradient

Gt =1n

n∑i=1

g ti , g t

i =

{∂`(w>t xit , yit ), if it is selectedg t−1

i , otherwise

update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ

output solution: wT ⇒ O(µT)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99

Page 67: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG (Roux et al. (2012))

Leveraging smoothness of loss

minw∈Rd

f (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22︸ ︷︷ ︸

smooth and strongly convexEstimated Average Gradient

Gt =1n

n∑i=1

g ti , g t

i =

{∂`(w>t xit , yit ), if it is selectedg t−1

i , otherwise

update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ

output solution: wT ⇒ O(µT)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99

Page 68: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG: efficient update of averaged gradient

logistic regression, least square regression, smooth SVMindividual gradient

gi = ∂`(w>xi , yi ) = αixi

update of averaged gradient

Gt =1n

n∑i=1

g ti =

1n

n∑i=1

αti xi = Gt−1 +

1n (αt

i − αt−1i )xit

computation cost per-iteration: O(d)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 45 / 99

Page 69: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

Page 70: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

Page 71: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

Page 72: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA updates

close form solution: hinge loss, squared hinge loss, absolute loss andsquare loss (Shalev-Shwartz & Zhang (2013))e.g. square loss

∆αti =

yi −w>t xi − αt−1i

1 + ‖xi‖22/(λn)

computation cost per-iteration: O(d)

approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 47 / 99

Page 73: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex

yes yes yes

smooth

no yes yes/no

loss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 74: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth

no yes yes/no

loss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 75: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 76: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 77: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 78: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 79: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 80: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter no step size noaveraging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 81: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter no step size noaveraging yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

Page 82: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

What about `1 regularization?

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Regularizer is Not Strongly Convex

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 49 / 99

Page 83: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Not Strongly Convex Solution: Add `22 regularization

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1 +

λ

2 ‖w‖22

setting λ = Θ(1/ε), SDCA (non-smooth or smooth)

O( 1ε2

)for general convex loss, O

(1ε

)for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99

Page 84: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Not Strongly Convex Solution: Add `22 regularization

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1 +

λ

2 ‖w‖22

setting λ = Θ(1/ε), SDCA (non-smooth or smooth)

O( 1ε2

)for general convex loss, O

(1ε

)for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99

Page 85: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

STOP Algorithms for Big Data Classification and Regression Algorithms

Other algorithms for `1 regularization

(Stochastic) Proximal Gradient DescentProximal Stochastic Gradient Descent (Langford et al., 2009;Shalev-Shwartz & Tewari, 2009; Duchi & Singer, 2009)sparsity can be achieved at each iterationO(1/ε2) iteration complexity

Stochastic Coordinate Descent (Shalev-Shwartz & Tewari, 2009;Bradley et al., 2011; Richtarik & Takac, 2013)

need to compute full gradientO(n/ε) iteration complexity for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 51 / 99

Page 86: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 52 / 99

Page 87: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization

Be Back in 5 minutes

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 53 / 99

Page 88: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 54 / 99

Page 89: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

Page 90: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

Page 91: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

Page 92: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Convergence of final solution

Iterative updates

xt = ΠX [xt−1 − γt∆t ]

to have convergence, intuitively γt∆t → 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 56 / 99

Page 93: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Convergence of (S)GD

iterative updates

xt = ΠX [xt−1 − γt∆t ]

GD: xt = xt−1 − γt∇f (xt−1)

∇f (xt−1)→ 0, xt → x∗

SGD: xt = xt−1 − γt∇f (xt−1; ξt)

γt∇f (xt−1; ξt)→ 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 57 / 99

Page 94: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Three Schemes of Step size

General Convex Optimization γt ∝ 1/√

t → 0

Strongly Convex Optimization γt ∝ 1/t → 0

Smooth Optimization γt = c,∆t → 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 58 / 99

Page 95: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

Page 96: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

Page 97: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

Page 98: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Strongly Convex Function

f (x) is λ-strongly convex

Step size: γt =1λt

Convergence Rate of xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(

G2 log TλT

)

Close to Optimal : O(

G2

λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 60 / 99

Page 99: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

Page 100: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

Page 101: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

Page 102: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

Page 103: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

Page 104: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

Page 105: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

Page 106: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

Page 107: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/T

General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(

1√T

)vs O

(log T√

T

)Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

Page 108: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/TGeneral Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O

(1√T

)vs O

(log T√

T

)

Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

Page 109: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/T

General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(

1√T

)vs O

(log T√

T

)

Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

Page 110: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 64 / 99

Page 111: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Parallel and Distributed Optimization

Parallel (shared memory)

Distributed (not shared)

To speed-up convergence

data distributed over multiple machines

moving to single machine sufferslow network bandwidthlimited disk or memory

benefits fromcluster of machinesmulti-core machine, GPU

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 65 / 99

Page 112: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

A simple solution: Average Runs

multi-core machinecluster of machines

Data

w1 w2 w3 w4 w5 w6

w =1k

k∑i=1

wi , Issue: Not the Optimal

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 66 / 99

Page 113: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Parallel SGD: Average Gradients

Mini-batch

synchronization

Mini-batch SGD

multi-core or clusterGood: reduced variance, faster conv.Bad: synchronization is expensiveSolutions:

asynchronized update: HogWild!

lesser synchronizations: DisDCA

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 67 / 99

Page 114: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Lock-free Parallel SGD: HOGWILD! (Niu et al., 2011)

minx

∑e∈E

fe(xe)

multi-core with shared-memory accesseach e is a small subset of [d ]

sparse SVM, matrix completion, graph-cutrobust 1/T convergence rate for strongly convex objective

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 68 / 99

Page 115: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Distributed SDCA (Yang, 2013)

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

convergence is not guaranteed: data are correlated

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2K

∥∥∥∥wt +Kλn∆αixi

∥∥∥∥2

2

guaranteed convergence; limited speed-up

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99

Page 116: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Distributed SDCA (Yang, 2013)

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

convergence is not guaranteed: data are correlated

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2K

∥∥∥∥wt +Kλn∆αixi

∥∥∥∥2

2

guaranteed convergence; limited speed-up

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99

Page 117: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

∆αij = arg max−φ∗ij (−αtij −∆αij )−

λn2K

∥∥∥∥utj +

Kλn∆αij xij

∥∥∥∥2

2

utj+1 = ut

j +Kλn∆αij xij

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 70 / 99

Page 118: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 71 / 99

Page 119: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 72 / 99

Page 120: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA

increasing m could lead to nearly linear speed-upincreasing K leads to parallel speed-up

0 2 4 6 8 10x 105

−10

−8

−6

−4

−2

0

2¡(t,m) vs t

t

log(¡(t,

m) )

m=10m=100m=1000

104 106105

(a) 1 million syn. data for regression

0" 5" 10" 15"

Liblinear"

DisDCA"0.1"

0.01"

0.001"

0.0001"

n = 109

n = 107

1"minute"3"minutes"

7"minutes"12"minutes"

7"minutes"

✏(T )

(b) 1 billion syn. data for classification; 400 GB,50*2 processors

The Distributed Library: BirdsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 73 / 99

Page 121: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 74 / 99

Page 122: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Screening for Lasso and Support Vector Machine

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99

Page 123: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Screening for Lasso and Support Vector Machine

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99

Page 124: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

Page 125: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

Page 126: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

Page 127: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Support Vector Machine

Dual SVM: maxα∈[0,1]n

1n

n∑i=1

αi −λ

2

∥∥∥∥∥ 1λn

n∑i=1

αiyixi

∥∥∥∥∥2

2

yiw∗xi < 1⇒ α∗i = 1, yiw∗xi > 1⇒ α∗i = 0

Ball Test (Ogawa et al., 2014; Wang et al.,2013)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 77 / 99

Page 128: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Stochastic Mirror Descent

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 78 / 99

Page 129: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing G ,D

Iteration Complexity of SGD depends on:‖x− x∗‖2 ≤ D, ‖∇f (x; ξ)‖2 ≤ G : positive correlation

Interpretation of Gradient Descent

xt =∏X

[xt−1 − γt∇f (xt−1; ξt)]

= minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1

2γt‖x − xt−1‖2

2︸ ︷︷ ︸distance to last solution

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖2‖∇f (xt−1; ξt)‖2 ≤ GD

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 79 / 99

Page 130: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

Page 131: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

Page 132: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

Page 133: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Factors that affect Iteration ComplexityProperty of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

xt =∏X

[xt−1 −∇f (xt−1; ξt)]

Complex Domain X leads to Expensive computationse.g.,PSD cone

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 81 / 99

Page 134: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Linear Optimization over the Domain: Frank-WolfeAlgorithm (Jaggi, 2013; Lacoste-Julien et al., 2013; Hazan, 2008)

st = arg maxs∈X〈s,∇f (xt−1)〉 : linear optimization

xt = (1− ηt)xt−1 + ηtst : xt ∈ X

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 82 / 99

Page 135: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Few Projections: SGD with only-one or log T projection (Mahdaviet al., 2012; Yang & Zhang, 2013)

x ∈ X ⇐⇒ g(x) ≤ 0

SGD for min-max

minx maxλ≥0 f (x) + λg(x)

objectiveviolation ofconstraints

Final projection

xT =∏X[ 1

T∑T

t=1 xt]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 83 / 99

Page 136: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

General Strategies for Stochastic Optimization Other Effective Strategies

How about kernel methods?

Linearization + STOP for linear methodsthe Nystrom method (Drineas & Mahoney, 2005)

Random Fourier Features (Rahimi & Recht, 2007)

Comparison of two (Yang et al., 2012)the Nystrom method: data dependent sampling, better approximationerror under large eigen-gap and power law eigen-distribution

Random Fourier Features: data independent sampling

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 84 / 99

Page 137: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 85 / 99

Page 138: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Implementations and A Distributed Library

Efficient implementations and a practical libraryEfficient averaging

Gradient sparsification

Distributed (parallel) optimization library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 86 / 99

Page 139: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Efficient Averaging

Update rule:

xt = (1− γtλ)xt−1 + γtgt

xt = (1− αt)xt−1 + αtxt

Efficient update when gt has many 0, or gt is sparse,

St =

(1− λγt 0

αt(1− λγt) 1− αt

)St−1, S1 = I

yt = yt−1 − [S−1t ]11γtgt

yt = yt−1 − ([S−1t ]21 + [S−1

t ]22αt)γtgt

xT = [ST ]11yT

xT = [ST ]21yT + [ST ]22yT

When Gradient is SparseYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 87 / 99

Page 140: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Gradient sparsification

Sparsification by importance sampling

Rti = unif(0, 1)

gti = gti [|gti | ≥ gi ] + gsign(gti )[giRti ≤ |gti | < gi ]

Unbiased sample: Egt = gt .Tradeoff variance increase for the efficient computation.

Especially useful for Logistic Regression

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 88 / 99

Page 141: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Distributed Optimization Library: Birds

The birds library implements distributed stochastic dual coordinateascent (DisDCA) for classification and regression with a broadsupport.For technical details see:

”Trading Computation for Communication: Distributed StochasticDual Coordinate Ascent.” Tianbao Yang. NIPS 2013.”Analysis of Distributed Stochastic Dual Coordinate Ascent” TianbaoYang, etc. Tech Report 2013, arxiv.

The code is distributed under GNU General Publich License (seelicense.txt for details).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 89 / 99

Page 142: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Distributed Optimization Library: Birds

What problems does it solve?Classification and RegressionLoss

1 Hinge loss and squared hinge loss (SVM)2 Logistic loss (Logistic Regression)3 Least Square Regression (Ridge Regression)

Regularizer1 `2 norm: SVM, Logistic Regression, Ridge Regression2 `1 norm: Lasso, SVM, LR with `1 norm

Multi-class : one-vs-all

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 90 / 99

Page 143: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Distributed Optimization Library: Birds

What data does it support?dense, sparsetxt, binary

What environment does it support?Prerequisites: Boost.MPI and Boost.Serialization LibraryTested on A cluster of Linux machines (up to hundreds of processors)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 91 / 99

Page 144: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

Thank You!

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 92 / 99

Page 145: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References I

Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos.Parallel coordinate descent for l1-regularized loss minimization. CoRR,2011.

Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning. J.Mach. Learn. Res., 6:2153–2175, 2005.

Duchi, John and Singer, Yoram. Efficient online and batch learning usingforward backward splitting. J. Mach. Learn. Res., 10:2899–2934, 2009.

Ghaoui, Laurent El, Viallon, Vivian, and Rabbani, Tarek. Safe featureelimination in sparse supervised learning. CoRR, abs/1009.3515, 2010.

Hazan, Elad. Sparse approximate solutions to semidefinite programs. InLATIN, pp. 306–316, 2008.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 93 / 99

Page 146: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References II

Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, andSundararajan, S. A dual coordinate descent method for large-scale linearsvm. In Proceedings of the 25th International Conference on MachineLearning, ICML ’08, pp. 408–415, 2008.

Jaggi, Martin. Revisiting frank-wolfe: Projection-free sparse convexoptimization. In ICML 2013 - Proceedings of the 30th InternationalConference on Machine Learning, 2013.

Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descentusing predictive variance reduction. In NIPS, pp. 315–323, 2013.

Lacoste-Julien, Simon, Jaggi, Martin, Schmidt, Mark W., and Pletscher,Patrick. Block-coordinate frank-wolfe optimization for structural svms.In ICML (1), volume 28, pp. 53–61, 2013.

Lan, Guanghui. An optimal method for stochastic composite optimization.Math. Program., 133(1-2):365–397, 2012.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 94 / 99

Page 147: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References III

Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning viatruncated gradient. J. Mach. Learn. Res., 10:777–801, June 2009.

Mahdavi, Mehrdad, Yang, Tianbao, Jin, Rong, Zhu, Shenghuo, and Yi,Jinfeng. Stochastic gradient descent with only one projection. In NIPS,pp. 503–511, 2012.

Mahdavi, Mehrdad, Zhang, Lijun, and Jin, Rong. Mixed optimization forsmooth functions. In NIPS, pp. 674–682, 2013.

Nemirovski, A. and Yudin, D. On cezari?s convergence of the steepestdescent method for approximating saddle point of convex-concavefunctons. Soviet Math Dkl., 19:341–362, 1978.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM J. onOptimization, pp. 1574–1609, 2009.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 95 / 99

Page 148: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References IV

Nesterov, Yurii. Introductory Lectures on Convex Optimization: A BasicCourse (Applied Optimization). Springer Netherlands, 2004.

Niu, Feng, Recht, Benjamin, Re, Christopher, and Wright, Stephen J.Hogwild!: A lock-free approach to parallelizing stochastic gradientdescent. CoRR, 2011.

Ogawa, Kohei, Suzuki, Yoshiki, Suzumura, Shinya, and Takeuchi, Ichiro.Safe sample screening for support vector machines. CoRR, 2014.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In NIPS, 2007.

Richtarik, Peter and Takac, Martin. Distributed coordinate descentmethod for learning with big data. CoRR, abs/1310.2059, 2013.

Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochasticgradient method with an exponential convergence rate forstrongly-convex optimization with finite training sets. CoRR, 2012.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 96 / 99

Page 149: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References V

Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic methods for l1regularized loss minimization. In Proceedings of the 26th AnnualInternational Conference on Machine Learning, ICML ’09, pp. 929–936,2009.

Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochastic dualcoordinate ascent. CoRR, abs/1211.2717, 2012.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascentmethods for regularized loss. Journal of Machine Learning Research, 14:567–599, 2013.

Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings of the24th International Conference on Machine Learning, pp. 807–814, 2007.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 97 / 99

Page 150: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References VI

Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew.Pegasos: primal estimated sub-gradient solver for svm. Math. Program.,127(1):3–30, 2011.

Shamir, Ohad and Zhang, Tong. Stochastic gradient descent fornon-smooth optimization: Convergence results and optimal averagingschemes. In ICML (1), pp. 71–79, 2013.

Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping.Lasso screening rules via dual polytope projection. CoRR,abs/1211.3966, 2012.

Wang, Jie, Wonka, Peter, and Ye, Jieping. Scaling svm and least absolutedeviations via exact data reduction. CoRR, abs/1310.7048, 2013. URLhttp://dblp.uni-trier.de/db/journals/corr/corr1310.html#WangWY13.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 98 / 99

Page 151: Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount of data generated everyday Facebook users upload3 millionphotos Goolge receives3

Implementations and A Distributed Library

References VII

Yang, Tianbao. Trading computation for communication: Distributedstochastic dual coordinate ascent. NIPS’13, pp. –, 2013.

Yang, Tianbao and Zhang, Lijun. Efficient stochastic gradient descent forstrongly convex optimization. CoRR, abs/1304.5504, 2013.

Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. ”nystrom method vs random fourier features: A theoreticaland empirical comparison”. In NIPS, pp. 485–493, 2012.

Zhu, Shenghuo. Stochastic gradient descent algorithms for strongly convexfunctions at o(1/t) convergence rates. CoRR, abs/1305.2218, 2013.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 99 / 99