shai shalev-shwartz and tong zhang

50
Proximal Stochastic Dual Coordinate Ascent Shai Shalev-Shwartz and Tong Zhang Statistics Department Rutgers University Shalev-Shwartz & Zhang (Rutgers) SDCA 1 / 31

Upload: others

Post on 15-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shai Shalev-Shwartz and Tong Zhang

Proximal Stochastic Dual Coordinate Ascent

Shai Shalev-Shwartz and Tong Zhang

Statistics DepartmentRutgers University

Shalev-Shwartz & Zhang (Rutgers) SDCA 1 / 31

Page 2: Shai Shalev-Shwartz and Tong Zhang

Motivation: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1

n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]

or the ridge regression problem:

minw

1

n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as we can.

Problem is deterministic optimization

But a good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).

We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

Shalev-Shwartz & Zhang (Rutgers) SDCA 2 / 31

Page 3: Shai Shalev-Shwartz and Tong Zhang

Motivation: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1

n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]or the ridge regression problem:

minw

1

n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as we can.

Problem is deterministic optimization

But a good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).

We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

Shalev-Shwartz & Zhang (Rutgers) SDCA 2 / 31

Page 4: Shai Shalev-Shwartz and Tong Zhang

Motivation: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1

n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]or the ridge regression problem:

minw

1

n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as we can.

Problem is deterministic optimization

But a good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).

We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

Shalev-Shwartz & Zhang (Rutgers) SDCA 2 / 31

Page 5: Shai Shalev-Shwartz and Tong Zhang

Outline

Loss Minimization with L2 Regularization

dual formulationDual Coordinate Ascent (DCA) and Stochastic Gradient Descentfast convergence Properties of SDCAthe importance of randomization

General regularization

dualityProx-SDCA algorithmfast convergence and comparison to other methods

Highlevel proof ideas

Shalev-Shwartz & Zhang (Rutgers) SDCA 3 / 31

Page 6: Shai Shalev-Shwartz and Tong Zhang

Loss Minimization with L2 Regularization

minwP (w) :=

[1

n

n∑i=1

φi(w>xi) +

λ

2‖w‖2

].

Examples:φi(z) Lipschitz smooth

SVM max0, 1− yiz 3 7

Logistic regression log(1 + exp(−yiz)) 3 3

Abs-loss regression |z − yi| 3 7

Square-loss regression (z − yi)2 7 3

Shalev-Shwartz & Zhang (Rutgers) SDCA 4 / 31

Page 7: Shai Shalev-Shwartz and Tong Zhang

Loss Minimization with L2 Regularization

minwP (w) :=

[1

n

n∑i=1

φi(w>xi) +

λ

2‖w‖2

].

Examples:φi(z) Lipschitz smooth

SVM max0, 1− yiz 3 7

Logistic regression log(1 + exp(−yiz)) 3 3

Abs-loss regression |z − yi| 3 7

Square-loss regression (z − yi)2 7 3

Shalev-Shwartz & Zhang (Rutgers) SDCA 4 / 31

Page 8: Shai Shalev-Shwartz and Tong Zhang

Dual Formulation

Primal problem:

w∗ = arg minwP (w) :=

[1

n

n∑i=1

φi(w>xi) +

λ

2‖w‖2

]

Dual problem:

α∗ = maxα∈Rn

D(α) :=

1

n

n∑i=1

−φ∗i (−αi)−λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2 ,

and the convex conjugate (dual) is defined as:

φ∗i (a) = supz

(az − φi(z)).

Shalev-Shwartz & Zhang (Rutgers) SDCA 5 / 31

Page 9: Shai Shalev-Shwartz and Tong Zhang

Relationship of Primal and Dual Solutions

Weak duality: P (w) ≥ D(α) for all w and αStrong duality: P (w∗) = D(α∗) with the relationship

w∗ =1

λn

n∑i=1

α∗,i · xi, α∗ i = −φ′i(w>∗ xi).

Duality gap: for any w and α:

P (w)−D(α)︸ ︷︷ ︸duality gap

≥ P (w)− P (w∗)︸ ︷︷ ︸primal sub-optimality

.

Shalev-Shwartz & Zhang (Rutgers) SDCA 6 / 31

Page 10: Shai Shalev-Shwartz and Tong Zhang

Relationship of Primal and Dual Solutions

Weak duality: P (w) ≥ D(α) for all w and αStrong duality: P (w∗) = D(α∗) with the relationship

w∗ =1

λn

n∑i=1

α∗,i · xi, α∗ i = −φ′i(w>∗ xi).

Duality gap: for any w and α:

P (w)−D(α)︸ ︷︷ ︸duality gap

≥ P (w)− P (w∗)︸ ︷︷ ︸primal sub-optimality

.

Shalev-Shwartz & Zhang (Rutgers) SDCA 6 / 31

Page 11: Shai Shalev-Shwartz and Tong Zhang

Example: Linear Support Vector Machine

Primal formulation:

P (w) =1

n

n∑i=1

max(0, 1− w>xiyi) +λ

2‖w‖22

Dual formulation:

D(α) =1

n

n∑i=1

αiyi −1

2λn2

∥∥∥∥∥n∑i=1

αixiyi

∥∥∥∥∥2

2

, αiyi ∈ [0, 1].

Relationship:

w∗ =1

λn

n∑i=1

α∗,ixi

Shalev-Shwartz & Zhang (Rutgers) SDCA 7 / 31

Page 12: Shai Shalev-Shwartz and Tong Zhang

Dual Coordinate Ascent (DCA)

Solve the dual problem using coordinate ascent

maxα∈Rn

D(α),

and keep the corresponding primal solution using the relationship

w =1

λn

n∑i=1

αixi.

DCA: At each iteration, optimize D(α) w.r.t. a single coordinate,while the rest of the coordinates are kept in tact.

Stochastic Dual Coordinate Ascent (SDCA): Choose the updatedcoordinate uniformly at random

SMO (John Platt), Liblinear (Hsieh et al) etc implemented DCA.

Shalev-Shwartz & Zhang (Rutgers) SDCA 8 / 31

Page 13: Shai Shalev-Shwartz and Tong Zhang

Dual Coordinate Ascent (DCA)

Solve the dual problem using coordinate ascent

maxα∈Rn

D(α),

and keep the corresponding primal solution using the relationship

w =1

λn

n∑i=1

αixi.

DCA: At each iteration, optimize D(α) w.r.t. a single coordinate,while the rest of the coordinates are kept in tact.

Stochastic Dual Coordinate Ascent (SDCA): Choose the updatedcoordinate uniformly at random

SMO (John Platt), Liblinear (Hsieh et al) etc implemented DCA.

Shalev-Shwartz & Zhang (Rutgers) SDCA 8 / 31

Page 14: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — update rule

Stochastic Gradient Descent (SGD) update rule:

w(t+1) =(1− 1

t

)w(t) − φ′i(w

(t)>xi)

λ txi

SDCA update rule:

1. ∆i = argmax∆∈R

D(α(t) + ∆i ei)

2. w(t+1) = w(t) +∆i

λnxi

Rather similar update rules.

SDCA has several advantages:

Stopping criterion: duality gap smaller than a valueNo need to tune learning rate

Shalev-Shwartz & Zhang (Rutgers) SDCA 9 / 31

Page 15: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — update rule — Example

SVM with the hinge loss: φi(w) = max0, 1− yiw>xi

SGD update rule:

w(t+1) =(1− 1

t

)w(t) − 1[yi x

>i w

(t) < 1]

λ txi

SDCA update rule:

1. ∆i = yi max

(0,min

(1,

1− yi x>i w(t−1)

‖xi‖22/(λn)+ yi α

(t−1)i

))− α(t−1)

i

1. α(t+1) = α(t) + ∆i ei

2. w(t+1) = w(t) +∆i

λnxi

Shalev-Shwartz & Zhang (Rutgers) SDCA 10 / 31

Page 16: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−6, smoothed loss

5 10 15 20 2510

−6

10−5

10−4

10−3

10−2

10−1

100

SDCA

SDCA−Perm

SGD

The convergence of SDCA is shockingly fast! How to explain this?

Shalev-Shwartz & Zhang (Rutgers) SDCA 11 / 31

Page 17: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−6, smoothed loss

5 10 15 20 2510

−6

10−5

10−4

10−3

10−2

10−1

100

SDCA

SDCA−Perm

SGD

The convergence of SDCA is shockingly fast! How to explain this?

Shalev-Shwartz & Zhang (Rutgers) SDCA 11 / 31

Page 18: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — experimental observations

On CCAT dataset, λ = 10−5, hinge-loss

5 10 15 20 25 30 3510

−6

10−5

10−4

10−3

10−2

10−1

100

SDCA

SDCA−Perm

SGD

How to understand the convergence behavior?

Shalev-Shwartz & Zhang (Rutgers) SDCA 12 / 31

Page 19: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — Current analysis is unsatisfactory

How many iterations are required to guarantee P (w(t)) ≤ P (w∗) + ε ?

For SGD: O(

1λ ε

)For SDCA:

Hsieh et al. (ICML 2008), following Luo and Tseng (1992):O(1ν log(1/ε)

), but, ν can be arbitrarily small

Shalev-Schwartz and Tewari (2009), Nesterov (2010):

O(n/ε) for general n-dimensional coordinate ascentCan apply it to the dual problemResulting rate is slower than SGDAnd, the analysis does not hold for logistic regression (it requires smoothdual)

Analysis is for dual sub-optimality

What we need: duality gap and primal sub-optimality

Shalev-Shwartz & Zhang (Rutgers) SDCA 13 / 31

Page 20: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. SGD — Current analysis is unsatisfactory

How many iterations are required to guarantee P (w(t)) ≤ P (w∗) + ε ?

For SGD: O(

1λ ε

)For SDCA:

Hsieh et al. (ICML 2008), following Luo and Tseng (1992):O(1ν log(1/ε)

), but, ν can be arbitrarily small

Shalev-Schwartz and Tewari (2009), Nesterov (2010):

O(n/ε) for general n-dimensional coordinate ascentCan apply it to the dual problemResulting rate is slower than SGDAnd, the analysis does not hold for logistic regression (it requires smoothdual)

Analysis is for dual sub-optimalityWhat we need: duality gap and primal sub-optimality

Shalev-Shwartz & Zhang (Rutgers) SDCA 13 / 31

Page 21: Shai Shalev-Shwartz and Tong Zhang

Dual vs. Primal sub-optimality

Good dual sub-optimality does not imply good primal sub-optimality!

Take data which is linearly separable using a vector w0

Set λ = 2ε/‖w0‖2 and use the hinge-loss

P (w∗) ≤ P (w0) = ε

Take dual solution 0 and the corresponding primal solution w(0) = 0

D(0) = 0 ⇒ D(α∗)−D(0) = P (w∗)−D(0) ≤ εP (w(0))− P (w∗) = 1− P (w∗) ≥ 1− ε

Conclusion: it is important to study the convergence of duality gap.

Shalev-Shwartz & Zhang (Rutgers) SDCA 14 / 31

Page 22: Shai Shalev-Shwartz and Tong Zhang

Dual vs. Primal sub-optimality

Good dual sub-optimality does not imply good primal sub-optimality!

Take data which is linearly separable using a vector w0

Set λ = 2ε/‖w0‖2 and use the hinge-loss

P (w∗) ≤ P (w0) = ε

Take dual solution 0 and the corresponding primal solution w(0) = 0

D(0) = 0 ⇒ D(α∗)−D(0) = P (w∗)−D(0) ≤ εP (w(0))− P (w∗) = 1− P (w∗) ≥ 1− ε

Conclusion: it is important to study the convergence of duality gap.

Shalev-Shwartz & Zhang (Rutgers) SDCA 14 / 31

Page 23: Shai Shalev-Shwartz and Tong Zhang

Our Results: to achieve ε accuracy

For (1/γ)-smooth loss:

O

((n+

1

γλ

)log

1

ε

)For L-Lipschitz loss:

O

(n+

L2

λ ε

)For “almost smooth” loss functions (e.g. the hinge-loss):

O

(n+

L

λ (ε/L)1/(1+ν)

)where ν > 0 is a data dependent quantity

Shalev-Shwartz & Zhang (Rutgers) SDCA 15 / 31

Page 24: Shai Shalev-Shwartz and Tong Zhang

Compare to Batch Gradient Descent Algorithm

Number of examples needed needed to achieve ε accuracy:

(1/γ)-smooth loss:

Batch GD: O(n · 1/(γλ) log(1/ε))SDCA: O(n+ 1/(γλ) log(1/ε))

L-Lipschitz loss:

Batch GD: O(n · L2/(λε))SDCA: O(n+ L2/(λε))

The gain of SDCA over batch algorithm is significant when n is large.

Shalev-Shwartz & Zhang (Rutgers) SDCA 16 / 31

Page 25: Shai Shalev-Shwartz and Tong Zhang

Compare to Batch Gradient Descent Algorithm

Number of examples needed needed to achieve ε accuracy:

(1/γ)-smooth loss:

Batch GD: O(n · 1/(γλ) log(1/ε))SDCA: O(n+ 1/(γλ) log(1/ε))

L-Lipschitz loss:

Batch GD: O(n · L2/(λε))SDCA: O(n+ L2/(λε))

The gain of SDCA over batch algorithm is significant when n is large.

Shalev-Shwartz & Zhang (Rutgers) SDCA 16 / 31

Page 26: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. DCA — Randomization is Crucial!

On CCAT dataset, λ = 10−4, smoothed hinge-loss

0 2 4 6 8 10 12 14 16 1810

−6

10−5

10−4

10−3

10−2

10−1

100

SDCADCA−Cyclic

SDCA−PermBound

Randomization is crucial!

In particular, the bound of Luo and Tseng holds for cyclic order, hencemust be inferior to our bound

Shalev-Shwartz & Zhang (Rutgers) SDCA 17 / 31

Page 27: Shai Shalev-Shwartz and Tong Zhang

SDCA vs. DCA — Randomization is Crucial!

On CCAT dataset, λ = 10−4, smoothed hinge-loss

0 2 4 6 8 10 12 14 16 1810

−6

10−5

10−4

10−3

10−2

10−1

100

SDCADCA−Cyclic

SDCA−PermBound

Randomization is crucial!

In particular, the bound of Luo and Tseng holds for cyclic order, hencemust be inferior to our bound

Shalev-Shwartz & Zhang (Rutgers) SDCA 17 / 31

Page 28: Shai Shalev-Shwartz and Tong Zhang

Smoothing the hinge-loss

φ(x) =

0 x > 1

1− x− γ/2 x < 1− γ1

2γ (1− x)2 o.w.

Shalev-Shwartz & Zhang (Rutgers) SDCA 18 / 31

Page 29: Shai Shalev-Shwartz and Tong Zhang

Smoothing the hinge-loss

Mild effect on 0-1 error

astro-ph CCAT cov1

0-1

erro

r

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.0354

0.0355

0.0356

0.0357

0.0358

0.0359

0.036

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.0503

0.0504

0.0505

0.0506

0.0507

0.0508

0.0509

0.051

0.0511

0.0512

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.226

0.2265

0.227

0.2275

0.228

0.2285

γ γ γ

Shalev-Shwartz & Zhang (Rutgers) SDCA 19 / 31

Page 30: Shai Shalev-Shwartz and Tong Zhang

Smoothing the hinge-loss

Improves training time

astro-ph CCAT cov1

0 20 40 60 80 100 120 140 16010

−6

10−5

10−4

10−3

10−2

10−1

γ=0.000

γ=0.010

γ=0.100

γ=1.000

0 10 20 30 40 50 60 70 80 90 10010

−6

10−5

10−4

10−3

10−2

10−1

γ=0.000

γ=0.010

γ=0.100

γ=1.000

0 10 20 30 40 50 6010

−6

10−5

10−4

10−3

10−2

10−1

100

γ=0.000

γ=0.010

γ=0.100

γ=1.000

Duality gap as a function of runtime for different smoothing parameters

Shalev-Shwartz & Zhang (Rutgers) SDCA 20 / 31

Page 31: Shai Shalev-Shwartz and Tong Zhang

Additional related work

Collins et al (2008): For smooth loss, similar bound to ours (forsmooth loss) but for a more complicated algorithm (ExponentiatedGradient on dual)

Lacoste-Julien, Jaggi, Schmidt, Pletscher (preprint on Arxiv):

Study Frank-Wolfe algorithm for the dual of structured predictionproblems.Boils down to SDCA for the case of binary hinge-loss.Same bound as our bound for the Lipschitz case

Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD for smoothloss and finite sample. Also obtain log(1/ε).

Shalev-Shwartz & Zhang (Rutgers) SDCA 21 / 31

Page 32: Shai Shalev-Shwartz and Tong Zhang

Proximal SDCA for General Regularizer

Want to solve:

minwP (w) :=

[1

n

n∑i=1

φi(X>i w) + λg(w)

],

where Xi are matrices; g(·) is strongly convex.Examples:

Multi-class logistic loss

φi(X>i w) = ln

K∑`=1

exp(w>Xi,`)− w>Xi,yi .

L1 − L2 regularization

g(w) =1

2‖w‖22 +

σ

λ‖w‖1

Shalev-Shwartz & Zhang (Rutgers) SDCA 22 / 31

Page 33: Shai Shalev-Shwartz and Tong Zhang

Dual Formulation

Primal:

minwP (w) :=

[1

n

n∑i=1

φi(X>i w) + λg(w)

],

Dual:

maxα

D(α) :=

[1

n

n∑i=1

−φ∗i (−αi)− λg∗(

1

λn

n∑i=1

Xiαi

)]

with the relationship

w = ∇g∗(

1

λn

n∑i=1

Xiαi

).

Prox-SDCA: extension of SDCA for arbitrarily strongly convex g(w).

Shalev-Shwartz & Zhang (Rutgers) SDCA 23 / 31

Page 34: Shai Shalev-Shwartz and Tong Zhang

Prox-SDCA

Dual:

maxα

D(α) :=

[1

n

n∑i=1

−φ∗i (−αi)− λg∗(v)

], v =

1

λn

n∑i=1

Xiαi.

Assume g(w) is strongly convex in norm ‖ · ‖P with dual norm ‖ · ‖D.

For each α, and the corresponding v and w, define prox-dual

Dα(∆α) =

[1

n

n∑i=1

−φ∗i (−(αi + ∆αi))

−λ

g∗(v) +∇g∗(v)>1

λn

n∑i=1

Xi∆αi +1

2

∥∥∥∥∥ 1

λn

n∑i=1

Xi∆αi

∥∥∥∥∥2

D︸ ︷︷ ︸upper bound of g∗(·)

Prox-SDCA: randomly pick i and update ∆αi by maximizing Dα(·).

Shalev-Shwartz & Zhang (Rutgers) SDCA 24 / 31

Page 35: Shai Shalev-Shwartz and Tong Zhang

Prox-SDCA

Dual:

maxα

D(α) :=

[1

n

n∑i=1

−φ∗i (−αi)− λg∗(v)

], v =

1

λn

n∑i=1

Xiαi.

Assume g(w) is strongly convex in norm ‖ · ‖P with dual norm ‖ · ‖D.For each α, and the corresponding v and w, define prox-dual

Dα(∆α) =

[1

n

n∑i=1

−φ∗i (−(αi + ∆αi))

−λ

g∗(v) +∇g∗(v)>1

λn

n∑i=1

Xi∆αi +1

2

∥∥∥∥∥ 1

λn

n∑i=1

Xi∆αi

∥∥∥∥∥2

D︸ ︷︷ ︸upper bound of g∗(·)

Prox-SDCA: randomly pick i and update ∆αi by maximizing Dα(·).

Shalev-Shwartz & Zhang (Rutgers) SDCA 24 / 31

Page 36: Shai Shalev-Shwartz and Tong Zhang

Prox-SDCA

Dual:

maxα

D(α) :=

[1

n

n∑i=1

−φ∗i (−αi)− λg∗(v)

], v =

1

λn

n∑i=1

Xiαi.

Assume g(w) is strongly convex in norm ‖ · ‖P with dual norm ‖ · ‖D.For each α, and the corresponding v and w, define prox-dual

Dα(∆α) =

[1

n

n∑i=1

−φ∗i (−(αi + ∆αi))

−λ

g∗(v) +∇g∗(v)>1

λn

n∑i=1

Xi∆αi +1

2

∥∥∥∥∥ 1

λn

n∑i=1

Xi∆αi

∥∥∥∥∥2

D︸ ︷︷ ︸upper bound of g∗(·)

Prox-SDCA: randomly pick i and update ∆αi by maximizing Dα(·).Shalev-Shwartz & Zhang (Rutgers) SDCA 24 / 31

Page 37: Shai Shalev-Shwartz and Tong Zhang

Example: L1 − L2 Regularized Logistic Regression

Primal:

P (w) =1

n

n∑i=1

ln(1 + e−w>XiYi)︸ ︷︷ ︸

φi(w)

2w>w + σ‖w‖1︸ ︷︷ ︸

λg(w)

.

Dual: with αiYi ∈ [0, 1]

D(α) =1

n

n∑i=1

−αiYi ln(αiYi)− (1− αiYi) ln(1− αiYi)︸ ︷︷ ︸φ∗i (−αi)

−λ2‖trunc(v, σ/λ)‖22

s.t. v =1

λn

n∑i=1

αiXi; w = trunc(v, σ/λ)

where

trunc(u, δ)j =

uj − δ if uj > δ

0 if |uj | ≤ δuj + δ if uj < −δ

Shalev-Shwartz & Zhang (Rutgers) SDCA 25 / 31

Page 38: Shai Shalev-Shwartz and Tong Zhang

Proximal-SDCA for L1-L2 Regularization

Algorithm:

Keep dual α and v = (λn)−1∑

i αiXi

Randomly pick i

Find ∆i by approximately maximizing:

−φ∗i (αi + ∆i)− trunc(v, σ/λ)>Xi ∆i −1

2λn‖Xi‖22∆2

i ,

where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi) + (1− (αi + ∆)Yi) ln(1− (αi + ∆)Yi)

α = α+ ∆i · eiv = v + (λn)−1∆i ·Xi.

Let w = trunc(v, σ/λ).

Closely related to Lin Xiao (2010): Dual Averaging Method forRegularized Stochastic Learning and Online Optimization

Shalev-Shwartz & Zhang (Rutgers) SDCA 26 / 31

Page 39: Shai Shalev-Shwartz and Tong Zhang

Proximal-SDCA for L1-L2 Regularization

Algorithm:

Keep dual α and v = (λn)−1∑

i αiXi

Randomly pick i

Find ∆i by approximately maximizing:

−φ∗i (αi + ∆i)− trunc(v, σ/λ)>Xi ∆i −1

2λn‖Xi‖22∆2

i ,

where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi) + (1− (αi + ∆)Yi) ln(1− (αi + ∆)Yi)

α = α+ ∆i · eiv = v + (λn)−1∆i ·Xi.

Let w = trunc(v, σ/λ).

Closely related to Lin Xiao (2010): Dual Averaging Method forRegularized Stochastic Learning and Online Optimization

Shalev-Shwartz & Zhang (Rutgers) SDCA 26 / 31

Page 40: Shai Shalev-Shwartz and Tong Zhang

Convergence rate

The same as the non-proximal version of SDCA: number of iterationsneeded to achieve ε accuracy

For (1/γ)-smooth loss:

O

((n+

1

γλ

)log

1

ε

)For L-Lipschitz loss:

O

(n+

L2

λ ε

)asymptotically faster rate for “almost smooth” loss functions (e.g. thehinge-loss)

Shalev-Shwartz & Zhang (Rutgers) SDCA 27 / 31

Page 41: Shai Shalev-Shwartz and Tong Zhang

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smooth φi:

1

n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n+ 1/ε).

Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).

Compare to batch accelerated proximal gradient (Nesterov):

number of iterations needed is O(n/√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Shalev-Shwartz & Zhang (Rutgers) SDCA 28 / 31

Page 42: Shai Shalev-Shwartz and Tong Zhang

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smooth φi:

1

n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n+ 1/ε).

Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).

Compare to batch accelerated proximal gradient (Nesterov):

number of iterations needed is O(n/√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Shalev-Shwartz & Zhang (Rutgers) SDCA 28 / 31

Page 43: Shai Shalev-Shwartz and Tong Zhang

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smooth φi:

1

n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n+ 1/ε).

Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).

Compare to batch accelerated proximal gradient (Nesterov):

number of iterations needed is O(n/√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Shalev-Shwartz & Zhang (Rutgers) SDCA 28 / 31

Page 44: Shai Shalev-Shwartz and Tong Zhang

Analysis of SDCA: Highlevel Idea

Main lemma: for any t and s ∈ [0, 1],

E[D(α(t))−D(α(t−1))]︸ ︷︷ ︸dual suboptimality improvement

≥ s

nE[P (w(t−1))−D(α(t−1))]︸ ︷︷ ︸

duality gap

−( sn

)2 G(t)

Improvement of dual can be estimated from duality gap

G(t) = O(1) for Lipschitz losses:

E[D(α(t))−D(α(t−1))] ≥ s

nE[P (w(t−1))−D(α(t−1))]− A

λ

( sn

)2

With appropriate s, G(t) ≤ 0 for smooth losses

E[D(α(t))−D(α(t−1))] ≥ s

nE[P (w(t−1))−D(α(t−1))]

Shalev-Shwartz & Zhang (Rutgers) SDCA 29 / 31

Page 45: Shai Shalev-Shwartz and Tong Zhang

Analysis of SDCA: Highlevel Idea

Main lemma: for any t and s ∈ [0, 1],

E[D(α(t))−D(α(t−1))]︸ ︷︷ ︸dual suboptimality improvement

≥ s

nE[P (w(t−1))−D(α(t−1))]︸ ︷︷ ︸

duality gap

−( sn

)2 G(t)

Improvement of dual can be estimated from duality gap

G(t) = O(1) for Lipschitz losses:

E[D(α(t))−D(α(t−1))] ≥ s

nE[P (w(t−1))−D(α(t−1))]− A

λ

( sn

)2

With appropriate s, G(t) ≤ 0 for smooth losses

E[D(α(t))−D(α(t−1))] ≥ s

nE[P (w(t−1))−D(α(t−1))]

Shalev-Shwartz & Zhang (Rutgers) SDCA 29 / 31

Page 46: Shai Shalev-Shwartz and Tong Zhang

Proof Idea: smooth loss

Main lemma: for any t and s ∈ [0, 1],

E[D(α(t))−D(α(t−1))] ≥ s

nE[P (w(t−1))−D(α(t−1))]

Bounding dual sub-optimality: the above lemma yields

E[D(α(t))−D(α(t−1))] ≥ s

nE[D(α∗)−D(α(t−1))],

which implies linear convergence of dual sub-optimality

Bounding duality gap: Summing the inequality for iterationsT0 + 1, . . . , T and choosing a random t ∈ T0 + 1, . . . , T yields,

E[(P (w(t−1))−D(α(t−1)))

]≤ n

s(T − T0)E[D(α(T ))−D(α(T0))]

Shalev-Shwartz & Zhang (Rutgers) SDCA 30 / 31

Page 47: Shai Shalev-Shwartz and Tong Zhang

Summary

Prox-SDCA algorithm:

solves loss minimization problems with regularization such as L1 or L2

Works very well in practice

it is important to use SDCA, which is superior to cyclic DCA:one cannot just randomize the order once and apply cyclic DCA

Our analysis shows that SDCA is superior to traditional methods inmany interesting scenarios

What we learn:

goal is to solve a determnistic optimization problembut good solution leads to a truly stochastic algorithm

Final question: is there a determnistic algorithm with similar fastconvergence properites?

Shalev-Shwartz & Zhang (Rutgers) SDCA 31 / 31

Page 48: Shai Shalev-Shwartz and Tong Zhang

Summary

Prox-SDCA algorithm:

solves loss minimization problems with regularization such as L1 or L2

Works very well in practice

it is important to use SDCA, which is superior to cyclic DCA:one cannot just randomize the order once and apply cyclic DCA

Our analysis shows that SDCA is superior to traditional methods inmany interesting scenarios

What we learn:

goal is to solve a determnistic optimization problembut good solution leads to a truly stochastic algorithm

Final question: is there a determnistic algorithm with similar fastconvergence properites?

Shalev-Shwartz & Zhang (Rutgers) SDCA 31 / 31

Page 49: Shai Shalev-Shwartz and Tong Zhang

Summary

Prox-SDCA algorithm:

solves loss minimization problems with regularization such as L1 or L2

Works very well in practice

it is important to use SDCA, which is superior to cyclic DCA:one cannot just randomize the order once and apply cyclic DCA

Our analysis shows that SDCA is superior to traditional methods inmany interesting scenarios

What we learn:

goal is to solve a determnistic optimization problembut good solution leads to a truly stochastic algorithm

Final question: is there a determnistic algorithm with similar fastconvergence properites?

Shalev-Shwartz & Zhang (Rutgers) SDCA 31 / 31

Page 50: Shai Shalev-Shwartz and Tong Zhang

Summary

Prox-SDCA algorithm:

solves loss minimization problems with regularization such as L1 or L2

Works very well in practice

it is important to use SDCA, which is superior to cyclic DCA:one cannot just randomize the order once and apply cyclic DCA

Our analysis shows that SDCA is superior to traditional methods inmany interesting scenarios

What we learn:

goal is to solve a determnistic optimization problembut good solution leads to a truly stochastic algorithm

Final question: is there a determnistic algorithm with similar fastconvergence properites?

Shalev-Shwartz & Zhang (Rutgers) SDCA 31 / 31