dykstra s algorithm, admm, and coordinate descent: connections, insights, and...

Overview DA-ADMM-CD Connections/Applications Extensions

Dykstra’s Algorithm, ADMM, and CoordinateDescent: Connections, Insights, and Extensions

Ryan J. TibshiraniStatistics Department and Department of Machine Learning, Carnegie Mellon

University

Presented by Chenyang Tao

Dec 15, 2017

Dykstra-ADMM-CD


Outline

1. Overview

2. Dykstra’s Algorithm, ADMM and Coordinate Descent

3. Connections, Insights and Applications to Lasso

4. Nonquadratic Loss Generalizations and Future Directions

Dykstra-ADMM-CD

Overview DA-ADMM-CD Connections/Applications Extensions ∎

Highlights

∎ Establish the equivalence between Dykstra’s algorithm (DA)and block coordinate descent (CD)

∎ Show the connections between DA and ADMM

∎ Improve CD’s convergence results using theories from DA

∎ Parallelizing CD using DA and ADMM counterparts

∎ Generalization to problems with nonquadratic loss

Dykstra-ADMM-CD

Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎

Two convex optimization problems

Best approximation problem

minu∈Rn

∥y − u∥22 subject to u ∈ C1 ∩⋯ ∩Cd (1)

∎ C1,⋯,Cd ⊆ Rn are convex sets, and y ∈ Rn

Regularized regression problem

minw∈Rp

12∥y − Xw∥2

2 +d

∑i=1

hi(wi) (2)

∎ y ∈ Rn and X ∈ Rn×p

∎ Xi ∈ Rn×pi , i = 1,⋯,d is a block decomposition of X

∎ hi ∶ Rpi → R are convex regularizers

Dykstra-ADMM-CD



minu∈Rn



minw∈Rp

12∥y − Xw∥2

2 +d

∑i=1

hi(wi) (2)

A peek at the main result

∎ For particular C1,⋯,Cd and h1,⋯,hd, (1) and (2) are duals ofeach other

∎ ADMM matches Dykstra’s algorithm when d = 2 and C1 is alinear subspace

Dykstra-ADMM-CD


Coordinate Descent


minw∈Rp

12∥y − Xw∥2

2 +d

∑i=1

hi(wi) (2)

∎ Initialize say w(0) = 0, and repeat, for k = 1,2,3,⋯

w(k)i = arg minwi∈Rpi

12

XXXXXXXXXXXy −∑

j<iXjw

(k)j − Xiwi −∑

j>iXjw

(k−1)j

XXXXXXXXXXX

2

2

+ hi(wi)

Dykstra-ADMM-CD


Alternating Direction Method of Multipliers (ADMM)

Example problem

∎ Solve problems such as

minx

f (x) + g(x)

∎ Equivalent to solve

minx,y

f (x) + g(y) subject to x = y

∎ Generalized form

minx

f (x) subject to Ax − b = 0

† We always assume f is convex

Dykstra-ADMM-CD



Example problem

minx

f (x) subject to Ax − b = 0

Dual ascent

∎ LagrangianL(x, λ) = f (x) + λT(Ax − b)

∎ The dual function

g(λ) = infx

L(x, λ) = −f ∗(−ATλ) − bTλ

f ∗ is the conjugate of f

Dykstra-ADMM-CD



Dual ascent

L(x, λ) = f (x) + λT(Ax − b)

g(λ) = infx

L(x, λ) = −f ∗(−ATλ) − bTλ

∎ The dual problemy∗ = arg max

λg(λ)

∎ The optimal solution for primal and dual are related by

x∗ = arg minx

L(x, y∗)

Dykstra-ADMM-CD



Dual ascent

L(x, λ) = f (x) + λT(Ax − b)

g(λ) = infx

L(x, λ) = −f ∗(−ATλ) − bTλ

y∗ = arg maxλ

g(λ), x∗ = arg minx

L(x, λ∗)

∎ Step 1 (descent x): xk+1 = arg min L(x, λk)∎ Step 2 (ascent λ): λk+1 = λk + αk(Axk+1 − b)

ADMM

∎ Lρ = L(x, λ) + ρ2∥Ax − b∥2

2, αk = ρ

Dykstra-ADMM-CD


Dykstra’s Algorithm (DA)


minu∈Rn


Algorithm (DA)

∎ Initialize u(0) to y, z(−d+1) = ⋯ = z(0) = 0

∎ u(k) = PC[k](u(k−1) + z(k−d))

∎ z(k) = u(k−1) + z(k−d) − u(k)

∎ PC(x) = arg minc ∥x − c∥22, and [⋅] is the modulus operator

Dykstra-ADMM-CD




minu∈Rn


Algorithm (DA - alternative version)

∎ Initialize u(0) to y, z(−d+1) = ⋯ = z(0) = 0

∎ u(k)0 = u(k−1)d

∎ u(k)i = PC[k](u(k)i−1 + z(k−1)i )

∎ z(k)i = u(k)i−1 + z(k−1)i − u(k)i

Dykstra-ADMM-CD



Dykstra-ADMM-CD

Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎

The Equivalence between DA and CD

Seminorms

∎ If h is a seminorm, then it can be expressed in general form

h(v) = maxd∈D

⟨d, v⟩,

where D is a closed, convex set containing 0

Equivalence via duality

∎ If {hi} in the regularized regression problem (2) areseminorms, then CD iterations are the same as DA on theconvex sets Ci = {v ∈ Rn ∶ XT

i v ∈ Di}, where Di is the convexset defining seminorm hi.

Dykstra-ADMM-CD


The Equivalence between DA and CD

Sketch of theories

Lemma 1.

wi = arg minwi∈Rpi

12∥b − Xiwi∥2

2 + hi(wi)⇔ Xiwi = (Id −PCi)(b)

Theorem 1. Let u and w are respectively the solution to problem(1) and (2). For all iterations k = 1,2,3,⋯ it holds that

z(k)i = Xiw(k)i and u(k)i = y −∑

j≤iXjw

(k)j −∑

j>iXjw

(k−1)j

Dykstra-ADMM-CD


Connections between DA and ADMM

∎ Best approximation problem (1) is equivalent to

minu1,u2∈Rn

∥y − u1∥22 + 1C1(u1) + 1C2(u2) subject to u1 = u2

∎ Can be solved by ADMM

∎ Augmented Lagrangian

L(u1,u2, z) = ∥y − u1∥22 + 1C1(u1) + 1C2(u2) + ρ∥u1 − u2 + z∥2

2

∎ This is equivalent to DA when d = 2, C1 is a linear subspaceand y ∈ C1.

Dykstra-ADMM-CD


Coordinate Descent for Lasso

Lasso problem

minw∈Rp

12∥y − Xw∥2

2 + λ∥w∥1

∎ hi(w) = λ∣wi∣ = maxd∈Di dwi, Di = [−λ,λ]∎ Ci = (XT

i )−1(Di) = {v ∈ Rn ∶ ∣XTi v∣ ≤ λ}

intersection of two halfspaces

∎ C1 ∩⋯ ∩Cd is a polyhedron

Dykstra-ADMM-CD


Analyzing CD’s Convergence Rate for Lasso

Using Dykstra’s theory to analyze the convergence rate on CDfor Lasso

Notations

∎ w is the lasso solution

∎ Σ = XTX, ∥z∥2Σ = zTΣz

∎ A is the active set and a = ∣A∣ is its size

∎ P⊥{ij+1,⋯,ia}

is the projection onto the orthocomplement of thecolumn span of X{ij+1,⋯,ia}

Dykstra-ADMM-CD


Analyzing CD’s Convergence Rate for Lasso

Adaptation of Iusem and De Pierro 1990

∥w(k+1) − w∥Σ∥w(k) − w∥Σ

≤ ( a2

a2 + λmin(XTAXA)/maxi∈A ∥Xi∥2

2)

1/2

Adaptation of Deutsch and Hundal 1994

∥w(k+1) − w∥Σ∥w(k) − w∥Σ

≤⎛⎜⎜⎝

1 −a−1

∏j=1

∥P⊥{ij+1,⋯,ia}

Xij∥2

2

∥Xij∥22

⎞⎟⎟⎠

1/2

Dykstra-ADMM-CD


Parallel Coordinate Descent - Dykstra

Reformulation

minu=(u1,⋯,ud)∈Rnd

d

∑i=1γi∥y − ui∥2

2 subject to u ∈ C0 ∩ (C1 ×⋯ ×Cd)

∎ C0 = {u1 = ⋯ = ud ∈ Rnd ∶ u1 = ⋯ = ud}∎ {γi > 0} weights sum to 1

Parallel-Dykstra

∎ u(k)0 = ∑di=1 γiu

(k−1)i ,

∎ for i = 1,⋯,d (executed in parallel)u(k)i = PCi(u(k)0 + z(k−1)

i ),z(k)i = u(k)0 + z(k−1)

i − u(k)i ,

Dykstra-ADMM-CD


Parallel Coordinate Descent - Dykstra

Reformulation

minu=(u1,⋯,ud)∈Rnd

d

∑i=1γi∥y − ui∥2

2 subject to u ∈ C0 ∩ (C1 ×⋯ ×Cd)

∎ C0 = {u1 = ⋯ = ud ∈ Rnd ∶ u1 = ⋯ = ud}∎ {γi > 0} weights sum to 1

Parallel-Dykstra-CD

∎ For i = 1,⋯,d (executed in parallel)


12∥y − Xw(k−1) + Xiw

(k−1)i /γi − Xiwi/γi∥

2

2+ hi(wi/γi)

Dykstra-ADMM-CD


Parallel Coordinate Descent - ADMM

u(k)0 =(∑d

i=1 ρi)u(k−1)0

1 +∑di=1 ρi

+ y − Xw(k−1)

1 +∑di=1 ρi

+ X(w(k−2) −w(k−1))1 +∑d

i=1 ρi,


12∥u(k)0 + Xiw

(k−1)i /ρi − Xiwi/ρi∥

2

2+hi(wi/ρi), i = 1,⋯,d

∎ {ρi} are the augmented Lagrangian parameters

See Theorem 4,5 in the paper for convergence result.

Dykstra-ADMM-CD


Experiment

0 500 1000 1500 2000

1e−0

81e−0

51e−0

21e

+01

1e+0

4

No parallelization

Actual iteration number

Subo

ptim

ality

Coordinate descentPar−Dykstra−CDPar−ADMM−CD, rho=10Par−ADMM−CD, rho=50Par−ADMM−CD, rho=200

0 50 100 150

1e−0

81e−0

51e−0

21e

+01

1e+0

4

10% parallelization

Effective iteration number

Subo

ptim

ality

Coordinate descentPar−Dykstra−CDPar−ADMM−CD, rho=10Par−ADMM−CD, rho=50Par−ADMM−CD, rho=200

Figure 1: Suboptimality curves for serial coordinate descent, parallel-Dykstra-CD, and three tuningsof parallel-ADMM-CD (i.e., three different values of ⇢ =

Ppi=1 ⇢i), each run over the same 30 lasso

problems with n = 100 and p = 500. For details of the experimental setup, see the supplement.

Nonquadratic loss: Dykstra’s algorithm and coordinate descent. Given a convex function f , ageneralization of (2) is the regularized estimation problem

minw2Rp

f(Xw) +

dX

i=1

hi(wi). (16)

Regularized regression (2) is given by f(z) = 12ky � zk2

2, and e.g., regularized classification (underthe logistic loss) by f(z) = �yT z +

Pni=1 log(1 + ezi). In (block) coordinate descent for (16), we

initialize say w(0) = 0, and repeat, for k = 1, 2, 3, . . .:

w(k)i = argmin

wi2Rpi

f

✓X

j<i

Xjw(k)j +

X

j>i

Xjw(k�1)j + Xiwi

◆+ hi(wi), i = 1, . . . , d. (17)

On the other hand, given a differentiable and strictly convex function g, we can generalize (1) to thefollowing best Bregman-approximation problem,

minu2Rn

Dg(u, b) subject to u 2 C1 \ · · · \ Cd. (18)

where Dg(u, b) = g(u) � g(b) � hrg(b), u � bi is the Bregman divergence between u and b withrespect to g. When g(v) = 1

2kvk22 (and b = y), this recovers the best approximation problem (1). As

shown in Censor and Reich (1998); Bauschke and Lewis (2000), Dykstra’s algorithm can be extendedto apply to (18). We initialize u

(0)d = b, z

(0)1 = · · · = z

(0)d = 0, and repeat for k = 1, 2, 3, . . .:

u(k)0 = u

(k�1)d ,

u(k)i = (P g

Ci� rg⇤)

⇣rg(u

(k)i�1) + z

(k�1)i

⌘,

z(k)i = rg(u

(k)i�1) + z

(k�1)i �rg(u

(k)i ),

9=; for i = 1, . . . , d,

(19)

where P gC(x) = argminc2C Dg(c, x) denotes the Bregman (rather than Euclidean) projection of x

onto a set C, and g⇤ is the conjugate function of g. Though it may not be immediately obvious, wheng(v) = 1

2kvk22 the above iterations (19) reduce to the standard (Euclidean) Dykstra iterations in (4).

Furthermore, Dykstra’s algorithm and coordinate descent are equivalent in the more general setting.Theorem 6. Let f be a strictly convex, differentiable function that has full domain. Assume thatXi 2 Rn⇥pi has full column rank and hi(v) = maxd2Di

hd, vi for a closed, convex set Di ✓ Rpi , fori = 1, . . . , d. Also, let g(v) = f⇤(�v), b = �rf(0), and Ci = (XT

i )�1(Di) ✓ Rn, i = 1, . . . , d.

8

Dykstra-ADMM-CD


Nonquadratic Loss: DA and CD

Regularized nonquadratic loss problem

minw∈Rp

f (Xw) +d

∑i=1

hi(wi) (3)

Bregman approximation problem

∎ Bregman divergence

Dg(u,b) = g(u) − g(b) − ⟨∇g(b),u − b⟩g is differentiable and strictly convex

∎ Best approximation problem can be generalized to

minu∈Rn

Dg(u,b) subject to u ∈ C1 ∩⋯ ∩Cd (4)

Dykstra-ADMM-CD


Nonquadratic Loss: DA and CD

Sketch of Theory

Theorem 6. Let

g(v) = f ∗(−v),b = −∇f (0) and Ci = (XTi )−1(Di) ⊆ Rn.

Then problems (3), (4) are dual to each other, and their solutionsw, u satisfy u = −∇f (Xw). Further, Dykstra’s algorithm (Eq. 19 inpaper) and coordinate descent (Eq. 17 in paper) are equivalent,i.e., for k = 1,2,3,⋯:

z(k)i = Xiw(k)i , and u(k)i = −∇f

⎛⎝∑j≤i

Xjw(k)j +∑

j>iXjw

(k−1)j

⎞⎠,

for i = 1,⋯,d

Dykstra-ADMM-CD


Nonquadratic loss: parallel coordinate descent methods

P-NQ-DA-CD and P-NQ-ADMM-CD are different

∎ Dykstra-CDClosed-form u0 updateParallel w-updates require coordinatewise minimizationinvolving the smooth, convex loss f

∎ ADMM-CDMore difficult u0 updateParallel w-updates require coordinatewise minimization with aquadratic loss

Dykstra-ADMM-CD


Future work proposed by the author

∎ Asynchronous parallel algorithms

∎ Coordinate descent in Hilbert space

Dykstra-ADMM-CD

dykstra s algorithm, admm, and coordinate descent: connections, insights, and...

Documents