dykstra s algorithm, admm, and coordinate descent: connections, insights, and...
TRANSCRIPT
Overview DA-ADMM-CD Connections/Applications Extensions
Dykstra’s Algorithm, ADMM, and CoordinateDescent: Connections, Insights, and Extensions
Ryan J. TibshiraniStatistics Department and Department of Machine Learning, Carnegie Mellon
University
Presented by Chenyang Tao
Dec 15, 2017
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions
Outline
1. Overview
2. Dykstra’s Algorithm, ADMM and Coordinate Descent
3. Connections, Insights and Applications to Lasso
4. Nonquadratic Loss Generalizations and Future Directions
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎
Highlights
∎ Establish the equivalence between Dykstra’s algorithm (DA)and block coordinate descent (CD)
∎ Show the connections between DA and ADMM
∎ Improve CD’s convergence results using theories from DA
∎ Parallelizing CD using DA and ADMM counterparts
∎ Generalization to problems with nonquadratic loss
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Two convex optimization problems
Best approximation problem
minu∈Rn
∥y − u∥22 subject to u ∈ C1 ∩⋯ ∩Cd (1)
∎ C1,⋯,Cd ⊆ Rn are convex sets, and y ∈ Rn
Regularized regression problem
minw∈Rp
12∥y − Xw∥2
2 +d
∑i=1
hi(wi) (2)
∎ y ∈ Rn and X ∈ Rn×p
∎ Xi ∈ Rn×pi , i = 1,⋯,d is a block decomposition of X
∎ hi ∶ Rpi → R are convex regularizers
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Best approximation problem
minu∈Rn
∥y − u∥22 subject to u ∈ C1 ∩⋯ ∩Cd (1)
Regularized regression problem
minw∈Rp
12∥y − Xw∥2
2 +d
∑i=1
hi(wi) (2)
A peek at the main result
∎ For particular C1,⋯,Cd and h1,⋯,hd, (1) and (2) are duals ofeach other
∎ ADMM matches Dykstra’s algorithm when d = 2 and C1 is alinear subspace
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Coordinate Descent
Regularized regression problem
minw∈Rp
12∥y − Xw∥2
2 +d
∑i=1
hi(wi) (2)
∎ Initialize say w(0) = 0, and repeat, for k = 1,2,3,⋯
w(k)i = arg minwi∈Rpi
12
XXXXXXXXXXXy −∑
j<iXjw
(k)j − Xiwi −∑
j>iXjw
(k−1)j
XXXXXXXXXXX
2
2
+ hi(wi)
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Alternating Direction Method of Multipliers (ADMM)
Example problem
∎ Solve problems such as
minx
f (x) + g(x)
∎ Equivalent to solve
minx,y
f (x) + g(y) subject to x = y
∎ Generalized form
minx
f (x) subject to Ax − b = 0
† We always assume f is convex
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Alternating Direction Method of Multipliers (ADMM)
Example problem
minx
f (x) subject to Ax − b = 0
Dual ascent
∎ LagrangianL(x, λ) = f (x) + λT(Ax − b)
∎ The dual function
g(λ) = infx
L(x, λ) = −f ∗(−ATλ) − bTλ
f ∗ is the conjugate of f
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Alternating Direction Method of Multipliers (ADMM)
Dual ascent
L(x, λ) = f (x) + λT(Ax − b)
g(λ) = infx
L(x, λ) = −f ∗(−ATλ) − bTλ
∎ The dual problemy∗ = arg max
λg(λ)
∎ The optimal solution for primal and dual are related by
x∗ = arg minx
L(x, y∗)
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Alternating Direction Method of Multipliers (ADMM)
Dual ascent
L(x, λ) = f (x) + λT(Ax − b)
g(λ) = infx
L(x, λ) = −f ∗(−ATλ) − bTλ
y∗ = arg maxλ
g(λ), x∗ = arg minx
L(x, λ∗)
∎ Step 1 (descent x): xk+1 = arg min L(x, λk)∎ Step 2 (ascent λ): λk+1 = λk + αk(Axk+1 − b)
ADMM
∎ Lρ = L(x, λ) + ρ2∥Ax − b∥2
2, αk = ρ
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Dykstra’s Algorithm (DA)
Best approximation problem
minu∈Rn
∥y − u∥22 subject to u ∈ C1 ∩⋯ ∩Cd (1)
Algorithm (DA)
∎ Initialize u(0) to y, z(−d+1) = ⋯ = z(0) = 0
∎ u(k) = PC[k](u(k−1) + z(k−d))
∎ z(k) = u(k−1) + z(k−d) − u(k)
∎ PC(x) = arg minc ∥x − c∥22, and [⋅] is the modulus operator
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Dykstra’s Algorithm (DA)
Best approximation problem
minu∈Rn
∥y − u∥22 subject to u ∈ C1 ∩⋯ ∩Cd (1)
Algorithm (DA - alternative version)
∎ Initialize u(0) to y, z(−d+1) = ⋯ = z(0) = 0
∎ u(k)0 = u(k−1)d
∎ u(k)i = PC[k](u(k)i−1 + z(k−1)i )
∎ z(k)i = u(k)i−1 + z(k−1)i − u(k)i
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎
Dykstra’s Algorithm (DA)
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
The Equivalence between DA and CD
Seminorms
∎ If h is a seminorm, then it can be expressed in general form
h(v) = maxd∈D
⟨d, v⟩,
where D is a closed, convex set containing 0
Equivalence via duality
∎ If {hi} in the regularized regression problem (2) areseminorms, then CD iterations are the same as DA on theconvex sets Ci = {v ∈ Rn ∶ XT
i v ∈ Di}, where Di is the convexset defining seminorm hi.
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
The Equivalence between DA and CD
Sketch of theories
Lemma 1.
wi = arg minwi∈Rpi
12∥b − Xiwi∥2
2 + hi(wi)⇔ Xiwi = (Id −PCi)(b)
Theorem 1. Let u and w are respectively the solution to problem(1) and (2). For all iterations k = 1,2,3,⋯ it holds that
z(k)i = Xiw(k)i and u(k)i = y −∑
j≤iXjw
(k)j −∑
j>iXjw
(k−1)j
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Connections between DA and ADMM
∎ Best approximation problem (1) is equivalent to
minu1,u2∈Rn
∥y − u1∥22 + 1C1(u1) + 1C2(u2) subject to u1 = u2
∎ Can be solved by ADMM
∎ Augmented Lagrangian
L(u1,u2, z) = ∥y − u1∥22 + 1C1(u1) + 1C2(u2) + ρ∥u1 − u2 + z∥2
2
∎ This is equivalent to DA when d = 2, C1 is a linear subspaceand y ∈ C1.
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Coordinate Descent for Lasso
Lasso problem
minw∈Rp
12∥y − Xw∥2
2 + λ∥w∥1
∎ hi(w) = λ∣wi∣ = maxd∈Di dwi, Di = [−λ,λ]∎ Ci = (XT
i )−1(Di) = {v ∈ Rn ∶ ∣XTi v∣ ≤ λ}
intersection of two halfspaces
∎ C1 ∩⋯ ∩Cd is a polyhedron
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Analyzing CD’s Convergence Rate for Lasso
Using Dykstra’s theory to analyze the convergence rate on CDfor Lasso
Notations
∎ w is the lasso solution
∎ Σ = XTX, ∥z∥2Σ = zTΣz
∎ A is the active set and a = ∣A∣ is its size
∎ P⊥{ij+1,⋯,ia}
is the projection onto the orthocomplement of thecolumn span of X{ij+1,⋯,ia}
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Analyzing CD’s Convergence Rate for Lasso
Adaptation of Iusem and De Pierro 1990
∥w(k+1) − w∥Σ∥w(k) − w∥Σ
≤ ( a2
a2 + λmin(XTAXA)/maxi∈A ∥Xi∥2
2)
1/2
Adaptation of Deutsch and Hundal 1994
∥w(k+1) − w∥Σ∥w(k) − w∥Σ
≤⎛⎜⎜⎝
1 −a−1
∏j=1
∥P⊥{ij+1,⋯,ia}
Xij∥2
2
∥Xij∥22
⎞⎟⎟⎠
1/2
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Parallel Coordinate Descent - Dykstra
Reformulation
minu=(u1,⋯,ud)∈Rnd
d
∑i=1γi∥y − ui∥2
2 subject to u ∈ C0 ∩ (C1 ×⋯ ×Cd)
∎ C0 = {u1 = ⋯ = ud ∈ Rnd ∶ u1 = ⋯ = ud}∎ {γi > 0} weights sum to 1
Parallel-Dykstra
∎ u(k)0 = ∑di=1 γiu
(k−1)i ,
∎ for i = 1,⋯,d (executed in parallel)u(k)i = PCi(u(k)0 + z(k−1)
i ),z(k)i = u(k)0 + z(k−1)
i − u(k)i ,
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Parallel Coordinate Descent - Dykstra
Reformulation
minu=(u1,⋯,ud)∈Rnd
d
∑i=1γi∥y − ui∥2
2 subject to u ∈ C0 ∩ (C1 ×⋯ ×Cd)
∎ C0 = {u1 = ⋯ = ud ∈ Rnd ∶ u1 = ⋯ = ud}∎ {γi > 0} weights sum to 1
Parallel-Dykstra-CD
∎ For i = 1,⋯,d (executed in parallel)
w(k)i = arg minwi∈Rpi
12∥y − Xw(k−1) + Xiw
(k−1)i /γi − Xiwi/γi∥
2
2+ hi(wi/γi)
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Parallel Coordinate Descent - ADMM
u(k)0 =(∑d
i=1 ρi)u(k−1)0
1 +∑di=1 ρi
+ y − Xw(k−1)
1 +∑di=1 ρi
+ X(w(k−2) −w(k−1))1 +∑d
i=1 ρi,
w(k)i = arg minwi∈Rpi
12∥u(k)0 + Xiw
(k−1)i /ρi − Xiwi/ρi∥
2
2+hi(wi/ρi), i = 1,⋯,d
∎ {ρi} are the augmented Lagrangian parameters
See Theorem 4,5 in the paper for convergence result.
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions ∎ ∎ ∎ ∎ ∎
Experiment
0 500 1000 1500 2000
1e−0
81e−0
51e−0
21e
+01
1e+0
4
No parallelization
Actual iteration number
Subo
ptim
ality
Coordinate descentPar−Dykstra−CDPar−ADMM−CD, rho=10Par−ADMM−CD, rho=50Par−ADMM−CD, rho=200
0 50 100 150
1e−0
81e−0
51e−0
21e
+01
1e+0
4
10% parallelization
Effective iteration number
Subo
ptim
ality
Coordinate descentPar−Dykstra−CDPar−ADMM−CD, rho=10Par−ADMM−CD, rho=50Par−ADMM−CD, rho=200
Figure 1: Suboptimality curves for serial coordinate descent, parallel-Dykstra-CD, and three tuningsof parallel-ADMM-CD (i.e., three different values of ⇢ =
Ppi=1 ⇢i), each run over the same 30 lasso
problems with n = 100 and p = 500. For details of the experimental setup, see the supplement.
Nonquadratic loss: Dykstra’s algorithm and coordinate descent. Given a convex function f , ageneralization of (2) is the regularized estimation problem
minw2Rp
f(Xw) +
dX
i=1
hi(wi). (16)
Regularized regression (2) is given by f(z) = 12ky � zk2
2, and e.g., regularized classification (underthe logistic loss) by f(z) = �yT z +
Pni=1 log(1 + ezi). In (block) coordinate descent for (16), we
initialize say w(0) = 0, and repeat, for k = 1, 2, 3, . . .:
w(k)i = argmin
wi2Rpi
f
✓X
j<i
Xjw(k)j +
X
j>i
Xjw(k�1)j + Xiwi
◆+ hi(wi), i = 1, . . . , d. (17)
On the other hand, given a differentiable and strictly convex function g, we can generalize (1) to thefollowing best Bregman-approximation problem,
minu2Rn
Dg(u, b) subject to u 2 C1 \ · · · \ Cd. (18)
where Dg(u, b) = g(u) � g(b) � hrg(b), u � bi is the Bregman divergence between u and b withrespect to g. When g(v) = 1
2kvk22 (and b = y), this recovers the best approximation problem (1). As
shown in Censor and Reich (1998); Bauschke and Lewis (2000), Dykstra’s algorithm can be extendedto apply to (18). We initialize u
(0)d = b, z
(0)1 = · · · = z
(0)d = 0, and repeat for k = 1, 2, 3, . . .:
u(k)0 = u
(k�1)d ,
u(k)i = (P g
Ci� rg⇤)
⇣rg(u
(k)i�1) + z
(k�1)i
⌘,
z(k)i = rg(u
(k)i�1) + z
(k�1)i �rg(u
(k)i ),
9=; for i = 1, . . . , d,
(19)
where P gC(x) = argminc2C Dg(c, x) denotes the Bregman (rather than Euclidean) projection of x
onto a set C, and g⇤ is the conjugate function of g. Though it may not be immediately obvious, wheng(v) = 1
2kvk22 the above iterations (19) reduce to the standard (Euclidean) Dykstra iterations in (4).
Furthermore, Dykstra’s algorithm and coordinate descent are equivalent in the more general setting.Theorem 6. Let f be a strictly convex, differentiable function that has full domain. Assume thatXi 2 Rn⇥pi has full column rank and hi(v) = maxd2Di
hd, vi for a closed, convex set Di ✓ Rpi , fori = 1, . . . , d. Also, let g(v) = f⇤(�v), b = �rf(0), and Ci = (XT
i )�1(Di) ✓ Rn, i = 1, . . . , d.
8
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions
Nonquadratic Loss: DA and CD
Regularized nonquadratic loss problem
minw∈Rp
f (Xw) +d
∑i=1
hi(wi) (3)
Bregman approximation problem
∎ Bregman divergence
Dg(u,b) = g(u) − g(b) − ⟨∇g(b),u − b⟩g is differentiable and strictly convex
∎ Best approximation problem can be generalized to
minu∈Rn
Dg(u,b) subject to u ∈ C1 ∩⋯ ∩Cd (4)
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions
Nonquadratic Loss: DA and CD
Sketch of Theory
Theorem 6. Let
g(v) = f ∗(−v),b = −∇f (0) and Ci = (XTi )−1(Di) ⊆ Rn.
Then problems (3), (4) are dual to each other, and their solutionsw, u satisfy u = −∇f (Xw). Further, Dykstra’s algorithm (Eq. 19 inpaper) and coordinate descent (Eq. 17 in paper) are equivalent,i.e., for k = 1,2,3,⋯:
z(k)i = Xiw(k)i , and u(k)i = −∇f
⎛⎝∑j≤i
Xjw(k)j +∑
j>iXjw
(k−1)j
⎞⎠,
for i = 1,⋯,d
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions
Nonquadratic loss: parallel coordinate descent methods
P-NQ-DA-CD and P-NQ-ADMM-CD are different
∎ Dykstra-CDClosed-form u0 updateParallel w-updates require coordinatewise minimizationinvolving the smooth, convex loss f
∎ ADMM-CDMore difficult u0 updateParallel w-updates require coordinatewise minimization with aquadratic loss
Dykstra-ADMM-CD
Overview DA-ADMM-CD Connections/Applications Extensions
Future work proposed by the author
∎ Asynchronous parallel algorithms
∎ Coordinate descent in Hilbert space
Dykstra-ADMM-CD