coordinate descent - carnegie mellon...
TRANSCRIPT
Coordinate Descent
Ryan TibshiraniConvex Optimization 10-725
Last time: numerical linear algebra primer
In Rn, rough flop counts for basic operations are as follows
• Vector-vector operations: n flops
• Matrix-vector multiplication: n2 flops
• Matrix-matrix multiplication: n3 flops
• Linear system solve: n3 flops
Operations with banded or sparse matrices are much cheaper
Two classes of approaches for linear system solvers:
• Direct: QR decomposition, Cholesky decomposition
• Indirect: Jacobi, Gauss-Seidl, gradient descent, conjugategradient
Rough rule of thumb: if problem fits easily in memory, go direct,else go indirect
2
Outline
Today:
• Coordinate descent
• Examples
• Implementation tricks
• Graphical lasso
• Screening rules
3
Coordinatewise optimality
We’ve seen some pretty sophisticated methods thus far
We now focus on a very simple technique that can be surprisinglyefficient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization
Q: Given convex, differentiable f : Rn → R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then havewe found a global minimizer?
That is, does f(x+ δei) ≥ f(x) for all δ, i ⇒ f(x) = minz f(z)?
(Here ei = (0, . . . , 1, . . . , 0) ∈ Rn is the ith standard basis vector)
4
x1 x2
f
A: Yes! Proof:
0 = ∇f(x) =
(∂f
∂x1(x), . . . ,
∂f
∂xn(x)
)
Q: Same question, but now for f convex, and not differentiable?
5
x1
x2
f
x1
x2
−4 −2 0 2 4
−4
−2
02
4
●
A: No! Look at the above counterexample
Q: Same, now f(x) = g(x) +∑n
i=1 hi(xi), with g convex, smooth,and each hi convex? (Here the nonsmooth part is called separable)
6
x1
x2
f
x1
x2
−4 −2 0 2 4
−4
−2
02
4
●
A: Yes! Proof: using convexity of g and subgradient optimality
f(y)− f(x) = g(x)− g(y) +
n∑i=1
[hi(yi)− hi(xi)]
≥n∑i=1
[∇ig(x)(yi − xi) + hi(yi)− hi(xi)]︸ ︷︷ ︸≥0
≥ 0
7
Coordinate descent
This suggests that for the problem
minx
f(x)
where f(x) = g(x) +∑n
i=1 hi(xi), with g convex and differentiableand each hi convex, we can use coordinate descent: let x(0) ∈ Rn,and repeat
x(k)i = argmin
xif(x
(k)1 , . . . , x
(k)i−1, xi, x
(k−1)i+1 , . . . , x(k−1)
n
),
i = 1, . . . , n
for k = 1, 2, 3, . . .
Important note: we always use most recent information possible
8
Tseng (2001) showed that for such f (provided f is continuous oncompact set {x : f(x) ≤ f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1
Notes:
• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . , n}• Can everywhere replace individual coordinates with blocks of
coordinates
• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge
• The analogy for solving linear systems: Gauss-Seidel versusJacobi method
1Using basic real analysis, we know x(k) has subsequence converging to x?
(Bolzano-Weierstrass), and f(x(k)) converges to f? (monotone convergence)9
Example: linear regression
Given y ∈ Rn, and X ∈ Rn×p with columns X1, . . . , Xp, considerthe linear regression problem:
minβ
1
2‖y −Xβ‖22
Minimizing over βi, with all βj , j 6= i fixed:
0 = ∇if(β) = XTi (Xβ − y) = XT
i (Xiβi +X−iβ−i − y)
i.e., we take
βi =XTi (y −X−iβ−i)
XTi Xi
Coordinate descent repeats this update for i = 1, 2, . . . , p, 1, 2, . . ..Note that this is exactly Gauss-Seidl for the system XTXβ = XT y
10
Coordinate descent vs gradient descent for linear regression: 100random instances with n = 100, p = 20
0 10 20 30 40
1e−
121e
−09
1e−
061e
−03
1e+
00
Iteration k
Sub
optim
ality
fk−
fsta
r
Coordinate descGrad desc (exact)Conj grad (exact)Grad desc (1/L)Accel grad (1/L)
11
Is it fair to compare 1 cycle of coordinate descent to 1 iteration ofgradient descent? Yes, if we’re clever
• Gradient descent: β ← β + tXT (y −Xβ), costs O(np) flops
• Coordinate descent, one coordinate update:
βi ←XTi (y −X−iβ−i)
XTi Xi
=XTi r
‖Xi‖22+ βi
where r = y −Xβ• Each coordinate costs O(n) flops: O(n) to update r, O(n) to
compute XTi r
• One cycle of coordinate descent costs O(np) operations, sameas gradient descent
12
Example: lasso regression
Consider the lasso problem:
minβ
1
2‖y −Xβ‖22 + λ‖β‖1
Note that nonsmooth part here is separable: ‖β‖1 =∑p
i=1 |βi|.Minimizing over βi, with βj , j 6= i fixed:
0 = XTi Xiβi +XT
i (X−iβ−i − y) + λsi
where si ∈ ∂|βi|. Solution is simply given by soft-thresholding
βi = Sλ/‖Xi‖22
(XTi (y −X−iβ−i)
XTi Xi
)Repeat this for i = 1, 2, . . . , p, 1, 2, . . .
13
Coordinate descent vs proximal gradient for lasso regression: 100random instances with n = 200, p = 50 (all methods cost O(np)per iter)
0 10 20 30 40 50 60
1e−
101e
−07
1e−
041e
−01
Iteration k
Sub
optim
ality
fk−
fsta
r
Coordinate descProximal gradAccel prox
14
Example: box-constrained QP
Given b ∈ Rn, Q ∈ Sn+, consider a box-constrained QP:
minx
1
2xTQx+ bTx subject to l ≤ x ≤ u
Fits into our framework, as I{l ≤ x ≤ u} =∑n
i=1 I{li ≤ xi ≤ ui}
Minimizing over xi with all xj , j 6= i fixed: same basic steps give
xi = T[li,ui]
(bi −
∑j 6=iQijxj
Qii
)where T[li,ui] is the truncation (projection) operator onto [li, ui]:
T[li,ui](z) =
ui if z > ui
z if li ≤ z ≤ uili if z < li
15
Example: support vector machines
A coordinate descent strategy can be applied to the SVM dual:
minα
1
2αT XXTα− 1Tα subject to 0 ≤ α ≤ C1, αT y = 0
Sequential minimal optimization or SMO (Platt 1998) is basicallyblockwise coordinate descent in blocks of 2. Instead of cycling, itchooses the next block greedily
Recall the complementary slackness conditions
αi(1− ξi − (Xβ)i − yiβ0
)= 0, i = 1, . . . , n (1)
(C − αi)ξi = 0, i = 1, . . . , n (2)
where β, β0, ξ are the primal coefficients, intercept, and slacks.Recall that β = XTα, β0 is computed from (1) using any i suchthat 0 < αi < C, and ξ is computed from (1), (2)
16
SMO repeats the following two steps:
• Choose αi, αj that violate complementary slackness, greedily(using heuristics)
• Minimize over αi, αj exactly, keeping all other variables fixed
Using equality constraint,reduces to minimizing uni-variate quadratic over aninterval (From Platt 1998)
Note this does not meet separability assumptions for convergencefrom Tseng (2001), and a different treatment is required
Many further developments on coordinate descent for SVMs havebeen made; e.g., a recent one is Hsiesh et al. (2008)
17
Coordinate descent in statistics and ML
History in statistics/ML:
• Idea appeared in Fu (1998), and then again in Daubechies etal. (2004), but was inexplicably ignored
• Later, three papers in 2007, especially Friedman et al. (2007),really sparked interest in statistics and ML communities
Why is it used?
• Very simple and easy to implement
• Careful implementations can achieve state-of-the-art
• Scalable, e.g., don’t need to keep full data in memory
Examples: lasso regression, lasso GLMs (under proximal Newton),SVMs, group lasso, graphical lasso (applied to the dual), additivemodeling, matrix completion, regression with nonconvex penalties
18
Pathwise coordinate descent for lasso
Structure for pathwise coordinate descent, Friedman et al. (2009):
Outer loop (pathwise strategy):
• Compute the solution over a sequence λ1 > λ2 > · · · > λr oftuning parameter values
• For tuning parameter value λk, initialize coordinate descentalgorithm at the computed solution for λk+1 (warm start)
Inner loop (active set strategy):
• Perform one coordinate cycle (or small number of cycles), andrecord active set A of coefficients that are nonzero
• Cycle over only the coefficients in A until convergence
• Check KKT conditions over all coefficients; if not all satisfied,add offending coefficients to A, go back one step
19
Notes:
• Even when the solution is desired at only one λ, the pathwisestrategy (solving over λ1 > · · · > λr = λ) is typically muchmore efficient than directly performing coordinate descent at λ
• Active set strategy takes advantage of sparsity; e.g., for largeproblems, coordinate descent for lasso is much faster than it isfor ridge regression
• With these strategies in place (and a few more clever tricks),coordinate descent can be competitve with fastest algorithmsfor `1 penalized minimization problems
• Fortran implementation glmnet, linked to R, MATLAB, etc.
20
Coordinate gradient descent
For a smooth function f , the iterations
x(k)i = x
(k−1)i − tki · ∇if
(x
(k)1 , . . . , x
(k)i−1, xi, x
(k−1)i+1 , . . . , x(k−1)
n
),
i = 1, . . . , n
for k = 1, 2, 3, . . . are called coordinate gradient descent, and whenf = g + h, with g smooth and h =
∑ni=1 hi, the iterations
x(k)i = proxhi,tki
(x
(k−1)i −tki·∇ig
(x
(k)1 , . . . , x
(k−1)i , . . . , x(k−1)
n
)),
i = 1, . . . , n
for k = 1, 2, 3, . . . are called coordinate proximal gradient descent
When g is quadratic, (proximal) coordinate gradient descent is thesame as coordinate descent under proper step sizes
21
Convergence analyses
Theory for coordinate descent moves quickly. Each combination ofthe following cases has (probably) been analyzed:
• Coordinate descent or (proximal) coordinate gradient descent?
• Cyclic rule, permuted cyclic, or greedy rule, randomized rule?
Roughly speaking, results are similar to those for proximal gradientdescent: under standard conditions, get standard rates
But constants differ and this matters! Much recent work is focusedon improving them
Also, it is generally believe that coordinate descent should performbetter than first-order methods (when implementable)
Some references are Beck and Tetruashvili (2013), Wright (2015),Sun and Hong (2015), Li et al. (2016)
22
Graphical lasso
Consider X ∈ Rn×p, with rows xi ∼ N(0,Σ), i = 1, . . . , n, drawnindependently. Suppose Σ is unknown. It is often reasonable (forlarge p) to seek a sparse estimate of Σ−1
Why? For z ∼ N(0,Σ), normality theory tells us
Σ−1ij = 0 ⇐⇒ zi, zj conditionally independent given z`, ` 6= i, j
Graphical lasso (Banerjee et al. 2007, Friedman et al. 2007):
minΘ∈Sp+
− log det Θ + tr(SΘ) + λ‖Θ‖1
where S = XTX/n, and ‖Θ‖1 =∑p
i,j=1 |Θij |. Observe that this
is a convex problem. Solution Θ serves as estimate for Σ−1
23
Glasso algorithm
Graphical lasso KKT conditions (stationarity):
−Θ−1 + S + λΓ = 0
where Γij ∈ ∂|Θij |. Let W = Θ−1. Note Wii = Sii + λ, becauseΘii > 0 at solution. Now partition:
W = Θ = S = Γ =[W11 w12
w21 w22
] [Θ11 θ12
θ21 θ22
] [S11 s12
s21 s22
] [Γ11 γ12
γ21 γ22
]where W11 ∈ R(p−1)×(p−1), w12 ∈ R(p−1)×1, and w21 ∈ R1×(p−1),w22 ∈ R; same with others
Glasso algorithm (Friedman et al., 2007): solve for w12 (recall thatw22 is known), with all other columns fixed; then solve for second-to-last column, etc., and cycle around until convergence
24
Screening rules
In some problems, screening rules can be used in combination withcoordinate descent to further wittle down the active set. Screeningrules themselves have amassed a huge literature recently
Originated with El Ghaoui et al. (2010), SAFE rule for the lasso:
|XTi y| < λ− ‖Xi‖2‖y‖2
λmax − λλmax
⇒ βi = 0, all i = 1, . . . , p
where λmax = ‖XT y‖∞ (the smallest value of λ such that β = 0)
Note: this is not an if and only if statement! But it does give us away of eliminating features apriori, without solving the lasso
(There have been many advances in screening rules for the lasso ...but this was the first, and one of the simplest)
25
SAFE rule derivation
Why is the SAFE rule true? Construction comes from lasso dual:
maxu
g(u) subject to ‖XTu‖∞ ≤ λ
where g(u) = 12‖y‖
22 − 1
2‖y − u‖22. Suppose that u0 is dual feasible
(e.g., take u0 = y · λ/λmax). Then γ = g(u0) is a lower bound onthe dual optimal value, so dual problem is equivalent to
maxu
g(u) subject to ‖XTu‖∞ ≤ λ, g(u) ≥ γ
Now let mi = maxu|XT
i u| subject to g(u) ≥ γ, for i = 1, . . . , p.
Then
mi < λ ⇒ |XTi u| < λ ⇒ βi = 0, i = 1, . . . , p
The last implication comes from the KKT conditions
26
(From El Ghaoui et al. 2010)
27
Another dual argument shows that
maxu
XTi u subject to g(u) ≥ γ
= minµ>0
−γµ+1
µ‖µy −Xi‖22
= ‖Xi‖2√‖y‖22 − 2γ −XT
i y
where the last equality comes from direct calculation
Thus mi is given the maximum of the above quantity over ±Xi,
mi = ‖Xi‖2√‖y‖22 − 2γ + |XT
i y|, i = 1, . . . , p
Lastly, subtitute γ = g(y · λ/λmax). Then mi < λ is precisely theSAFE rule given on previous slide
28
References
Early coordinate descent references in optimization:
• D. Bertsekas and J. Tsitsiklis (1989), “Parallel and distributeddomputation: numerical methods”, Chapter 3
• Z. Luo and P. Tseng (1992), “On the convergence of thecoordinate descent method for convex differentiableminimization”
• J. Ortega and W. Rheinboldt (1970), “Iterative solution ofnonlinear equations in several variables”
• P. Tseng (2001), “Convergence of a block coordinate descentmethod for nondifferentiable minimization”
• J. Warga (1969), “Minimizing certain convex functions”
29
Early coordinate descent references in statistics and ML:
• I. Daubechies and M. Defrise and C. De Mol (2004), “Aniterative thresholding algorithm for linear inverse problemswith a sparsity constraint”
• J. Friedman and T. Hastie and H. Hoefling and R. Tibshirani(2007), “Pathwise coordinate optimization”
• W. Fu (1998), “Penalized regressions: the bridge versus thelasso”
• T. Wu and K. Lange (2008), “Coordinate descent algorithmsfor lasso penalized regression”
• A. van der Kooij (2007), “Prediction accuracy and stability ofregresssion with optimal scaling transformations”
30
Applications of coordinate descent:
• O. Banerjee and L. Ghaoui and A. d’Aspremont (2007),“Model selection through sparse maximum likelihoodestimation”
• J. Friedman and T. Hastie and R. Tibshirani (2007), “Sparseinverse covariance estimation with the graphical lasso”
• J. Friedman and T. Hastie and R. Tibshirani (2009),“Regularization paths for generalized linear models viacoordinate descent”
• C.J. Hsiesh and K.W. Chang and C.J. Lin and S. Keerthi andS. Sundararajan (2008), “A dual coordinate descent methodfor large-scale linear SVM”
• R. Mazumder and J. Friedman and T. Hastie (2011),“SparseNet: coordinate descent with non-convex penalties”
• J. Platt (1998), “Sequential minimal optimization: a fastalgorithm for training support vector machines”
31
Recent theory for coordinate descent:• A. Beck and L. Tetruashvili (2013), “On the convergence of
block coordinate descent type methods”• X. Li, T. Zhao, R. Arora, H. Liu, M. Hong (2016) “An
improved convergence analysis of cyclic block coordinatedescent-type methods for strongly convex minimization”• Y. Nesterov (2010), “Efficiency of coordinate descent
methods on huge-scale optimization problems”• J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke
(2015), “Coordinate descent converges faster with the Gauss-Southwell rule than random selection”• P. Richtarik and M. Takac (2011), “Iteration complexity of
randomized block-coordinate descent methods for minimizinga composite function”• R. Sun and M. Hong (2015), “Improved iteration complexity
bounds of cyclic block coordinate descent for convexproblems”• S. Wright (2015), “Coordinate descent algorithms”
32
Screening rules and graphical lasso:
• L. El Ghaoui and V. Viallon and T. Rabbani (2010), “Safefeature elimination in sparse supervised learning”
• R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J.Taylor, and R. J. Tibshirani (2011), “Strong rules fordiscarding predictors in lasso-type problems”
• R. Mazumder and T. Hastie (2011), “The graphical lasso:new insights and alternatives”
• R. Mazumder and T. Hastie (2011), “Exact covariancethresholding into connected components for large-scalegraphical lasso”
• J. Wang, P. Wonka, and J. Ye (2015), “Lasso screening rulesvia dual polytope projection”
• D. Witten and J. Friedman and N. Simon (2011), “Newinsights and faster computations for the graphical lasso”
33