coordinate descent - carnegie mellon...

Coordinate Descent

Ryan TibshiraniConvex Optimization 10-725

Last time: numerical linear algebra primer

In Rn, rough flop counts for basic operations are as follows

• Vector-vector operations: n flops

• Matrix-vector multiplication: n2 flops

• Matrix-matrix multiplication: n3 flops

• Linear system solve: n3 flops

Operations with banded or sparse matrices are much cheaper

Two classes of approaches for linear system solvers:

• Direct: QR decomposition, Cholesky decomposition

• Indirect: Jacobi, Gauss-Seidl, gradient descent, conjugategradient

Rough rule of thumb: if problem fits easily in memory, go direct,else go indirect

2

Outline

Today:

• Coordinate descent

• Examples

• Implementation tricks

• Graphical lasso

• Screening rules

3

Coordinatewise optimality

We’ve seen some pretty sophisticated methods thus far

We now focus on a very simple technique that can be surprisinglyefficient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization

Q: Given convex, differentiable f : Rn → R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then havewe found a global minimizer?

That is, does f(x+ δei) ≥ f(x) for all δ, i ⇒ f(x) = minz f(z)?

(Here ei = (0, . . . , 1, . . . , 0) ∈ Rn is the ith standard basis vector)

4

x1 x2

f

A: Yes! Proof:

0 = ∇f(x) =

(∂f

∂x1(x), . . . ,

∂f

∂xn(x)

)

Q: Same question, but now for f convex, and not differentiable?

5

x1

x2

f

x1

x2

−4 −2 0 2 4

−4

−2

02

4

●

A: No! Look at the above counterexample

Q: Same, now f(x) = g(x) +∑n

i=1 hi(xi), with g convex, smooth,and each hi convex? (Here the nonsmooth part is called separable)

6

x1

x2

f

x1

x2

−4 −2 0 2 4

−4

−2

02

4

●

A: Yes! Proof: using convexity of g and subgradient optimality

f(y)− f(x) = g(x)− g(y) +

n∑i=1

[hi(yi)− hi(xi)]

≥n∑i=1

[∇ig(x)(yi − xi) + hi(yi)− hi(xi)]︸︷︷︸≥0

≥ 0

7

Coordinate descent

This suggests that for the problem

minx

f(x)

where f(x) = g(x) +∑n

i=1 hi(xi), with g convex and differentiableand each hi convex, we can use coordinate descent: let x(0) ∈ Rn,and repeat

x(k)i = argmin

xif(x

(k)1 , . . . , x

(k)i−1, xi, x

(k−1)i+1 , . . . , x(k−1)

n

),

i = 1, . . . , n

for k = 1, 2, 3, . . .

Important note: we always use most recent information possible

8

Tseng (2001) showed that for such f (provided f is continuous oncompact set {x : f(x) ≤ f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1

Notes:

• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . , n}• Can everywhere replace individual coordinates with blocks of

coordinates

• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge

• The analogy for solving linear systems: Gauss-Seidel versusJacobi method

1Using basic real analysis, we know x(k) has subsequence converging to x?

(Bolzano-Weierstrass), and f(x(k)) converges to f? (monotone convergence)9

Example: linear regression

Given y ∈ Rn, and X ∈ Rn×p with columns X1, . . . , Xp, considerthe linear regression problem:

minβ

1

2‖y −Xβ‖22

Minimizing over βi, with all βj , j 6= i fixed:

0 = ∇if(β) = XTi (Xβ − y) = XT

i (Xiβi +X−iβ−i − y)

i.e., we take

βi =XTi (y −X−iβ−i)

XTi Xi

Coordinate descent repeats this update for i = 1, 2, . . . , p, 1, 2, . . ..Note that this is exactly Gauss-Seidl for the system XTXβ = XT y

10

Coordinate descent vs gradient descent for linear regression: 100random instances with n = 100, p = 20

0 10 20 30 40

1e−

121e

−09

1e−

061e

−03

1e+

00

Iteration k

Sub

optim

ality

fk−

fsta

r

Coordinate descGrad desc (exact)Conj grad (exact)Grad desc (1/L)Accel grad (1/L)

11

Is it fair to compare 1 cycle of coordinate descent to 1 iteration ofgradient descent? Yes, if we’re clever

• Gradient descent: β ← β + tXT (y −Xβ), costs O(np) flops

• Coordinate descent, one coordinate update:

βi ←XTi (y −X−iβ−i)

XTi Xi

=XTi r

‖Xi‖22+ βi

where r = y −Xβ• Each coordinate costs O(n) flops: O(n) to update r, O(n) to

compute XTi r

• One cycle of coordinate descent costs O(np) operations, sameas gradient descent

12

Example: lasso regression

Consider the lasso problem:

minβ

1

2‖y −Xβ‖22 + λ‖β‖1

Note that nonsmooth part here is separable: ‖β‖1 =∑p

i=1 |βi|.Minimizing over βi, with βj , j 6= i fixed:

0 = XTi Xiβi +XT

i (X−iβ−i − y) + λsi

where si ∈ ∂|βi|. Solution is simply given by soft-thresholding

βi = Sλ/‖Xi‖22

(XTi (y −X−iβ−i)

XTi Xi

)Repeat this for i = 1, 2, . . . , p, 1, 2, . . .

13

Coordinate descent vs proximal gradient for lasso regression: 100random instances with n = 200, p = 50 (all methods cost O(np)per iter)

0 10 20 30 40 50 60

1e−

101e

−07

1e−

041e

−01

Iteration k

Sub

optim

ality

fk−

fsta

r

Coordinate descProximal gradAccel prox

14

Example: box-constrained QP

Given b ∈ Rn, Q ∈ Sn+, consider a box-constrained QP:

minx

1

2xTQx+ bTx subject to l ≤ x ≤ u

Fits into our framework, as I{l ≤ x ≤ u} =∑n

i=1 I{li ≤ xi ≤ ui}

Minimizing over xi with all xj , j 6= i fixed: same basic steps give

xi = T[li,ui]

(bi −

∑j 6=iQijxj

Qii

)where T[li,ui] is the truncation (projection) operator onto [li, ui]:

T[li,ui](z) =

ui if z > ui

z if li ≤ z ≤ uili if z < li

15

Example: support vector machines

A coordinate descent strategy can be applied to the SVM dual:

minα

1

2αT XXTα− 1Tα subject to 0 ≤ α ≤ C1, αT y = 0

Sequential minimal optimization or SMO (Platt 1998) is basicallyblockwise coordinate descent in blocks of 2. Instead of cycling, itchooses the next block greedily

Recall the complementary slackness conditions

αi(1− ξi − (Xβ)i − yiβ0

)= 0, i = 1, . . . , n (1)

(C − αi)ξi = 0, i = 1, . . . , n (2)

where β, β0, ξ are the primal coefficients, intercept, and slacks.Recall that β = XTα, β0 is computed from (1) using any i suchthat 0 < αi < C, and ξ is computed from (1), (2)

16

SMO repeats the following two steps:

• Choose αi, αj that violate complementary slackness, greedily(using heuristics)

• Minimize over αi, αj exactly, keeping all other variables fixed

Using equality constraint,reduces to minimizing uni-variate quadratic over aninterval (From Platt 1998)

Note this does not meet separability assumptions for convergencefrom Tseng (2001), and a different treatment is required

Many further developments on coordinate descent for SVMs havebeen made; e.g., a recent one is Hsiesh et al. (2008)

17

Coordinate descent in statistics and ML

History in statistics/ML:

• Idea appeared in Fu (1998), and then again in Daubechies etal. (2004), but was inexplicably ignored

• Later, three papers in 2007, especially Friedman et al. (2007),really sparked interest in statistics and ML communities

Why is it used?

• Very simple and easy to implement

• Careful implementations can achieve state-of-the-art

• Scalable, e.g., don’t need to keep full data in memory

Examples: lasso regression, lasso GLMs (under proximal Newton),SVMs, group lasso, graphical lasso (applied to the dual), additivemodeling, matrix completion, regression with nonconvex penalties

18

Pathwise coordinate descent for lasso

Structure for pathwise coordinate descent, Friedman et al. (2009):

Outer loop (pathwise strategy):

• Compute the solution over a sequence λ1 > λ2 > · · · > λr oftuning parameter values

• For tuning parameter value λk, initialize coordinate descentalgorithm at the computed solution for λk+1 (warm start)

Inner loop (active set strategy):

• Perform one coordinate cycle (or small number of cycles), andrecord active set A of coefficients that are nonzero

• Cycle over only the coefficients in A until convergence

• Check KKT conditions over all coefficients; if not all satisfied,add offending coefficients to A, go back one step

19

Notes:

• Even when the solution is desired at only one λ, the pathwisestrategy (solving over λ1 > · · · > λr = λ) is typically muchmore efficient than directly performing coordinate descent at λ

• Active set strategy takes advantage of sparsity; e.g., for largeproblems, coordinate descent for lasso is much faster than it isfor ridge regression

• With these strategies in place (and a few more clever tricks),coordinate descent can be competitve with fastest algorithmsfor `1 penalized minimization problems

• Fortran implementation glmnet, linked to R, MATLAB, etc.

20

Coordinate gradient descent

For a smooth function f , the iterations

x(k)i = x

(k−1)i − tki · ∇if

(x

(k)1 , . . . , x

(k)i−1, xi, x

(k−1)i+1 , . . . , x(k−1)

n

),

i = 1, . . . , n

for k = 1, 2, 3, . . . are called coordinate gradient descent, and whenf = g + h, with g smooth and h =

∑ni=1 hi, the iterations

x(k)i = proxhi,tki

(x

(k−1)i −tki·∇ig

(x

(k)1 , . . . , x

(k−1)i , . . . , x(k−1)

n

)),

i = 1, . . . , n

for k = 1, 2, 3, . . . are called coordinate proximal gradient descent

When g is quadratic, (proximal) coordinate gradient descent is thesame as coordinate descent under proper step sizes

21

Convergence analyses

Theory for coordinate descent moves quickly. Each combination ofthe following cases has (probably) been analyzed:

• Coordinate descent or (proximal) coordinate gradient descent?

• Cyclic rule, permuted cyclic, or greedy rule, randomized rule?

Roughly speaking, results are similar to those for proximal gradientdescent: under standard conditions, get standard rates

But constants differ and this matters! Much recent work is focusedon improving them

Also, it is generally believe that coordinate descent should performbetter than first-order methods (when implementable)

Some references are Beck and Tetruashvili (2013), Wright (2015),Sun and Hong (2015), Li et al. (2016)

22

Graphical lasso

Consider X ∈ Rn×p, with rows xi ∼ N(0,Σ), i = 1, . . . , n, drawnindependently. Suppose Σ is unknown. It is often reasonable (forlarge p) to seek a sparse estimate of Σ−1

Why? For z ∼ N(0,Σ), normality theory tells us

Σ−1ij = 0 ⇐⇒ zi, zj conditionally independent given z`, ` 6= i, j

Graphical lasso (Banerjee et al. 2007, Friedman et al. 2007):

minΘ∈Sp+

− log det Θ + tr(SΘ) + λ‖Θ‖1

where S = XTX/n, and ‖Θ‖1 =∑p

i,j=1 |Θij |. Observe that this

is a convex problem. Solution Θ serves as estimate for Σ−1

23

Glasso algorithm

Graphical lasso KKT conditions (stationarity):

−Θ−1 + S + λΓ = 0

where Γij ∈ ∂|Θij |. Let W = Θ−1. Note Wii = Sii + λ, becauseΘii > 0 at solution. Now partition:

W = Θ = S = Γ =[W11 w12

w21 w22

] [Θ11 θ12

θ21 θ22

] [S11 s12

s21 s22

] [Γ11 γ12

γ21 γ22

]where W11 ∈ R(p−1)×(p−1), w12 ∈ R(p−1)×1, and w21 ∈ R1×(p−1),w22 ∈ R; same with others

Glasso algorithm (Friedman et al., 2007): solve for w12 (recall thatw22 is known), with all other columns fixed; then solve for second-to-last column, etc., and cycle around until convergence

24

Screening rules

In some problems, screening rules can be used in combination withcoordinate descent to further wittle down the active set. Screeningrules themselves have amassed a huge literature recently

Originated with El Ghaoui et al. (2010), SAFE rule for the lasso:

|XTi y| < λ− ‖Xi‖2‖y‖2

λmax − λλmax

⇒ βi = 0, all i = 1, . . . , p

where λmax = ‖XT y‖∞ (the smallest value of λ such that β = 0)

Note: this is not an if and only if statement! But it does give us away of eliminating features apriori, without solving the lasso

(There have been many advances in screening rules for the lasso ...but this was the first, and one of the simplest)

25

SAFE rule derivation

Why is the SAFE rule true? Construction comes from lasso dual:

maxu

g(u) subject to ‖XTu‖∞ ≤ λ

where g(u) = 12‖y‖

22 − 1

2‖y − u‖22. Suppose that u0 is dual feasible

(e.g., take u0 = y · λ/λmax). Then γ = g(u0) is a lower bound onthe dual optimal value, so dual problem is equivalent to

maxu

g(u) subject to ‖XTu‖∞ ≤ λ, g(u) ≥ γ

Now let mi = maxu|XT

i u| subject to g(u) ≥ γ, for i = 1, . . . , p.

Then

mi < λ ⇒ |XTi u| < λ ⇒ βi = 0, i = 1, . . . , p

The last implication comes from the KKT conditions

26

(From El Ghaoui et al. 2010)

27

Another dual argument shows that

maxu

XTi u subject to g(u) ≥ γ

= minµ>0

−γµ+1

µ‖µy −Xi‖22

= ‖Xi‖2√‖y‖22 − 2γ −XT

i y

where the last equality comes from direct calculation

Thus mi is given the maximum of the above quantity over ±Xi,

mi = ‖Xi‖2√‖y‖22 − 2γ + |XT

i y|, i = 1, . . . , p

Lastly, subtitute γ = g(y · λ/λmax). Then mi < λ is precisely theSAFE rule given on previous slide

28

References

Early coordinate descent references in optimization:

• D. Bertsekas and J. Tsitsiklis (1989), “Parallel and distributeddomputation: numerical methods”, Chapter 3

• Z. Luo and P. Tseng (1992), “On the convergence of thecoordinate descent method for convex differentiableminimization”

• J. Ortega and W. Rheinboldt (1970), “Iterative solution ofnonlinear equations in several variables”

• P. Tseng (2001), “Convergence of a block coordinate descentmethod for nondifferentiable minimization”

• J. Warga (1969), “Minimizing certain convex functions”

29

Early coordinate descent references in statistics and ML:

• I. Daubechies and M. Defrise and C. De Mol (2004), “Aniterative thresholding algorithm for linear inverse problemswith a sparsity constraint”

• J. Friedman and T. Hastie and H. Hoefling and R. Tibshirani(2007), “Pathwise coordinate optimization”

• W. Fu (1998), “Penalized regressions: the bridge versus thelasso”

• T. Wu and K. Lange (2008), “Coordinate descent algorithmsfor lasso penalized regression”

• A. van der Kooij (2007), “Prediction accuracy and stability ofregresssion with optimal scaling transformations”

30

Applications of coordinate descent:

• O. Banerjee and L. Ghaoui and A. d’Aspremont (2007),“Model selection through sparse maximum likelihoodestimation”

• J. Friedman and T. Hastie and R. Tibshirani (2007), “Sparseinverse covariance estimation with the graphical lasso”

• J. Friedman and T. Hastie and R. Tibshirani (2009),“Regularization paths for generalized linear models viacoordinate descent”

• C.J. Hsiesh and K.W. Chang and C.J. Lin and S. Keerthi andS. Sundararajan (2008), “A dual coordinate descent methodfor large-scale linear SVM”

• R. Mazumder and J. Friedman and T. Hastie (2011),“SparseNet: coordinate descent with non-convex penalties”

• J. Platt (1998), “Sequential minimal optimization: a fastalgorithm for training support vector machines”

31

Recent theory for coordinate descent:• A. Beck and L. Tetruashvili (2013), “On the convergence of

block coordinate descent type methods”• X. Li, T. Zhao, R. Arora, H. Liu, M. Hong (2016) “An

improved convergence analysis of cyclic block coordinatedescent-type methods for strongly convex minimization”• Y. Nesterov (2010), “Efficiency of coordinate descent

methods on huge-scale optimization problems”• J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke

(2015), “Coordinate descent converges faster with the Gauss-Southwell rule than random selection”• P. Richtarik and M. Takac (2011), “Iteration complexity of

randomized block-coordinate descent methods for minimizinga composite function”• R. Sun and M. Hong (2015), “Improved iteration complexity

bounds of cyclic block coordinate descent for convexproblems”• S. Wright (2015), “Coordinate descent algorithms”

32

Screening rules and graphical lasso:

• L. El Ghaoui and V. Viallon and T. Rabbani (2010), “Safefeature elimination in sparse supervised learning”

• R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J.Taylor, and R. J. Tibshirani (2011), “Strong rules fordiscarding predictors in lasso-type problems”

• R. Mazumder and T. Hastie (2011), “The graphical lasso:new insights and alternatives”

• R. Mazumder and T. Hastie (2011), “Exact covariancethresholding into connected components for large-scalegraphical lasso”

• J. Wang, P. Wonka, and J. Ye (2015), “Lasso screening rulesvia dual polytope projection”

• D. Witten and J. Friedman and N. Simon (2011), “Newinsights and faster computations for the graphical lasso”

33

coordinate descent - carnegie mellon...

Documents