sparse inverse covariance estimation - ibm … the optimization problem newton lasso methods orthant...

77
watsonlogo The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen Oztoprak, Jorge Nocedal and Steven Rennie Summer Tutorial at IBM TJ Watson Research Center August 2, 2012 Peder, Figen, Jorge and Steven Covariance Selection

Upload: hahanh

Post on 24-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

watsonlogo

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Sparse Inverse Covariance Estimation

Peder Olsen, Figen Oztoprak, Jorge Nocedal and Steven Rennie

Summer Tutorial atIBM TJ Watson Research Center

August 2, 2012

Peder, Figen, Jorge and Steven Covariance Selection

Page 2: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Problem: Given an empirical covariance matrix S

S =1

N

N∑i=1

(xi − µ)(xi − µ)>.

find a sparse inverse covariance matrix P to represent thedata.

Approach: Minimize the objective function

minP�0

F (P)def= L(P)+λ‖vec(P)‖1, L(P) = −log det(P)+trace(SP).

L is the negative log likelihood function and the `1 term is asparsity inducing regularizer.

Peder, Figen, Jorge and Steven Covariance Selection

Page 3: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Problem History

Finding a sparse inverse covariance: Dempster (1972).

Covariance Selection: What it’s called.

Convex `1 approach: Banerjee (2006).

First order methods: covsel, glasso, psm, smacs, sinco,alm

Second order methods: ipm, quic, plus our methods.

Peder, Figen, Jorge and Steven Covariance Selection

Page 4: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

First Order Solvers

covsel A block-coordinate descent that solves the dualproblem one row at a time. (2006) O. Banerjee, L. ElGhaoui, A. d’Aspremont, and G. Natsoulis

glasso Graphical LASSO. One of the more popular solvers.It solves the primal problem one row at a time.(2008) A. d’Aspremont, O. Banerjee, and L. ElGhaoui.

psm Projected Sub-gradient Method. (2008) P. Duchi, S.Gould, and D. Koller.

smacs Smooth Minimization Algorithm for CovarianceSelection. An optimal first order ascent method.(2009) Z. Lu.

Peder, Figen, Jorge and Steven Covariance Selection

Page 5: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

More Solvers

sinco Sparse INverse COvariances. A method intended formassive parallel computation. A greedy coordinatedescent method. (2009) K. Scheinberg and I. Rish.

alm Alternating linear minimization. Uses an augmentedLagrangian to introduce an auxilliary variable for thenon-smooth term. (2010) K. Scheinberg, S. Ma, andD. Goldfarb.

ipm Interior Point Method. A second order method.(2010) L. Li and K. C. Toh

quic A second order Newton method that solves theLASSO problem using a coordinate descent method.(2011) C. J. Hsieh, M. A. Sustik, P. Ravikumar, andI. S. Dhillon.

Peder, Figen, Jorge and Steven Covariance Selection

Page 6: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Implementations of almost all of these methods are available fordownload. Among first-order solvers, glasso and alm are amongthe most popular. quic is extremely (!) fast for very sparseproblems.A good overview of the problem is available from Irina Rish’s andGenady Grabarnik’s lectures on sparse signal modeling:ELEN E6898 Sparse Signal Modeling (Spring 2011).

Peder, Figen, Jorge and Steven Covariance Selection

Page 7: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Problem Extensions

The penalty is a bit too simplistic. Consider the more generalpenalty term ∑

ij

λij |Pij |

Since Pii > 0 is forced by the positive definite requirement wechoose λii = 0. We have found λij ∝ 1√

NSiiSjjto work well for

i 6= j .

Another possible extension is to smooth towards Θ∑ij

λij |Pij −Θij |

It’s also possible to consider “group LASSO” type of penalties withblocking in the covariance or other structural constraints in thesparsity of the inverse covariance.

Peder, Figen, Jorge and Steven Covariance Selection

Page 8: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The Exponential Family

Another avenue of extension worthy of consideration is theviewpoint of exponential families. The exponential family ischaracterized by the features φ(x) and is given by

P(x|θ) =eθ>φ(x)

Z (θ), Z (θ) =

∫eθ>φ(x)dx.

Z (θ) is the partition function or normalizer. The covarianceselection problem corresponds to the features φ(x) = vec(xx>),with parameters θ = vec(P) and log partition functionlogZ (θ) = log det(P) + n

2 log(2π).

Peder, Figen, Jorge and Steven Covariance Selection

Page 9: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The general normal distribution

By extending the features to φ(x) =

(x

vec(xx>)

)we can consider

the general normal distribution with parameters

θ =

(ψP

)=

(Σ−1µΣ−1

).

The corresponding log likelihood function is

L(θ) = s>θ − log(Z (θ)), s =1

T

T∑t=1

φ(xt)

and the log partition function is

log(Z (θ)) =1

2ψ>P−1ψ − 1

2log det(P) +

n

2log(2π).

Peder, Figen, Jorge and Steven Covariance Selection

Page 10: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Related Problems

Sparse multivariate regression with covarianceestimation: LASSO + covariance selection. (2010) A.J.Rothman, E. Levina, and J. Zhu.

Covariance constrained to a Kronecker product: Leads totwo interconnected covariance selection problems. (2011) T.Tsiligkaridis and A. O. Hero III.

Peder, Figen, Jorge and Steven Covariance Selection

Page 11: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The `0 problem

The `0 problem: Replace penalty withcard(P) = |{Pij |Pij 6= 0}|.The true sparsity structure can be recovered under therestricted eigenvalue property and enough data. We define das the maximum non-zero entries of a row in the truecovariance matrix.

`1 problem: Need O(d2 log(n)) samples.`0 problem: Need O(d log(n)) samples.

Peder, Figen, Jorge and Steven Covariance Selection

Page 12: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The problem is convex because it is the sum of two terms that areconvex.

The log-likelihood of an exponential family is convex, since

∂2

∂θ∂θ>logZ (θ) = Var[φ(x)].

This is probably the simplest and most elegant way to provethat −log det(P) is convex ((1997) R. Gopinath).

The penalty term is convex by inspection. All norms are bydefinition convex.

Peder, Figen, Jorge and Steven Covariance Selection

Page 13: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Consider the function |x |p for p ≥ 0. The function is convex ifp ≥ 1 and sparsity promoting if p ≤ 1.

Peder, Figen, Jorge and Steven Covariance Selection

Page 14: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Norms are convex and sparsity promoting

Convexity is insured by the triangle inequality. For any 0 ≤ α ≤ 1with α + β = 1 we have by the triangle inequality

‖αx + βy‖ ≤ ‖αx‖+ ‖βy‖ = α‖x‖+ β‖x‖.

That any norm is sparsity inducing follows by |x | being sparsityinducing, since along any direction x we have ‖αx‖ = |α|‖x‖.‖x‖p is not a norm for p < 1, and convexity is lost, but it is stillsparsity inducing.

Peder, Figen, Jorge and Steven Covariance Selection

Page 15: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

How natural are `p norms?

They may seem unnatural except for p = 1, 2 and ∞, but considerthe `p ball for p = 2/3, 1, 2,∞.

x

y

-2 -1 0 1 2

-2

-1

0

1

2`∞ ball`2 ball`1 ball`2/3 ball

Peder, Figen, Jorge and Steven Covariance Selection

Page 16: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Peder, Figen, Jorge and Steven Covariance Selection

Page 17: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

When λ = 0 the problem becomes equivalent to the maximumlikelihood problem, and the solution is P∗ = S−1. Consider thecase when λ 6= 0 and the solution P∗ is not sparse withZ∗ = sign(P∗). We then have

F (P∗) = L(P∗) + λ‖vec(P∗)‖1= L(P∗) + λ trace(Z∗P∗)

= −log det(P∗) + λ trace(P∗(S + Z∗))

Therefore, the solution is P∗ = (S + λZ∗)−1. In general if weknow sign(P∗) the function is smooth in all the non-zero (free)variables and therefore the solution is “easy” to find.

Peder, Figen, Jorge and Steven Covariance Selection

Page 18: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

What is an Orthant Face?

If the value

Z = sign(P∗)

is known then the prob-lem is smooth for the freevariables on the orthantface

O(Z) = {P : sign(Z) = ε}.

The orthant faces are theregions where the sign ofP does not change.

Peder, Figen, Jorge and Steven Covariance Selection

Page 19: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The diagonal elements

Note that the diagonal elements of P always have to be strictlypositive to ensure the solution is positive definite. Therefore thesewill always be free variables. Since P is symmetric we need onlydetermine the sign of

(n−12

)variables.

Peder, Figen, Jorge and Steven Covariance Selection

Page 20: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Even if the orthant problem can be solved efficiently there are still

3(n−12 ) orthant faces to search over. This discrete optimization

problem of selecting the orthant face seems equally hard. However,if we guide the orthant face search by using the gradient on theorthant surface the discrete problem is aided by the continuous.The rest of the talk will show the structure of the problem andhow to do the optimization efficiently.

Peder, Figen, Jorge and Steven Covariance Selection

Page 21: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Dual Formulation

minP�0

F (P) = minP�0

L(P) + λ‖vec(P)‖1

= minP�0

L(P) + λ max‖vec(Z)‖∞≤1

trace(ZP)

= minP�0

max‖vec(Z)‖∞≤1

−log detP + trace(PS) + λ trace(ZP)

= minP�0

max‖vec(Z)‖∞≤1

−log detP + trace(P(S + λZ))

= max‖vec(Z)‖∞≤1

minP�0− log detP + trace(P(S + λZ))

= max‖vec(Z)‖∞≤1

log det(S + λZ) + d

Peder, Figen, Jorge and Steven Covariance Selection

Page 22: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

At the optimum we have as shown F (P∗) = U(Z∗) with the primaland dual variables satisfying the relation

λZ∗ + S− (P∗)−1 = 0

and P∗ � 0 and ‖vec(Z∗)‖∞ ≤ 1Define the dual function to be

U(Z) = log det(S + λZ)− d

then we have

U(Z) ≤ U(Z∗) = F (P∗) ≤ F (P)

so that any pair of matrices P, Z satisfying P � 0 and‖vec(Z)‖∞ ≤ 1 yields an upper an lower bound of the objective atthe optimal point.Note that dual problem is smooth with a box constraint. Boxconstraint problems can be solved using projected gradients,something that has a long history.

Peder, Figen, Jorge and Steven Covariance Selection

Page 23: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Relationships between the primal and dual

We have that

if

[P∗]ij = 0 then [Z∗]ij 6∈ {−1, 1}[P∗]ij > 0 then [Z∗]ij = 1[P∗]ij < 0 then [Z∗]ij = −1.

The corners of the box corresponds to a non-sparse solution P∗.

Peder, Figen, Jorge and Steven Covariance Selection

Page 24: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The gradient at a point P of L is as we shall later see given byG = vec(S−P−1). Using this we can get a good approximation toZ∗ if we have a good approximation to P∗. Let P be anapproximation to P∗ and form the value

[Z]ij =

1 if [P]ij > 0−1 if [P]ij < 0−1 if [P]ij = 0 and [G]ij > λ1 if [P]ij = 0 and [G]ij < −λ− 1λ [G]ij if [P]ij = 0 and | [G]ij | ≤ λ.

Peder, Figen, Jorge and Steven Covariance Selection

Page 25: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

We already know the solution for λ = 0. What other exactsolutions can we find? The following is a list of solutions known tous:

For λ large the solution is diagonal and known.

For n = 2 we can give the exact solution.

For λ sufficiently close to zero the solution is not sparse andwe can give the exact solution.

For values of λ where the solution is block-diagonal theblocking can be detected and the exact solution consists ofsolving each block independently.

The block-diagonal solution is similar to and was inspired by thesafe feature elimination for the lasso problem. (2011) L. ElGhaoui

Peder, Figen, Jorge and Steven Covariance Selection

Page 26: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Locating Exact Solutions

The key to finding exact solutions is to use the duality relationship

S− (P∗)−1 + λZ∗,

where P � 0 and ‖vec(Z)‖∞ ≤ 1. Z will try to zero out S, andwhen it can’t P has to fill in the rest. Essentially it is easier tosolve the dual problem analytically, since it is smooth, and we cansimply guess the solution and verify it for the primal problem.

The key to proving that a solution is correct is the concept of thesub-gradient. A sub-gradient is the slope of a line that touches Fat P and lies below F everywhere. If zero is a sub-gradient thenthis is the global minimum. The sub-differential is the set of allpossible sub-gradients at a point.

Peder, Figen, Jorge and Steven Covariance Selection

Page 27: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Peder, Figen, Jorge and Steven Covariance Selection

Page 28: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Derivatives Small and large

Frechet Differentiable: The good old derivative exists(Frechet is the derivative extended to Banach spaces).

Gateaux Differentiable: The directional derivatives exists.

Sub-differential: The collection of all sub-gradients.

Clarke Derivative: An extension to the sub-differential.

Bouligard Derivative: An extension to directional derivative.

Pseudo-gradient: Not quite a gradient: A few screws shortof a hardware store.

Weak derivative: When a function is non-differentiable theweak derivative works “under the integral sign”.

Financial Derivatives: The biggest scam of all...

Peder, Figen, Jorge and Steven Covariance Selection

Page 29: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The diagonal solution

Recall thatS− (P∗)−1 + λZ∗,

where P � 0 and ‖vec(Z)‖∞ ≤ 1. For Z to zero out theoff-diagonal part we must have λ ≥ Sij for all i 6= j . Since the signof the diagonal elements must be positive we have Zii = 1 and weget P∗ = (diag(S) + λI)−1. This is the solution if and only ifλ ≥ Sij for all i 6= j .The solution can be verified by computing the sub-differential andverifying that 0 is a sub-gradient. A more difficult proof usesHadamard’s inequality to verify that Z∗ is the solution to the dualproblem.This simple method of locating zeros only worked because both P∗

and (P∗)−1 had the same sparsity structure (diagonal). This isalso the case for block-diagonal solutions.

Peder, Figen, Jorge and Steven Covariance Selection

Page 30: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The 2 by 2 case

A 2× 2 matrix is diagonal if the only off-diagonal element is zero.Therefore, we can guess that there are only two kinds of solutions:(1) A diagonal solution when λ is large and (2) a non-sparsesolution in the orthant given by λ = 0, i.e. Z∗ = sign(S−1). Wehave

S−1 =1

det(S)

(S22 −S12−S12 S11.

)and therefore

Z∗ =

(1 −sign(S12)

−sign(S12) 1

).

Peder, Figen, Jorge and Steven Covariance Selection

Page 31: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Complete 2 by 2 solution

Assuming that n = 2, S � 0 and λ ≥ 0 then the solution to thecovariance selection problem is

P∗ =

(diag(S) + λI)−1 if λ ≥ |S12|(

S11 + λ S12(1− λ/|S12|)S12(1− λ/|S12|) S22 + λ

)−1if 0 ≤ λ < |S12|.

Peder, Figen, Jorge and Steven Covariance Selection

Page 32: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Block diagonal solutions

If λ is larger than the absolute value of all the off-block diagonalelements of S then the solution P∗ is block-diagonal and eachblock can be found by solving a covariance selection problem. Theprocedure is best illustrated by solving an example. (2012) R.Mazumder, T. Hastie.

Peder, Figen, Jorge and Steven Covariance Selection

Page 33: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Let λ = 0.14

S =

1.06 0.16 −0.03 −0.15 0.00 −0.04 0.01 −0.13 0.020.16 0.85 −0.11 −0.15 −0.01 0.00 0.03 0.00 0.010.03 −0.11 1.03 0.06 0.11 0.00 −0.04 0.02 −0.050.15 −0.15 0.06 0.89 0.02 −0.03 −0.01 −0.02 0.200.00 −0.01 0.11 0.02 0.93 0.04 −0.01 −0.02 0.140.04 0.00 0.00 −0.03 0.04 1.12 −0.12 −0.06 0.000.01 0.03 −0.04 −0.01 −0.01 −0.12 0.87 0.09 −0.090.13 0.00 0.02 −0.02 −0.02 −0.06 0.09 1.03 0.020.02 0.01 −0.05 0.20 0.14 0.00 −0.09 0.02 1.06

First locate all the elements for which |Sij | > λ

Peder, Figen, Jorge and Steven Covariance Selection

Page 34: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Let λ = 0.14

S =

1.06 0.16 −0.03 −0.15 0.00 −0.04 0.01 −0.13 0.020.16 0.85 −0.11 −0.15 −0.01 0.00 0.03 0.00 0.010.03 −0.11 1.03 0.06 0.11 0.00 −0.04 0.02 −0.050.15 −0.15 0.06 0.89 0.02 −0.03 −0.01 −0.02 0.200.00 −0.01 0.11 0.02 0.93 0.04 −0.01 −0.02 0.140.04 0.00 0.00 −0.03 0.04 1.12 −0.12 −0.06 0.000.01 0.03 −0.04 −0.01 −0.01 −0.12 0.87 0.09 −0.090.13 0.00 0.02 −0.02 −0.02 −0.06 0.09 1.03 0.020.02 0.01 −0.05 0.20 0.14 0.00 −0.09 0.02 1.06

Peder, Figen, Jorge and Steven Covariance Selection

Page 35: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

Let λ = 0.14

S =

? ? ?? ? ?

?? ? ? ?

? ??

??

? ? ?

We find the solution blocks by locating connected components of

the graph corresponding to the stars.

Peder, Figen, Jorge and Steven Covariance Selection

Page 36: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

ConvexitySimplicityDual ProblemExact Solutions

The graph corresponding to S

1 2 3

4 5 6

7 8 9

A glance at the graph shows that there are 4 single element blocksand one block with 5 elements. We know the solution for thesingle component blocks and only need to solve the remainingn = 5 block.

Peder, Figen, Jorge and Steven Covariance Selection

Page 37: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

In general we might want to consider minimizing functions on theform

f (x) + λ‖x‖1for smooth differentiable convex functions f . In the case when f is

a quadratic, this problem is known as the least absolute shrinkageand selection operator (LASSO) problem ((1996) R. Tibshirani).When f is not a quadratic we iteratively solve the LASSOproblems

x̂k+1 = argminxf (xk) + (f ′(xk))>(x− xk)

+1

2(x− xk)>f ′′(xk)(x− xk) + λ‖x‖1

We call this the Newton-lasso method. Convergence propertiesthat rely on exact solution of lasso is given by (2009) P. Tsengand S. Yun.

Peder, Figen, Jorge and Steven Covariance Selection

Page 38: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

The line search

Not so fast. The method on the previous page won’t work ingeneral. If f (xk+1) < f (xk) then we are OK. This will be the casewhen we are close to the optimum. Otherwise we need a moreconservative approach. For the covariance selection problem the

solution to the quadratic may yield a matrix that is not positivedefinite. We consider x = xt + t(x̂t+1 − xt) and find a value t that

sufficiently decreases the function according to the Armijo rule

f (xk+1)− f (xk) ≥ σ(xk+1 − xk)>f ′(xk)

and σ ∈ (0, 1).This condition is not enough to ensure quadratic convergence, butsince eventually all steps will be t = 1 it’s not an issue.

Peder, Figen, Jorge and Steven Covariance Selection

Page 39: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

The first step in developing a Newton-lasso method is tocompute the gradient gk = vec(L′(Pk)) and the HessianHk = L′′(Pk) for our problem. Recall thatL(P) = −log det(P) + trace(PS), we compute the Taylorexpansion for the non-linear term log det(P) around Pk .

Peder, Figen, Jorge and Steven Covariance Selection

Page 40: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

The Taylor Expansion

Let P = Pk + ∆, ∆ = P− Pk and X = P−1/2k ∆P

−1/2k . Let

{ei}di=1 denote the eigenvalues of X, then

log detP = log det(Pk + ∆)

= log detPk + log det(I + P−1/2k ∆P

−1/2k )

= log detPk + log det(I + X)

= log detPk +d∑

i=1

log(1 + ei )

= log detPk +∞∑k=1

(−1)k+1

k

d∑i=1

eki

= log detPk +∞∑k=1

(−1)k+1

ktrace(Xk)

Peder, Figen, Jorge and Steven Covariance Selection

Page 41: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

The quadratic part

Extracting the linear and quadratic parts in terms of ∆ we get

log detP = log detPk + trace(X)− 1

2trace(X2) +O(X3)

= log detPk + trace(P−1/2k ∆P

−1/2k )

−1

2trace(P

−1/2k ∆P−1k ∆P

−1/2k ) +O(X3)

= log detPk + trace(∆P−1k )− 1

2trace(∆P−1k ∆P−1k ) +O(X3)

= log detPk + vec>(∆)vec(P−1k )

−1

2vec>(∆)vec(P−1k ∆P−1k ) +O(X3)

= log detPk + vec>(∆)vec(P−1k )

−1

2vec>(∆)(P−1k ⊗ P−>k )vec(∆) +O(X3)

Peder, Figen, Jorge and Steven Covariance Selection

Page 42: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

It follows that

gk = vec(S− P−1k ) Hk = P−1k ⊗ P−1k .

The fact that the Hessian is a Kronecker product can be used tomake the computation of the Newton direction and the solution tothe Newton-lasso problem more efficient. Also, we never need toexplicitly instantiate the Hessian.

Peder, Figen, Jorge and Steven Covariance Selection

Page 43: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

Definition (Kronecker Products)

For matrices A ∈ Rm1×n1 and B ∈ Rm2×n2 we define the Kroneckerproduct A⊗ B ∈ R(m1m2)×(n1n2) to be

A⊗ B =

a11B · · · a1nB...

......

am1B · · · amnB

.

It can be verified that this definition is equivalent to(A⊗ B)(i−1)m2+j ,(k−1)n2+l = aikbjl , which we simply write(A⊗ B)(ij)(kl) = aikbjl .

Peder, Figen, Jorge and Steven Covariance Selection

Page 44: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

Theorem

The following equations gives identities for multiplying,transposing, inverting and computing the trace of Kroneckerproducts.

(A⊗ B)(C⊗D) = (AC)⊗ (BD)

(A⊗ B)> = A> ⊗ B>

(A⊗ B)−1 = A−1 ⊗ B−1

Im ⊗ In = Imn

(B> ⊗ A)vec(X) = vec(AXB)

trace(A⊗ B) = trace(A) trace(B)

Peder, Figen, Jorge and Steven Covariance Selection

Page 45: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

More on the linesearch

Two issues on the linesearch we have not dealt with. Both areborrowed from the creators of QUIC.

1 What line-search method do we use?

2 How do we ensure positive definiteness?

We used the backtracking linesearch with t = 1, 1/2, 1/4, . . .. Thereasoning is two-fold. Firstly, since t = 1 will be the predominantchoice, the line-search is not dominating the compute time.Secondly, it is simple to implement and the accuracy of the linesearch is not very important as long as convergence is guaranteed.

Peder, Figen, Jorge and Steven Covariance Selection

Page 46: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

Positive definiteness

First of all how do we check for positive definiteness?

Find the smallest eigenvalue and check that it is positive.

Do a Cholesky decomposition P = LL>. If the decompositionsucceeds and Lii 6= 0 then P � 0, otherwise P 6� 0.

The first method is perhaps more reliable and gives moreinformation. But the second method is by far the fastest.

Peder, Figen, Jorge and Steven Covariance Selection

Page 47: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

The line-search interval

In the linesearch we must ensure that t is chosen so thatPk+1 = Pk + tV is positive definite. We can do that in two ways

Find the smallest t > 0 such that det(Pk + tV) = 0 bysolving the generalized eigenvalue problem Pkx = λVx.

If the Cholesky decomposition succeeds for a particular t thenPk+1 � 0.

Peder, Figen, Jorge and Steven Covariance Selection

Page 48: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

Along with the line-search strategy outlined QUIC had two moreimportant innovations

The lasso problem was solved using coordinate descent.Each variable can be solved very efficiently, by using thestructure of the Hessian. In total it uses only O(n|F|)operations, where F are the free variables, per sweep over thevariables.

By starting the process from a sparse (diagonal) matrix, reallysparse solutions can be found extremely efficiently.

Peder, Figen, Jorge and Steven Covariance Selection

Page 49: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

QUIC results

Peder, Figen, Jorge and Steven Covariance Selection

Page 50: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

ISTA

The Iterative Shrinkage Thresholding Algorithm (ISTA) is amethod to solve the problem ((2004) I. Daubechies, M. Defrise,and C. De Mol.)

f (x) + λ‖x‖1when f is a convex quadratic function. The ISTA method simply

iterates

xi+1 = Sλ/c

(xi −

1

c∇f (xi )

),

where cI− f ′′(x) � 0 and Sλ is the Donoho-Johnstone shrinkageoperator applied to each coordinate

Sλ(x) =

x − λ if x > λ

0 if |x | ≤ λx + λ if x < −λ.

Peder, Figen, Jorge and Steven Covariance Selection

Page 51: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

FISTA

The Fast Iterative Shrinkage Thresholding Algorithm (FISTA)method ((2009) A. Beck and M. Teboulle) first does an ISTA step

x̂i = Sλ/c

(xi −

1

c∇f (xi )

),

then a Nesterov acceleration is applied to give

xi+1 = x̂i +ti − 1

ti+1(x̂i − x̂i−1)

where

x̂1 = x̂0, t1 = 1, ti+1 =1 +

√1 + 4t2i

2.

FISTA converges significantly faster than ISTA at very littlecomputational overhead.

Peder, Figen, Jorge and Steven Covariance Selection

Page 52: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

ISTA for Covariance Selection

If we apply the ISTA iteration to the lasso subproblem of thecovariance selection problem we get the elegant iteration:

xi = Sλ/c

(vec(X̂i )−

1

c

(gk + Hkvec(X̂i − Pk)

))= Sλ/c

(1

cvec(−S + 2P−1k − P−1k X̂iP

−1k ) + vec(X̂i )

),

The inverse matrix P−1k can be precomputed and stored for theiterations x0 → x1 . . .→ xi . . ..After the ISTA/FISTA iteration is performed we perform aback-tracking line-search. as described earlier, to determine Pk+1.

Peder, Figen, Jorge and Steven Covariance Selection

Page 53: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The Quadratic ApproximationThe Kronecker ProductQUICFISTA

Algorithm

1 Compute starting point P0 = diag−1(λ+ diag(S)), k = 0.

2 Stop if the minimum sub-gradient norm is smaller than ε

3 Solve the lasso sub problem using FISTA. Call theapproximate solution Xk+1

4 Find Pk+1 by a backtracking line search from Pk to Xk+1.

5 k ← k + 1 and go to step 2.

Peder, Figen, Jorge and Steven Covariance Selection

Page 54: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Active and Free Variables

We divide the variables in the following two groups

Active Variables The active constraints/variables are the variableswhose values we fix at 0. We denote the set of activevariables as A.

Free Variables The free variables are the variables whose values arenot fixed at 0. We denote the set of free variables asF .

An orthant face naturally divides the variables into active and freevariables, where the sign of each free variable is fixed according tothe orthant face.

We claimed earlier that optimizingover an orthant face was “simple”.How to do this?

Peder, Figen, Jorge and Steven Covariance Selection

Page 55: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Choosing the Orthant Face

If we are at a given point Pk and consider the optimization in anorthant face containing Pk there may be several choices of orthantfaces. For each value [Pk ]ij = 0 we can choose the correspondingorthant-sign to be negative, zero or positive.Consider an infinitesimal change of [Pk ]ij to decide the sign.

If a small positive change reduces the function value, makethe sign positive. This happens if ∂L

∂Pij> λ.

If a small negative change reduces the function then the signis negative. This happens if ∂L

∂Pij< −λ.

Finally if neither a positive nor a negative change reduces thefunction value then we make the sign zero. This happens if∣∣∣ ∂L∂Pij

∣∣∣ ≤ λ.

Peder, Figen, Jorge and Steven Covariance Selection

Page 56: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

Peder, Figen, Jorge and Steven Covariance Selection

Page 57: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

For a < −1 the minimum occurs for x < 0:

Peder, Figen, Jorge and Steven Covariance Selection

Page 58: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

For |a| <= 1 the minimum occurs when x = 0:

Peder, Figen, Jorge and Steven Covariance Selection

Page 59: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

For |a| <= 1 the minimum occurs when x = 0:

Peder, Figen, Jorge and Steven Covariance Selection

Page 60: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

For |a| <= 1 the minimum occurs when x = 0:

Peder, Figen, Jorge and Steven Covariance Selection

Page 61: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

As an example consider the function f (x) = 12(x − a)2 + |x | at

x = 0 for various values of a:

For a > 1 the minimum occurs for x > 0:

Peder, Figen, Jorge and Steven Covariance Selection

Page 62: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Orthant Indicator

We defined the orthant indicator Zk to be

[Zk ]ij =

1 if [Pk ]ij > 0−1 if [Pk ]ij < 0−1 if [Pk ]ij = 0 and [Gk ]ij > λ1 if [Pk ]ij = 0 and [Gk ]ij < −λ0 if [Pk ]ij = 0 and | [Gk ]ij | ≤ λ.

The 0 ensure the active variables do not move away from 0. Thedual variable took a different value 1

λ [Gk ]ij here.The value Gk + λZk is the steepest descent direction at the pointPk , and we refer to it as the pseudo-gradient. We considerminimizing L(P) + λ trace(PZ) on the orthant face in place ofF (P).

Peder, Figen, Jorge and Steven Covariance Selection

Page 63: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

When the Newton step corresponding to the quadraticapproximation on the orthant face is outside the orthant face, wemust make the decision as to whether or not to allow the search toleave the orthant.

Leaving the orthant face leads to complications. The Newtondirection is not guaranteed to be a descent direction anymore, butthis can be fixed with something known as pseudo gradientalignment. Also, we need to ensure that we enforce sparsitywhenever possible.

Peder, Figen, Jorge and Steven Covariance Selection

Page 64: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Not leaving the orthant face is simpler. However, only consideringthe line segment insde the orthant face leads to many small steps.Typically only one coordinate will be made sparse per line-step andthis may mean millions of line-searches for problems wheren > 1000. We need a strategy to allow many variables to becomesparse at once – a sparsity acceleration if you will. We project the

line-segment using the orthant-projection

Π(Pij) =

{Pij if sign(Pij) = sign(Zk)ij

0 otherwise.

Peder, Figen, Jorge and Steven Covariance Selection

Page 65: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Figen tried line search strategies both confined to and not confinedto the orthant. She also tried different strategies for sparsityacceleration. The orthant projection scheme was best most of thetime, and also happened to be the simplest to implement!

Peder, Figen, Jorge and Steven Covariance Selection

Page 66: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

The OWL package that optimizes functions with an `1 penaltyuses a procedure called gradient alignment. ((2007) G. Andrew, J.Gao ).

Gradient alignment is not needed with the orthant projectionmethod. Figen performed extensive experiments using pseudogradient alignment and other line search and projection procedures.The performance of alignment strategies were uniformly worse andwere not employed in the final experiments.

Peder, Figen, Jorge and Steven Covariance Selection

Page 67: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

Pseudo Gradient Alignment

x

y

-3 -2 -1 0 1 2 3

-1

0

1

2

3

4

Pseudo gradient

Newton Direction

Directional Derivative

Alignment

Peder, Figen, Jorge and Steven Covariance Selection

Page 68: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

The reduced quadratic

We use the notation pk = vec(Pk) = ( pkFpkA ) = ( pkF

0 ) . Recall thatthe piecewise quadratic approximation to F is

qk(P) = L(Pk) + g>k (p− pk) +1

2(p− pk)>Hk(p− pk) + λ‖p‖1

If we constrain the model to the Zk orthant face we get

qk(P) = L(Pk) + g>k (p− pk) +1

2(p− pk)>Hk(p− pk) + λp>zk

subject to sign(p) = zk . Finally, if we substitute in pkA = 0 anddrop the constraints and the constant we get the reduced quadratic

QF (pF ) = g>kF (pF−pkF )+1

2(pF−pkF )>HkF (pF−pkF )+λp>FzkF .

Here HkF equals Hk = P−1k ⊗ Pk with the rows and columnscorresponding to A removed.

Peder, Figen, Jorge and Steven Covariance Selection

Page 69: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

The solution to the reduced quadratic can be seen to be

p∗F = pkF + H−1kF (λzkF − gkF ).

We need a quick way to compute p∗F without storing H−1kF .

For A = ∅ the computation becomes trivial:P∗ = Pk − Pk(λZk − Gk)Pk .

Observation: We can do fast multiplication (O(n|F|)) by HkF bylifting, multiplying by Hk and then projecting:

HkFxF = [Hk ( xF0 )]F =

[P−1k mat ( xF

0 ) P−1k

]F .

Peder, Figen, Jorge and Steven Covariance Selection

Page 70: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

The conjugate gradient algorithm is an iterative procedure to findthe solution to Ax = b, when A � 0.

At iteration k the conjugate gradient algorithm finds the projectionof the solution onto the Krylov subspace span{b,Ab, . . . ,Ak−1b}.

In each iteration of the conjugate gradient algorithm we compute amatrix–vector product Ayk . This is the most expensive step.

Peder, Figen, Jorge and Steven Covariance Selection

Page 71: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

The conjugate Gradient Algorithm

Initialize: r0 = b− Ax0, y0 = x0, k = 0while rk > ε do

αk =r>k rk

y>k Aykxk+1 = xk + αkyk

rk+1 = rk − αkAyk

βk =r>k+1rk+1

r>k rkyk+1 = rk+1 + βkyk

k ← k + 1

Peder, Figen, Jorge and Steven Covariance Selection

Page 72: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Orthant Projection Line SearchPseudo Gradient AlignmentConjugate Gradient AlgorithmLBFGS

LBFGS (Limited memory Broyden–Fletcher–Goldfarb–Shannon) isconsidered the all-around best method to minimize non-linearfunctions. We used it to solve the reduced quadratic. The HessianHF is replaced by a limited memory BFGS matrix BF . Instead ofusing the properties of HF to efficiently compute the Newton step,we use the properties of the approximation BF .

The OWL package ((2007) Andrew, G. and Gao, J.) doessomething similar, but used the full quadratic instead of thereduced quadratic. Which approach is best depends on the sparsityof the solution.

Peder, Figen, Jorge and Steven Covariance Selection

Page 73: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

The algorithms in red are ours.

Algorithms tested

Algorithm DescriptionMA-FISTA Newton-LASSO-FISTA methodMA-Coord Newton-LASSO method using coordinate

descent (QUIC)MB-CG-K Orthant-based Newton-CG method with a limit

of K CG iterationsMB-CG-D MB-CG-K with K=5 initially and increased

by 1 every 3 iterations.MB-LBFGS Orthant-based quasi-Newton methodALM Alternating linearization method.

Peder, Figen, Jorge and Steven Covariance Selection

Page 74: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

λ 0.5 0.1 0.05 0.01problem algorithm iter time iter time iter time iter time

card(P∗) 0.74% 7.27% 11.83% 32.48%cond(P∗) 8.24 27.38 51.01 118.56MA-FISTA 8 5.71 10 22.01 11 37.04 12 106.27

n = 500 MA-Coord 21 3.86 49 100.63 66 279.69 103 1885.89

Card(Σ−1) = 2.4% MB-CG-5 15 4.07 97 26.24 257 70.91 1221 373.63MB-CG-D 12 3.88 34 15.41 65 43.29 189 275.29MB-LBFGS 47 5.37 178 21.92 293 38.23 519 84.13ALM 445 162.96 387 152.76 284 115.11 574 219.80

card(P∗) 0.21% 14.86% 25.66% 47.33%cond(P∗) 3.39 16.11 32.27 99.49MA-FISTA 4 1.25 19 13.12 15 34.53 13 100.90

n = 500 MA-Coord 4 0.42 14 19.69 21 71.51 55 791.84

Card(Σ−1) = 20.1% MB-CG-5 3 0.83 27 7.36 101 28.40 795 240.90MB-CG-D 3 0.84 15 5.22 31 14.14 176 243.55MB-LBFGS 9 1.00 82 11.42 155 23.04 455 78.33ALM 93 35.75 78 32.98 149 61.35 720 292.43

card(P∗) 0.18% 6.65% 13.19% 25.03%cond(P∗) 6.22 18.23 39.59 132.13MA-FISTA 7 28.20 9 106.79 12 203.07 12 801.79

n = 1000 MA-Coord 9 5.23 24 225.59 36 951.23 - >5000

Card(Σ−1) = 3.5% MB-CG-5 9 15.34 51 87.73 108 198.17 1103 2026.26MB-CG-D 8 15.47 21 51.99 39 132.38 171 1584.14MB-LBFGS 34 18.27 111 80.02 178 111.49 548 384.30ALM 247 617.63 252 639.49 186 462.34 734 1826.29

Peder, Figen, Jorge and Steven Covariance Selection

Page 75: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

λ 0.5 0.1 0.05 0.01problem algorithm iter time iter time iter time iter time

card(P∗) 0.10% 8.18% 18.38% 36.34%cond(P∗) 4.20 11.75 26.75 106.34MA-FISTA 4 9.03 7 72.21 10 156.46 22 554.08

n = 1000 MA-Coord 4 2.23 12 79.71 19 408.62 49 4837.46

Card(Σ−1) = 11% MB-CG-5 3 4.70 20 35.85 47 83.42 681 1778.88MB-CG-D 3 4.61 12 26.87 27 78.98 148 2055.44MB-LBFGS 8 4.29 67 40.31 124 82.51 397 297.90ALM 113 283.99 99 255.79 106 267.02 577 1448.83

card(P∗) 0.13% 1.75% 4.33% 14.68%cond(P∗) 7.41 23.71 46.54 134.54MA-FISTA 8 264.94 10 1039.08 10 1490.37 - >5000

n = 2000 MA-Coord 14 54.33 34 1178.07 - >5000 - >5000

Card(Σ−1) = 1% MB-CG-5 13 187.41 78 896.24 203 2394.95 - >5000MB-CG-D 9 127.11 27 532.15 43 1038.26 - >5000MB-LBFGS 41 115.13 155 497.31 254 785.36 610 2163.12ALM - >5000 - >5000 - >5000 - >5000

card(P∗) 0.05% 1.49% 10.51% 31.68%cond(P∗) 2.32 4.72 17.02 79.61MA-FISTA P∗ = P0 7 153.18 9 694.93 12 2852.86

n = 2000 MA-Coord P∗ = P0 7 71.55 13 1152.86 - >5000

Card(Σ−1) = 18.7% MB-CG-5 P∗ = P0 6 71.54 21 250.11 397 4766.69MB-CG-D P∗ = P0 6 75.82 13 188.93 110 5007.83MB-LBFGS P∗ = P0 26 78.34 71 232.23 318 1125.67ALM 52 874.22 76 1262.83 106 1800.67 - >5000

Peder, Figen, Jorge and Steven Covariance Selection

Page 76: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

Although QUIC or MA-Coord beats the other algorithms whenthe solution is very sparse, the situation can be reversed bydetecting the diagonal blocks.

λ 0.5problem algorithm iter time

card(P∗) 0.74%n = 500 MA-Coord 21 3.86

Card(Σ−1) = 2.4% MB-CG-D (blocks) - 2.34

card(P∗) 0.21%n = 500 MA-Coord 4 0.42

Card(Σ−1) = 20.1% MB-CG-D (blocks) - 0.01

card(P∗) 0.18%n = 1000 MA-Coord 9 5.23

Card(Σ−1) = 3.5% MB-CG-D (blocks) - 0.55

card(P∗) 0.10%n = 1000 MA-Coord 4 2.23

Card(Σ−1) = 11% MB-CG-D (blocks) 3 0.02

card(P∗) 0.13%n = 2000 MA-Coord 14 54.33

Card(Σ−1) = 1% MB-CG-D (blocks) - 33.29

Peder, Figen, Jorge and Steven Covariance Selection

Page 77: Sparse Inverse Covariance Estimation - IBM … The optimization problem Newton LASSO Methods Orthant Wise Optimization Results Sparse Inverse Covariance Estimation Peder Olsen, Figen

The optimization problemNewton LASSO Methods

Orthant Wise OptimizationResults

So Long And Thanks For All The Fish

Thanks to everyone for coming to the talk and for lasting to thevery end! If you find any typos I would greatly appreciate it if youlet me know.

Peder, Figen, Jorge and Steven Covariance Selection