semidefinite optimization with applications in sparse …aspremon/pdf/banff07.pdf · 2013. 9....

63
Semidefinite Optimization with Applications in Sparse Multivariate Statistics Alexandre d’Aspremont ORFE, Princeton University Joint work with L. El Ghaoui, M. Jordan, V. Krishnamurthy, G. Lanckriet, R. Luss and Nathan Srebro. Support from NSF and Google. A. d’Aspremont BIRS, Banff, January 2007. 1

Upload: others

Post on 19-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Optimization

with Applications in Sparse Multivariate Statistics

Alexandre d’AspremontORFE, Princeton University

Joint work with L. El Ghaoui, M. Jordan, V. Krishnamurthy, G. Lanckriet,R. Luss and Nathan Srebro.

Support from NSF and Google.

A. d’Aspremont BIRS, Banff, January 2007. 1

Page 2: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Introduction

Semidefinite Programming:

• Essentially: linear programming over positive semidefinite matrices.

• Sounds very specialized but has applications everywhere (oftennon-obvious). . .

• One example here: convex relaxations of combinatorial problems.

Sparse Multivariate Statistics:

• Sparse variants of PCA, SVD, etc are combinatorial problems.

• Efficient relaxations using semidefinite programming.

• Solve realistically large problems.

A. d’Aspremont BIRS, Banff, January 2007. 2

Page 3: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Linear Programming

A linear program (LP) is written:

minimize cTxsubject to Ax = b

x � 0

its dual is another LP:maximize bTxsubject to ATy � c

• Here, x � 0 means that the vector x ∈ Rn has nonnegative coefficients.

• First solved using the simplex algorithm (exponential complexity).

• Using interior point methods, complexity is O(n3.5).

A. d’Aspremont BIRS, Banff, January 2007. 3

Page 4: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Programming

A semidefinite program (SDP) is written:

minimize Tr(CX)subject to Tr(AiX) = bi, i = 1, . . . ,m

X � 0

its dual is:maximize bTysubject to

i yiAi � C

• Here, X � 0 means that the matrix X ∈ Sn is positive semidefinite.

• Nesterov & Nemirovskii (1994) extended the complexity analysis of interiorpoint methods used for solving LPs to semidefinite programs (and others).

• Complexity in O(n4.5) when m ∼ n (see Ben-Tal & Nemirovski (2001)), harderto exploit problem structure such as sparsity, low-rank matrices, etc.

A. d’Aspremont BIRS, Banff, January 2007. 4

Page 5: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 5

Page 6: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite relaxations

Easy & Hard Problems. . .

Classical view on complexity:

• linear is easy

• nonlinear is hard(er)

Correct view:

• convex is easy

• nonconvex is hard(er)

A. d’Aspremont BIRS, Banff, January 2007. 6

Page 7: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Convex Optimization

Problem format:

minimize f0(x)subject to f1(x) ≤ 0, . . . , fm(x) ≤ 0

where x ∈ Rn is the optimization variable and fi : Rn → R are convex.

• includes LS, LP, QP, and many others

• like LS, LP, and QP, convex problems are fundamentally tractable

(cf. ellipsoid method)

Nonconvexity makes problems essentially untractable...

• Sometimes the result of bad problem formulation

• However, often arises because of some natural limitation: fixed transactioncosts, binary communications, ...

We can use convex optimization results to find bounds on the optimal value anapproximate solutions by relaxation.

A. d’Aspremont BIRS, Banff, January 2007. 7

Page 8: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Basic Problem

• We focus here on a specific class of problems: Quadratically ConstrainedQuadratic Programs (QCQP).

• Vast range of applications...

A QCQP can be written:

minimize xTP0x + qT0 x + r0

subject to xTPix + qTi x + ri ≤ 0, i = 1, . . . ,m

• If all Pi are positive semidefinite, this is a convex problem: easy.

• Here, we suppose at least one Pi not p.s.d.

A. d’Aspremont BIRS, Banff, January 2007. 8

Page 9: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Example: Partitioning Problem

Two-way partitioning problem:

minimize xTWxsubject to x2

i = 1, i = 1, . . . , n

where W ∈ Sn, with Wii = 0. A QCQP in the variable x ∈ Rn.

• A feasible x corresponds to the partition

{1, . . . , n} = {i | xi = −1} ∪ {i | xi = 1}.

• The matrix coefficient Wij can be interpreted as the cost of having theelements i and j in the same partition.

• The objective is to find the partition with least total cost.

• Classic particular instance: MAXCUT (Wij ≥ 0).

A. d’Aspremont BIRS, Banff, January 2007. 9

Page 10: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Relaxation

The original QCQP:

minimize xTP0x + qT0 x + r0

subject to xTPix + qTi x + ri ≤ 0, i = 1, . . . ,m

can be rewritten:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

X = xxT .

This is the same problem (lifted in Sn).

A. d’Aspremont BIRS, Banff, January 2007. 10

Page 11: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Relaxation

We can replace X = xxT by X � xxT , Rank(X) = 1, so this is again:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

X � xxT , Rank(X) = 1

The constraint X � xxT is a Schur complement constraint and is convex. Theonly remaining nonconvex constraint is now Rank(X) = 1. We simply drop itand solve:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

[

X xT

x 1

]

� 0

This is a semidefinite program in X ∈ Sn.

A. d’Aspremont BIRS, Banff, January 2007. 11

Page 12: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Relaxation

The original QCQP:

minimize xTP0x + qT0 x + r0

subject to xTPix + qTi x + ri ≤ 0, i = 1, . . . ,m

was relaxed as:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

[

X xT

x 1

]

� 0

• The relaxed problem is convex and can be solved efficiently.

• The optimal value of the SDP is a lower bound on the solution of the originalproblem.

A. d’Aspremont BIRS, Banff, January 2007. 12

Page 13: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite Relaxation: Partitioning

The partitioning problem defined was a QCQP:

minimize xTWxsubject to x2

i = 1, i = 1, . . . , n

There are only quadratic terms, so the variable x disappears from the relaxation,which becomes:

minimize Tr(WX)subject to X � 0

Xii = 1, i = 1, . . . , n

• These relaxations only provide a lower bound on the optimal value.

• If Rank(X) = 1 at the optimum, X = xxT and the relaxation is tight.

• How can we compute good feasible points otherwise?

• One solution: take the dominant eigenvector of X and project it on {−1, 1}.

A. d’Aspremont BIRS, Banff, January 2007. 13

Page 14: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Randomization

The original QCQP:

minimize xTP0x + qT0 x + r0

subject to xTPix + qTi x + ri ≤ 0, i = 1, . . . ,m

was relaxed into:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

X � xxT

• The last constraint means X − xxT is a covariance matrix...

• Pick y as a Gaussian variable with y ∼ N (x, X − xxT ), y will solve the QCQP“on average” over this distribution, in other words:

minimize E[yTP0y + qT0 y + r0]

subject to E[yTPiy + qTi y + ri] ≤ 0, i = 1, . . . , m

• A good feasible point can then be obtained by sampling enough x. . .

A. d’Aspremont BIRS, Banff, January 2007. 14

Page 15: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 15

Page 16: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

The l1 heuristic

Start from a linear system:Ax = b

with A ∈ Rm×n where m < n. We look for a sparse solution:

minimize Card(x)subject to Ax = b.

If the solution set is bounded, this can be formulated as a Mixed Integer LinearProgram:

minimize 1Tu

subject to Ax = b|x| � Buu ∈ {0, 1}n.

This is a hard problem. . .

A. d’Aspremont BIRS, Banff, January 2007. 16

Page 17: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

l1 relaxation

Assuming |x| ≤ 1, we can replace:

Card(x) =

n∑

i=1

1{xi 6=0}

with

‖x‖1 =

n∑

i=1

|xi|

Graphically, assuming x ∈ [−1, 1] this is:

0

1

−1 1

Card(x)

|x|

x

The l1 norm is the largest convex lower bound on Card(x) in [−1, 1].

A. d’Aspremont BIRS, Banff, January 2007. 17

Page 18: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

l1 relaxation

minimize Card(x)subject to Ax = b

becomes minimize ‖x‖1

subject to Ax = b

• The relaxed problem is a linear program.

• This trick can be used for other problems (cf. minimum rank result fromFazel, Hindi & Boyd (2001)).

• Candes & Tao (2005) or Donoho & Tanner (2005) show that if there is asufficiently sparse solution, it is optimal and the relaxation is tight. (Thisresult only works in the linear case).

A. d’Aspremont BIRS, Banff, January 2007. 18

Page 19: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

l1 relaxation

The original problem in MILP format:

minimize 1Tu

subject to Ax = b|x| � Buu ∈ {0, 1}n,

can be reformulated as a (nonconvex) QCQP:

minimize 1Tu

subject to Ax = b−x � Bu, x � Buu2

i = ui, i = 1, . . . , n.

• We could also formulate a semidefinite relaxation.

• Lemarechal & Oustry (1999) show that this is equivalent to relaxingu ∈ {0, 1}n as u ∈ [0, 1]n, which is exactly the l1 heuristic.

A. d’Aspremont BIRS, Banff, January 2007. 19

Page 20: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 20

Page 21: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

We estimate a sample covariance matrix Σ from empirical data. . .

• Objective: infer dependence relationships between variables.

• We want this information to be as sparse as possible.

• Basic solution: look at the magnitude of the covariance coefficients:

|Σij| > β ⇔ variables i and j are related,

and simply threshold smaller coefficients to zero. (not always psd.)

We can do better. . .

A. d’Aspremont BIRS, Banff, January 2007. 21

Page 22: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

Following Dempster (1972), look for zeros in the inverse covariance matrix:

• Parsimony. Suppose that we are estimating a Gaussian density:

f(x,Σ) =

(

1

)

p2(

1

detΣ

)12

exp

(

−1

2xTΣ−1x

)

,

a sparse inverse matrix Σ−1 corresponds to a sparse representation of thedensity f as a member of an exponential family of distributions:

f(x,Σ) = exp(α0 + t(x) + α11t11(x) + . . . + αrstrs(x))

with here tij(x) = xixj and αij = Σ−1ij .

• Dempster (1972) calls Σ−1ij a concentration coefficient.

There is more. . .

A. d’Aspremont BIRS, Banff, January 2007. 22

Page 23: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

Conditional independence:

• Suppose X,Y,Z have are jointly normal with covariance matrix Σ, with

Σ =

(

Σ11 Σ12

Σ21 Σ22

)

where Σ11 ∈ R2×2 and Σ22 ∈ R.

• Conditioned on Z, X, Y are still normally distributed with covariance matrix Cgiven by:

C = Σ11 − Σ12Σ−122 Σ21 =

((

Σ−1)

11

)−1

• So X and Y are conditionally independent iff(

Σ−1)

11is diagonal, which is

also:Σ−1

xy = 0

A. d’Aspremont BIRS, Banff, January 2007. 23

Page 24: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

• Suppose we have iid noise ǫi ∼ N (0, 1) and the following linear model:

x = z + ǫ1y = z + ǫ2z = ǫ3

• Graphically, this is:

X Y

Z

A. d’Aspremont BIRS, Banff, January 2007. 24

Page 25: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

• The covariance matrix and inverse covariance are given by:

Σ =

2 1 11 2 11 1 1

Σ−1 =

1 0 −10 1 −1

−1 −1 3

• The inverse covariance matrix has Σ−112 clearly showing that the variables x and

y are independent conditioned on z.

• Graphically, this is again:

X Y

Z

versusX Y

Z

A. d’Aspremont BIRS, Banff, January 2007. 25

Page 26: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

On a slightly larger scale. . .

4.5Y

5.5Y

1Y

10Y

0.5Y

7Y

9Y

8Y

5Y

4Y

6.5Y

1.5Y

6Y

3Y

9.5Y

2Y

2.5Y

8.5Y

7.5Y

3.5Y

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y

Before After

A. d’Aspremont BIRS, Banff, January 2007. 26

Page 27: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Applications & Related Work

• Gene expression data. The sample data is composed of gene expressionvectors and we want to isolate links in the expression of various genes. SeeDobra, Hans, Jones, Nevins, Yao & West (2004), Dobra & West (2004) forexample.

• Speech Recognition. See Bilmes (1999), Bilmes (2000) or Chen & Gopinath(1999).

• Finance. Covariance estimation.

• Related work by Dahl, Roychowdhury & Vandenberghe (2005): interior pointmethods for large, sparse MLE.

A. d’Aspremont BIRS, Banff, January 2007. 27

Page 28: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Maximum Likelihood Estimation

• We can estimate Σ by solving the following maximum likelihood problem:

maxX∈Sn

log detX − Tr(SX)

• This problem is convex, has an explicit answer Σ = S−1 if S ≻ 0.

• Problem here: how do we make Σ−1 sparse?

• In other words, how do we efficiently choose I and J?

• Solution: penalize the MLE.

A. d’Aspremont BIRS, Banff, January 2007. 28

Page 29: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

AIC and BIC

Original solution in Akaike (1973), penalize the likelihood function:

maxX∈Sn

log detX − Tr(SX) − ρCard(X)

where Card(X) is the number of nonzero elements in X .

• Set ρ = 2/(m + 1) for AIC and ρ = log(m + 1)/(m + 1) for BIC.

• We can form a convex relaxation of AIC or BIC penalized MLE by replacingCard(X) by ‖X‖1 =

ij |Xij| to solve:

maxX∈Sn

log detX − Tr(SX) − ρ‖X‖1

Again, the classic l1 heuristic: ‖X‖1 is a convex lower bound on Card(X).

A. d’Aspremont BIRS, Banff, January 2007. 29

Page 30: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Robustness

• This penalized MLE problem can be rewritten:

maxX∈Sn

min|Uij|≤ρ

log detX − Tr((S + U)X)

• This can be interpreted as a robust MLE problem with componentwise noiseof magnitude ρ on the elements of S.

• The relaxed sparsity requirement is equivalent to a robustification.

A. d’Aspremont BIRS, Banff, January 2007. 30

Page 31: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 31

Page 32: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse Principal Component Analysis

Principal Component Analysis (PCA): classic tool in multivariate data analysis

• Input: a covariance matrix A

• Output: a sequence of factors ranked by variance

• Each factor is a linear combination of the problem variables

Typical use: dimensionality reduction.

Numerically, just an eigenvalue decomposition of the covariance matrix:

A =n

i=1

λixixTi

A. d’Aspremont BIRS, Banff, January 2007. 32

Page 33: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse Principal Component Analysis

Computing factors amounts to solving:

maximize xTAxsubject to ‖x‖2 = 1.

This problem is easy, its solution is again λmax(A) at x1. Here however, we wanta little bit more. . .

We look for a sparse solution and solve instead:

maximize xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

where Card(x) denotes the cardinality (number of non-zero elements) of x. Thisis non-convex and numerically hard.

A. d’Aspremont BIRS, Banff, January 2007. 33

Page 34: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Related literature

Previous work:

• Cadima & Jolliffe (1995): the loadings with small absolute value arethresholded to zero.

• A non-convex method called SCoTLASS by Jolliffe, Trendafilov & Uddin(2003). (Same problem formulation)

• Zou, Hastie & Tibshirani (2004): a regression based technique called SPCA.Based on a representation of PCA as a regression problem. Sparsity is obtainedusing the LASSO Tibshirani (1996) a l1 norm penalty.

Performance:

• These methods are either very suboptimal (thresholding) or lead to nonconvex

optimization problems (SPCA).

• Regression: works for very large scale examples.

A. d’Aspremont BIRS, Banff, January 2007. 34

Page 35: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite relaxation

Start from max xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

Let X = xxT , and write everything in terms of the matrix X:

max Tr(AX)subject to Tr(X) = 1

Card(X) ≤ k2

X = xxT ,

Replace X = xxT by the equivalent X � 0, Rank(X) = 1:

max Tr(AX)subject to Tr(X) = 1

Card(X) ≤ k2

X � 0, Rank(X) = 1,

again, this is the same problem.

A. d’Aspremont BIRS, Banff, January 2007. 35

Page 36: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite relaxation

Numerically, this is still hard:

• The Card(X) ≤ k2 is still non-convex

• So is the constraint Rank(X) = 1

However, we have made some progress:

• The objective Tr(AX) is now linear in X

• The (non-convex) constraint ‖x‖2 = 1 became a linear constraint Tr(X) = 1.

We still need to relax the two non-convex constraints above:

• If u ∈ Rp, Card(u) = q implies ‖u‖1 ≤ √q‖u‖2. So we can replace

Card(X) ≤ k2 by the weaker (but convex): 1T |X |1 ≤ k

• Simply drop the rank constraint

A. d’Aspremont BIRS, Banff, January 2007. 36

Page 37: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Semidefinite relaxation

Semidefinite relaxation combined with l1 heuristic:

max xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

becomesmax Tr(AX)subject to Tr(X) = 1

1T |X |1 ≤ k

X � 0,

• This is a semidefinite program in the variable X ∈ Sn, polynomialcomplexity. . .

• Small problem instances can be solved using SEDUMI by Sturm (1999) orSDPT3 by Toh, Todd & Tutuncu (1999).

• This semidefinite program has O(n2) dense constraints on the matrix, we wantto solve large problems n ∼ 103.

Can’t use interior point methods. . .

A. d’Aspremont BIRS, Banff, January 2007. 37

Page 38: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Robustness & Tightness

Robustness. The penalized problem can be written:

min{|Uij|≤ρ} λmax(A + U)

Natural interpretation: robust maximum eigenvalue problem with componentwisenoise of magnitude ρ on the coefficients of the matrix A.

Tightness. The KKT optimality conditions are here:

(A + U)X = λmax(A + U)XU ◦ X = ρ|X |Tr(X) = 1, X � 0|Uij| ≤ ρ, i, j = 1, . . . , n.

The first order condition means that if λmax(A + U) is simple, Rank(X) = 1 sothe relaxation is tight: the solution to the relaxed problem is also a globaloptimum for the original combinatorial problem.

A. d’Aspremont BIRS, Banff, January 2007. 38

Page 39: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse Singular Value Decomposition

A similar reasoning involves a non-square m × n matrix A, and the problem

max uTAvsubject to ‖u‖2 = ‖v‖2 = 1

Card(u) ≤ k1, Card(v) ≤ k2,

in the variables (u, v) ∈ Rm ×Rn where k1 ≤ m, k2 ≤ n are fixed. This is relaxedas:

max Tr(ATX12)subject to 1

T |Xii|1 ≤ ki, i = 1, 21

T |X12|1 ≤√

k1k2

X � 0, Tr(Xii) = 1

in the variable X ∈ Sm+n with blocks Xij for i, j = 1, 2, using the fact that theeigenvalues of the matrix:

[

0 AAT 0

]

are {σi, . . . ,−σi, . . .} where σ are the singular values of the matrix A ∈ Rm×n.

A. d’Aspremont BIRS, Banff, January 2007. 39

Page 40: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Nonnegative Matrix Factorization

Direct extension of sparse PCA result. . . Solving

max uTAvsubject to ‖u‖2 = ‖v‖2 = 1

Card(u) ≤ k1, Card(v) ≤ k2,

also solves:min ‖A − uvT‖F

subject to Card(u) ≤ k1

Card(v) ≤ k2,

So, by adding constraints on u and v we can use the previous result to form arelaxation for the Nonnegative Matrix Factorization problem:

max Tr(ATX12)subject to 1

T |Xii|1 ≤ ki, i = 1, 21

T |X12|1 ≤√

k1k2

X � 0, Tr(Xii) = 1Xij ≥ 0,

Caveat: only works with rank one factorization. . .

A. d’Aspremont BIRS, Banff, January 2007. 40

Page 41: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 41

Page 42: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

Most of our problems are dense, with n ∼ 103.

Solver options:

• Interior point methods fail beyond n ∼ 400.

• Projected subgradient: extremely slow.

• Bundle method (see Helmberg & Rendl (2000)): a bit faster, but can’t takeadvantage of box-like structure of feasible set. Convergence in O(1/ǫ2).

A. d’Aspremont BIRS, Banff, January 2007. 42

Page 43: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Complexity options. . .

O(n) O(n) O(n2)

Memory

Complexity

O(1/ǫ2) O(1/ǫ) O(log(1/ǫ))

First-order Smooth Newton IP

A. d’Aspremont BIRS, Banff, January 2007. 43

Page 44: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Here, we can exploit problem structure

• Our problems here have min-max structure. For sparse PCA:

min|Uij|≤ρ

λmax(A + U) = min|Uij|≤ρ

maxX∈Sn

Tr((A + U)X)

• This min-max structure means that we can use prox function algorithms byNesterov (2005) (see also Nemirovski (2004)) to solve large, dense probleminstances.

A. d’Aspremont BIRS, Banff, January 2007. 44

Page 45: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Solveminx∈Q1

f(x)

• Starts from a particular min-max model on the problem:

f(x) = f(x) + maxu

{〈Tx, u〉 − φ(u) : u ∈ Q2}

• assuming that:

◦ f is defined over a compact convex set Q1 ⊂ Rn

◦ f(x) is convex, differentiable and has a Lipschitz continuous gradient withconstant M ≥ 0

◦ T is a linear operator: T ∈ Rn×n

◦ φ(u) is a continuous convex function over some compact set Q2 ⊂ Rn.

A. d’Aspremont BIRS, Banff, January 2007. 45

Page 46: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

If problem has min-max model, two steps:

• Regularization. Add strongly convex penalty inside the min-maxrepresentation to produce an ǫ-approximation of f with Lipschitz continuousgradient (generalized Moreau-Yosida regularization step, see Lemarechal &Sagastizabal (1997) for example).

• Optimal first order minimization. Use optimal first order scheme forLipschitz continuous functions detailed in Nesterov (1983) to the solve theregularized problem.

Benefits:

• For fixed problem size, the number of iterations required to get an ǫ solution isgiven by O (1/ǫ) compared to O

(

1/ǫ2)

for generic first-order methods.

• Low memory requirements: change in granularity of the solver: larger numberof cheaper iterations.

Caveat: Only efficient if the subproblems involved in these steps can be solvedexplicitly or extremely efficiently. . .

A. d’Aspremont BIRS, Banff, January 2007. 46

Page 47: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Regularization. We can find a uniform ǫ-approximation to λmax(X) withLipschitz continuous gradient. Let µ > 0 and X ∈ Sn, we define:

fµ(X) = µ log Tr

(

exp

(

X

µ

))

which requires computing a matrix exponential at a numerical cost of O(n3). Wethen have:

λmax(X) ≤ fµ(X) ≤ λmax(X) + µ log n,

so if we set µ = ǫ/ log n, fµ(X) becomes a uniform ǫ-approximation ofλmax(X) and fµ(X) has a Lipschitz continuous gradient with constant:

L =1

µ=

log n

ǫ.

The gradient ∇fµ(X) can also be computed explicitly as:

exp

(

X − λmax(X)I

µ

)

/Tr

(

exp

(

X − λmax(X)I

µ

))

using the same matrix exponential.

A. d’Aspremont BIRS, Banff, January 2007. 47

Page 48: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Optimal first-order minimization. The minimization algorithm in Nesterov(1983) then involves the following steps:

Choose ǫ > 0 and set X0 = βIn, For k = 0, . . . , N(ǫ) do

1. Compute ∇fǫ(Xk)

2. Find Yk = arg minY {Tr(∇fǫ(Xk)(Y − Xk)) + 12Lǫ‖Y − Xk‖2

F : Y ∈ Q1}.3. Find

Zk = arg minX

{

Lǫβ2d1(X) +

∑ki=0

i+12 Tr(∇fǫ(Xi)(X − Xi)) : X ∈ Q1

}

.

4. Update Xk = 2k+3Zk + k+1

k+3Yk.

5. Test if gap less than target precision.

• Step 1 requires computing a matrix exponential.

• Steps 2 and 3 are both Euclidean projections on Q1 = {U : |Uij ≤ ρ}.

A. d’Aspremont BIRS, Banff, January 2007. 48

Page 49: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

Complexity:

• The number of iterations to get accuracy ǫ is

O

(

n√

log n

ǫ

)

.

• At each iteration, the cost of computing a matrix exponential up to machineprecision is O(n3).

Computing matrix exponentials:

• Many options, cf. “Nineteen Dubious Ways to Compute the Exponential of aMatrix” by Moler & Van Loan (2003).

• Pade approximation, full eigenvalue decomposition: O(n3) up to machineprecision.

• In practice, machine precision is unnecessary. . .

A. d’Aspremont BIRS, Banff, January 2007. 49

Page 50: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

First order algorithm

In d’Aspremont (2005): When minimizing a function with Lipschitz-continuousgradient using the method in Nesterov (1983), an approximate gradient issufficient to get the O(1/ǫ) convergence rate. If the function and gradientapproximations satisfy:

|f(x) − f(x)| ≤ δ and |〈∇f(x) −∇f(x), y〉| ≤ δ x, y ∈ Q1,

we have:

f(xk) − f(x⋆) ≤ Ld(x⋆)

(k + 1)(k + 2)σ+ 10δ

where L, d(x⋆) and σ are problem constants.

• Only a few dominant eigs. are required to get the matrix exponential.

• Dominant eigenvalues with ARPACK: cubic convergence.

• Optimal complexity of O(1/ǫ), same cost per iteration as regular methods withcomplexity O(1/ǫ2).

• ARPACK exploits sparsity.

A. d’Aspremont BIRS, Banff, January 2007. 50

Page 51: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance

A. d’Aspremont BIRS, Banff, January 2007. 51

Page 52: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

Forward rates covariance matrix for maturities ranging from 0.5 to 10 years.

4.5Y

5.5Y

1Y

10Y

0.5Y

7Y

9Y

8Y

5Y

4Y

6.5Y

1.5Y

6Y

3Y

9.5Y

2Y

2.5Y

8.5Y

7.5Y

3.5Y

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y

ρ = 0 ρ = .01

A. d’Aspremont BIRS, Banff, January 2007. 52

Page 53: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Zoom. . .

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y

A. d’Aspremont BIRS, Banff, January 2007. 53

Page 54: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Covariance Selection

00.20.40.60.810

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CovselSimple Thresholding

Specificity

Sen

sitivi

ty

00.20.40.60.810

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CovselSimple Thresholding

Specificity

Sen

sitivi

tyClassification Error. Sensitivity/Specificity curves for the solution to thecovariance selection problem compared with a simple thresholding of B−1, forvarious levels of noise: σ = 0.3 (left) and σ = 0.5 (right). Here n = 50.

A. d’Aspremont BIRS, Banff, January 2007. 54

Page 55: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse PCA

−5

0

5

10

15

−5

0

5

10

−5

0

5

10

−4

−3

−2

−1

0

1

−1

0

1

2

3

−1

0

1

2

3

f1f2

f 3

g1

g2

g 3

PCA Sparse PCA

Clustering of the gene expression data in the PCA versus sparse PCA basis with500 genes. The factors f on the left are dense and each use all 500 genes whilethe sparse factors g1, g2 and g3 on the right involve 6, 4 and 4 genes respectively.(Data: Iconix Pharmaceuticals)

A. d’Aspremont BIRS, Banff, January 2007. 55

Page 56: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse PCA

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

NormalCancer

Fac

tor3

(500

gen

es)

Factor 2 (500 genes)

−6 −4 −2 0 2 4 6−3

−2

−1

0

1

2

3

4

NormalCancer

Fac

tor

3(1

gen

e)

Factor 2 (12 genes)

PCA Clustering (left) & DSPCA Clustering (right), colon cancer data set in Alon,Barkai, Notterman, Gish, Ybarra, Mack & Levine (1999).

A. d’Aspremont BIRS, Banff, January 2007. 56

Page 57: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Smooth first-order vs IP

200 250 300 350 400 450 500 550 60010

2

103

104

105

CSDPSmooth

Problem Size n

CPU

Tim

e(in

sec.

)

Out of memory

102.4

102.5

102.6

102.7

102.8

105

106

107

108

109

1010

CSDPSmooth

Problem Size nM

emor

yU

sed

(in

byte

s)

Out of memory

Figure 1: CPU time and memory usage versus n.

A. d’Aspremont BIRS, Banff, January 2007. 57

Page 58: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse PCA

0 50 100 150 200 250 300 350 400 4500

2

4

6

8

10

12

14

14 Genes, Rho = 3216 Genes, Rho = 3021 Genes, Rho = 2828 Genes, Rho = 24%

Eig

enva

lues

Req

uired

Time (seconds)

0 0.5 1 1.5 2 2.5

x 104

0

2

4

6

8

10

12

14

14 Genes, Rho = 3216 Genes, Rho = 3021 Genes, Rho = 2828 Genes, Rho = 24

%Eig

enva

lues

Req

uired

Duality Gap

Eigenvalues vs. CPU Time (left), Duality Gap vs Eigs. (right), on 1000 genes.

A. d’Aspremont BIRS, Banff, January 2007. 58

Page 59: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse PCA

0 10 20 30 40 500.45

0.5

0.55

0.6

0.65

0.7

PCADSPCA

Ran

dIn

dex

# of Nonzeros

Sparsity versus Rand Index on colon cancer data set.

A. d’Aspremont BIRS, Banff, January 2007. 59

Page 60: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse Nonnegative Matrix Factorization

Test relaxation on a matrix of the form:

M = xyT + U

where U is uniform noise.

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

1

2

4

6

8

10

12

2 4 6 8 10 121

xyT M Solution

x

yT

A. d’Aspremont BIRS, Banff, January 2007. 60

Page 61: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Conclusion

• Semidefinite relaxations of combinatorial problems in multivariate statistics.

• Infer sparse structural information on large datasets.

• Efficient codes can solve problems of with 103 variables in a few minutes.

Source code and binaries for sparse PCA (DSPCA) and covariance selection(COVSEL) available at:

www.princeton.edu/∼aspremon

These slides are available at:

www.princeton.edu/∼aspremon/Banff07.pdf

A. d’Aspremont BIRS, Banff, January 2007. 61

Page 62: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

References

Akaike, J. (1973), Information theory and an extension of the maximum likelihood principle, in B. N. Petrov & F. Csaki, eds, ‘Second

international symposium on information theory’, Akedemiai Kiado, Budapest, pp. 267–281.

Alon, A., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999), ‘Broad patterns of gene expression revealed byclustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays’, Cell Biology 96, 6745–6750.

Ben-Tal, A. & Nemirovski, A. (2001), Lectures on modern convex optimization : analysis, algorithms, and engineering applications,MPS-SIAM series on optimization, Society for Industrial and Applied Mathematics : Mathematical Programming Society, Philadelphia,

PA.

Bilmes, J. A. (1999), ‘Natural statistic models for automatic speech recognition’, Ph.D. thesis, UC Berkeley, Dept. of EECS, CS Division .

Bilmes, J. A. (2000), ‘Factored sparse inverse covariance matrices’, IEEE International Conference on Acoustics, Speech, and SignalProcessing .

Cadima, J. & Jolliffe, I. T. (1995), ‘Loadings and correlations in the interpretation of principal components’, Journal of Applied Statistics

22, 203–214.

Candes, E. & Tao, T. (2005), ‘Decoding by linear programming’, ArXiv: math.MG/0502327 .

Chen, S. S. & Gopinath, R. A. (1999), ‘Model selection in acoustic modeling’, EUROSPEECH .

Dahl, J., Roychowdhury, V. & Vandenberghe, L. (2005), ‘Maximum likelihood estimation of gaussian graphical models: numerical

implementation and topology selection’, UCLA preprint .

d’Aspremont, A. (2005), ‘Smooth optimization for sparse semidefinite programs’, ArXiv: math.OC/0512344 .

Dempster, A. (1972), ‘Covariance selection’, Biometrics 28, 157–175.

Dobra, A., Hans, C., Jones, B., Nevins, J. J. R., Yao, G. & West, M. (2004), ‘Sparse graphical models for exploring gene expression data’,Journal of Multivariate Analysis 90(1), 196–212.

Dobra, A. & West, M. (2004), ‘Bayesian covariance selection’, working paper .

Donoho, D. L. & Tanner, J. (2005), ‘Sparse nonnegative solutions of underdetermined linear equations by linear programming’, Proceedings

of the National Academy of Sciences 102(27), 9446–9451.

Fazel, M., Hindi, H. & Boyd, S. (2001), ‘A rank minimization heuristic with application to minimum order system approximation’,

Proceedings American Control Conference 6, 4734–4739.

Helmberg, C. & Rendl, F. (2000), ‘A spectral bundle method for semidefinite programming’, SIAM Journal on Optimization 10(3), 673–696.

Jolliffe, I. T., Trendafilov, N. & Uddin, M. (2003), ‘A modified principal component technique based on the LASSO’, Journal ofComputational and Graphical Statistics 12, 531–547.

A. d’Aspremont BIRS, Banff, January 2007. 62

Page 63: Semidefinite Optimization with Applications in Sparse …aspremon/PDF/Banff07.pdf · 2013. 9. 9. · Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Lemarechal, C. & Oustry, F. (1999), ‘Semidefinite relaxations and Lagrangian duality with application to combinatorial optimization’, INRIA,

Rapport de recherche 3710.

Lemarechal, C. & Sagastizabal, C. (1997), ‘Practical aspects of the Moreau-Yosida regularization: theoretical preliminaries’, SIAM Journal on

Optimization 7(2), 367–385.

Moler, C. & Van Loan, C. (2003), ‘Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later’, SIAM Review45(1), 3–49.

Nemirovski, A. (2004), ‘Prox-method with rate of convergence O(1/T) for variational inequalities with lipschitz continuous monotone

operators and smooth convex-concave saddle point problems’, SIAM Journal on Optimization 15(1), 229–251.

Nesterov, Y. (1983), ‘A method of solving a convex programming problem with convergence rate O(1/k2)’, Soviet Mathematics Doklady

27(2), 372–376.

Nesterov, Y. (2005), ‘Smooth minimization of nonsmooth functions’, Mathematical Programming, Series A 103, 127–152.

Nesterov, Y. & Nemirovskii, A. (1994), Interior-point polynomial algorithms in convex programming, Society for Industrial and Applied

Mathematics, Philadelphia.

Sturm, J. (1999), ‘Using SEDUMI 1.0x, a MATLAB toolbox for optimization over symmetric cones’, Optimization Methods and Software

11, 625–653.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the LASSO’, Journal of the Royal statistical society, series B 58(1), 267–288.

Toh, K. C., Todd, M. J. & Tutuncu, R. H. (1999), ‘SDPT3 – a MATLAB software package for semidefinite programming’, Optimization

Methods and Software 11, 545–581.

Zou, H., Hastie, T. & Tibshirani, R. (2004), ‘Sparse principal component analysis’, To appear in Journal of Computational and GraphicalStatistics .

A. d’Aspremont BIRS, Banff, January 2007. 63