semideﬁnite optimization with applications in sparse …aspremon/pdf/banff07.pdf · 2013. 9....

Semidefinite Optimization

with Applications in Sparse Multivariate Statistics

Alexandre d’AspremontORFE, Princeton University

Joint work with L. El Ghaoui, M. Jordan, V. Krishnamurthy, G. Lanckriet,R. Luss and Nathan Srebro.

Support from NSF and Google.

A. d’Aspremont BIRS, Banff, January 2007. 1

Introduction

Semidefinite Programming:

• Essentially: linear programming over positive semidefinite matrices.

• Sounds very specialized but has applications everywhere (oftennon-obvious). . .

• One example here: convex relaxations of combinatorial problems.

Sparse Multivariate Statistics:

• Sparse variants of PCA, SVD, etc are combinatorial problems.

• Efficient relaxations using semidefinite programming.

• Solve realistically large problems.


Linear Programming

A linear program (LP) is written:

minimize cTxsubject to Ax = b

x � 0

its dual is another LP:maximize bTxsubject to ATy � c

• Here, x � 0 means that the vector x ∈ Rn has nonnegative coefficients.

• First solved using the simplex algorithm (exponential complexity).

• Using interior point methods, complexity is O(n3.5).


Semidefinite Programming

A semidefinite program (SDP) is written:

minimize Tr(CX)subject to Tr(AiX) = bi, i = 1, . . . ,m

X � 0

its dual is:maximize bTysubject to

∑

i yiAi � C

• Here, X � 0 means that the matrix X ∈ Sn is positive semidefinite.

• Nesterov & Nemirovskii (1994) extended the complexity analysis of interiorpoint methods used for solving LPs to semidefinite programs (and others).

• Complexity in O(n4.5) when m ∼ n (see Ben-Tal & Nemirovski (2001)), harderto exploit problem structure such as sparsity, low-rank matrices, etc.


Outline

• Two classic relaxation tricks

◦ Semidefinite relaxations and the lifting technique

◦ The l1 heuristic

• Applications

◦ Covariance selection

◦ Sparse PCA, SVD

◦ Sparse nonnegative matrix factorization

• Solving large-scale semidefinite programs

◦ First-order methods

◦ Numerical performance


Semidefinite relaxations

Easy & Hard Problems. . .

Classical view on complexity:

• linear is easy

• nonlinear is hard(er)

Correct view:

• convex is easy

• nonconvex is hard(er)


Convex Optimization

Problem format:

minimize f0(x)subject to f1(x) ≤ 0, . . . , fm(x) ≤ 0

where x ∈ Rn is the optimization variable and fi : Rn → R are convex.

• includes LS, LP, QP, and many others

• like LS, LP, and QP, convex problems are fundamentally tractable

(cf. ellipsoid method)

Nonconvexity makes problems essentially untractable...

• Sometimes the result of bad problem formulation

• However, often arises because of some natural limitation: fixed transactioncosts, binary communications, ...

We can use convex optimization results to find bounds on the optimal value anapproximate solutions by relaxation.


Basic Problem

• We focus here on a specific class of problems: Quadratically ConstrainedQuadratic Programs (QCQP).

• Vast range of applications...

A QCQP can be written:

minimize xTP0x + qT0 x + r0

subject to xTPix + qTi x + ri ≤ 0, i = 1, . . . ,m

• If all Pi are positive semidefinite, this is a convex problem: easy.

• Here, we suppose at least one Pi not p.s.d.


Example: Partitioning Problem

Two-way partitioning problem:

minimize xTWxsubject to x2

i = 1, i = 1, . . . , n

where W ∈ Sn, with Wii = 0. A QCQP in the variable x ∈ Rn.

• A feasible x corresponds to the partition

{1, . . . , n} = {i | xi = −1} ∪ {i | xi = 1}.

• The matrix coefficient Wij can be interpreted as the cost of having theelements i and j in the same partition.

• The objective is to find the partition with least total cost.

• Classic particular instance: MAXCUT (Wij ≥ 0).


Semidefinite Relaxation

The original QCQP:



can be rewritten:

minimize Tr(XP0) + qT0 x + r0

subject to Tr(XPi) + qTi x + ri ≤ 0, i = 1, . . . ,m

X = xxT .

This is the same problem (lifted in Sn).



We can replace X = xxT by X � xxT , Rank(X) = 1, so this is again:



X � xxT , Rank(X) = 1

The constraint X � xxT is a Schur complement constraint and is convex. Theonly remaining nonconvex constraint is now Rank(X) = 1. We simply drop itand solve:



[

X xT

x 1

]

� 0

This is a semidefinite program in X ∈ Sn.



The original QCQP:



was relaxed as:



[

X xT

x 1

]

� 0

• The relaxed problem is convex and can be solved efficiently.

• The optimal value of the SDP is a lower bound on the solution of the originalproblem.


Semidefinite Relaxation: Partitioning

The partitioning problem defined was a QCQP:

minimize xTWxsubject to x2

i = 1, i = 1, . . . , n

There are only quadratic terms, so the variable x disappears from the relaxation,which becomes:

minimize Tr(WX)subject to X � 0

Xii = 1, i = 1, . . . , n

• These relaxations only provide a lower bound on the optimal value.

• If Rank(X) = 1 at the optimum, X = xxT and the relaxation is tight.

• How can we compute good feasible points otherwise?

• One solution: take the dominant eigenvector of X and project it on {−1, 1}.


Randomization

The original QCQP:



was relaxed into:



X � xxT

• The last constraint means X − xxT is a covariance matrix...

• Pick y as a Gaussian variable with y ∼ N (x, X − xxT ), y will solve the QCQP“on average” over this distribution, in other words:

minimize E[yTP0y + qT0 y + r0]

subject to E[yTPiy + qTi y + ri] ≤ 0, i = 1, . . . , m

• A good feasible point can then be obtained by sampling enough x. . .


Outline




• Applications


◦ Sparse PCA, SVD






The l1 heuristic

Start from a linear system:Ax = b

with A ∈ Rm×n where m < n. We look for a sparse solution:

minimize Card(x)subject to Ax = b.

If the solution set is bounded, this can be formulated as a Mixed Integer LinearProgram:

minimize 1Tu

subject to Ax = b|x| � Buu ∈ {0, 1}n.

This is a hard problem. . .


l1 relaxation

Assuming |x| ≤ 1, we can replace:

Card(x) =

n∑

i=1

1{xi 6=0}

with

‖x‖1 =

n∑

i=1

|xi|

Graphically, assuming x ∈ [−1, 1] this is:

0

1

−1 1

Card(x)

|x|

x

The l1 norm is the largest convex lower bound on Card(x) in [−1, 1].


l1 relaxation

minimize Card(x)subject to Ax = b

becomes minimize ‖x‖1

subject to Ax = b

• The relaxed problem is a linear program.

• This trick can be used for other problems (cf. minimum rank result fromFazel, Hindi & Boyd (2001)).

• Candes & Tao (2005) or Donoho & Tanner (2005) show that if there is asufficiently sparse solution, it is optimal and the relaxation is tight. (Thisresult only works in the linear case).


l1 relaxation

The original problem in MILP format:

minimize 1Tu

subject to Ax = b|x| � Buu ∈ {0, 1}n,

can be reformulated as a (nonconvex) QCQP:

minimize 1Tu

subject to Ax = b−x � Bu, x � Buu2

i = ui, i = 1, . . . , n.

• We could also formulate a semidefinite relaxation.

• Lemarechal & Oustry (1999) show that this is equivalent to relaxingu ∈ {0, 1}n as u ∈ [0, 1]n, which is exactly the l1 heuristic.


Outline




• Applications


◦ Sparse PCA, SVD






Covariance Selection

We estimate a sample covariance matrix Σ from empirical data. . .

• Objective: infer dependence relationships between variables.

• We want this information to be as sparse as possible.

• Basic solution: look at the magnitude of the covariance coefficients:

|Σij| > β ⇔ variables i and j are related,

and simply threshold smaller coefficients to zero. (not always psd.)

We can do better. . .



Following Dempster (1972), look for zeros in the inverse covariance matrix:

• Parsimony. Suppose that we are estimating a Gaussian density:

f(x,Σ) =

(

1

2π

)

p2(

1

detΣ

)12

exp

(

−1

2xTΣ−1x

)

,

a sparse inverse matrix Σ−1 corresponds to a sparse representation of thedensity f as a member of an exponential family of distributions:

f(x,Σ) = exp(α0 + t(x) + α11t11(x) + . . . + αrstrs(x))

with here tij(x) = xixj and αij = Σ−1ij .

• Dempster (1972) calls Σ−1ij a concentration coefficient.

There is more. . .



Conditional independence:

• Suppose X,Y,Z have are jointly normal with covariance matrix Σ, with

Σ =

(

Σ11 Σ12

Σ21 Σ22

)

where Σ11 ∈ R2×2 and Σ22 ∈ R.

• Conditioned on Z, X, Y are still normally distributed with covariance matrix Cgiven by:

C = Σ11 − Σ12Σ−122 Σ21 =

((

Σ−1)

11

)−1

• So X and Y are conditionally independent iff(

Σ−1)

11is diagonal, which is

also:Σ−1

xy = 0



• Suppose we have iid noise ǫi ∼ N (0, 1) and the following linear model:

x = z + ǫ1y = z + ǫ2z = ǫ3

• Graphically, this is:

X Y

Z



• The covariance matrix and inverse covariance are given by:

Σ =

2 1 11 2 11 1 1

Σ−1 =

1 0 −10 1 −1

−1 −1 3

• The inverse covariance matrix has Σ−112 clearly showing that the variables x and

y are independent conditioned on z.

• Graphically, this is again:

X Y

Z

versusX Y

Z



On a slightly larger scale. . .

4.5Y

5.5Y

1Y

10Y

0.5Y

7Y

9Y

8Y

5Y

4Y

6.5Y

1.5Y

6Y

3Y

9.5Y

2Y

2.5Y

8.5Y

7.5Y

3.5Y

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y

Before After


Applications & Related Work

• Gene expression data. The sample data is composed of gene expressionvectors and we want to isolate links in the expression of various genes. SeeDobra, Hans, Jones, Nevins, Yao & West (2004), Dobra & West (2004) forexample.

• Speech Recognition. See Bilmes (1999), Bilmes (2000) or Chen & Gopinath(1999).

• Finance. Covariance estimation.

• Related work by Dahl, Roychowdhury & Vandenberghe (2005): interior pointmethods for large, sparse MLE.


Maximum Likelihood Estimation

• We can estimate Σ by solving the following maximum likelihood problem:

maxX∈Sn

log detX − Tr(SX)

• This problem is convex, has an explicit answer Σ = S−1 if S ≻ 0.

• Problem here: how do we make Σ−1 sparse?

• In other words, how do we efficiently choose I and J?

• Solution: penalize the MLE.


AIC and BIC

Original solution in Akaike (1973), penalize the likelihood function:

maxX∈Sn

log detX − Tr(SX) − ρCard(X)

where Card(X) is the number of nonzero elements in X .

• Set ρ = 2/(m + 1) for AIC and ρ = log(m + 1)/(m + 1) for BIC.

• We can form a convex relaxation of AIC or BIC penalized MLE by replacingCard(X) by ‖X‖1 =

∑

ij |Xij| to solve:

maxX∈Sn

log detX − Tr(SX) − ρ‖X‖1

Again, the classic l1 heuristic: ‖X‖1 is a convex lower bound on Card(X).


Robustness

• This penalized MLE problem can be rewritten:

maxX∈Sn

min|Uij|≤ρ

log detX − Tr((S + U)X)

• This can be interpreted as a robust MLE problem with componentwise noiseof magnitude ρ on the elements of S.

• The relaxed sparsity requirement is equivalent to a robustification.


Outline




• Applications


◦ Sparse PCA, SVD






Sparse Principal Component Analysis

Principal Component Analysis (PCA): classic tool in multivariate data analysis

• Input: a covariance matrix A

• Output: a sequence of factors ranked by variance

• Each factor is a linear combination of the problem variables

Typical use: dimensionality reduction.

Numerically, just an eigenvalue decomposition of the covariance matrix:

A =n

∑

i=1

λixixTi


Sparse Principal Component Analysis

Computing factors amounts to solving:

maximize xTAxsubject to ‖x‖2 = 1.

This problem is easy, its solution is again λmax(A) at x1. Here however, we wanta little bit more. . .

We look for a sparse solution and solve instead:

maximize xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

where Card(x) denotes the cardinality (number of non-zero elements) of x. Thisis non-convex and numerically hard.


Related literature

Previous work:

• Cadima & Jolliffe (1995): the loadings with small absolute value arethresholded to zero.

• A non-convex method called SCoTLASS by Jolliffe, Trendafilov & Uddin(2003). (Same problem formulation)

• Zou, Hastie & Tibshirani (2004): a regression based technique called SPCA.Based on a representation of PCA as a regression problem. Sparsity is obtainedusing the LASSO Tibshirani (1996) a l1 norm penalty.

Performance:

• These methods are either very suboptimal (thresholding) or lead to nonconvex

optimization problems (SPCA).

• Regression: works for very large scale examples.


Semidefinite relaxation

Start from max xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

Let X = xxT , and write everything in terms of the matrix X:

max Tr(AX)subject to Tr(X) = 1

Card(X) ≤ k2

X = xxT ,

Replace X = xxT by the equivalent X � 0, Rank(X) = 1:

max Tr(AX)subject to Tr(X) = 1

Card(X) ≤ k2

X � 0, Rank(X) = 1,

again, this is the same problem.



Numerically, this is still hard:

• The Card(X) ≤ k2 is still non-convex

• So is the constraint Rank(X) = 1

However, we have made some progress:

• The objective Tr(AX) is now linear in X

• The (non-convex) constraint ‖x‖2 = 1 became a linear constraint Tr(X) = 1.

We still need to relax the two non-convex constraints above:

• If u ∈ Rp, Card(u) = q implies ‖u‖1 ≤ √q‖u‖2. So we can replace

Card(X) ≤ k2 by the weaker (but convex): 1T |X |1 ≤ k

• Simply drop the rank constraint



Semidefinite relaxation combined with l1 heuristic:

max xTAxsubject to ‖x‖2 = 1

Card(x) ≤ k,

becomesmax Tr(AX)subject to Tr(X) = 1

1T |X |1 ≤ k

X � 0,

• This is a semidefinite program in the variable X ∈ Sn, polynomialcomplexity. . .

• Small problem instances can be solved using SEDUMI by Sturm (1999) orSDPT3 by Toh, Todd & Tutuncu (1999).

• This semidefinite program has O(n2) dense constraints on the matrix, we wantto solve large problems n ∼ 103.

Can’t use interior point methods. . .


Robustness & Tightness

Robustness. The penalized problem can be written:

min{|Uij|≤ρ} λmax(A + U)

Natural interpretation: robust maximum eigenvalue problem with componentwisenoise of magnitude ρ on the coefficients of the matrix A.

Tightness. The KKT optimality conditions are here:

(A + U)X = λmax(A + U)XU ◦ X = ρ|X |Tr(X) = 1, X � 0|Uij| ≤ ρ, i, j = 1, . . . , n.

The first order condition means that if λmax(A + U) is simple, Rank(X) = 1 sothe relaxation is tight: the solution to the relaxed problem is also a globaloptimum for the original combinatorial problem.


Sparse Singular Value Decomposition

A similar reasoning involves a non-square m × n matrix A, and the problem

max uTAvsubject to ‖u‖2 = ‖v‖2 = 1

Card(u) ≤ k1, Card(v) ≤ k2,

in the variables (u, v) ∈ Rm ×Rn where k1 ≤ m, k2 ≤ n are fixed. This is relaxedas:

max Tr(ATX12)subject to 1

T |Xii|1 ≤ ki, i = 1, 21

T |X12|1 ≤√

k1k2

X � 0, Tr(Xii) = 1

in the variable X ∈ Sm+n with blocks Xij for i, j = 1, 2, using the fact that theeigenvalues of the matrix:

[

0 AAT 0

]

are {σi, . . . ,−σi, . . .} where σ are the singular values of the matrix A ∈ Rm×n.


Nonnegative Matrix Factorization

Direct extension of sparse PCA result. . . Solving

max uTAvsubject to ‖u‖2 = ‖v‖2 = 1

Card(u) ≤ k1, Card(v) ≤ k2,

also solves:min ‖A − uvT‖F

subject to Card(u) ≤ k1

Card(v) ≤ k2,

So, by adding constraints on u and v we can use the previous result to form arelaxation for the Nonnegative Matrix Factorization problem:

max Tr(ATX12)subject to 1

T |Xii|1 ≤ ki, i = 1, 21

T |X12|1 ≤√

k1k2

X � 0, Tr(Xii) = 1Xij ≥ 0,

Caveat: only works with rank one factorization. . .


Outline




• Applications


◦ Sparse PCA, SVD






Outline

Most of our problems are dense, with n ∼ 103.

Solver options:

• Interior point methods fail beyond n ∼ 400.

• Projected subgradient: extremely slow.

• Bundle method (see Helmberg & Rendl (2000)): a bit faster, but can’t takeadvantage of box-like structure of feasible set. Convergence in O(1/ǫ2).


First order algorithm

Complexity options. . .

O(n) O(n) O(n2)

Memory

Complexity

O(1/ǫ2) O(1/ǫ) O(log(1/ǫ))

First-order Smooth Newton IP



Here, we can exploit problem structure

• Our problems here have min-max structure. For sparse PCA:

min|Uij|≤ρ

λmax(A + U) = min|Uij|≤ρ

maxX∈Sn

Tr((A + U)X)

• This min-max structure means that we can use prox function algorithms byNesterov (2005) (see also Nemirovski (2004)) to solve large, dense probleminstances.



Solveminx∈Q1

f(x)

• Starts from a particular min-max model on the problem:

f(x) = f(x) + maxu

{〈Tx, u〉 − φ(u) : u ∈ Q2}

• assuming that:

◦ f is defined over a compact convex set Q1 ⊂ Rn

◦ f(x) is convex, differentiable and has a Lipschitz continuous gradient withconstant M ≥ 0

◦ T is a linear operator: T ∈ Rn×n

◦ φ(u) is a continuous convex function over some compact set Q2 ⊂ Rn.



If problem has min-max model, two steps:

• Regularization. Add strongly convex penalty inside the min-maxrepresentation to produce an ǫ-approximation of f with Lipschitz continuousgradient (generalized Moreau-Yosida regularization step, see Lemarechal &Sagastizabal (1997) for example).

• Optimal first order minimization. Use optimal first order scheme forLipschitz continuous functions detailed in Nesterov (1983) to the solve theregularized problem.

Benefits:

• For fixed problem size, the number of iterations required to get an ǫ solution isgiven by O (1/ǫ) compared to O

(

1/ǫ2)

for generic first-order methods.

• Low memory requirements: change in granularity of the solver: larger numberof cheaper iterations.

Caveat: Only efficient if the subproblems involved in these steps can be solvedexplicitly or extremely efficiently. . .



Regularization. We can find a uniform ǫ-approximation to λmax(X) withLipschitz continuous gradient. Let µ > 0 and X ∈ Sn, we define:

fµ(X) = µ log Tr

(

exp

(

X

µ

))

which requires computing a matrix exponential at a numerical cost of O(n3). Wethen have:

λmax(X) ≤ fµ(X) ≤ λmax(X) + µ log n,

so if we set µ = ǫ/ log n, fµ(X) becomes a uniform ǫ-approximation ofλmax(X) and fµ(X) has a Lipschitz continuous gradient with constant:

L =1

µ=

log n

ǫ.

The gradient ∇fµ(X) can also be computed explicitly as:

exp

(

X − λmax(X)I

µ

)

/Tr

(

exp

(

X − λmax(X)I

µ

))

using the same matrix exponential.



Optimal first-order minimization. The minimization algorithm in Nesterov(1983) then involves the following steps:

Choose ǫ > 0 and set X0 = βIn, For k = 0, . . . , N(ǫ) do

1. Compute ∇fǫ(Xk)

2. Find Yk = arg minY {Tr(∇fǫ(Xk)(Y − Xk)) + 12Lǫ‖Y − Xk‖2

F : Y ∈ Q1}.3. Find

Zk = arg minX

{

Lǫβ2d1(X) +

∑ki=0

i+12 Tr(∇fǫ(Xi)(X − Xi)) : X ∈ Q1

}

.

4. Update Xk = 2k+3Zk + k+1

k+3Yk.

5. Test if gap less than target precision.

• Step 1 requires computing a matrix exponential.

• Steps 2 and 3 are both Euclidean projections on Q1 = {U : |Uij ≤ ρ}.



Complexity:

• The number of iterations to get accuracy ǫ is

O

(

n√

log n

ǫ

)

.

• At each iteration, the cost of computing a matrix exponential up to machineprecision is O(n3).

Computing matrix exponentials:

• Many options, cf. “Nineteen Dubious Ways to Compute the Exponential of aMatrix” by Moler & Van Loan (2003).

• Pade approximation, full eigenvalue decomposition: O(n3) up to machineprecision.

• In practice, machine precision is unnecessary. . .



In d’Aspremont (2005): When minimizing a function with Lipschitz-continuousgradient using the method in Nesterov (1983), an approximate gradient issufficient to get the O(1/ǫ) convergence rate. If the function and gradientapproximations satisfy:

|f(x) − f(x)| ≤ δ and |〈∇f(x) −∇f(x), y〉| ≤ δ x, y ∈ Q1,

we have:

f(xk) − f(x⋆) ≤ Ld(x⋆)

(k + 1)(k + 2)σ+ 10δ

where L, d(x⋆) and σ are problem constants.

• Only a few dominant eigs. are required to get the matrix exponential.

• Dominant eigenvalues with ARPACK: cubic convergence.

• Optimal complexity of O(1/ǫ), same cost per iteration as regular methods withcomplexity O(1/ǫ2).

• ARPACK exploits sparsity.


Outline




• Applications


◦ Sparse PCA, SVD







Forward rates covariance matrix for maturities ranging from 0.5 to 10 years.

4.5Y

5.5Y

1Y

10Y

0.5Y

7Y

9Y

8Y

5Y

4Y

6.5Y

1.5Y

6Y

3Y

9.5Y

2Y

2.5Y

8.5Y

7.5Y

3.5Y

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y

ρ = 0 ρ = .01


Zoom. . .

6Y

1.5Y

5.5Y

2.5Y

8Y 5Y

2Y

0.5Y

9.5Y

4Y7.5Y

7Y

10Y

6.5Y

3Y

9Y

8.5Y

1Y



00.20.40.60.810

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CovselSimple Thresholding

Specificity

Sen

sitivi

ty

00.20.40.60.810

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CovselSimple Thresholding

Specificity

Sen

sitivi

tyClassification Error. Sensitivity/Specificity curves for the solution to thecovariance selection problem compared with a simple thresholding of B−1, forvarious levels of noise: σ = 0.3 (left) and σ = 0.5 (right). Here n = 50.


Sparse PCA

−5

0

5

10

15

−5

0

5

10

−5

0

5

10

−4

−3

−2

−1

0

1

−1

0

1

2

3

−1

0

1

2

3

f1f2

f 3

g1

g2

g 3

PCA Sparse PCA

Clustering of the gene expression data in the PCA versus sparse PCA basis with500 genes. The factors f on the left are dense and each use all 500 genes whilethe sparse factors g1, g2 and g3 on the right involve 6, 4 and 4 genes respectively.(Data: Iconix Pharmaceuticals)


Sparse PCA

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

NormalCancer

Fac

tor3

(500

gen

es)

Factor 2 (500 genes)

−6 −4 −2 0 2 4 6−3

−2

−1

0

1

2

3

4

NormalCancer

Fac

tor

3(1

gen

e)

Factor 2 (12 genes)

PCA Clustering (left) & DSPCA Clustering (right), colon cancer data set in Alon,Barkai, Notterman, Gish, Ybarra, Mack & Levine (1999).


Smooth first-order vs IP

200 250 300 350 400 450 500 550 60010

2

103

104

105

CSDPSmooth

Problem Size n

CPU

Tim

e(in

sec.

)

Out of memory

102.4

102.5

102.6

102.7

102.8

105

106

107

108

109

1010

CSDPSmooth

Problem Size nM

emor

yU

sed

(in

byte

s)

Out of memory

Figure 1: CPU time and memory usage versus n.


Sparse PCA

0 50 100 150 200 250 300 350 400 4500

2

4

6

8

10

12

14

14 Genes, Rho = 3216 Genes, Rho = 3021 Genes, Rho = 2828 Genes, Rho = 24%

Eig

enva

lues

Req

uired

Time (seconds)

0 0.5 1 1.5 2 2.5

x 104

0

2

4

6

8

10

12

14

14 Genes, Rho = 3216 Genes, Rho = 3021 Genes, Rho = 2828 Genes, Rho = 24

%Eig

enva

lues

Req

uired

Duality Gap

Eigenvalues vs. CPU Time (left), Duality Gap vs Eigs. (right), on 1000 genes.


Sparse PCA

0 10 20 30 40 500.45

0.5

0.55

0.6

0.65

0.7

PCADSPCA

Ran

dIn

dex

# of Nonzeros

Sparsity versus Rand Index on colon cancer data set.


Sparse Nonnegative Matrix Factorization

Test relaxation on a matrix of the form:

M = xyT + U

where U is uniform noise.

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

1

2

4

6

8

10

12

2 4 6 8 10 121

xyT M Solution

x

yT


Conclusion

• Semidefinite relaxations of combinatorial problems in multivariate statistics.

• Infer sparse structural information on large datasets.

• Efficient codes can solve problems of with 103 variables in a few minutes.

Source code and binaries for sparse PCA (DSPCA) and covariance selection(COVSEL) available at:

www.princeton.edu/∼aspremon

These slides are available at:

www.princeton.edu/∼aspremon/Banff07.pdf


www.princeton.edu/~aspremon

www.princeton.edu/~aspremon/Banff07.pdf

References

Akaike, J. (1973), Information theory and an extension of the maximum likelihood principle, in B. N. Petrov & F. Csaki, eds, ‘Second

international symposium on information theory’, Akedemiai Kiado, Budapest, pp. 267–281.

Alon, A., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999), ‘Broad patterns of gene expression revealed byclustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays’, Cell Biology 96, 6745–6750.

Ben-Tal, A. & Nemirovski, A. (2001), Lectures on modern convex optimization : analysis, algorithms, and engineering applications,MPS-SIAM series on optimization, Society for Industrial and Applied Mathematics : Mathematical Programming Society, Philadelphia,

PA.

Bilmes, J. A. (1999), ‘Natural statistic models for automatic speech recognition’, Ph.D. thesis, UC Berkeley, Dept. of EECS, CS Division .

Bilmes, J. A. (2000), ‘Factored sparse inverse covariance matrices’, IEEE International Conference on Acoustics, Speech, and SignalProcessing .

Cadima, J. & Jolliffe, I. T. (1995), ‘Loadings and correlations in the interpretation of principal components’, Journal of Applied Statistics

22, 203–214.

Candes, E. & Tao, T. (2005), ‘Decoding by linear programming’, ArXiv: math.MG/0502327 .

Chen, S. S. & Gopinath, R. A. (1999), ‘Model selection in acoustic modeling’, EUROSPEECH .

Dahl, J., Roychowdhury, V. & Vandenberghe, L. (2005), ‘Maximum likelihood estimation of gaussian graphical models: numerical

implementation and topology selection’, UCLA preprint .

d’Aspremont, A. (2005), ‘Smooth optimization for sparse semidefinite programs’, ArXiv: math.OC/0512344 .

Dempster, A. (1972), ‘Covariance selection’, Biometrics 28, 157–175.

Dobra, A., Hans, C., Jones, B., Nevins, J. J. R., Yao, G. & West, M. (2004), ‘Sparse graphical models for exploring gene expression data’,Journal of Multivariate Analysis 90(1), 196–212.

Dobra, A. & West, M. (2004), ‘Bayesian covariance selection’, working paper .

Donoho, D. L. & Tanner, J. (2005), ‘Sparse nonnegative solutions of underdetermined linear equations by linear programming’, Proceedings

of the National Academy of Sciences 102(27), 9446–9451.

Fazel, M., Hindi, H. & Boyd, S. (2001), ‘A rank minimization heuristic with application to minimum order system approximation’,

Proceedings American Control Conference 6, 4734–4739.

Helmberg, C. & Rendl, F. (2000), ‘A spectral bundle method for semidefinite programming’, SIAM Journal on Optimization 10(3), 673–696.

Jolliffe, I. T., Trendafilov, N. & Uddin, M. (2003), ‘A modified principal component technique based on the LASSO’, Journal ofComputational and Graphical Statistics 12, 531–547.


Lemarechal, C. & Oustry, F. (1999), ‘Semidefinite relaxations and Lagrangian duality with application to combinatorial optimization’, INRIA,

Rapport de recherche 3710.

Lemarechal, C. & Sagastizabal, C. (1997), ‘Practical aspects of the Moreau-Yosida regularization: theoretical preliminaries’, SIAM Journal on

Optimization 7(2), 367–385.

Moler, C. & Van Loan, C. (2003), ‘Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later’, SIAM Review45(1), 3–49.

Nemirovski, A. (2004), ‘Prox-method with rate of convergence O(1/T) for variational inequalities with lipschitz continuous monotone

operators and smooth convex-concave saddle point problems’, SIAM Journal on Optimization 15(1), 229–251.

Nesterov, Y. (1983), ‘A method of solving a convex programming problem with convergence rate O(1/k2)’, Soviet Mathematics Doklady

27(2), 372–376.

Nesterov, Y. (2005), ‘Smooth minimization of nonsmooth functions’, Mathematical Programming, Series A 103, 127–152.

Nesterov, Y. & Nemirovskii, A. (1994), Interior-point polynomial algorithms in convex programming, Society for Industrial and Applied

Mathematics, Philadelphia.

Sturm, J. (1999), ‘Using SEDUMI 1.0x, a MATLAB toolbox for optimization over symmetric cones’, Optimization Methods and Software

11, 625–653.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the LASSO’, Journal of the Royal statistical society, series B 58(1), 267–288.

Toh, K. C., Todd, M. J. & Tutuncu, R. H. (1999), ‘SDPT3 – a MATLAB software package for semidefinite programming’, Optimization

Methods and Software 11, 545–581.

Zou, H., Hastie, T. & Tibshirani, R. (2004), ‘Sparse principal component analysis’, To appear in Journal of Computational and GraphicalStatistics .


semideﬁnite optimization with applications in sparse …aspremon/pdf/banff07.pdf · 2013. 9....

Documents