preconditioners for sparse approximations and big data … · 2020-02-18 · • compressed sensing...

Preconditioners: Sparse Approximations, Big Data

TH

E

U N I V E R S

I TY

OF

ED I N B U

RG

H

Preconditionersfor Sparse Approximationsand Big Data Optimization

Jacek Gondzio

thanks to my collaborator:

K. Fountoulakis

Padova, February 18th, 2020 1


Introduction

New applications:

• signal/image processing(compression, reconstruction, deblurring)

• modern computational statistics(inverse problems, classification, machine learning)

Overarching mathematical problem:

• dimension reduction

Optimization faces new challenges:

huge scale of data availableneeds new processing algorithms



Outline

• Homotopy:“replace a difficult problem with a sequenceof (easy to solve) auxilliary problems”

– Interior Point Methods– Big Data optimization

• 2nd-order methods for optimization– Inexact Newton method– Preconditioners needed– Matrix-free methods

• The same idea of preconditioner in IPMs and BD• Big Data problem (240

≈ 1012 variables)• Conclusions



Interior Point Methods (IPMs)



Primal-Dual Pair of Quadratic ProgramsPrimal Dual

min cTx+ 12xTQx max bTy − 1

2xTQx

s.t. Ax = b, s.t. ATy+ s−Qx = c,x ≥ 0; s ≥ 0.

LagrangianL(x, y) = cTx+

12xTQx− yT (Ax− b)− sTx.

Optimality Conditions

Ax = b,

ATy+ s−Qx = c,

XSe = 0, ( i.e., xj · sj = 0 ∀j),(x, s) ≥ 0,

X=diag{x1, · · · , xn}, S=diag{s1, · · · , sn}, e=(1, · · · ,1)∈Rn.Padova, February 18th, 2020 5


Interior-Point Framework

The log barrier − log xj“replaces” the inequality xj ≥ 0. x

−ln x

1

We derive the first order optimality conditionsfor the primal barrier problem:

Ax = b,

−Qx+ ATy+ s = c,XSe = µe,

and apply Newton method to solve this system of(nonlinear) equations.Padova, February 18th, 2020 6


First Order Opt Conditions for QP

Ax = b,

ATy+ s−Qx = c,XSe = 0,(x, s) ≥ 0,

First Order Opt Conditions for Barrier QP

Ax = b,

ATy+ s−Qx = c,XSe = µe,

(x, s) > 0,



Homotopy/Continuation

LP/QP is about “solving” the following equation

xj · sj = 0 ∀j = 1,2, ..., n.

Do not guess!Replace this equation with

xj · sj = µ ∀j = 1,2, ..., n,

and decrease µ gradually to zero.

Let the intelligent IPMs find outwhat the optimal partition is.



The pronunciation of Greek letter µ

Robert De Niro, Taxi Driver (1976)



Big Data Optimization



Sparse Approximation

• Machine Learning: Classification with SVMs

• Statistics: Estimate x from observations

• Wavelet-based signal/image reconst. & restoration

• Compressed Sensing (Signal Processing)

All such problems lead to the same dense, potentiallyvery large QP.



Binary Classification

min τ‖x‖1+m∑

i=1log(1+e−bix

Tai) min τ‖x‖22+m∑

i=1log(1+e−bix

Tai)



Bayesian Statistics Viewpoint

Estimate x from observations

y = Ax+ e,

where y are observations and e is the Gaussian noise.

→ minx ‖y − Ax‖22

If the prior on x is Laplacian (log p(x) = −λ‖x‖1 + K)then

minx

τ‖x‖1+ ‖Ax− b‖22

Tibshirani,J. of Royal Statistical Soc B 58 (1996) 267-288.



ℓ1-regularization

minx

τ‖x‖1+ φ(x).

Unconstrained optimization ⇒ easy

Serious Issue: nondifferentiability of ‖.‖1

Two possible tricks:

• Splitting x = u− v with u, v ≥ 0

• Smoothing with pseudo-Huber approximation

replaces ‖x‖1 with ψµ(x) =∑ni=1(

√

µ2+ x2i − µ)



Huber:



Continuation

Embed inexact Newton Meth into a homotopy approach:

• Inequalities u ≥ 0, v ≥ 0 −→ use IPM

replace z ≥ 0 with −µ logz and drive µ to zero.

• pseudo-Huber regression −→ use continuation

replace |xi| with µ(

√

1+x2iµ2

−1) and drive µ to zero.

Questions:

• Theory?

• Practice?



Main Tool: Inexact Newton Method

Replace an exact Newton direction

∇2f(x)∆x = −∇f(x)

with an inexact one:

∇2f(x)∆x = −∇f(x) + r,

where the error r is small: ‖r‖ ≤ η‖∇f(x)‖, η ∈ (0,1).

The NLP community usually writes it as:

‖∇2f(x)∆x+∇f(x)‖2 ≤ η‖∇f(x)‖2, η ∈ (0,1).

Dembo, Eisenstat & Steihaug,SIAM J. on Numerical Analysis 19 (1982) 400–408.Padova, February 18th, 2020 17


Inexact Primal-Dual IPM:G., Convergence Analysis of an Inexact Feasible IPM forConvex QP, SIAM J Opt, 23 (2013) No 3, 1510–1527.G., Matrix-Free Interior Point Method,Comput Opt and Appls, 51 (2012) 457–480.

Primal-Dual Newton Conjugate Grad Meth:Fountoulakis and G., A Second-order Method forStrongly Convex ℓ1-regularization Problems,Mathematical Programming, 156 (2016) 189–219.Dassios, Fountoulakis and G., A Preconditionerfor a Primal-Dual Newton Conjugate Gradient Methodfor Compressed Sensing Problems,SIAM J on Sci Comput, 37 (2015) A2783–A2812.



Compressed Sensing and Continuation

Replaceminx

f(x) = τ‖W ∗x‖1+12‖Ax− b‖22, −→ xτ

withminx

fµ(x) = τψµ(W ∗x) +12‖Ax− b‖22, −→ xτ,µ

Solve approximately a family of problems for a (short)decreasing sequence of µ’s: µ0 > µ1 > µ2 · · ·

Theorem (Brief description)

There exists a µ such that ∀µ ≤ µ the difference of thetwo solutions satisfies

‖xτ,µ − xτ‖2 = O(µ1/2) ∀ τ, µ



Convergence of the primal-dual Newton CG

Use inexact Newton directions:

‖∇2fµ(x)∆x+∇fµ(x)‖2 ≤ η‖∇fµ(x)‖2, η ∈ (0,1)

computed by the Newton CG method.

Theorem (Primal convergence)

Let {xk}∞k=0 be a sequence generated by pdNCG. Thenthe sequence {xk}∞k=0 converges to the primal (perturbed)solution xτ,µ.

Theorem (Rate of convergence)

If the forcing factor ηk satisfies limk→∞ ηk = 0, thenpdNCG converges superlinearly.



Continuation

(three examples of ℓ1-regularization)



Three examples of ℓ1-regularization

• Compressed Sensingwith K. Fountoulakis and P. Zhlobich

minx

τ‖x‖1+12‖Ax− b‖22, A ∈ Rm×n

• Compressed Sensing (Coherent and Redundant Dict.)with I. Dassios and K. Fountoulakis

minx

τ‖W ∗x‖1+12‖Ax− b‖22, W ∈ Cn×l, A ∈ Rm×n

think of Total Variation

• Big Data optimization (Machine Learning)with K. Fountoulakis



Example 1: Compressed Sensing

with K. Fountoulakis and P. Zhlobich

Large dense quadratic optimization problem:

minx

τ‖x‖1+12‖Ax− b‖22,

where A ∈ Rm×n is a very special matrix.

Fountoulakis, G., ZhlobichMatrix-free IPM for Compressed Sensing Problems,Math. Prog. Computation 6 (2014), pp. 1–31.

Software available at http://www.maths.ed.ac.uk/ERGO/



Restricted Isometry Property (RIP)

• rows of A are orthogonal to each other (A is built ofa subset of rows of an othonormal matrix U ∈Rn×n)

AAT = Im.

• small subsets of columns of A are nearly-orthogonalto each other: Restricted Isometry Property (RIP)

‖AT A−m

nIk‖ ≤ δk ∈ (0,1).

Candes, Romberg & Tao,Comm on Pure and Appl Maths 59 (2005) 1207-1233.



Restricted Isometry Property

Matrix A ∈ Rm×k (k ≪ n) is built of a subset of columnsof A ∈ Rm×n.

A = −→ A =

AT A = = ≈m

nIk.

This yields a very well conditioned optimization problem.



Problem Reformulation

minx

τ‖x‖1+12‖Ax− b‖22

Replace x = x+ − x− to be able to use |x| = x++ x−.Use |xi| = zi+ zi+n to replace ‖x‖1 with ‖x‖1 = 1T2nz.

(Increases problem dimension from n to 2n.)

minz≥0

cTz +12zTQz,

where

Q =[

AT

−AT

]

[A −A ] =[

ATA −ATA

−ATA ATA

]

∈ R2n×2n



Preconditioner

Approximate

M =[

ATA −ATA

−ATA ATA

]

+

[

Θ−11

Θ−12

]

withP =

m

n

[In −In

−In In

]

+

[

Θ−11

Θ−12

]

.

We expect (optimal partition):

• k entries of Θ−1 → 0, k ≪ 2n,

• 2n− k entries of Θ−1 → ∞.



Spectral Properties of P−1M

Theorem

• Exactly n eigenvalues of P−1M are 1.

• The remaining n eigenvalues satisfy

|λ(P−1M)− 1| ≤ δk +n

mδkL,

where δk is the RIP-constant, andL is a threshold of “large” (Θ1+Θ2)−1.

Fountoulakis, G., ZhlobichMatrix-free IPM for Compressed Sensing Problems,Math. Prog. Computation 6 (2014), pp. 1–31.



Preconditioning

2 4 6 8 10 12 14 16 18 20 2210

0

101

102

103

Number of CG/PCG call

Ma

tVe

cs

Matrix−vector products per CG/PCG call

Unpreconditioned CG

Preconditioned CG

2 4 6 8 10 12 14 16 18 20 2210

−6

10−4

10−2

100

102

104

106

108

number of CG/PCG call m

ag

nit

ud

e o

f λ(M

)/λ(P

−1M

)

Spread of λ(M)/λ(P−1

M) per call of CG/PCG

λ(M)

λ(P−1

M)

−→ good clustering of eigenvalues

mf-IPM compares favourably with NestA on easy probs(NestA: Becker, Bobin and Candes).



SPARCO problems

Comparison on 18 out of 26 classes of problems(all but 6 complex and 2 installation-dependent ones).

Solvers compared:PDCO, Saunders and Kim, Stanford,ℓ1−ℓs, Kim, Koh, Lustig, Boyd, Gorinevsky, Stanford,FPC-AS-CG, Wen, Yin, Goldfarb, Zhang, Rice,SPGL1, Van Den Berg, Friedlander, Vancouver, andmf-IPM, Fountoulakis, G., Zhlobich, Edinburgh.

On 36 runs (noisy and noiseless problems), mf-IPM:• is the fastest on 11,• is the second best on 14, and• overall is very robust.



Linear Algebra Perspective

Convert a difficulty into an advantage

Interestingly the same trick works:in IPMs and in Big Data!



Linear Algebra of IPMs for LP/QP

Newton direction

A 0 0−Q AT IS 0 X

∆x∆y∆s

=

ξpξdξµ

.

Eliminate ∆s from the second equation and get[

−Q−Θ−1 AT

A 0

] [∆x∆y

]

=[ξd −X−1ξµ

ξp

]

,

where Θ = XS−1 is a diagonal scaling matrix.

Eliminate ∆x from the first equation and get

(A(Q+Θ−1)−1AT )∆y = g.



IPM Linear Algebra: Splitting Preconditioner

For “basic” variables, xB → xB > 0 and sB → sB = 0hence

Θj = xj/sj ≈ (x2j)/(xjsj) = O(µ−1) ∀j ∈ B.

For “non-basic” variables, xN→ xN = 0 and sN→ sN > 0hence

Θj = xj/sj ≈ (xjsj)/(s2j) = O(µ) ∀j ∈ N .

Convert a difficulty into an advantage→ Exploit the property:

As µ→ 0 then:ΘB → ∞, Θ−1

B → 0;

ΘN → 0, Θ−1N → ∞.



Oliveira & Sorensen, LAA 394 (2005) 1-24.

Guess basic/nonbasic partition A= [B|N ], invertible B.

E−1HE−T

=

Θ1/2B Θ−1/2

B B−1

Θ1/2N

Θ1/2B

−Θ−1B BT

−Θ−1N NT

B N 0

Θ1/2B Θ1/2

B

Θ1/2N

B−TΘ−1/2B

=

Im WT

W −In−m−Im

, where W = Θ1/2N NTB−TΘ−1/2

B .

With µ→ 0 we have Θ−1B → 0, ΘN → 0 hence

W = Θ1/2N NTB−TΘ−1/2

B ≈ 0.



Property of Sparse Approximations

minx

f(x) = τ‖x‖1+ ‖Ax− b‖22

Assume a sparse solution exists x = [xB | xZ] with xZ = 0.Partition matrix A = [AB |AZ] accordingly.

Then only a small subset of the Hessian ATA is “relevant”

ATA =

[

ATBAB ATBAZ

ATZAB ATZAZ

]



Splitting x = u − v and IPMs

There is a need to solve equations with ATA+Θ−1 and

• Θ−1j → 0, for j in the sparse part B (xj > 0),

• Θ−1j → ∞, for j in the zero part Z (xj ≈ 0).

Then

ATA+Θ−1 =

[

ATBAB +Θ−1B ATBAZ

ATZAB ATZAZ +Θ−1Z

]

≈

[

ATBAB

Θ−1Z

]

From RIP: ATBAB ≈ mn I.



Smoothing with pseudo-Huber approximation

There is a need to solve equations with ATA+ ∇2ψµ(x)and

• ψµ(xj) ≫ 0 and ∇2ψµ(xj) → 0, for j in part B,

• ψµ(xj) ≈ 0 and ∇2ψµ(xj) → 1µ, for j in part Z.

Then

ATA+∇2ψµ(x) =

[

ATBAB+∇2ψµ(xB) ATBAZ

ATZAB ATZAZ+∇2ψµ(xZ)

]

≈

[

ATBAB1µI

]



Example 2: CS, Coherent & Redundant Dict.

with I. Dassios and K. Fountoulakis.


minx

τ‖W ∗x‖1+12‖Ax− b‖22,

where A ∈ Rm×n and W ∈ Cn×l is a dictionary.

Dassios, Fountoulakis and G.A Preconditioner for a Primal-Dual Newton ConjugateGradient Method for Compressed Sensing Problems,SIAM J on Sci. Comput. 37 (2015) A2783–A2812.

Software available at http://www.maths.ed.ac.uk/ERGO/



W-Restricted Isometry Property (W-RIP)

• rows of A are nearly-orthogonal to each other, i.e.,there exists a small constant δ such that

‖AAT − Im‖ ≤ δ.

• W-Restricted Isometry Property (W-RIP):there exists a constant δq such that

(1− δq)‖Wz‖22 ≤ ‖AWz‖22 ≤ (1 + δq)‖Wz‖22

for all at most q-sparse z ∈ Cn.

Candes, Eldar & Needell,Appl and Comp Harmonic Anal 31 (2011) 59-73.



Preconditioner

ApproximateH = τ∇2ψµ(W ∗x) + ATA

withP = τ∇2ψµ(W ∗x) + ρIn.

We expect (optimal partition):

• k entries of W ∗x ≫ 0, k ≪ l, ∇2ψµ(W ∗x)j → 0,

• l−k entries of W ∗x ≈ 0, ∇2ψµ(W ∗x)j → 1µ.

The preconditioner approximates well the 2nd derivativeof the pseudo-Huber regularization.



Spectral Properties of P−1H

Theorem

• The eigenvalues of P−1H satisfy

|λ(P−1H)− 1| ≤η(δ, δq, ρ)

ρ,

where δq is the W-RIP constant,δ is another small constant, andη(δ, δq, ρ) is some simple function.

Dassios, Fountoulakis and G.A Preconditioner for a Primal-Dual Newton ConjugateGradient Method for Compressed Sensing Problems,SIAM J on Sci. Comput. 37 (2015) A2783–A2812.Padova, February 18th, 2020 41


CS: Coherent and Redundant Dictionaries

−→ good clustering of eigenvalues

pdNCG outperforms TFOCS on several examples(TFOCS: Becker, Candes and Grant).



Example 3: Big Data and Optimization

with K. Fountoulakis.


minx

τ‖x‖1+12‖Ax− b‖22,

where A ∈ Rm×n.

Fountoulakis and G.Performance of First- and Second-Order Methods for ℓ1-regularized Least Squares Problems, COAP 65 (2016)605–635.

Software availablehttp://www.maths.ed.ac.uk/ERGO/trillion/



Simple example for ℓ1-regularization

minx

τ‖x‖1+ ‖Ax− b‖22

Special matrix given in SVD form A = QΣGT .

Matlab generator:http://www.maths.ed.ac.uk/ERGO/trillion/

The user controls:

• the condition number κ(A),

• the sparsity of matrix A.



Simple example for ℓ1-regularization

Suppose m ≥ n. Write A in the SVD form:

A = Q

[diag{σ1, σ2, . . . , σn}

0

]

GT ,

where Q is an m×m orthogonal matrix,σ1, σ2, . . . , σn are the singular values of A,G is a product of Givens rotations

G = G(i1, j1, θ1)G(i2, j2, θ2) . . . G(iK, jK, θK),

hence it is an orthonormal matrix.Observe that

ATA = G[

diag{σ21, σ22, . . . , σ

2n}

]

GT .



Excessive Computational Tests (4 mths of CPU)

• FISTA (Fast Iterative Shrinkage-Thresholding Alg)

• PCDM (Parallel Coordinate Descent Method)

• PSSgb (Projected Scaled Subgrad, Gafni-Bertsekas)

• pdNCG (primal-dual Newton Conjugate Gradients)

The 1-st order methods:

• work well if the condition number κ(A) ≤ 102,

• struggle when κ(A) ≥ 103,

• stall when κ(A) ≥ 104.

The 2-nd ordermethod (pdNCG, diag preconditioner):

• works well if the condition number κ(A) ≤ 106.



Let us go big: a trillion (240) variables

n (billions) Processors Memory (TB) time (s)1 64 0.192 19234 256 0.768 196816 1024 3.072 198664 4096 12.288 1970256 16384 49.152 1990

1,024 65536 196.608 2006

ARCHER (ranked 25 on top500.com, 11 March 2015)Linpack Performance (Rmax) 1,642.54 TFlop/sTheoretical Peak (Rpeak) 2,550.53 TFlop/sFountoulakis and G. Performance of First- andSecond-Order Methods for ℓ1-regularized Least SquaresProblems, COAP 65 (2016) 605–635.Padova, February 18th, 2020 47


Conclusions

2nd-order methods for optimization:

• employ inexact Newton method• rely on preconditioners• enjoy matrix-free implementation

Trick:

• find the “essential subspace” and• exploit it to simplify the linear algebra

– works in IPMs for LP– works in Newton CG for ℓ1-regularization

Simple, reliable test example for ℓ1-regularization:http://www.maths.ed.ac.uk/ERGO/trillion/



Baby test example: RCDC vs pdNCG

10−1

100

101

102

103

106

108

1010

1012

1014

1016

CPU time

Ob

jecti

ve f

un

cti

on

dcNCG

RCDC

10−1

100

101

102

103

106

108

1010

1012

1014

CPU time

dcNCG

RCDC

Dimensions: m = 4× 103, n = 2× 103.x∗ has 50 non-zero elements randomly positioned.

RCDC interrupted after 109 iterations, 31 hours.



ID rhs Accuracy mfIPM ℓ1 ℓs pdco fpc as cg spgl1

2 b 3.0e-04 61 48 687 9 40000b 1.0e-11 65 98 40007 40002 22

3 b 7.0e-04 241 462 4941 106 40000b 1.0e-08 415 1612 40157 212 148

5 b 2.0e-03 5991 9842 28203 521 40000b 2.0e-05 7953 19684 41283 874 2567

10 b 1.0e-03 4775 8529 6203 40002 40000b 9.0e-10 4567 8192 41227 40161 40000

701 b 2.0e-02 947 1794 5967 1049 40000b 7.0e-09 1341 2656 42041 40017 15239

702 b 4.0e-03 809 1574 3341 40001 40000b 1.0e-07 1123 3030 49563 40157 11089



A better linearization

τ Dx︸︷︷︸

∇ψµ(x)

+ AT (Ax− b) = 0,

where D:=diag(D1,...,Dn) with Di:=(µ2+x2i )−1

2 ∀i=1,...,n

Set g = Dx. Use the easier form of the equations.

Difficult:

τg+ AT (Ax− b) = 0,g = Dx.

Easy:

τg+ AT (Ax− b) = 0,D−1g = x.

Chan, Golub, Mulet,SIAM J. on Sci. Comput. 20 (1999) 1964–1977.



A better linearizationExample: gi = 0.99

bad: gi = Dixi good: D−1i gi = xi


preconditioners for sparse approximations and big data … · 2020-02-18 · • compressed sensing...

Documents