parallel selective algorithms for nonconvex big …1402.5521v5 [cs.dc] 9 dec 2014 1 parallel...

arX

iv:1

402.

5521

v5 [

cs.D

C]

9 D

ec 2

014

1

Parallel Selective Algorithms forNonconvex Big Data Optimization

Francisco Facchinei,∗ Gesualdo Scutari,∗ and Simone Sagratella

Abstract—We propose a decomposition framework for theparallel optimization of the sum of a differentiable (possiblynonconvex) function and a (block) separable nonsmooth, convexone. The latter term is usually employed to enforce structure inthe solution, typically sparsity. Our framework is very flexibleand includes both fully parallel Jacobi schemes and Gauss-Seidel (i.e., sequential) ones, as well as virtually all possibilities“in between” with only a subset of variables updated at eachiteration. Our theoretical convergence results improve onexistingones, and numerical results on LASSO, logistic regression,andsome nonconvex quadratic problems show that the new methodconsistently outperforms existing algorithms.

Index Terms—Parallel optimization, Variables selection, Dis-tributed methods, Jacobi method, LASSO, Sparse solution.

I. I NTRODUCTION

The minimization of the sum of a smooth function,F , andof a nonsmooth (block separable) convex one,G,

minx∈X

V (x) , F (x) +G(x), (1)

is an ubiquitous problem that arises in many fields of en-gineering, so diverse as compressed sensing, basis pursuitdenoising, sensor networks, neuroelectromagnetic imaging,machine learning, data mining, sparse logistic regression,genomics, metereology, tensor factorization and completion,geophysics, and radio astronomy. Usually the nonsmooth termis used to promote sparsity of the optimal solution, whichoften corresponds to a parsimonious representation of somephenomenon at hand. Many of the aforementioned applicationscan give rise to extremely large problems so that standardoptimization techniques are hardly applicable. And indeed,recent years have witnessed a flurry of research activity aimedat developing solution methods that are simple (for examplebased solely on matrix/vector multiplications) but yet capableto converge to a good approximate solution in reasonable time.It is hard here to even summarize the huge amount of workdone in this field; we refer the reader to the recent works[2]–[14] and books [15]–[17] as entry points to the literature.

However, with big data problems it is clearly necessaryto designparallel methods able to exploit the computationalpower of multi-core processors in order to solve many interest-ing problems. Furthermore, even if parallel methods are used,it might be useful to reduce the number of (block) variablesthat are updated at each iteration, so as to alleviate the burdenof dimensionality, and also because it has been observed

∗ The first two authors contributed equally to the paper.F. Facchinei and S. Sagratella are with the Dept. of Computer, Control,

and Management Engineering, at Univ. of Rome La Sapienza, Rome, Italy.Emails:<facchinei,sagratella>@dis.uniroma1.it.

G. Scutari is with the Dept. of Electrical Engineering, at the State Univ. ofNew York at Buffalo, Buffalo, USA. Email:[email protected]

Part of this work has been presented at the 2014 IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP 2014),Florence, Italy, May 4-9 2014, [1].

experimentally (even if in restricted settings and under somevery strong convergence assumptions, see [13], [18]–[20])thatthis might be beneficial. While sequential solutions methodsfor Problem (1) have been widely investigated (especiallywhenF is convex), the analysis of parallel algorithms suit-able to large-scale implementations lags behind. Gradient-typemethods can of course be easily parallelized, but, in spite oftheir good theoretical convergence bounds. they suffer frompractical drawbacks. Fist of all, they are notoriously slowinpractice. Accelerated (proximal) versions have been proposedin the literature to alleviate this issue, but they require theknowledge of some function parameters (e.g., the Lipschitzconstant of∇F , and the strong convexity constant ofF ,whenF is assumed strongly convex), which is not generallyavailable, unlessF andG have a very special structure (e.g.,quadratic functions); (over)estimates of such parametersaffectnegatively the convergence speed. Moreover, all (proximal,accelerated) gradient-based schemes use only the first orderinformation ofF ; recently we showed in [21] that exploitingthe structure ofF by replacing its linearization with a “better”approximant can enhance practical convergence speed. How-ever, beyond that, and looking at recent approaches, we areaware of only few (recent) papers dealing with parallel solutionmethods [8]–[13] and [20], [22]–[27]. The former group ofworks investigatedeterministic algorithms, while the latterrandomones. One advantage of the analyses in these works isthat they provide a global rate of convergence. However, i) theyare essentially still (regularized) gradient-based methods; ii)as proximal-gradient algorithms, they require good and globalknowledge of someF andG parameters; and iii) except for[9], [10], [12], [26], they are proved to converge only whenF is convex. Indeed, to date there are no methods that areparallel and random and that can be applied tononconvexproblems. For this reason, and also because of their markedlydifferent flavor (for example deterministic convergence vs.convergence in mean or in probability), we do not discussfurther random algorithms in this paper. We refer instead toSection V for a more detailed discussion on deterministic,parallel, and sequential solution methods proposed in theliterature for instances (mainly convex) of (1).

In this paper, we focus onnonconvexproblems in theform (1), and proposed a new broad,deterministicalgorithmicframework for the computation of their stationary solutions,which hinges on ideas first introduced in [21]. The essential,rather natural idea underlying our approach is to decompose(1) into a sequence of (simpler) subproblems whereby thefunction F is replaced by suitable convex approximations;the subproblems can be solved in aparallel and distributedfashion and it is not necessary to update all variables at eachiteration. Key features of the proposed algorithmic frameworkare: i) it is parallel, with a degree of parallelism that canbe chosen by the user and that can go from a complete

http://arxiv.org/abs/1402.5521v5

2

parallelism (every variable is updated in parallel to all theothers) to the sequential (only one variable is updated at eachiteration), covering virtually all the possibilities in “between”;ii) it permits the update in adeterministic fashion of onlysome (block) variables at each iteration (a feature that turnsout to be very important numerically); iii) it easily leads todistributed implementations; iv) no knowledge ofF and Gparameters (e.g., the Lipschitz constant of∇F ) is required; v)it can tackle a nonconvexF ; vi) it is very flexible in the choiceof the approximation ofF , which need not be necessarily itsfirst or second order approximation (like in proximal-gradientschemes); of course it includes, among others, updates basedon gradient- or Newton-type approximations; vii) it easilyallows for inexact solution of the subproblems (in some large-scale problems the cost of computing the exact solution of allthe subproblems can be prohibitive); and viii) it appears tobenumerically efficient. While features i)-viii), taken singularly,are certainly not new in the optimization community, we arenot aware of any algorithm that possesses them all, the more soif one limits attention to methods that can handle nonconvexobjective functions, which is in fact the main focus on thispaper. Furthermore, numerical results show that our schemecompares favorably to existing ones.

The proposed framework encompasses a gamut of novel al-gorithms, offering great flexibility to control iteration complex-ity, communication overhead, and convergence speed, whileconverging under the same conditions; these desirable featuresmake our schemes applicable to several different problems andscenarios. Among the variety of new proposed updating rulesfor the (block) variables, it is worth mentioning a combinationof Jacobi and Gauss-Seidel updates, which seems particularlyvaluable in the optimization of highly nonlinear objectivefunction; to the best of our knowledge, this is the first timethat such a scheme is proposed and analyzed.

A further contribution of the paper is an extensive im-plementation effort over a parallel architecture (the GeneralComputer Cluster of the Center for Computational Researchat the SUNY at Buffalo), which includes our schemes and themost competitive ones in the literature. Numerical resultsonLASSO, Logistic Regression, and some nonconvex problemsshow that our algorithms consistently outperform state-of-the-art schemes.

The paper is organized as follows. Section II formallyintroduces the optimization problem along with the mainassumptions under which it is studied. Section III describesour novel general algorithmic framework along with its con-vergence properties. In Section IV we discuss several instancesof the general scheme introduced in Section III. Section Vcontains a detailed comparison of our schemes with state-of-the-art algorithms on similar problems. Numerical resultsarepresented in Section VI. Finally, Section VII draws some con-clusions. All proofs of our results are given in the Appendix.

II. PROBLEM DEFINITION

We consider Problem (1), where the feasible setX =X1 × · · · × XN is a Cartesian product of lower dimensionalconvex setsXi ⊆ R

ni , andx ∈ Rn is partitioned accordingly:

x = (x1, . . . ,xN ), with each xi ∈ Rni ; F is smooth

(and not necessarily convex) andG is convex and possiblynondifferentiable, withG(x) =

∑Ni=1

gi(xi). This formulationis very general and includes problems of great interest. Belowwe list some instances of Problem (1).• G(x) = 0; the problem reduces to the minimization of asmooth, possibly nonconvex problem with convex constraints.• F (x) = ‖Ax − b‖2 and G(x) = c‖x‖1, X = R

n, withA ∈ R

m×n, b ∈ Rm, and c ∈ R++ given constants; this is

the renowned and much studied LASSO problem [2].• F (x) = ‖Ax − b‖2 andG(x) = c

∑N

i=1 ‖xi‖2, X = Rn,

with A ∈ Rm×n, b ∈ R

m, andc ∈ R++ given constants; thisis the group LASSO problem [28].

• F (x) =∑m

j=1 log(1 + e−ajyTj x) and G(x) = c‖x‖1 (or

G(x) = c∑N

i=1 ‖xi‖2), with yi ∈ Rn, ai ∈ R, andc ∈ R++

given constants; this is the sparse logistic regression problem[29], [30].• F (x) =

∑m

j=1 max{0, 1 − aiyTi x}

2 and G(x) = c‖x‖1,with ai ∈ {−1, 1}, yi ∈ R

n, andc ∈ R++ given; this is theℓ1-regularizedℓ2-loss Support Vector Machine problem [5].• F (X1,X2) = ‖Y − X1X2‖

2F and G(X2) = c‖X2‖1,

X = {(X1,X2) ∈ Rn×m × R

m×N : ‖X1ei‖2 ≤ αi, ∀i =1, . . . ,m}, whereX1 and X2 are the (matrix) optimizationvariables,Y ∈ R

n×N , c > 0, and (αi)mi=1 > 0 are given

constants,ei is them-dimensional vector with a 1 in thei-thcoordinate and0’s elsewhere, and‖X‖F and‖X‖1 denote theFrobenius norm and the L1 matrix norm ofX, respectively;this is an example of the dictionary learning problem for sparserepresentation, see, e.g., [31]. Note thatF (X1,X2) is notjointly convex in (X1,X2).• Other problems that can be cast in the form (1) includethe Nuclear Norm Minimization problem, the Robust PrincipalComponent Analysis problem, the Sparse Inverse CovarianceSelection problem, the Nonnegative Matrix (or Tensor) Fac-torization problem, see e.g., [32] and references therein.

Assumptions. Given (1), we make the following blanketassumptions:(A1) EachXi is nonempty, closed, and convex;

(A2) F is C1 on an open set containingX ;

(A3) ∇F is Lipschitz continuous onX with constantLF ;

(A4) G(x) =∑N

i=i gi(xi), with all gi continuous and convexon Xi;

(A5) V is coercive.

Note that the above assumptions are standard and are satisfiedby most of the problems of practical interest. For instance,A3holds automatically ifX is bounded; the block-separabilitycondition A4 is a common assumption in the literature ofparallel methods for the class of problems (1) (it is in factinstrumental to deal with the nonsmoothness ofG in a parallelenvironment). Interestingly A4 is satisfied by all standardGusually encountered in applications, includingG(x) = ‖x‖1and G(x) =

∑N

i=1 ‖xi‖2, which are among the most com-monly used functions. Assumption A5 is needed to guaranteethat the sequence generated by our method is bounded; wecould dispense with it at the price of a more complex analysisand cumbersome statement of convergence results.

3

III. M AIN RESULTS

We begin introducing an informal description of our algo-rithmic framework along with a list of key features that wewould like our schemes enjoy; this will shed light on the coreidea of the proposed decomposition technique.

We want to developparallel solution methods for Problem(1) whereby operations can be carried out on some or (possi-bly) all (block) variablesxi at thesametime. The most naturalparallel (Jacobi-type) method one can think of is updatingallblocks simultaneously: givenxk, each (block) variablexi isupdated by solving the following subproblem

xk+1i ∈ argmin

xi∈Xi

{F (xi,x

k−i) + gi(xi)

}, (2)

wherex−i denotes the vector obtained fromx by deleting theblockxi. Unfortunately this method converges only under veryrestrictive conditions [33] that are seldom verified in practice.To cope with this issue the proposed approach introduces some“memory" in the iterate: the new point is a convex combinationof xk and the solutions of (2). Building on this iterate, wewould like our framework to enjoy many additional features,as described next.Approximating F : Solving each subproblem as in (2) may betoo costly or difficult in some situations. One may then preferto approximate this problem, in some suitable sense, in orderto facilitate the task of computing the new iteration. To thisend, we assume that for alli ∈ N , {1, . . . , N} we can definea functionPi(z;w) : Xi×X → R, the candidate approximantof F , having the following properties (we denote by∇Pi thepartial gradient ofPi with respect to the first argumentz):

(P1) Pi(•;w) is convex and continuously differentiable onXi

for all w ∈ X ;(P2) ∇Pi(xi;x) = ∇xi

F (x) for all x ∈ X ;(P3) ∇Pi(z; •) is Lipschitz continuous onX for all z ∈ Xi.

Such a functionPi should be regarded as a (simple) convexapproximation ofF at the pointx with respect to the blockof variablesxi that preserves the first order properties ofFwith respect toxi.

Based on this approximation we can define at any pointxk ∈ X a regularizedapproximationhi(xi;x

k) of V withrespect toxi whereinF is replaced byPi while the nondif-ferentiable term is preserved, and a quadratic proximal termis added to make the overall approximation strongly convex.More formally, we have

hi(xi;xk) , Pi(xi;x

k) +τi2

(xi − xk

i

)TQi(x

k)(xi − xk

i

)

︸︷︷︸,hi(xi;xk)

+gi(xi),(3)

whereQi(xk) is anni × ni positive definite matrix (possibly

dependent onxk). We always assume that the functionshi(•,xk

i ) are uniformly strongly convex.

(A6) All hi(•;xk) are uniformly strongly convex onXi

with a common positive definiteness constantq > 0;furthermore,Qi(•) is Lipschitz continuous onX .

Note that an easy and standard way to satisfy A6 is to take, forany i and for anyk, τi = q > 0 andQi(x

k) = I. However, if

Pi(•;xk) is already uniformly strongly convex, one can avoidthe proximal term and setτi = 0 while satisfying A6.

Associated with eachi and pointxk ∈ X we can define thefollowing optimal block solution map:

xi(xk, τi) , argmin

xi∈Xi

hi(xi;xk). (4)

Note thatxi(xk, τi) is always well-defined, since the optimiza-

tion problem in (4) is strongly convex. Given (4), we can thenintroduce, for eachy ∈ X , the solution map

x(y, τ ) , [x1(y, τ1), x2(y, τ2), . . . , xN (y, τN )]T .

The proposed algorithm (that we formally describe later on)isbased on the computation of (an approximation of)x(xk, τ ).Therefore the functionsPi should lead to as easily computablefunctions x as possible. An appropriate choice depends onthe problem at hand and on computational requirements. Wediscuss alternative possible choices for the approximationsPi

in Section IV.Inexact solutions: In many situations (especially in the caseof large-scale problems), it can be useful to further reducethe computational effort needed to solve the subproblems in(4) by allowing inexact computationszk of xi(x

k, τi), i.e.,‖zki − xi

(xk, τ

)‖ ≤ εki , whereεki measures the accuracy in

computing the solution.Updating only some blocks:Another important feature wewant for our algorithm is the capability of updating at eachiteration only someof the (block) variables, a feature thathas been observed to be very effective numerically. In fact,our schemes are guaranteed to converge under the updateof only a subsetof the variables at each iteration; the onlycondition is that such a subset contains at least one (block)component which is within a factorρ ∈ (0, 1] “far away”from the optimality, in the sense explained next. Sincexk

i is anoptimal solution of (4) if and only ifxi(x

k, τi) = xki , a natural

distance ofxki from the optimality isd k

i , ‖xi(xk, τi)−xk

i ‖;one could then select the blocksxi’s to update based onsuch an optimality measure (e.g., opting for blocks exhibitinglarger d k

i ’s). However, this choice requires the computationof all the solutions xi(x

k, τi), for i = 1, . . . , n, whichin some applications (e.g., huge-scale problems) might becomputationally too expensive. Building on the same idea, wecan introduce alternative less expensive metrics by replacingthe distance‖xi(x

k, τi)−xki ‖ with a computationally cheaper

error bound, i.e., a functionEi(x) such that

si‖xi(xk, τi)− xk

i ‖ ≤ Ei(xk) ≤ si‖xi(x

k, τi)− xki ‖, (5)

for some 0 < si ≤ si. Of course one can always setEi(x

k) = ‖xi(xk, τi) − xk

i ‖, but other choices are alsopossible; we discuss this point further in Section IV.

Algorithmic framework: We are now ready to formallyintroduce our algorithm, Algorithm 1, that includes all thefeatures discussed above; convergence to stationary solutions1

of (1) is stated in Theorem 1.

1We recall that a stationary solutionx∗ of (1) is a points for which asubgradientξ ∈ ∂G(x∗) exists such that(∇F (x∗) +ξ)T (y− x

∗) ≥ 0 forall y ∈ X. Of course, ifF is convex, stationary points coincide with globalminimizers.

4

Algorithm 1: Inexact Flexible Parallel Algorithm (FLEXA)

Data : {εki } for i ∈ N , τ ≥ 0, {γk} > 0, x0 ∈ X , ρ ∈ (0, 1].Setk = 0.

(S.1) : If xk satisfies a termination criterion: STOP;(S.2) : SetMk , maxi{Ei(x

k)}.Choose a setSk that contains at least one indexifor which Ei(x

k) ≥ ρMk.(S.3) : For all i ∈ Sk, solve (4) with accuracyεki :

Find zki ∈ Xi s.t. ‖zki − xi

(xk, τ

)‖ ≤ εki ;

Set zki = zki for i ∈ Sk and zki = xki for i 6∈ Sk

(S.4) : Setxk+1 , xk + γk (zk − xk);(S.5) : k ← k + 1, and go to(S.1).

Theorem 1. Let {xk} be the sequence generated by Al-gorithm 1, under A1-A6. Suppose that{γk} and {εki }satisfy the following conditions: i)γk ∈ (0, 1]; ii)∑

k γk = +∞; iii)

∑k

(γk

)2< +∞; and iv) εki ≤

γkα1 min{α2, 1/‖∇xiF (xk)‖} for all i ∈ N and some

α1 > 0 and α2 > 0. Additionally, if inexact solutions areused in Step 3, i.e.,εki > 0 for somei and infinitek, thenassume also thatG is globally Lipschitz onX . Then, eitherAlgorithm 1 converges in a finite number of iterations to astationary solution of(1) or every limit point of{xk} (atleast one such points exists) is a stationary solution of(1).

Proof: See Appendix B.

The proposed algorithm is extremely flexible. We can al-ways chooseSk = N resulting in the simultaneous update ofall the (block) variables (full Jacobi scheme); or, at the otherextreme, one can update a single (block) variable per time, thusobtaining a Gauss-Southwell kind of method. More classicalcyclic Gauss-Seidel methods can also be derived and arediscussed in the next subsection. One can also compute inexactsolutions (Step 3) while preserving convergence, providedthatthe error termεki and the step-sizeγk ’s are chosen accordingto Theorem 1; some practical choices for these parameters arediscussed in Section IV. We emphasize that the Lipschitzianityof G is required only if x(xk, τ) is not computed exactlyfor infinite iterations. At any rate this Lipschitz conditions isautomatically satisfied ifG is a norm (and therefore in LASSOand group LASSO problems for example) or ifX is bounded.

As a final remark, note that versions of Algorithm 1 whereall (or most of) the variables are updated at each iterationare particularly amenable to implementation indistributeden-vironments (e.g., multi-user communications systems, ad-hocnetworks, etc.). In fact, in this case, not only the calculationof the inexact solutionszki can be carried out in parallel, butthe information that “thei-th subproblem” has to exchangewith the “other subproblem” in order to compute the nextiteration is very limited. A full appreciation of the potentialitiesof our approach in distributed settings depends however onthe specific application under consideration and is beyond thescope of this paper. We refer the reader to [21] for someexamples, even if in less general settings.

A. Gauss-Jacobi algorithms

Algorithm 1 and its convergence theory cover fully parallelJacobi as well as Gauss-Southwell-type methods, and many oftheir variants. In this section we show that Algorithm 1 canalso incorporatehybrid parallel-sequential(Jacobi−Gauss-Seidel) schemes wherein block of variables are updatedsimul-taneouslyby sequentiallycomputing entries per block. Thisprocedure seems particularly well suited to parallel optimiza-tion on multi-core/processor architectures.

Suppose that we haveP processors that can be used inparallel and we want to exploit them to solve Problem (1)(P will denote both the number of processors and the set{1, 2, . . . , P}). We “assign” to each processorp the variablesIp; thereforeI1, . . . , IP is a partition of I. We denote byxp , (xpi)i∈Ip the vector of (block) variablesxpi assigned toprocessorp, with i ∈ Ip; andx−p is the vector of remainingvariables, i.e., the vector of those assigned to all processorsexcept thep-th one. Finally, giveni ∈ Ip, we partitionxp

as xp = (xpi<,xpi≥), wherexpi< is the vector containingall variables inIp that come beforei (in the order assumedin Ip), while xpi≥ are the remaining variables. Thus we willwrite, with a slight abuse of notationx = (xpi<,xpi≥,x−p).

Once the optimization variables have been assigned to theprocessors, one could in principle apply the inexact JacobiAlgorithm 1. In this scheme each processorp would com-pute sequentially, at each iterationk and for every (block)variablexpi, a suitablezkpi by keeping all variables butxpi

fixed to (xkpj)i6=j∈Ip andxk

−p. But since we are solving theproblems for each group of variables assigned to a processorsequentially, this seems a waste of resources; it is insteadmuchmore efficient to use, within each processor, a Gauss-Seidelscheme, whereby thecurrentcalculated iterates are used in allsubsequent calculations. Our Gauss-Jacobi method formallydescribed in Algorithm 2 implements exactly this idea; itsconvergence properties are given in Theorem 2.

Algorithm 2: Inexact Gauss-Jacobi Algorithm

Data : {εkpi} for p ∈ P andi ∈ Ip, τ ≥ 0, {γk} > 0, x0 ∈ K.Setk = 0.

(S.1) : If xk satisfies a termination criterion: STOP;(S.2) : For all p ∈ P do (in parallel),

For all i ∈ Ip do (sequentially)a) Findzkpi ∈ Xi s.t.‖zkpi − xpi

((xk+1

pi< ,xkpi≥,x

k−p), τ

)‖ ≤ εkpi;

b) Setxk+1pi , xk

pi + γ k(zkpi − xk

pi

)

(S.3) : k ← k + 1, and go to(S.1).

Theorem 2. Let {xk}∞k=1 be the sequence generated byAlgorithm 2, under the setting of Theorem 1, but with theaddition assumption that∇F is bounded onX . Then, eitherAlgorithm 2 converges in a finite number of iterations to astationary solution of (1) or every limit point of the sequence{xk}∞k=1 (at least one such points exists) is a stationarysolution of (1).

Proof: See Appendix C.Although the proof of Theorem 2 is relegated to the ap-

pendix, it is interesting to point out that the gist of the proof

5

is to show that Algorithm 2 is nothing else but an instance ofAlgorithm 1 with errors. We remark that Algorithm 2 containsas special case the classical cyclical Gauss-Seidel schemeitis sufficient to setP = 1 then a single processor updates allthe (scalar) variables sequentially while using the new valuesof those that have already been updated.

By updating all variables at each iteration, Algorithm 2has the advantage that neither the error boundsEi nor theexact solutionsxpi need to be computed, in order to decidewhich variables should be updated. Furthermore it is ratherintuitive that the use of the “latest available information”should reduce the number of overall iterations needed toconverge with respect to Algorithm 1. However this advantagesshould be weighted against the following two facts: i) updatingall variables at each iteration might not always be the best (or afeasible) choice; and ii) in many practical instances of Problem(1), using the latest information as dictated by Algorithm2 may require extra calculations, e.g. to compute functiongradients, and communication overhead (these aspects arediscussed on specific examples in Section VI). It may thenbe of interest to consider a further scheme, that we mightcall “Gauss-Jacobi with Selection”, where we merge the basicideas of Algorithms 1 and 2. Roughly speaking, at eachiteration we proceed as in the Gauss-Jacobi Algorithm 2, butwe perform the Gauss steps only on a subsetSk

p of eachIp, where the subsetSk

p is defined according to the rulesused in Algorithm 1. To present this combined scheme, weneed to extend the notation used in Algorithm 2. LetSk

p

be a subset ofIp. For notational purposes only, we reorderxp so that first we have all variables inSk

p and then theremaining variables inIp: xp = (xSk

p,xIp\Sk

p). Now, similarly

to what done before, and given an indexi ∈ Skp , we partition

xSkp

as xSkp

= (xSkp i<

,xSkp i≥

), wherexSkp i<

is the vectorcontaining all variables inSk

p that come beforei (in the orderassumed inSk

p ), while xSkp i≥

are the remaining variablesin Sk

p . Thus we will write, with a slight abuse of notationx = (xSk

p i<,xSk

p i≥,x−p). The proposed Gauss-Jacobi with

Selection is formally described in Algorithm 3 below.

Algorithm 3: Inexact GJ Algorithm with Selection

Data : {εkpi} for p ∈ P andi ∈ Ip, τ ≥ 0, {γk} > 0, x0 ∈ K,ρ ∈ (0, 1]. Setk = 0.

(S.1) : If xk satisfies a termination criterion: STOP;(S.2) : SetMk , maxi{Ei(x

k)}.Choose setsSk

p ⊆ Ip so that their unionSk containsat least one indexi for which Ei(x

k) ≥ ρMk.(S.3) : For all p ∈ P do (in parallel),

For all i ∈ Skp do (sequentially)

a) Findzkpi ∈ Xi s.t.

‖zkpi− xpi

((xk+1

Skp i<

,xkSkp i≥

,xk−p), τ

)‖ ≤ εkpi;

b) Setxk+1pi , xk

pi + γ k(zkpi − xk

pi

)

(S.4) : Setxk+1pi = xk

pi for all i 6∈ Sk,k ← k + 1, and go to(S.1).

Theorem 3. Let {xk}∞k=1 be the sequence generated byAlgorithm 3, under the setting of Theorem 1, but with the

addition assumption that∇F is bounded onX . Then, eitherAlgorithm 3 converges in a finite number of iterations to astationary solution of (1) or every limit point of the sequence{xk}∞k=1 (at least one such points exists) is a stationarysolution of (1).

Proof: The proof is just a (notationally complicated, butconceptually easy) combination of the proofs of Theorems 1and 2 and is omitted for lack of space.

Our experiments show that, in the case of highly nonlinearobjective functions, Algorithm 3 performs very well in prac-tice, see Section VI.

IV. EXAMPLES AND SPECIAL CASES

Algorithms 1 and 2 are very general and encompass a gamutof novel algorithms, each corresponding to various forms ofthe approximantPi, the error bound functionEi, the step-size sequenceγk, the block partition, etc. These choices leadto algorithms that can be very different from each other, butall converging under the same conditions. These degrees offreedom offer a lot of flexibility to control iteration complexity,communication overhead, and convergence speed. We outlinenext several effective choices for the design parameters alongwith some illustrative examples of specific algorithms resultingfrom a proper combination of these choices.On the choice of the step-sizeγk. An example of step-sizerule satisfying conditions i)-iv) in Theorem 1 is: given0 <γ0 ≤ 1, let

γk = γk−1(1− θ γk−1

), k = 1, . . . , (6)

whereθ ∈ (0, 1) is a given constant. Notice that while thisrule may still require some tuning for optimal behavior, it isquite reliable, since in general we are not using a (sub)gradientdirection, so that many of the well-known practical drawbacksassociated with a (sub)gradient method with diminishing step-size are mitigated in our setting. Furthermore, this choiceofstep-size does not require any form of centralized coordina-tion, which is a favorable feature in a parallel environment.Numerical results in Section VI show the effectiveness of (thecustomization of) (6) on specific problems.

Remark 4 (Line-search variants of Algorithms 1 and 2). Itis possible to prove convergence of Algorithms 1 and 2 alsousing other step-size rules, for example Armijo-like line-searchprocedures onV . In fact, Proposition 8 in Appendix A showsthat the directionx

(xk, τ

)−xk is a “good" descent direction.

Based on this result, it is not difficult to prove that ifγk ischosen by a suitable line-search procedure onV , convergenceof Algorithm 1 to stationary points of Problem (1) (in the senseof Theorem 1) is guaranteed. Note that standard line-searchmethods proposed for smooth functions cannot be applied toV (due to the nonsmooth partG); one needs to rely on moresophisticated procedures, e.g., in the spirit of those proposedin [6], [10], [12], [37]. We provide next an example of line-search rule that can be used to computeγk while guaranteeingconvergence of Algorithms 1 and 2 [instead of using rules i)-iii) in Theorem 1]; because of space limitations, we consideronly the case of exact solutions (i.e.,ǫki = 0 in Algorithm 1and ǫkpi = 0 in Algorithms 2 and 3). Writing for shortxk

i ,

6

xi

(xk, τ

)and∆xk , (∆xk

i )ni=1, with ∆xk

i , xki − xk

i , anddenoting by(x)Sk the vector whose componenti is equalto xi if i ∈ Sk, and zero otherwise, we have the following:let α, β ∈ (0, 1), chooseγk = βℓ, where ℓ is the smallestnonnegative integerℓ such that

V(xk + βℓ(∆xk)Sk

)− V (xk) ≤ −α · βℓ ‖

(∆xk

)Sk ‖

2.

Of course, the above procedure will likely be more efficientin terms of iterations than the one based on diminishingstep-size rules [as (6)]. However, performing a line-searchon a multicore architecture requires some shared memoryand coordination among the cores/processors; therefore wedo not consider further this variant. Finally, we observe thatconvergence of Algorithms 1 and 2 can also be obtainedby choosing a constant (suitably small ) stepsizeγk. Thisis actually the easiest option, but since it generally leadstoextremely slow convergence we omitted this option from ourdevelopments here.

On the choice of the error bound functionEi(x).• As we mentioned, the most obvious choice is to takeEi(x) = ‖xi(x

k, τi) − xki ‖. This is a valuable choice if the

computation ofxi(xk, τi) can be easily accomplished. For

instance, in the LASSO problem withN = {1, . . . , n} (i.e.,when each block reduces to a scalar variable), it is well-knownthat xi(x

k, τi) can be computed in closed form using the soft-thresholding operator [11].• In situations where the computation of‖xi(x

k, τi)−xki ‖ is

not possible or advisable, we can resort to estimates. Assumemomentarily thatG ≡ 0. Then it is known [34, Proposition6.3.1] under our assumptions that‖ΠXi

(xki−∇xi

F (xk))−xki ‖

is an error bound for the minimization problem in (4) andtherefore satisfies (5), whereΠXi

(y) denotes the Euclideanprojection ofy onto the closed and convex setXi. In this sit-uation we can chooseEi(x

k) = ‖ΠXi(xk

i −∇xiF (xk))−xk

i ‖.If G(x) 6≡ 0 things become more complex. In most cases ofpractical interest, adequate error bounds can be derived from[10, Lemma 7].• It is interesting to note that the computation ofEi isonly needed if a partial update of the (block) variables isperformed. However, an option that is always feasible is totakeSk = N at each iteration, i.e., update all (block) variablesat each iteration. With this choice we can dispense with thecomputation ofEi altogether.

On the choice of the approximantPi(xi;x).• The most obvious choice forPi is the linearization ofF atxk with respect toxi:

Pi(xi;xk) = F (xk) +∇xi

F (xk)T (xi − xki ).

With this choice, and taking for simplicityQi(xk) = I,

xi(xk, τi) = argmin

xi∈Xi

{F (xk) +∇xi

F (xk)T (xi − xki )

+τi2‖xi − xk

i ‖2 + gi(xi)

}.

(7)This is essentially the way a new iteration is computed in most(block-)CDMs for the solution of (group) LASSO problemsand its generalizations.• At another extreme we could just takePi(xi;x

k) =

F (xi,xk−i). Of course, to have P1 satisfied (cf. Section III),

we must assume thatF (xi,xk−i) is convex. With this choice,

and setting for simplicityQi(xk) = I, we have

xi(xk, τi) , argmin

xi∈Xi

{F (xi,x

k−i) +

τi2‖xi − xk

i ‖2 + gi(xi)

},

(8)thus giving rise to a parallel nonlinear Jacobi type method forthe constrained minimization ofV (x).• Between the two “extreme” solutions proposed above,one can consider “intermediate” choices. For example, IfF (xi,x

k−i) is convex, we can takePi(xi;x

k) as a secondorder approximation ofF (xi,x

k−i), i.e.,

Pi(xi;xk) = F (xk) +∇xi


+ 12 (xi − xk

i )T∇2

xixiF (xk)(xi − xk

i ).(9)

When gi(xi) ≡ 0, this essentially corresponds to tak-ing a Newton step in minimizing the “reduced” problemminxi∈Xi

F (xi,xk−i), resulting in

xi(xk, τi) = argmin

xi∈Xi

{F (xk) +∇xi


+ 12 (xi − xk

i )T∇2

xixiF (xk)(xi − xk

i )

+ τi2 ‖xi − xk

i ‖2 + gi(xi)

}.

(10)• Another “intermediate” choice, relying on a specific structureof the objective function that has important applications is thefollowing. Suppose thatF is a sum-utility function, i.e.,

F (x) =∑

j∈J

fj(xi,x−i),

for some finite setJ . Assume now that for everyj ∈ Si ⊆ J ,the functionsfj(•,x−i) is convex. Then we may set

Pi(xi;xk) =

∑

j∈Si

fj(xi,xk−i) +

∑

j 6∈Si

∇fj(xi,xk−i)

T (xi − xki )

thus preserving, for eachi, the favorable convex part ofF withrespect toxi while linearizing the nonconvex parts. This is theapproach adopted in [21] in the design of multi-users systems,to which we refer for applications in signal processing andcommunications.

The framework described in Algorithm 1 can give rise tovery different schemes, according to the choices one makesfor the many variable features it contains, some of whichhave been detailed above. Because of space limitation, wecannot discuss here all possibilities. We provide next justa fewinstances of possible algorithms that fall in our framework.

Example#1−(Proximal) Jacobi algorithms for convexfunctions.Consider the simplest problem falling in our setting:the unconstrained minimization of a continuously differen-tiable convex function, i.e., assume thatF is convex,G ≡ 0,and X = R

n. Although this is possibly the best studiedproblem in nonlinear optimization, classical parallel methodsfor this problem [33, Sec. 3.2.4] require very strong contrac-tion conditions. In our framework we can takePi(xi;x

k) =F (xi,x

k−i), resulting in a parallel Jacobi-type method which

does not need any additional assumptions. Furthermore ourtheory shows that we can even dispense with the convexityassumption and still get convergence of a Jacobi-type method

7

to a stationary point. If in addition we takeSk = N , we obtainthe class of methods studied in [21], [35]–[37].

Example#2−Parallel coordinate descent methods forLASSO. Consider the LASSO problem, i.e., Problem (1) withF (x) = ‖Ax− b‖2, G(x) = c‖x‖1, andX = R

n. Probably,to date, the most successful class of methods for this problemis that of CDMs, whereby at each iteration asingle variableis updated using (7). We can easily obtain a parallel versionfor this method by takingni = 1, Sk = N and still using (7).Alternatively, instead of linearizingF , we can better exploitthe structure ofF and use (8). In fact, it is well known that inLASSO problems subproblem (8) can be solved analytically(whenni = 1). We can easily consider similar methods forthe group LASSO problem as well (just takeni > 1).

Example#3−Parallel coordinate descent methods for Lo-gistic Regression.Consider the Logistic Regression problem,i.e., Problem (1) withF (x) =

∑mj=1 log(1 + e−aiy

Ti x),

G(x) = c‖x‖1, andX = Rn, whereyi ∈ R

n, ai ∈ {−1, 1},and c ∈ R++ are given constants. SinceF (xi,x

k−i) is

convex, we can takePi(xi;xk) = F (xk) + ∇xi

F (xk)T

(xi − xki ) + 1

2 (xi − xki )

T∇2xixi

F (xk)(xi − xki ) and thus

obtaining a fully distributed and parallel CDM that usesa second order approximation of the smooth functionF .Moreover by takingni = 1 and using a soft-thresholdingoperator, eachxi can be computed in closed form.

Example#4−Parallel algorithms for dictionary learningproblems. Consider the dictionary learning problem, i.e.,Problem (1) withF (X) = ‖Y − X1X2‖

2F and G(X2) =

c‖X2‖1, and X = {X , (X1,X2) ∈ Rn×m × R

m×N :‖X1ei‖2 ≤ αi, ∀i = 1, . . . ,m}. SinceF (X1,X2) is convexin X1 and X2 separately, one can takeP1(X1;X

k2) =

F (X1,Xk2) andP2(X2;X

k1) = F (Xk

1 ,X2). Although natural,this choice does not lead to a close form solutions of thesubproblems associated with the optimization ofX1 andX2.This desirable property can be obtained using the followingalternative approximations ofF [38]: P1(X1;X

k) = F (Xk)+⟨∇X1

F (Xk),X1 −Xk1

⟩and P2(X2;X

k) = F (Xk) +⟨∇X2

F (Xk),X2 −Xk2

⟩, where 〈A,B〉 , tr(ATB). Note

that differently from [38], our algorithm is parallel and moreflexible in the choice of the proximal gain terms [cf. (3)].

V. RELATED WORKS

The proposed algorithmic framework draws on SuccessiveConvex Approximation (SCA) paradigms that have a long his-tory in the optimization literature. Nevertheless, our algorithmsand their convergence conditions (cf. Theorems 1 and 2) unifyand extend current parallel and sequential SCA methods inseveral directions, as outlined next.(Partially) Parallel Methods: The roots of parallel deter-ministic SCA schemes (whereinall the variables are updatedsimultaneously) can be traced back at least to the work ofCohen on the so-called auxiliary principle [35], [36] and itsrelated developments, see e.g. [8]–[13], [21], [23], [27],[37],[39], [40]. Roughly speaking these works can be divided intwo groups, namely: solution methods forconvexobjectivefunctions [8], [11], [13], [23], [27], [35], [36] andnonconvexones [9], [10], [12], [21], [37], [39], [40]. All methods in the

former group (and [9], [10], [12], [39], [40]) are (proximal)gradient schemes; they thus share the classical drawbacks ofgradient-like schemes (cf. Sec. I); moreover, by replacingtheconvex functionF with its first order approximation, they donot take any advantage of any structure ofF beyond meredifferentiability; exploiting any available structural propertiesof F , instead, has been shown to enhance (practical) conver-gence speed, see e.g. [21]. Comparing with the second groupof works [9], [10], [12], [21], [37], [39], [40], our algorithmicframework improves on their convergence properties whileadding great flexibility in the selection of how many variablesto update at each iteration. For instance, with the exceptionof [10], [13], [18]–[20], all the aforementioned works do notallow parallel updates of only asubsetof all variables, afeature that instead, fully explored as we do, can dramaticallyimprove the convergence speed of the algorithm, as we showin Section VI. Moreover, with the exception of [37], theyall require an Armijo-type line-search, which makes themnot appealing for a (parallel) distributed implementation. Ascheme in [37] is actually based on diminishing step-size-rules, but its convergence properties are quite weak: not allthe limit points of the sequence generated by this scheme areguaranteed to be stationary solutions of (1).

Our framework instead i) deals withnonconvex(nons-mooth) problems; ii) allows one to use a much varied arrayof approximations forF and also inexact solutions of thesubproblems; iii) is fully parallel and distributed (it does notrely on any line-search); and iv) leads to thefirst distributedconvergentschemes based on very general (possibly)partialupdating rules of the optimization variables. In fact, amongdeterministic schemes, we are aware of only three algorithms[10], [13], [23] performing at each iteration a parallel update ofonly asubsetof all the variables. These algorithms however aregradient-like schemes, and do not allow inexact solutions ofthe subproblems, when a closed form solution is not available(in some large-scale problems the cost of computing the exactsolution of all the subproblems can be prohibitive). In addition,[10] requires an Armijo-type line-search whereas [23] and [13]are applicable only toconvexobjective functions and are notfully parallel. In fact, convergence conditions therein imposea constraint on the maximum number of variables that can besimultaneously updated, a constraint that in many large scaleproblems is likely not satisfied.Sequential Methods: Our framework contains as special casesalso sequential updates; it is then interesting to compare ourresults to sequential schemes too. Given the vast literature onthe subject, we consider here only the most recent and generalwork [14]. In [14] the authors consider the minimizationof a possibly nonsmooth function by Gauss-Seidel methodswhereby, at each iteration, asingle block of variables isupdated by minimizing aglobal upperconvex approximationof the function. However, finding such an approximation isgenerally not an easy task, if not impossible. To cope withthis issue, the authors also proposed a variant of their schemethat does not need this requirement but uses an Armijo-typeline-search, which however makes the scheme not suitablefor a parallel/distributed implementation. Contrary to [14], inour framework conditions on the approximation function (cf.

8

P1-P3) are trivial to be satisfied (in particular,P need notbe an upper bound ofF ), enlarging significantly the classof utility functions V which the proposed solution methodis applicable to. Furthermore, our framework gives rise toparallel and distributed methods (no line search is used) whosedegree of parallelism can be chosen by the user.

VI. N UMERICAL RESULTS

In this section we provide some numerical results showingsolid evidence of the viability of our approach. Our aim isto compare to state-of-the-art methods as well as test theinfluence of two key features of the proposed algorithmicframework, namely: parallelism and selective (greedy) selec-tion of (only some) variables to update at each iteration. Thetests were carried out on i) LASSO and Logistic Regressionproblems, two of the most studied (convex) instances ofProblem (1); and ii)nonconvexquadratic programming.

All codes have been written in C++. All algebra is per-formed by using the Intel Math Kernel Library (MKL). Thealgorithms were tested on the General Compute Cluster ofthe Center for Computational Research at the SUNY Buffalo.In particular we used a partition composed of 372 DELL16x2.20GHz Intel E5-2660 “Sandy Bridge” Xeon Processorcomputer nodes with 128 Gb of DDR4 main memory andQDR InfiniBand 40Gb/s network card. In our experiments,parallel algorithms have been tested using up to 40 parallelprocesses (8 nodes with 5 cores per each), while sequentialalgorithms ran on a single process (using thus one single core).

A. LASSO problem

We implemented the instance of Algorithm 1 described inExample # 2 in the previous section, using the approximatingfunctionPi as in (8). Note that in the case of LASSO problemsxi(x

k, τi), the unique solution (8), can be easily computed inclosed form using the soft-thresholding operator, see [11].Tuning of Algorithm 1: In the description of our algorithmicframework we considered fixed values ofτi, but it is clear thatvarying them a finite number of times does not affect in anyway the theoretical convergence properties of the algorithms.We found that the following choices work well in practice: (i)τi are initially all set toτi = tr(ATA)/2n, i.e., to half of themean of the eigenvalues of∇2F ; (ii) all τi are doubled if at acertain iteration the objective function does not decrease; and(iii) they are all halved if the objective function decreases forten consecutive iterations or the relative error on the objectivefunctionre(x) is sufficiently small, specifically if

re(x) ,V (x) − V ∗

V ∗≤ 10−2, (11)

whereV ∗ is the optimal value of the objective functionV(in our experiments on LASSOV ∗ is known). In order toavoid increments in the objective function, whenever allτiare doubled, the associated iteration is discarded, and in (S.4)of Algorithm 1 it is setxk+1 = xk. In any case we limitedthe number of possible updates of the values ofτi to 100.

The step-sizeγk is updated according to the following rule:

γk = γk−1

(1−min

{1,

10−4

re(xk)

}θ γk−1

), k = 1, . . . ,

(12)

with γ0 = 0.9 andθ = 1e− 7. The above diminishing rule isbased on (6) while guaranteeing thatγk does not become tooclose to zero before the relative error is sufficiently small.

Finally the error bound function is chosen asEi(xk) =

‖xi(xk, τi)−xk

i ‖, andSk in Step 2 of the algorithm is set to

Sk = {i : Ei(xk) ≥ σMk}.

In our tests we consider two options forσ, namely: i)σ = 0,which leads to afully parallel scheme wherein at each iterationall variables are updated; and ii)σ = 0.5, which correspondsto updating only a subset of all the variables at each iteration.Note that for both choices ofσ, the resulting setSk satisfiesthe requirement in (S.2) of Algorithm 1; indeed,Sk alwayscontains the indexi corresponding to the largestEi(x

k). Weterm the above instance of Algorithm 1FLEXible parallelAlgorithm(FLEXA), and we will refer to these two versions asFLEXA σ = 0 and FLEXA σ = 0.5. Note that both versionssatisfy conditions of Theorem 1, and thus are convergent.

Algorithms in the literature: We compared our versions ofFLEXA with the most competitive distributed and sequentialalgorithms proposed in the literature to solve the LASSO prob-lem. More specifically, we consider the following schemes.• FISTA : The Fast Iterative Shrinkage-Thresholding Algo-rithm (FISTA) proposed in [11] is a first order method and canbe regarded as the benchmark algorithm for LASSO problems.Building on the separability of the terms in the objectivefunction V , this method can be easily parallelized and thustake advantage of a parallel architecture. We implemented theparallel version that use a backtracking procedure to estimatethe Lipschitz constantLF of ∇F [11].• SpaRSA: This is the first order method proposed in [12];it is a popular spectral projected gradient method that usesaspectral step length together with a nonmonotone line searchto enhance convergence. Also this method can be easilyparallelized, which is the version implemented in our tests.In all the experiments we set the parameters of SpaRSA as in[12]: M = 5, σ = 0.01, αmax = 1e30, andαmin = 1e− 30.• GRock & Greedy-1BCD: GRock is a parallel algorithmproposed in [13] that performs well on sparse LASSO prob-lems. We tested the instance of GRock where the numberof variables simultaneously updated is equal to the numberof the parallel processors. It is important to remark that thetheoretical convergence properties of GRock are in jeopardy asthe number of variables updated in parallel increases; roughlyspeaking, GRock is guaranteed to converge if the columnsof the data matrixA in the LASSO problem are “almost”orthogonal, a feature that is not satisfied in many applications.A special instance with convergence guaranteed is the onewhere only one block per time (chosen in a greedy fashion)is updated; we refer to this special case asgreedy-1BCD.• ADMM : This is a classical Alternating Method of Mul-tipliers (ADMM). We implemented the parallel version asproposed in [41].

In the implementation of the parallel algorithms, the datamatrix A of the LASSO problem is generated in a uniformcolumn block distributed manner. More specifically, eachprocessor generates a slice of the matrix itself such that the

9

overall one can be reconstructed asA = [A1 A2 · · · AP ],whereP is the number of parallel processors, andAi hasn/Pcolumns for eachi. Thus the computation of each productAx

(which is required to evaluate∇F ) and the norm‖x‖1 (thatis G) is divided into the parallel jobs of computingAixi and‖xi‖1, followed by a reducing operation.Numerical Tests: We generated six groups of LASSO problemsusing the random generator proposed by Nesterov [9], whichpermits to control the sparsity of the solution. For the firstfive groups, we considered problems with 10,000 variablesand matrixA having 9,000 rows. The five groups differ inthe degree of sparsity of the solution; more specifically thepercentage of non zeros in the solution is 1%, 10%, 20%, 30%,and 40%, respectively. The last group is formed by instanceswith 100,000 variables and 5000 rows forA, and solutionshaving 1% of non zero variables. In all experiments and forall the algorithms, the initial point was set to the zero vector.

Results of our experiments for the 10,000 variables groupsare reported in Fig. 1, where we plot the relative error asdefined in (11) versus the CPU time; all the curves are obtainedusing 40 cores, and averaged over ten independent randomrealizations. Note that the CPU time includes communicationtimes (for distributed algorithms) and the initial time neededby the methods to perform all pre-iteration computations (thisexplains why the curves of ADMM start after the others; infact ADMM requires some nontrivial initializations). For oneinstance, the one corresponding to 1% of the sparsity of thesolution, we plot also the relative error versus iterations[Fig.2(a2)]; similar behaviors of the algorithms have been observedalso for the other instances, and thus are not reported. Resultsfor the LASSO instance with 100,000 variables are plotted inFig. 2. The curves are averaged over five random realizations.

Given Fig. 1 and 2, the following comments are in order.On all the tested problems, FLEXAσ = 0.5 outperformsin a consistent manner all other implemented algorithms.In particular, as the sparsity of the solution decreases, theproblems become harder and the selective update operated byFLEXA (σ = 0.5) improves over FLEXA (σ = 0), whereinstead all variables are updated at each iteration. FISTA iscapable to approach relatively fast low accuracy when thesolution is not too sparse, but has difficulties in reachinghigh accuracy. SpaRSA seems to be very insensitive to thedegree of sparsity of the solution; it behaves well on 10,000variables problems and not too sparse solutions, but is muchless effective on very large-scale problems. The version ofGRock withP = 40 is the closest match to FLEXA, but onlywhen the problems are very sparse (but it is not supported bya convergence theory on our test problems). This is consistentwith the fact that its convergence properties are at stake whenthe problems are quite dense. Furthermore, if the problem isvery large, updating only 40 variables at each iteration, asGRock does, could slow down the convergence, especiallywhen the optimal solution is not very sparse. From this pointofview, FLEXA σ = 0.5 seems to strike a good balance betweennot updating variables that are probably zero at the optimumand nevertheless update a sizeable amount of variables whenneeded in order to enhance convergence.Remark 5 (On parallelism). Fig. 2 shows that FLEXA

time (sec)(a1)

0 1 2 3 4 5 6 7 8 9 10

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

100

101 SPARSA Greedy-1BCD ADMM FLEXA σ=0 GRock (P=40) FISTA FLEXA σ=0.5

iterations (a2)

0 10 20 30 40 50

rela

tive

erro

r

10-6

10-4

10-2

100

102ADMMFLEXA σ=0.5GRock (P=40)FLEXA σ=0SpaRSAFISTAGreedy-1BCD

time (sec)(b)

0 1 2 3 4 5 6 7 8 9 10

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

100

101 FLEXA σ=0.5 GRock (P=40) FLEXA σ=0 SPARSA ADMM FISTA Greedy-1BCD

time (sec)(c)

0 5 10 15 20

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

100

101

FLEXA σ=0.5 FLEXA σ=0 SPARSA FISTA ADMM GRock (P=40) Greedy-1BCD

time (sec)(d)

0 5 10 15 20

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

100

101 FLEXA σ=0.5 FLEXA σ=0 GRock (P=40) FISTA SPARSAADMMGreedy-1BCD

time (sec)(e)

0 5 10 15 20

rela

tive

erro

r

10-6

10-4

10-2

100

FLEXA σ=0.5 FLEXA σ=0 SPARSA FISTA ADMM GRock (P=20) Greedy-1BCD

Fig. 1: LASSO with 10,000 variables; relative error vs. time(in seconds) for:(a1) 1% non zeros - (b) 10% non zeros - (c) 20% non zeros - (d) 30%nonzeros - (e) 40% non zeros; (a2) relative error vs. iterationsfor 1% non zeros.

time (sec) (a)

0 100 200 300 400

rela

tive

erro

r

10-6

10-4

10-2

100

102

ADMM SpaRSAFLEXA σ = 0.5FISTAFLEXA σ = 0GRock (P = 8)Greedy-1BCD

time (sec) (b)

0 100 200 300 400

rela

tive

erro

r

10-6

10-4

10-2

100

102

ADMM FLEXA σ = 0 FLEXA σ = 0.5 FISTA SpaRSA Greedy-1BCD GRock (P=20)

Fig. 2: LASSO with105 variables; Relative error vs. time for: (a) 8 cores -(b) 20 cores.

seems to exploit well parallelism on LASSO problems. Indeed,when passing from 8 to 20 cores, the running time approx-imately halves. This kind of behavior has been consistentlyobserved also for smaller problems and different number ofcores; because of the space limitation, we do not reportthese experiments. Note that practical speed-up due to theuse of a parallel architecture is given by several factor thatare not easily predictable and very dependent on the specificproblem at hand, including communication times among thecores, the data format, etc. In this paper we do not pursue atheoretical study of the speed-up that, given the generality of

10

our framework, seems a challenging goal. We finally observethat GRock appears to improve greatly with the number ofcores. This is due to the fact that in GRock the maximumnumber of variables that is updated in parallel is exactly equalto the number of cores (i.e., the degree of parallelism), andthis might become a serious drawback on very large problems(on top of the fact that convergence is in jeopardy). On thecontrary, our theory permits the parallel update ofanynumberof variables while guaranteeing convergence.Remark 6 (On selective updates). It is interesting to commentwhy FLEXA σ = 0.5 behaves better than FLEXAσ = 0. Tounderstand the reason behind this apparently counterintuitivephenomenon, we first note that Algorithm 1 has the remarkablecapability to identify those variables that will be zero at asolution; because of lack of space, we do not provide herethe proof of this statement but only an informal description.Roughly speaking, it can be shown that, fork large enough,those variables that are zero inx(xk, τ ) will be zero alsoin a limiting solution x. Therefore, suppose thatk is largeenough so that this identification property already takes place(we will say that “we are in the identification phase”) andconsider an indexi such thatxi = 0. Then, if xk

i is zero,it is clear, by Steps 3 and 4, thatxk′

i will be zero for allindicesk′ > k, independently of whetheri belongs toSk ornot. In other words, if a variable that is zero at the solutionis already zero when the algorithm enters the identificationphase,that variable will be zero in all subsequent iterations;this fact, intuitively, should enhance the convergence speed ofthe algorithm. Conversely, if when we enter the identificationphasexk

i is not zero, the algorithm will have to bring it backto zero iteratively. It should then be clear why updating onlyvariables that we have “strong” reason to believe will be nonzero at a solution is a better strategy than updating them all. Ofcourse, there may be a problem dependence and the best valueof σ can vary from problem to problem. But we believe thatthe explanation outlined above gives firm theoretical groundto the idea that it might be wise to “waste" some calculationsand perform only a partial update of the variables. �

B. Logistic regression problems

The logistic regression problem is described in Example #3(cf. Section III) and is a highly nonlinear problem involvingmany exponentials that, notoriously, give rise to numericaldifficulties. Because of these high nonlinearities, a Gauss-Seidel approach is expected to be more effective than a pureJacobi method, a fact that was confirmed by our preliminarytests. For this reason, for the logistic regression problemwe tested also an instance of Algorithm 3; we term it GJ-FLEXA. The setting of the free parameters in GJ-FLEXAis essentially the same described for LASSO, but with thefollowing differences:(a) The approximantPi is chosen as the second order ap-

proximation of the original functionF (Ex. #3, Sec. III);(b) The initialτi are set to tr(YTY)/2n for all i, wheren is

the total number of variables andY = [y1 y2 · · · ym]T .(c) Since the optimal valueV ∗ is not known for the lo-

gistic regression problem, we no longer usere(x) asmerit function but‖Z(x)‖∞, with Z(x) = ∇F (x) −

Π[−c,c]n (∇F (x)− x) . Here the projectionΠ[−c,c]n(z)can be efficiently computed; it acts component-wise onz,since[−c, c]n = [−c, c]×· · ·× [−c, c]. Note thatZ(x) isa valid optimality measure function; indeed,Z(x) = 0 isequivalent to the standard necessary optimality conditionfor Problem (1), see [6]. Therefore, wheneverre(x)was used for the Lasso problems, we now use‖Z(x)‖∞[including in the step- size rule (12)].

We tested the algorithms on three instances of the logisticregression problem that are widely used in the literature,and whose essential data features are given in Table I;we downloaded the data from the LIBSVM repositoryhttp://www.csie.ntu.edu.tw/∼cjlin/libsvm/,which we refer to for a detailed description of the testproblems. In our implementation, the matrixY is stored ina column block distributed mannerY = [Y1 Y2 · · · YP ],whereP is the number of parallel processors. We comparedFLEXA (σ = 0.5) and GJ-FLEXA with the other parallelalgorithms (whose tuning of the free parameters is thesame as in Fig. 1 and Fig. 2), namely: FISTA, SpaRSA,and GRock. For the logistic regression problem, wealso tested one more algorithm, that we call CDM. ThisCoordinate Descent Method is an extremely efficient Gauss-Seidel-type method (customized for logistic regression),and is part of the LIBLINEAR package available athttp://www.csie.ntu.edu.tw/∼cjlin/.

In Fig. 3, we plotted the relative error vs. the CPU time(the latter defined as in Fig. 1 and Fig. 2) achieved by theaforementioned algorithms for the three datasets, and usinga different number of cores, namely: 8, 16, 20, 40; for eachalgorithm but GJ-FLEXA we report only the best performanceover the aforementioned numbers of cores. Note that in orderto plot the relative error, we had to preliminary estimateV ∗ (which is not known for logistic regression problems).To do so, we ran GJ-FLEXA until the merit function value‖Z(xk)‖∞ went below 10−7, and used the correspondingvalue of the objective function as estimate ofV ∗. We remarkthat we used this value only to plot the curves. Next to eachplot, we also reported the overall FLOPS counted up tillreaching the relative errors as indicated in the table. Notethat the FLOPS of GRock on real-sim and rcv1 are thosecounted in 24 hours simulation time; when terminated, thealgorithm achieved a relative error that was still very far fromthe reference values set in our experiment. Specifically, GRockreached1.16 (instead of1e− 4) on real-sim and0.58 (insteadof 1e − 3) on rcv1; the counted FLOPS up till those errorvalues are still reported in the tables.

Data set m n c

gisette (scaled) 6000 5000 0.25real-sim 72309 20958 4

rcv1 677399 47236 4

TABLE I: Data sets for logistic regression tests

The analysis of the figures shows that, due to the highnonlinearities of the objective function, the more performingmethods are the Gauss-Seidel-type ones. In spite of this,FLEXA still behaves quite well. But GJ-FLEXA with onecore, thus a non parallel method, clearly outperforms all other

http://www.csie.ntu.edu.tw/

11

time (sec)0 100 200 300 400 500 600

rela

tive

erro

r

10-6

10-4

10-2

100

102 Gisette (6000x5000)

CDM FLEXA GJ (1 core) FLEXA GJ (20 cores)SpaRSA (20 cores)FLEXA σ = 0.5 (20 cores)FISTA (20 cores)GRock (P = 20, 20 cores)

Algo. FLOPS (1e-2/1e-6)GJ-FLEXA (1C) 1.30e+10/1.23e+11GJ-FLEXA (20C) 5.18e+11/5.07e+12

FLEXA σ=0.5 (20C) 1.88e+12/4.06e+13CDM 2.15e+10/1.68e+11

SpaRSA (20C) 2.20e+12/5.37e+13FISTA (20C) 3.99e+12/5.66e+13GroCK (20C) 7.18e+12/1.81e+14

time (sec)0 20 40 60 80

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

100

101 Real Sim (72309x20958)

CDMFLEXA GJ(20 cores)FISTA (20 cores)GRock (P=20, 20 cores)SpaRSA (20 cores)FLEXA GJ (1 core)FLEXA GJ (8 cores)FLEXA σ = 0.5 (20 cores)

Algorithms FLOPS (1e-4/1e-6)GJ-FLEXA (1C) 2.76e+9/6.60e+9GJ-FLEXA (20C) 9.83e+10/2.85e+11

FLEXAσ=0.5 (20C) 3.54e+10/4.69e+11CDM 4.43e+9/2.18e+10

SpaRSA (20C) 7.18e+9/1.94e+11FISTA (20C) 3.91e+10/1.56e+11GroCK (20C) 8.30e+14 (after 24h)

time (sec)0 200 400 600 800 1000

rela

tive

erro

r

10-6

10-4

10-2

100

102 Rcv (67739x47236)FLEXA GJ(1 core)FLEXA GJ (20 cores)FLEXA σ=0.5 (20C cores)CDMGRock (20 cores)SPARSA (20 cores)FISTA (20 cores)

Algorithms FLOPS (1e-3/1e-6)GJ-FLEXA (1C) 3.61e+10/2.43e+11GJ-FLEXA (20C) 1.35e+12/6.22e+12

FLEXAσ=0.5 (20C) 8.53e+11/7.19e+12CDM 5.60e+10/6.00e+11

SpaRSA (20C) 9.38e+12/7.20e+13FISTA (20C) 2.58e+12/2.76e+13GroCK (20C) 1.72e+15 (after 24h)

Fig. 3: Logistic Regression problems: Relative error vs. time (in seconds) andFLOPS for i) gisette, ii) real-sim, and iii) rcv.

algorithms. The explanation can be the following. GJ-FLEXAwith one core is essentially a Gauss-Seidel-type method butwith two key differences: the use of a stepsize and moreimportantly a (greedy) selection rule by which only somevariables are updated at each round. As the number of coresincreases, the algorithm gets “closer and closer” to a Jacobi-type method, and because of the high nonlinearities, movingalong a “Jacobi direction” does not bring improvements. Inconclusion, for logistic regression problems, our experimentssuggests that while the (opportunistic) selection of variablesto update seems useful and brings to improvements evenin comparison to the extremely efficient, dedicated CDMalgorithm/software, parallelism (at least, in the form embeddedin our scheme), does not appear to be beneficial as insteadobserved for LASSO problems.

C. Nonconvex quadratic problems

We consider now a nonconvex instance of Problem (1);because of space limitation, we present some preliminaryresults only on non convex quadratic problems; a more detailedanalysis of the behavior of our method in the nonconvex caseis an important topic that is the subject of a forthcomingpaper. Nevertheless, current results suggest that FLEXA’sgoodbehavior on convex problems extends to the nonconvex setting.

Consider the following nonconvex instance of Problem (1):

minx

‖Ax− b‖2 − c‖x‖2︸︷︷︸F (x)

+ c‖x‖1︸︷︷︸G(x)

s.t. −b ≤ xi ≤ b, ∀i = 1, . . . , n,

(13)

where c is a positive constant chosen so thatF (x) is nolonger convex. We simulated two instances of (13), namely:1) A ∈ R

9000×10000 generated using Nesterov’s model (asin LASSO problems in Sec. VI-A), with 1% of nonzeroin the solution,b = 1, c = 100, and c = 1000; and 2)A ∈ R

9000×10000 as in 1) but with 10% sparsity,b = 0.1,c = 100, and c = 2800. Note that the Hessian ofF inthe two aforementioned cases has the same eigenvalues ofthe Hessian of‖Ax − b‖2 in the original LASSO problem,but translated to the left by2c. In particular,F in 1) and 2)has (many) minimum eigenvalues equal to−2000 and−5600,respectively; therefore, the objective functionV in (13) is(markedly) nonconvex. SinceV is now always unboundedfrom below by construction, we added in (13) box constraints.Tuning of Algorithm 1: We ran FLEXA using the same tuningas for LASSO problems in Sec. VI-A, but adding the extraconditionτi > c, for all i, so that the resulting one dimensionalsubproblems (4) are convex and can be solved in closed form(as for LASSO). As termination criterion we used the meritfunction ‖Z(x)‖∞, with Z(x) , [Z1(x), . . . , Zn(x)], and

Zi(x) ,

0 if Zi(x) ≤ 0 andxi = b,0 if Zi(x) ≥ 0 andxi = −b,Zi(x) otherwise,

whereZ(x) , ∇F (x) − Π[−b,b]n (∇F (x) − x) , and Zi(x)denotes thei-th component ofZ(x). We stopped the iterationswhen‖Z(x)‖∞ ≤ 1e− 3

We compared FLEXA with FISTA and SpaRSA. Note thatamong all algorithms considered in the previous sections, onlySpaRSA has convergence guarantees for nonconvex problems;but we also added FISTA to the comparison because it seemsto perform well in practice and because of its benchmarkstatus in the convex case. On our tests, the three algorithmsalways converge to the same (stationary) point. Computedstationary solutions of class 1) of simulated problems haveapproximately 1% of non zeros and 0.1% of variables on thebounds, whereas those of class 2) have 3% of non zeros and0.3% of variables on the bounds. Results of our experimentsfor the 1% sparsity problem are reported in Fig. 4 and thosefor the 10% one in Fig. 5; we plot the relative error as definedin (11) and the merit value versus the CPU time; all the curvesare obtained using 20 cores and averaged over 10 independentrandom realization. The CPU time includes communicationtimes and the initial time needed by the methods to performall pre-iteration computations. These preliminary tests seem

time (sec)0 20 40 60 80 100

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

FLEXA, σ=0.5

FISTA

SpaRSA

time (sec)0 20 40 60 80 100 120 140

mer

it fu

nctio

n

10-2

10-1

100

101

102

SpaRSAFISTAFLEXA, σ=0.5

Fig. 4: Nonconvex problem (13) with 1% solution sparsity: relative error vs.time (in seconds), and merit value vs. time (in seconds).

12

time (sec)0 20 40 60 80 100

rela

tive

erro

r

10-6

10-5

10-4

10-3

10-2

10-1

FLEXA, σ=0.5

FISTA

SpaRSA

time (sec)0 20 40 60 80 100 120 140

mer

it fu

nctio

n

10-2

10-1

100

101

102

FISTA

SpaRSA

FLEXA, σ=0.5

Fig. 5: Nonconvex problem (13) with 10% solution sparsity:relative error vs.time (in seconds), and merit value vs. time (in seconds).

to indicate that FLEXA performs well also on nonconvexproblems.

VII. C ONCLUSIONS

We proposed a general algorithmic framework for theminimization of the sum of a possibly noncovex differentiablefunction and a possibily nonsmooth but block-separable con-vex one. Quite remarkably, our framework leads to differentnew algorithms whose degree of parallelism can be chosen bythe user that also has complete control on the number of vari-ables to update at each iterations; all the algorithms convergeunder the same conditions. Many well known sequential andsimultaneous solution methods in the literature are just specialcases of our algorithmic framework. It is remarkable that ourscheme encompasses many different practical realizationsandthat this flexibility can be used to easily cope with differentclasses of problems that bring different challenges. Our prelim-inary tests are very promising, showing that our schemes canoutperform state-of-the-art algorithms. Experiments on largerand more varied classes of problems (including those listedinSec. II) are the subject of current research. Among the topicsthat should be further studied, one key issue is the right initialchoice and and subsequent tuning of theτi. These parametersto a certaine extent determine the lenght of the shift at eachiteration and play a crucial role in establishing the quality ofthe results.

VIII. A CKNOWLEDGMENTS

The authors are very grateful to Loris Cannelli and PaoloScarponi for their invaluable help in developing the C++ codeof the simulated algorithms.

The work of Facchinei was supported by the MIUR projectPLATINO (Grant Agreement n. PON01_01007). The work ofScutari was supported by the USA National Science Founda-tion under Grants CMS 1218717 and CAREER Award No.1254739.

APPENDIX

We first introduce some preliminary results instrumental toprove both Theorem 1 and Theorem 2. Hereafter, for notationalsimplicity, we will omit the dependence ofx(y, τ ) on τ andwrite x(y). Given S ⊆ N and x , (xi)

Ni=1, we will also

denote by(x)S (or interchangeablyxS) the vector whosecomponenti is equal toxi if i ∈ S, and zero otherwise.

A. Intermediate results

Lemma 7. Let H(x;y) ,∑

i hi(xi;y). Then, the followinghold:(i) H(•;y) is uniformly strongly convex onX with constantcτ > 0, i.e.,

(x−w)T (∇xH (x;y)−∇xH (w;y)) ≥ cτ ‖x−w‖2 ,(14)

for all x,w ∈ X and giveny ∈ X ;(ii) ∇xH(x; •) is uniformly Lipschitz continuous onX , i.e.,there exists a0 < L∇H

<∞ independent onx such that

‖∇xH (x;y) −∇xH (x;w)‖ ≤ L∇H ‖y −w‖ , (15)

for all y,w ∈ X and givenx ∈ X .

Proposition 8. Consider Problem (1) under A1-A6. Then themappingX ∋ y 7→ x(y) has the following properties:(a) x(•) is Lipschitz continuous onX , i.e., there exists apositive constantL such that

‖x(y) − x(z)‖ ≤ L ‖y − z‖ , ∀y, z ∈ X ; (16)

(b) the set of the fixed-points ofx(•) coincides with the setof stationary solutions of Problem (1); thereforex(•) has afixed-point;(c) for every giveny ∈ X and for any setS ⊆ N ,

(x(y) − y)T

S ∇xF (y)S+∑

i∈S

gi(xi(y)) −∑

i∈S

gi(yi) (17)

≤ −cτ ‖(x(y) − y)S‖2,

with cτ , q mini τi.

Proof: We prove the proposition in the following order:(c), (a), (b).(c): Given y ∈ X , by definition, eachxi(y) is the uniquesolution of problem (4); then it is not difficult to see that thefollowing holds: for allzi ∈ Xi,

(zi − xi(y))T ∇xi

hi(xi(y);y) + gi(zi)− gi(xi(y)) ≥ 0.(18)

Summing and subtracting∇xiPi (yi; y) in (18), choosing

zi = yi, and using P2, we get

(yi − xi(y))T(∇xi

Pi(xi(y); y) −∇xiPi(yi; y))

+ (yi − xi(y))T ∇xi

F (y) + gi(yi)− gi(xi(y))

−τi (xi(y)− yi)T Qi(y) (xi(y)− yi) ≥ 0,

(19)

for all i ∈ N . Observing that the term on the first line of (19)is non positive and using P1, we obtain

(yi − xi(y))T ∇xi

F (y) + gi(yi)− gi(xi(y))

≥ cτ ‖xi(y) − yi‖2,

for all i ∈ N . Summing overi ∈ S we get (17).(a): We use the notation introduced in Lemma 7. Giveny, z ∈X , by optimality and (18), we have, for allv andw in X

(v − x(y))T ∇xH (x(y);y) +G(v) −G(x(y)) ≥ 0

(w − x(z))T ∇xH (x(z); z) +G(w) −G(x(z)) ≥ 0.

Settingv = x(z) andw = x(y), summing the two inequal-ities above, and adding and subtracting∇xH (x(y); z), we

13

obtain:

(x(z)− x(y))T(∇xH (x(z); z) −∇xH (x(y); z))

≤ (x(y) − x(z))T(∇xH (x(y); z) −∇xH (x(y);y)) .

(20)Using (14) we can lower bound the left-hand-side of (20) as

(x(z)− x(y))T (∇xH (x(z); z) −∇xH (x(y); z))

≥ cτ ‖x(z)− x(y)‖2 ,(21)

whereas the right-hand-side of (20) can be upper bounded as

(x(y)− x(z))T(∇xH (x(y); z) −∇xH (x(y);y))

≤ L∇H ‖x(y)− x(z)‖ ‖y − z‖ ,(22)

where the inequality follows from the Cauchy-Schwartz in-equality and (15). Combining (20), (21), and (22), we obtainthe desired Lipschitz property ofx(•).

(b): Let x⋆ ∈ X be a fixed point ofx(y), that isx⋆ = x(x⋆).Each xi(y) satisfies (18) for any giveny ∈ X . For someξi ∈ ∂gi(x

∗), settingy = x⋆ and usingx⋆ = x(x⋆) and theconvexity ofgi, (18) reduces to

(zi − x⋆i )

T(∇xi

F (x⋆) + ξi) ≥ 0, (23)

for all zi ∈ Xi and i ∈ N . Taking into account the Cartesianstructure ofX , the separability ofG, and summing (23) overi ∈ N , we obtain (z− x⋆)

T(∇xF (x⋆) + ξ) ≥ 0, for all

z ∈ X, with z , (zi)Ni=1 and ξ , (ξi)

Ni=1 ∈ ∂G(x∗);

thereforex⋆ is a stationary solution of (1).

The converse holds because i)x(x⋆) is the unique optimalsolution of (4) withy = x⋆, and ii) x⋆ is also an optimalsolution of (4), since it satisfies the minimum principle.

Lemma 9. [42, Lemma 3.4, p.121]Let {Xk}, {Y k}, and{Zk} be three sequences of numbers such thatY k ≥ 0 for allk. Suppose that

Xk+1 ≤ Xk − Y k + Zk, ∀k = 0, 1, . . .

and∑∞

k=0 Zk < ∞. Then eitherXk → −∞ or else{Xk}

converges to a finite value and∑∞

k=0 Yk <∞.

Lemma 10. Let{xk} be the sequence generated by Algorithm1. Then, there is a positive constantc such that the followingholds: for all k ≥ 1,

(∇xF (xk)

)TSk

(x(xk)− xk

)Sk +

∑

i∈Sk

gi(xi(xk))

−∑

i∈Sk

gi(xki ) ≤ −c ‖x(x

k)− xk‖2.(24)

Proof: Let jk be an index inSk such thatEjk(xk) ≥

ρmaxi Ei(xk) (Step 2 of Algorithm 1). Then, using the

aforementioned bound and (5), it is easy to check that the

following chain of inequalities holds:

sjk‖xSk(xk)− xkSk‖ ≥ sjk‖xjk(x

k)− xkjk‖

≥ Ejk (xk)

≥ ρmaxi

Ei(xk)

≥(ρmin

isi)(

maxi{‖xi(x

k)− xki ‖}

)

≥

(ρmini si

N

)‖x(xk)− xk‖.

Hence we have for anyk,

‖xSk(xk)− xkSk‖ ≥

(ρmini siNsjk

)‖x(xk)− xk‖. (25)

Invoking now Proposition 8(c) withS = Sk andy = xk, and

using (25), (24) holds true, withc , cτ

(ρmini s

i

N maxj sj

)2

.

B. Proof of Theorem 1We are now ready to prove the theorem. For any given

k ≥ 0, the Descent Lemma [33] yields

F(xk+1

)≤ F

(xk

)+ γk∇xF

(xk

)T (zk − xk

)

+

(γk

)2L∇F

2

∥∥zk − xk∥∥2 ,

(26)with zk , (zki )

Ni=1 andzk , (zki )

Ni=1 defined in Step 2 and 3

(Algorithm 1). Observe that∥∥zk − xk

∥∥2 ≤∥∥zk − xk

∥∥2

≤ 2∥∥x(xk)− xk

∥∥2

+2∑

i∈N

∥∥zki − xi(xk)∥∥2

≤ 2∥∥x(xk)− xk

∥∥2 + 2∑

i∈N (εki )2,

(27)where the first inequality follows from the definition ofzk andzk, and in the last inequality we used

∥∥zki − xi(xk)∥∥ ≤ εki .

Denoting bySk

the complement ofS, we also have, for allk,

∇xF(xk

)T (zk − xk

)

= ∇xF(xk

)T (zk − x(xk) + x(xk)− xk

)

= ∇xF(xk

)TSk (z

k − x(xk))Sk

+∇xF(xk

)TS

k (xk − x(xk))S

k

+∇xF(xk

)TSk (x(x

k)− xk)Sk

+∇xF(xk

)TS

k (x(xk)− xk)S

k

= ∇xF(xk

)TSk (z

k − x(xk))Sk

+∇xF(xk

)TSk (x(x

k)− xk)Sk ,(28)

where in the second equality we used the definition ofzk andof the setSk. Now, using (28) and Lemma 10, we can write

∇xF(xk

)T (zk − xk

)+∑

i∈Sk gi(zki )−

∑i∈Sk gi(x

ki )

= ∇xF(xk

)T (zk − xk

)+∑

i∈Sk gi(xi(xk))

−∑

i∈Sk gi(xki ) +

∑i∈Sk gi(z

ki )−

∑i∈Sk gi(xi(x

k))(29)

14

≤ −c∥∥x(xk)− xk

∥∥2 +∑

i∈Sk εki∥∥∇xi

F (xk)∥∥

+LG

∑i∈Sk εki ,

(30)

whereLG is a (global) Lipschitz constant for (all)gi.Finally, from the definition ofzk and of the setSk, we have

for all k,

V (xk+1) = F (xk+1) +∑

i∈N gi(xk+1i )

= F (xk+1) +∑

i∈N gi(xki + γk(zki − xk

i ))

≤ F (xk+1) +∑

i∈N gi(xki ) + γk

(∑i∈Sk(gi(z

ki )− gi(x

ki ))

)

≤ V(xk

)− γk

(c− γkL∇F

) ∥∥x(xk)− xk∥∥2 + T k,

(31)where in the first inequality we used the the convexity of thegi’s, whereas the second one follows from (26), (27) and (30),with

T k , γk∑

i∈Sk

εki(LG +

∥∥∇xiF (xk)

∥∥)+(γk

)2L∇F

∑

i∈N

(εki )2.

Using assumption (iv), we can boundT k as

T k ≤ (γk)2[Nα1(α2LG + 1) + (γk)2L∇F (Nα1α2)

2],

which, by assumption (iii) implies∑∞

k=0 Tk < ∞. Since

γk → 0, it follows from (31) that there exist some positiveconstantβ1 and a sufficiently largek, say k, such that

V (xk+1) ≤ V (xk)− γkβ1

∥∥x(xk)− xk∥∥2 + T k, (32)

for all k ≥ k. Invoking Lemma 9 with the identificationsXk = V (xk+1), Y k = γkβ1

∥∥x(xk)− xk∥∥2 andZk = T k

while using∑∞

k=0 Tk <∞, we deduce from (32) that either

{V (xk)} → −∞ or else{V (xk)} converges to a finite valueand

limk→∞

k∑

t=k

γt∥∥x(xt)− xt

∥∥2 < +∞. (33)

SinceV is coercive,V (x) ≥ miny∈X V (y) > −∞, implyingthat {V

(xk

)} is convergent; it follows from (33) and∑∞

k=0 γk =∞ that lim infk→∞

∥∥x(xk)− xk∥∥ = 0.

Using Proposition 8, we show next thatlimk→∞

∥∥x(xk)− xk∥∥ = 0; for notational simplicity we will

write △x(xk) , x(xk) − xk. Suppose, by contradiction,that lim supk→∞

∥∥△x(xk)∥∥ > 0. Then, there exists aδ > 0

such that∥∥△x(xk)

∥∥ > 2δ for infinitely many k and also∥∥△x(xk)∥∥ < δ for infinitely many k. Therefore, one can

always find an infinite set of indexes, sayK, having thefollowing properties: for anyk ∈ K, there exists an integerik > k such that

∥∥△x(xk)∥∥ < δ,

∥∥△x(xik )∥∥ > 2δ (34)

δ ≤∥∥△x(xj)

∥∥ ≤ 2δ k < j < ik. (35)

Given the above bounds, the following holds: for allk ∈ K,

δ(a)<

∥∥△x(xik )∥∥−

∥∥△x(xk)∥∥

≤∥∥x(xik)− x(xk)

∥∥+∥∥xik − xk

∥∥(b)

≤ (1 + L)∥∥xik − xk

∥∥

(c)

≤ (1 + L)

ik−1∑

t=k

γt(∥∥△x(xt)St

∥∥+∥∥(zt − x(xt))St

∥∥)

(d)

≤ (1 + L) (2δ + εmax)

ik−1∑

t=k

γt, (36)

where (a) follows from (34); (b) is due to Proposition 8(a);(c) comes from the triangle inequality, the updating rule ofthe algorithm and the definition ofzk; and in (d) we used(34), (35), and‖zt − x(xt)‖ ≤

∑i∈N εti, where εmax ,

maxk∑

i∈N εki <∞. It follows from (36) that

lim infk→∞

ik−1∑

t=k

γt ≥δ

(1 + L)(2δ + εmax)> 0. (37)

We show next that (37) is in contradiction with the conver-gence of{V (xk)}. To do that, we preliminary prove that,for sufficiently largek ∈ K, it must be

∥∥△x(xk)∥∥ ≥ δ/2.

Proceeding as in (36), we have: for any givenk ∈ K,∥∥△x(xk+1)

∥∥−∥∥△x(xk)

∥∥ ≤ (1 + L)∥∥xk+1 − xk

∥∥≤ (1 + L)γk

(∥∥△x(xk)∥∥+ εmax

).

It turns out that for sufficiently largek ∈ K so that(1+L)γk <δ/(δ + 2εmax), it must be

∥∥△x(xk)∥∥ ≥ δ/2; (38)

otherwise the condition∥∥△x(xk+1)

∥∥ ≥ δ would be violated[cf. (35)]. Hereafter we assume without loss of generality that(38) holds for all k ∈ K (in fact, one can alway restrict{xk}k∈K to a proper subsequence).

We can show now that (37) is in contradiction with theconvergence of{V (xk)}. Using (32) (possibly over a subse-quence), we have: for sufficiently largek ∈ K,

V (xik) ≤ V (xk)− β2

ik−1∑

t=k

γt∥∥△x(xt)

∥∥2 +ik−1∑

t=k

T t

(a)< V (xk)− β2(δ

2/4)

ik−1∑

t=k

γt +

ik−1∑

t=k

T t (39)

where in (a) we used (35) and (38), andβ2 is some positiveconstant. Since{V (xk)} converges and

∑∞k=0 T

k <∞, (39)implies limK∋k→∞

∑ik−1t=k γt = 0, which contradicts (37).

Finally, since the sequence{xk} is bounded [by the coercivityof V and the convergence of{V (xk)}], it has at least onelimit point x, that belongs toX . By the continuity ofx(•)[Proposition 8(a)] andlimk→∞

∥∥x(xk)− xk∥∥ = 0, it must

be x(x) = x. By Proposition 8(b)x is a stationary solutionof Problem (1). As a final remark, note that ifεki = 0 foreveryi and for everyk large enough, i.e., if eventuallyx(xk)is computed exactly, there is no need to assume thatG isglobally Lipschitz. In fact in (30) the term containingLG

disappears, allT k are zero and all the subsequent derivationshold independent of the Lipschitzianity ofG. �

C. Proof of Theorem 2

We show next that Algorithm 2 is just an instance of theinexact Jacobi scheme described in Algorithm 1 satisfying the

15

convergence conditions in Theorem 1; which proves Theorem2. It is not difficult to see that this boils down to proving that,for all p ∈ P and i ∈ Ip, the sequencezkpi in Step 2a) ofAlgorithm 2 satisfies

‖zkpi − xpi(xk)‖ ≤ εkpi, (40)

for some{εkpi} such that∑

n εkpi γ

k <∞. The following holdsfor the LHS of (40):‖zkpi − xpi(x

k)‖ ≤ ‖xpi(xk+1pi< ,xk

pi≥,xk−p)− xpi(x

k)‖

+ ‖zkpi − xpi(xk+1pi< ,xk

pi≥,xk−p)‖

(a)

≤ ‖xpi(xk+1pi< ,xk

pi≥,xk−p)− xpi(x

k)‖ + εkpi(b)

≤ L ‖xk+1pi< − xk

pi<‖+ εkpi

(c)= Lγk

∥∥(zkpi< − xkpi<

)∥∥+ εkpi

≤ Lγk(∑i−1

j=1(‖zkpj − xpj(x

k)‖+ ‖xpj(xk)− xk

pj‖))

+ εkpi(d)

≤ Lγk∑i−1

j=1 εkpj + Lγk

(LG + β

cτ

)+ εkpi , εkpi,

(41)where (a) follows from the error bound in Step 2a) ofAlgorithm 2; in (b) we used Proposition 8a); (c) is dueto Step 2b); and (d) follows from‖xpj(x

k) − xkpj‖ ≤

(LG + ‖∇xpiF (xk)‖)/cτ and ‖∇xpi

F (xk)‖ ≤ β for some0 < β <∞. It turns out that (40) is satisfied choosingεkpi asdefined in the last inequality of (41).

REFERENCES

[1] F. Facchinei, S. Sagratella, and G. Scutari, “Flexible parallelalgorithms for big data optimization,” inProc. of the IEEE 2014International Conference on Acoustics, Speech, and SignalProcessing(ICASSP 2014), Florence, Italy, May 4-9,. [Online]. Available:http://arxiv.org/abs/1311.2444

[2] R. Tibshirani, “Regression shrinkage and selection viathe lasso,”Journal of the Royal Statistical Society. Series B (Methodological), pp.267–288, 1996.

[3] Z. Qin, K. Scheinberg, and D. Goldfarb, “Efficient block-coordinatedescent algorithms for the group lasso,”Mathematical ProgrammingComputation, vol. 5, pp. 143–169, June 2013.

[4] A. Rakotomamonjy, “Surveying and comparing simultaneous sparseapproximation (or group-lasso) algorithms,”Signal processing, vol. 91,no. 7, pp. 1505–1526, July 2011.

[5] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, “A comparison ofoptimization methods and software for large-scale l1-regularized linearclassification,”The Journal of Machine Learning Research, vol. 9999,pp. 3183–3234, 2010.

[6] R. H. Byrd, J. Nocedal, and F. Oztoprak, “An Inexact SuccessiveQuadratic Approximation Method for Convex L-1 RegularizedOpti-mization,” arXiv preprint arXiv:1309.3529, 2013.

[7] K. Fountoulakis and J. Gondzio, “A Second-Order Method for StronglyConvex L1-Regularization Problems,”arXiv preprint arXiv:1306.5386,2013.

[8] I. Necoara and D. Clipici, “Efficient parallel coordinate descent al-gorithm for convex optimization problems with separable constraints:application to distributed MPC,”Journal of Process Control, vol. 23,no. 3, pp. 243–253, March 2013.

[9] Y. Nesterov, “Gradient methods for minimizing composite functions,”Mathematical Programming, vol. 140, pp. 125–161, August 2013.

[10] P. Tseng and S. Yun, “A coordinate gradient descent method fornonsmooth separable minimization,”Mathematical Programming, vol.117, no. 1-2, pp. 387–423, March 2009.

[11] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo-rithm for linear inverse problems,”SIAM Journal on Imaging Sciences,vol. 2, no. 1, pp. 183–202, Jan. 2009.

[12] S. J. Wright, R. D. Nowak, and M. A. Figueiredo, “Sparse reconstructionby separable approximation,”IEEE Trans. on Signal Processing, vol. 57,no. 7, pp. 2479–2493, July 2009.

[13] Z. Peng, M. Yan, and W. Yin, “Parallel and distributed sparse optimiza-tion,” in Signals, Systems and Computers, 2013 Asilomar Conferenceon. IEEE, 2013, pp. 659–646.

[14] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergenceanalysis of block successive minimization methods for nonsmoothoptimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1126–1153, 2013.

[15] P. L. Bühlmann, S. A. van de Geer, and S. Van de Geer,Statistics forhigh-dimensional data. Springer, 2011.

[16] S. Sra, S. Nowozin, and S. J. Wright, Eds.,Optimization for Ma-chine Learning, ser. Neural Information Processing. Cambridge,Massachusetts: The MIT Press, Sept. 2011.

[17] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski,Optimization withSparsity-inducing Penalties. Foundations and TrendsR© in MachineLearning, Now Publishers Inc, Dec. 2011.

[18] Y. Li and S. Osher, “Coordinate descent optimization for ℓ1 minimiza-tion with application to compressed sensing; a greedy algorithm,” InverseProbl. Imaging, vol. 3, no. 3, pp. 487–503, 2009.

[19] I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Nearest neighborbased greedy coordinate descent,” inAdvances in Neural InformationProcessing Systems 24, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira,and K. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2160–2168.

[20] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin,“Featureclustering for accelerating parallel coordinate descent,” in Proc. of the26th Annual Conference on Neural Information Processing Systems(NIPS), Lake Tahoe, NV, USA, Dec. 3–6 2012, pp. 28–36.

[21] G. Scutari, F. Facchinei, P. Song, D. Palomar, and J.-S.Pang, “De-composition by Partial linearization: Parallel optimization of multi-agentsystems,”IEEE Trans. Signal Process., vol. 62, pp. 641–656, Feb. 2014.

[22] Y. Nesterov, “Efficiency of coordinate descent methodson huge-scaleoptimization problems,”SIAM Journal on Optimization, vol. 22, no. 2,pp. 341–362, 2012.

[23] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallelcoordinate descent for l1-regularized loss minimization,” in Proc. ofthe 28th International Conference on Machine Learning, Bellevue, WA,USA, June 28–July 2, 2011.

[24] I. Necoara, Y. Nesterov, and F. Glineur, “A random coordinate descentmethod on large optimization problems with linear constraints,”Technical Report, pp. 1–25, July 2011. [Online]. Available:http://acse.pub.ro/person/ion-necoara/

[25] I. Necoara and D. Clipici, “Distributed random coordinate descentmethod for composite minimization,”Technical Report, pp. 1–41, Nov.2013. [Online]. Available: http://arxiv-web.arxiv.org/abs/1312.5302

[26] Z. Lu and L. Xiao, “Randomized Block Coordinate Non-MonotoneGradient Method for a Class of Nonlinear Programming,”arXiv preprintarXiv:1306.5918v1, 2013.

[27] P. Richtárik and M. Takác, “Parallel coordinate descent methods for bigdata optimization,”arXiv preprint arXiv:1212.0873, 2012.

[28] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,”Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

[29] S. K. Shevade and S. S. Keerthi, “A simple and efficient algorithm forgene selection using sparse logistic regression,”Bioinformatics, vol. 19,no. 17, pp. 2246–2253, 2003.

[30] L. Meier, S. Van De Geer, and P. Bühlmann, “The group lasso forlogistic regression,”Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 70, no. 1, pp. 53–71, 2008.

[31] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algo-rithms,” in Proc. of the 21th Annual Conference on Neural InformationProcessing Systems (NIPS), Vancouver, British Columbia, CA, Dec. 3–62007, pp. 801–808.

[32] D. Goldfarb, S. Ma, and K. Scheinberg, “Fast alternating linearizationmethods for minimizing the sum of two convex functions,”MathematicalProgramming, vol. 141, pp. 349–382, Oct. 2013.

[33] D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distributed Computa-tion: Numerical Methods, 2nd ed. Athena Scientific Press, 1989.

[34] F. Facchinei and J.-S. Pang,Finite-Dimensional Variational Inequalitiesand Complementarity Problem. Springer-Verlag, New York, 2003.

[35] G. Cohen, “Optimization by decomposition and coordination: A unifiedapproach,”IEEE Trans. on Automatic Control, vol. 23, no. 2, pp. 222–232, April 1978.

http://arxiv.org/abs/1311.2444

http://acse.pub.ro/person/ion-necoara/

http://arxiv-web.arxiv.org/abs/1312.5302

16

[36] ——, “Auxiliary problem principle and decomposition ofoptimizationproblems,”Journal of Optimization Theory and Applications, vol. 32,no. 3, pp. 277–305, Nov. 1980.

[37] M. Patriksson, “Cost approximation: a unified framework of descentalgorithms for nonlinear programs,”SIAM Journal on Optimization,vol. 8, no. 2, pp. 561–582, 1998.

[38] M. Razaviyayn, H.-W. Tseng, and Z.-Q. Luo, “Dictionarylearningfor sparse representation: Complexity and algorithms,” inProc. of theIEEE 2014 International Conference on Acoustics, Speech, and SignalProcessing (ICASSP 2014), Florence, Italy, May 4-9,.

[39] M. Fukushima and H. Mine, “A generalized proximal pointalgorithm forcertain non-convex minimization problems,”Int. J. of Systems Science,vol. 12, no. 8, pp. 989–1000, 1981.

[40] H. Mine and M. Fukushima, “A minimization method for thesum ofa convex function and a continuously differentiable function,” J. ofOptimization Theory and App., vol. 33, no. 1, pp. 9–23, Jan. 1981.

[41] W. Deng, M.-J. Lai, Z. Peng, and W. Yin, “Parallel multi-block admmwith o(1/k) convergence,”arXiv:1312.3040, March 2014.

[42] D. P. Bertsekas and J. N. Tsitsiklis,Neuro-Dynamic Programming.Cambridge, Massachusetts: Athena Scientific Press, May. 2011.

parallel selective algorithms for nonconvex big …1402.5521v5 [cs.dc] 9 dec 2014 1 parallel...

Documents