a non-monotonic method for large-scale non-negative least squares

29
This article was downloaded by: [The University of Manchester Library] On: 11 November 2014, At: 05:31 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Optimization Methods and Software Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/goms20 A non-monotonic method for large- scale non-negative least squares Dongmin Kim a , Suvrit Sra b & Inderjit S. Dhillon a a University of Texas at Austin , Austin , TX , USA b Max Planck Institute for Intelligent Systems , Tübingen , Germany Published online: 02 Feb 2012. To cite this article: Dongmin Kim , Suvrit Sra & Inderjit S. Dhillon (2013) A non-monotonic method for large-scale non-negative least squares, Optimization Methods and Software, 28:5, 1012-1039, DOI: 10.1080/10556788.2012.656368 To link to this article: http://dx.doi.org/10.1080/10556788.2012.656368 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms- and-conditions

Upload: inderjit-s

Post on 11-Mar-2017

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A non-monotonic method for large-scale non-negative least squares

This article was downloaded by: [The University of Manchester Library]On: 11 November 2014, At: 05:31Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Optimization Methods and SoftwarePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/goms20

A non-monotonic method for large-scale non-negative least squaresDongmin Kim a , Suvrit Sra b & Inderjit S. Dhillon aa University of Texas at Austin , Austin , TX , USAb Max Planck Institute for Intelligent Systems , Tübingen ,GermanyPublished online: 02 Feb 2012.

To cite this article: Dongmin Kim , Suvrit Sra & Inderjit S. Dhillon (2013) A non-monotonic methodfor large-scale non-negative least squares, Optimization Methods and Software, 28:5, 1012-1039,DOI: 10.1080/10556788.2012.656368

To link to this article: http://dx.doi.org/10.1080/10556788.2012.656368

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software, 2013Vol. 28, No. 5, 1012–1039, http://dx.doi.org/10.1080/10556788.2012.656368

A non-monotonic method for large-scale non-negativeleast squares

Dongmin Kima* Suvrit Srab and Inderjit S. Dhillona

aUniversity of Texas at Austin, Austin, TX, USA; bMax Planck Institute for Intelligent Systems,Tübingen, Germany

(Received 22 November 2010; final version received 2 January 2012)

We present a new algorithm for solving the non-negative least-squares (NNLS) problem. Our algorithmextends the unconstrained quadratic optimization algorithm of Barzilai and Borwein (BB) [J. Barzilai andJ. M. Borwein; Two-Point Step Size Gradient Methods. IMA J. Numer.Anal. 1988.] to handle nonnegativityconstraints. Our extension differs from other constrained BB variants in simple but crucial aspects, themost notable being our modification to the BB stepsize itself. Our stepsize computation takes into accountthe nonnegativity constraints, and is further refined by a stepsize scaling strategy. These changes, incombination with orthogonal projections onto the nonnegative orthant, yield an effective NNLS algorithm.We compare our algorithm with several competing approaches, including established bound-constrainedsolvers, popular BB-based methods, and also a specialised NNLS algorithm. On several synthetic andreal-world datasets our method displays highly competitive empirical performance.

Keywords: least squares; nonnegativity constraints; large-scale; non-monotonic descent; Barzilai-Borweinstepsize; gradient projection method; NNLS

AMS Subject Classification: 93E24; 90C52; 90C20

1. Introduction

We study the nonnegative least-squares (NNLS) problem

minimizex

f (x) = 12‖Ax − b‖2, subject to x ≥ 0, (1)

where A ∈ Rm×n and b ∈ R

m. NNLS is a fundamental problem that arises naturally in applicationswhere in addition to satisfying a least-squares model, the variables must also satisfy nonnegativityconstraints. Such constraints often stem from physical grounds, e.g. when the variables xi (1 ≤i ≤ n) encode quantities such as frequency counts (data mining [20]), chemical concentrations(chemometrics [10]), and image intensities (astronomy [19,24]; medical imaging [28]).

Despite its apparent simplicity, NNLS can be challenging to solve, especially for large-scaleproblems. For such problems (e.g., when A is large and sparse), first-order (gradient-based)methods are often more preferable [2,3,26], since methods depending on second-order (Hessian)information can be computationally too expensive. Thus, we too focus on developing a first-ordermethod.

*Corresponding author. Email: [email protected]

© 2013 Taylor & Francis

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 3: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1013

We begin by recalling gradient projection (GP), the canonical first-order constrained opti-mization method. Let f (x) be a differentiable function, and � the set of feasible solutions. GPperforms the iteration: xk+1 = [xk − γ k∇f (xk)]� for k ≥ 1, where [·]� denotes orthogonal pro-jection onto �, while γ k > 0 is a stepsize [29]. For NNLS, the projection is trivial and the stepsizemay be computed using a standard linesearch [3], making GP seem attractive. Unfortunately, GPproves less effective because it inherits drawbacks from (unconstrained) steepest-descent, namely,zig-zagging or jamming, which make it converge very slowly [3,26].

For steepest descent, slow convergence has been addressed by numerous techniques. Amongstthese, beyond the well-known idea of conjugate gradients, the stepsize computation suggested byBarzilai and Borwein (BB) [1] stands out. Not only is this computation inexpensive, but despiteexhibiting non-monotonic descent, it is also surprisingly effective [7,14].

More specifically, for performing the unconstrained steepest-descent iteration xk+1 = xk −γ k∇f (xk), BB proposed the following two stepsizes (where �xk = xk − xk−1 and �f k =∇f (xk) − ∇f (xk−1) = ATA�xk):

γ k = ‖�xk‖2

〈�xk , �f k〉 = ‖ − γ k−1∇f (xk−1)‖2

〈�xk , ATA�xk〉 = ‖∇f (xk−1)‖2

〈∇f (xk−1), ATA∇f (xk−1)〉 (2a)

and

γ k = 〈�xk , �f k〉‖�f k‖2

= 〈�xk , ATA�xk〉‖ATA�xk‖2

= 〈∇f (xk−1), ATA∇f (x)k−1〉‖ATA∇f (xk−1)‖2

. (2b)

Convergence of steepest-descent run with either of (2a) or (2b) was first proved for a two-variablequadratic problem [1]; later convergence for the general unconstrained convex quadratic case wasestablished [27].

Since these stepsizes accelerate steepest descent considerably [1,14,27], one might wonderwhether they similarly accelerate GP. Unfortunately, despite the strong resemblance of GP tosteepest-descent, naïvely plugging in the BB stepsizes (2) into GP does not work. Dai andFletcher [14] presented a counter-example showing that GP fails to converge when run withBB steps (we reproduce a similar counter-example in Appendix 1, Figure A1). Thus, it seems thatto ensure convergence for constrained problems, some form of linesearch that guarantees descentis almost inevitable when invoking the BB formulae (or variants thereof). Indeed, this observa-tion is reaffirmed by noting that all methods in the literature that invoke BB steps, depend onlinesearch [5–8,13–15,17].

In this paper, we explore the possibility of using BB steps for GP, but without resorting tolinesearch. Consider therefore, the following typical alternatives to linesearch: (a) a constantstepsize (γ k ≤ 2/L, where L is the Lipschitz constant of ∇f (x)); or (b) stepsizes γ k given by asequence {βk} of diminishing scalars that satisfy1, e.g.

(i) limk→∞

βk = 0 and (ii) limk→∞

k∑i=1

β i = ∞. (3)

Alternative (a) cannot be used as the BB steps vary from iteration to iteration. We show, however,that alternative (b) can be combined with BB steps to develop a convergent GP method (seeSection 2.4 for details). Besides obtaining convergence, there are two more reasons that motivateus to invoke diminishing scalars (DS). First, they align well with BB steps, since akin to BBmethods, descent based on DS is also usually non-monotonic. Second, and more importantly, theuse of DS and their impact on BB-type methods has not been investigated elsewhere.

Using DS, however, has its share of difficulties. Although theoretically elegant, using DS is notalways practical as the diminishing sequence must be carefully selected; for an excellent example

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 4: A non-monotonic method for large-scale non-negative least squares

1014 D. Kim et al.

5 10 15 20 25 30 35 4010

−4

10−2

100

102

Running time (seconds)

Obj

ectiv

e fu

nctio

n va

lue

OAOA+LSOA+DS

(a)

5 10 15 20 25 30 35 4010

−5

10−4

10−3

10−2

10−1

Running time (seconds)

Nor

m o

f pro

ject

ed g

radi

ent

OAOA+LSOA+DS

(b)

5 10 15 20 25 30 35 4010

0

101

102

Running time (seconds)

||xk −

x* ||

OAOA+LSOA+DS

(c)

Figure 1. Objective function value (left), norm of projected gradient ‖∇f+(xk)‖∞ (middle), and true error ‖xk − x∗‖(right) versus running time (in seconds) for OA, OA+LS, and OA+DS.

of how a poor choice of the diminishing sequence can be disastrous, see [25, pg. 1578]. To reducedependence on diminishing scalars, we propose to not just invoke (3) out-of-the-box, but to ratheruse it in a relaxed fashion (see Section 2.2).

But even a relaxed use of DS is insufficient for ensuring a highly competitive algorithm. Thisdifficulty arises because even though the DS strategy helps ensure convergence, the actual under-lying problem remains unaddressed. In other words, although the clash between projection and BBsteps (see Figure 1) is suppressed by diminishing scalars, it is hardly eliminated. We overcomethis clash by introducing a crucial modification to the BB steps themselves. Our modificationleads to rapid (empirical) convergence. The details of how we blend DS with modified BB stepsare described in Section 2, wherein we derive our algorithm following a series of investigativeexperiments that guide our design choices.

2. Algorithm

We begin by investigating how BB steps augmented with DS fare against some known BBapproaches. Admittedly, like most gradient-based approaches, actual performance can vary sig-nificantly in the face of issues such as ill-conditioning. So to avoid getting lost in numericalconcerns and to gain insight into how the different approaches distinguish themselves, we beginour experimentation with a well-conditioned, large, sparse matrix. We remark that the NNLS

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 5: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1015

problem in this section was chosen to illustrate the major character and deficiency of each BBvariant discussed; obviously this does not imply any conclusions about the possible differencesin performance of the variants across all possible NNLS instances.

We experiment with a sparse, full-rank matrixA of size 12288 × 4096, having 16.57% nonzeros;the matrix’s smallest singular value is σmin(A) = 0.0102, and the largest is σmax(A) = 0.5779. Wesimulate the optimal solution x∗ to have 26.12% (1,070) nonzeros, while also satisfying Ax∗ = b.Then, we solve the associated NNLS problem (1) using the following algorithms:

(a) Ordinary Algorithm (OA) – GP with an alternating use of BB steps (2);(b) OA + LineSearch (OA+LS) – OA with linesearch every five iterations;(c) OA + Diminishing Scalar (OA+DS) – OA, but with stepsize βkγ k , with different choices for

the sequence {βk} satisfying (3).

We implemented all three algorithms in Matlab. For OA+LS, we use Armijo along projectionarc linesearch [3], invoking it every five iterations by using a reference iterate to ensure conver-gence (if after five iterations, descent has happened, linesearch proceeds from the current iterate,otherwise it uses the reference iterate). For OA+DS we experimented with several diminishingsequences and have reported the best results. We run all three algorithms with a convergencetolerance2 of ‖∇f+‖∞ < 10−8, where ∇f+ is the projected-gradient:

[∇f+(x)]i ={

min{0, ∂if (x)} if xi = 0,

∂if (x) if xi > 0.(4)

Note that ‖∇f+(x∗)‖∞ = 0 must hold at optimality.Figure 1 reports the result of our experiment with the three algorithms mentioned above. It

plots the objective function value, the norm of the projected-gradient, and the norm of error (i.e.distance to the true solution), against the running time in seconds. As also observed in [14], OAoscillates significantly. The figure indicates that both OA+LS and OA+DS converge, though,linesearch seems to be more helpful as OA+LS converges faster than OA+DS. Although OA+LSapproaches the convergence tolerance the quickest, one notable behaviour of OA+DS is that itrapidly catches up and eventually satisfies the convergence tolerance, with a convergence ratesimilar to OA+LS near the solution. The figure suggests that despite being slower than OA+LS,algorithm OA+DS could still be a viable candidate. But it is not competitive enough, especiallyconsidering that it requires user intervention to tune the diminishing sequence. Let us thus see,how to improve OA+DS.

2.1 Subspace BB Steps

To avoid the dominance of diminishing scalars in ensuring convergence, we need to address adeeper underlying problem: the oscillatory behaviour exhibited by the unconstrained BB methodwhen mixed with projections. When do oscillations (e.g., Figures 1 and A1) arise? Recall thatthe unconstrained BB method generates a sequence that converges to the unconstrained opti-mum. Thus, it must be the projection step that derails such a sequence by pulling it towards thenonnegative orthant. In ordinary gradient projection, the projection step is a simple device forenforcing feasibility, and it does not ‘break’ the algorithm since projections are nonexpansive andthe stepsize is chosen to always ensure descent. Quite logically, we may also conclude that somerestriction to the BB step is needed, especially when the projection actually affects the currentiterate.

Thus, the non-monotonicity of the BB step is the source of trouble. But we also know that thisnon-monotonicity is a key ingredient for rapid convergence. So, to alleviate oscillations without

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 6: A non-monotonic method for large-scale non-negative least squares

1016 D. Kim et al.

curtailing the non-monotonicity of the BB steps, we once again recall that the unconstrainedBB is guaranteed to converge. Now, suppose that oscillations start appearing. If we knew whichvariables were active, i.e., zero, at the solution, we could reduce the optimization task to anunconstrained problem over the inactive variables alone. Then, we could compute the solution tothis reduced problem by restricting the computation of BB steps to the inactive variables. Notethat we can obtain the constrained optimum by simply incorporating the active variables into thisunconstrained solution.

This is a key observation behind active-set methods, and it proves key in the development ofour method too. Specifically, we partition the variables into active and working sets, carrying outthe optimization over the working set alone. In addition, since the gradient is readily available,we exploit it to refine the active set and obtain the binding set.

Definition 2.1 (Active and binding sets.) Given x, the active set A(x) and the binding set B(x)

are defined as

A(x) = {i | xi = 0}, (5)

B(x) = {i | xi = 0, ∂if (x) > 0}. (6)

The role of the binding set is simple: variables bound at the current iteration are guaran-teed to be active at the next. Denote an orthogonal projection onto the nonnegative orthantby [·]+; if i ∈ B(xk) and we iterate xk+1 = [xk − γ k∇f (xk)]+ (for γ k > 0), then since xk+1

i =[xk

i − γ k∂if (xk)]+ = 0, the membership i ∈ A(xk+1) holds. Therefore, if we know that i ∈ B(xk),we may discard xk

i from the update. To employ this idea in conjunction with the BB step, wefirst compute B(xk) and then confine the computation of the stepsize to the subspace defined byj /∈ B(xk). Formally, we propose to replace the basic BB steps (2) by the subspace-BB steps:

αk = ‖∇ f k−1‖2

〈∇ f k−1, ATA∇ f k−1〉 (7a)

or

αk = 〈∇ f k−1, ATA∇ f k−1〉‖ATA∇ f k−1‖2

, (7b)

where ∇ f k−1 is defined as

[∇ f k−1]i ={

∂if (xk−1) for i /∈ B(xk),

0 otherwise.

Notice that ∇ f k−1 = ∇f+(xk−1) only if B(xk) = B(xk−1).Using the subspace-BB steps (7) we now modify OA+DS to obtain the iteration

xk+1 = [xk − βk · αk∇f (xk)]+, (8)

which defines another algorithm, which we call SA+DS (for Subspace Algorithm + DS). Toillustrate how SA+DS performs in comparison to OA+DS, with an identical choice of the dimin-ishing scalar sequence, we repeat the experiment of Figure 1, running it this time with SA+DS.Figure 2 compares the objective function, projected-gradient norm, and the error norm achievedby OA+DS to those attained by SA+DS. Since both algorithms are run with the same sequenceof diminishing scalars, the vast difference in performance shown in the plots may be attributedto the subspace-BB steps. Also note that in contrast to all other methods shown so far (Figures 1and 2), SA+DS manages to satisfy the convergence criterion ‖∇f+‖∞ < 10−8 fairly rapidly.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 7: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1017

5 10 15 20 25 30 35 4010

−10

10−7

10−4

10−1

102

Running time (seconds)

Obj

ectiv

e fu

nctio

n va

lue

OA+DSSA+DS

(a)

5 10 15 20 25 30 35 4010

−8

10−6

10−4

10−2

100

10−8

Running time (seconds)

Nor

m o

f pro

ject

ed g

radi

ent

OA+DSSA+DS

(b)

5 10 15 20 25 30 35 4010

−3

10−2

10−1

100

101

102

Running time (seconds)

||xk −

x* ||

OA+DSSA+DS

(c)

Figure 2. Objective function value (left), projected gradient norm (middle), and true error ‖xk − x∗‖ (right) versusrunning time (in seconds) for OA+DS and SA+DS. The DS used corresponds to c = 5, δ = 0.4 from Figure 3.

2.2 Diminishing optimistically

Having seen the benefit of subspace-BB steps in SA+DS, we now take a closer look at therole of diminishing scalars in SA+DS. As already mentioned, gradient projection methods arein general highly sensitive to the choice of the diminishing sequence, which must be chosencarefully to obtain good empirical performance. Even though Figure 2 shows SA+DS to exhibitgood performance, without proper tuning such performance is not easy to attain. Figure 3 illustratesthis difficulty by showing results of running SA+DS with various choices for the diminishingsequence: one observes that the subspace-BB steps do not help much if a poor diminishingsequence is used.

However, one does note that the subspace-BB steps help in a ‘robust’ manner, that is, theyconsistently improve (convergence speed of SA+DS) as the effect of diminishing sequence weak-ens. To investigate this behaviour further, we run SA+DS with βk = 1, i.e. without scaling thesubspace-BB steps. While considering the impact of subspace-BB steps without diminishingscalars, one might wonder if linesearch combined with subspace-BB is superior. So we also testa method that combines subspace-BB steps with linesearch (SA+LS). We compare SA+DS runwith βk = 1 against SA+LS and the best-performing SA+DS (from Figure 3): the results arepleasantly surprising as shown in Figure 4.

Figure 4 suggests that the subspace-BB alone can produce converging iterates and even thelazy linesearch may affect it adversely. In view of Figures 3 and 4, we conclude that eitherscaling the stepsize via βk or invoking linesearch can adversely affect the convergence speed of

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 8: A non-monotonic method for large-scale non-negative least squares

1018 D. Kim et al.

5 10 15 20 25 30 35 4010

−10

10−7

10−4

10−1

102

Running time (seconds)

Obj

ectiv

e fu

nctio

n va

lue

c=1, δ=1.0c=5, δ=1.0c=1, δ=0.7c=5, δ=0.7c=1, δ=0.4c=5, δ=0.4

(a)

0 10 20 30 4010

−8

10−6

10−4

10−2

100

10−8

Running time (seconds)

Nor

m o

f pro

ject

ed g

radi

ent

c=1, δ=1.0c=5, δ=1.0c=1, δ=0.7c=5, δ=0.7c=1, δ=0.4c=5, δ=0.4

(b)

5 10 15 20 25 30 35 4010

−3

10−2

10−1

100

101

102

Running time (seconds)

||xk −

x* ||

c=1, δ=1.0c=5, δ=1.0c=1, δ=0.7c=5, δ=0.7c=1, δ=0.4c=5, δ=0.4

(c)

Figure 3. Empirical convergence of the SA+DS with respect to different choices of the diminishing sequence. Thechoices used were βk = c/kδ . In this experiment, c = 5, and δ = 0.4 eventually led to the fastest convergence. Asexpected, if βk decays too rapidly, the convergence gets impaired and the algorithm eventually stalls due to limitedmachine precision. The exact values of the diminishing parameters are not as important as the message that SA+DS’sconvergence is sensitive to the diminishing sequence employed.

SA+DS, whereas even near constant {βk} seems to retain SA+DS’s convergence. This behaviouris opposite to that exhibited by the non-subspace method OA+DS, for which the diminishingscalars not only control the convergence speed but also dominate the convergence itself. Thiscontrast (Figure 4) is a crucial distinguishing feature of SA+DS.

Therefore, for empirical performance we must retain the benefits of subspace steps, while stillguaranteeing convergence. However at this point, we must stress that despite its robust empiricalperformance, subspace-BB steps are unlikely to be sufficient for guaranteeing convergence: theirnon-monotonic behaviour and their lack of explicit active set partitioning make it difficult toprevent the oscillation of the active sets throughout iterations. To tackle this difficulty, whileminimizing interference with the subspace-BB steps, we propose to relax the application ofdiminishing scalars by using an ‘optimistic’ diminishment strategy.

This strategy is as follows. We scale the subspace-BB step (7) with some constant scalar β fora fixed number, say M, of iterations. Then, we check whether a descent condition is satisfied. Ifyes, then the method continues for M more iterations with the same β; if no, then we diminishthe scaling factor β. The diminishment is ‘optimistic’ because even when the method fails tosatisfy the descent condition that triggers diminishment, we merely shrink β once and continueusing it for the next M iterations. We remark that superficially this strategy might seem similar toan occasional linesearch, but it is fundamentally different: unlike linesearch it does not enforce

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 9: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1019

5 10 15 20 25 30 35 4010

−11

10−8

10−5

10−2

101

Running time (seconds)

Obj

ectiv

e fu

nctio

n va

lue

SA+DS; βk=1SA+DSSA+LS

(a)

5 10 15 20 25 30 35 4010

−9

10−7

10−5

10−3

10−1

10−8

Running time (seconds)

Nor

m o

f pro

ject

ed g

radi

ent

SA+DS; βk=1SA+DSSA+LS

(b)

5 10 15 20 25 30 35 4010

−4

10−3

10−2

10−1

100

101

102

Running time (seconds)

||xk −

x* ||

SA+DS; βk=1SA+DSSA+LS

(c)

Figure 4. Objective function value (left), projected gradient norm (middle) and true error ‖xk − x∗‖ (right) versusrunning time for SA+DS with βk = 1 compared with the best performing instance of SA+DS selected from Figure 2.

monotonicity after failing to descend for a prescribed number of iterations. We formalize thisbelow.

Suppose that the method is at iteration c, and then iterates with a constant βc for the next Miterations, so that from the current iterate xc, we compute

xk+1 = [xk − βc · αk∇f (xk)]+, (9)

for k = c, c + 1, . . . , c + M − 1, where αk is computed via (7). Now, for xc+M , we check thedescent condition

f (xc) − f (xc+M) ≥ σ 〈∇f (xc), xc − xc+M〉, (10)

for some σ ∈ (0, 1). If xc+M passes the test (10), then we reuse βc and set βc+M = βc; otherwise,we diminish βc and set

βc+M ←− η · βc, (11)

for some η ∈ (0, 1). After adjusting βc+M the method repeats the update (9) for another Miterations.

2.3 SBB: Subspace-BB with optimistic diminishment

With the subspace-BB steps and the optimistic diminishment strategy we now have all the ingre-dients necessary to present our final NNLS algorithm: Subspace BB (SBB). The termination

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 10: A non-monotonic method for large-scale non-negative least squares

1020 D. Kim et al.

criterion used by our algorithm is approximate satisfaction of the KKT conditions, which forNNLS reduces to checking whether the norm

‖∇f+(xk)‖∞ ≤ ε for a given threshold ε ≥ 0, (12)

where the projected gradient ∇f+ is as defined in (4). We use a termination criterion basedon (12) for checking convergence of all the methods described in this paper. Algorithm 1 presentspseudo-code of SBB.

Algorithm 1 The Subspace BB algorithm (SBB).

Given x0 and x1

for i = 1, · · · until the stopping criteria (12) met dox0 ← xi−1 and x1 ← xi

for j = 1, · · · , M do {/* Subspace BB */}Compute αj using (7a) and (7b) alternativelyxj+1 ← [xj − β i · αj∇f (xj

)]+end forif xM satisfies (10) then

xi+1 ← xM , and β i+1 ← β i

else {/* Diminish Optimistically */}β i+1 ← ηβ i, where η ∈ (0, 1)

end ifend for

Figure 5 summarizes the best performing methods from Figures 1–4, while highlighting thatSBB outperforms all other variants.

2.4 Convergence analysis

In this section, we analyse some theoretical properties of SBB. First, we establish convergenceunder the assumption that f is strictly convex (or equivalently that ATA has full-rank). Later,with an additional mild assumption, we show that the proof easily extends to the case whereATA is rank-deficient. Finally, we briefly discuss properties such as convergence rate and theidentification of active variables.

We remind the reader that when proving convergence of iterative optimization routines, oneoften assumes Lipschitz continuity of the objective function. The objective function for NNLS isonly locally Lipschitz continuous, i.e. there exists a constant L, such that

|f (x) − f (y)| ≤ L‖x − y‖, ∀x, y ∈ �, or equivalently, ‖∇f (x)‖ ≤ L ∀x ∈ �,

where � is an appropriate compact set.Even though the domain of NNLS (x ≥ 0) does not define such a compact set, we can essentially

view the domain to be compact. To see why, let xu and x∗ denote the unconstrained least-squaressolution and the NNLS solution, respectively. Let xp = [xu]+ be the projection of xu onto thenonnegative orthant. Then, the following inequalities are immediate

‖Axu − b‖ ≤ ‖Ax∗ − b‖ ≤ ‖Axp − b‖.

Using these inequalities we can derive a simple upper bound U on ‖x∗‖ as follows:

‖Ax∗‖ − ‖b‖ ≤ ‖Ax∗ − b‖ ≤ ‖Axp − b‖,

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 11: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1021

5 10 15 20 25 30 35 4010

−11

10−8

10−5

10−2

101

Running time (seconds)

Obj

ectiv

e fu

nctio

n va

lue

OA+LSSA+LSSA+DSSBB

(a)

5 10 15 20 25 30 35 4010

−9

10−7

10−5

10−3

10−1

10−8

Running time (seconds)

Nor

m o

f pro

ject

ed g

radi

ent

OA+LSSA+LSSA+DSSBB

(b)

5 10 15 20 25 30 35 4010

−4

10−3

10−2

10−1

100

101

102

Running time (seconds)

||xk −

x* ||

OA+LSSA+LSSA+DSSBB

(c)

Figure 5. Objective function value (left), projected gradient norm (middle) and true error ‖xk − x∗‖ (right) versusrunning time for all the algorithms. In these plots, we have shown SBB using a dashed line to distinguish it from othermethods. This figure is essentially a summary of the information shown in Figures 1–4.

hence

σmin(A) · ‖x∗‖ ≤ ‖Ax∗‖ ≤ ‖Axp − b‖ + ‖b‖,

‖x∗‖ ≤ ‖Axp − b‖ + ‖b‖σmin(A)

= U, (13)

where σmin(A) > 0 denotes the smallest singular value of A. Thus, the domain of NNLS canbe effectively restricted to � = {x : 0 ≤ x ≤ U}, and we may safely consider the NNLS objectiveto be Lipschitz continuous. Finally, to ensure that the iterates remain in �, we can modify theprojection [x]+ so that no element xi grows larger than U. We will assume this upper boundimplicitly in the discussion below (also in Algorithm 1), and avoid mentioning it for simplicity ofpresentation.3

For clarity in our proofs, we introduce some additional notation. Let M be the fixed number ofsubspace-BB iterations, so that we check the descent condition (10) only once every M iterations.We index these Mth iterates with I = {1, M + 1, 2M + 1, 3M + 1, . . .}, and then consider thesequence {xr}, r ∈ I generated by SBB. Let x∗ denote the optimal solution to the problem. Weprove that {xr} → x∗.

Suppose that the diminishment step (11) is triggered only a finite number of times. Then, thereexists a sufficiently large K such that

f (xr) − f (xr+1) ≥ σ 〈∇f (xr), xr − xr+1〉,

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 12: A non-monotonic method for large-scale non-negative least squares

1022 D. Kim et al.

for all r ∈ I, r ≥ K . In this case, we may view the sequence {xr} as if it were generated by anordinary gradient projection scheme, whereby convergence of {xr} follows from [3]. Therefore,to prove the convergence of the entire algorithm, it is sufficient to discuss the case where (11) isinvoked infinitely often.

Given an infinite sequence {xr}, there is a corresponding sequence {βr} which by construc-tion is diminishing. Recall that the diminishing sequence with unbounded sum condition (3)ensures convergence [3]. We show that multiplying {βr} with the subspace-BB steps preservesthe condition (3), and hence inherits the convergence guarantees.

Proposition 2.2 In Algorithm 1, the stepsize βr · αr satisfies

(i) limr→∞ βr · αr = 0 and (ii) lim

r→∞

r∑i=1

β i · αi = ∞.

Proof From the definition of the subspace-BB (7), simple algebra shows that

‖∇ f k−1‖2

〈∇ f k−1, ATA∇ f k−1〉 = 〈y1, y1〉〈y1, ATAy1〉

and

〈∇ f k−1, ATA∇ f k−1〉‖ATA∇ f k−1‖2

= 〈y2, y2〉〈y2, ATAy2〉

,

where y1 = ∇ f k−1 and y2 = (ATA)1/2∇ f k−1. Since ATA is a positive definite matrix, for all y �= 0,its Rayleigh quotient satisfies

0 < λmin(ATA) ≤ 〈y, ATAy〉〈y, y〉 ≤ λmax(ATA),

where λmax(ATA) and λmin(ATA) denote the largest and the smallest eigenvalues of ATA,respectively. Now, we can see that at any given iteration r, the subspace-BB step αr satisfies,

1

λmax(ATA)≤ αr ≤ 1

λmin(ATA). (14)

Since limr→∞ βr = 0 by construction, we can show that Condition (i) of (3) also holds forβrαr , since

limr→∞ βr · αr ≤ 1

λmin(ATA)· lim

r→∞ βr = 0.

Similarly, we also obtain Condition (ii) of (3), since

limr→∞

r∑i=1

β i · αi ≥ 1

λmax(ATA)· lim

r→∞

r∑i=1

β i = ∞.

Using this proposition, we now state the main convergence theorem. The proof is essentially thatof gradient descent with diminishing stepsizes; we adapt it by showing some additional propertiesof {xr}.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 13: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1023

Theorem 2.3 Let the objective function f (x) = 12‖Ax − b‖2 be strictly convex, {xr}be a sequence

generated by Algorithm 1, and x∗ = argminx f (x). Then {f (xr)} → f (x∗) and xr → x∗.

Proof Consider the update (9) of Algorithm 1, we can rewrite it as

xr+1 = xr + βr · αrdr, (15)

where the descent direction dr satisfies

dri =

⎧⎪⎪⎨⎪⎪⎩

0 if i ∈ B(xr),

− min

{xr

i

βr · αr, ∂if (xr)

}if xr

i > 0, ∂if (xr) > 0,

−∂if (xr) otherwise.

(16)

With (16), and since there exists at least one ∂if (xr) > 0 unless xr = x∗, we can conclude thatthere exists a constant c1 > 0 such that

−〈∇f (xr), dr〉 ≥∑

i/∈B(xr)

m2i ≥ c1‖∇f (xr)‖2 > 0, (17)

where mi = min{xri /β

r · αr , ∂if (xr)}. Similarly, it can also be shown that there exists c2 > 0such that

‖dr‖2 ≤ c2‖∇f (xr)‖2. (18)

Using Proposition 2.2 with inequalities (17) and (18), the proof is immediate from Proposi-tion 1.2.4 in [3]. �

To extend the above proof for rank-deficient ATA, note that ultimately our convergence proofrests on that of ordinary gradient projection with linesearch or diminishing scalars. Neither ofthese two requires f to be strictly convex. The only difficulty that arises is in (13) and (14), wherewe compute the values 1/σmin(A) or 1/λmin(ATA). It can be shown that for the convex quadraticprogram

minimizex

12 xTHx − cTx,

if one assumes c to be in the range of H, then the BB step (2) is bounded above by 1/λ+min(H),

where λ+min denotes the smallest positive eigenvalue of H [15]. For our problem, we can equate

this assumption on c to the assumption that ATb lies in the range of ATA. Consequently, we canmodify (13) and (14) to

‖x∗‖ ≤ ‖Axp − b‖ + ‖b‖σ+

min(A)= U and αr ≤ 1

λ+min(A

TA),

where σ+min and λ+

min denote the smallest positive singular- and eigenvalues, respectively.4

Finally, to close our analysis we point out that since SBB may be viewed as a gradient projectionmethod, it inherits properties such as the convergence rate [3] and identification of the activevariables [11].

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 14: A non-monotonic method for large-scale non-negative least squares

1024 D. Kim et al.

3. Related work

Before we show numerical results, it is fitting to briefly review related work. Over the years, avariety of methods have been applied to solve NNLS. Several of these approaches are summarizedin [9, Chapter 5]. The main approaches can be roughly divided into three categories.

The first category includes methods that were developed for solving linear inequality con-strained least squares problems by transforming them into corresponding least distance prob-lems [9,30]. However, such transformations prove to be too costly for NNLS and do not yieldmuch advantage, unless significant additional engineering efforts are made.

The second and more successful category of methods includes active-set methods [16]. Activeset methods typically deal with one constraint per iteration, and the overall optimization problemis approximated by solving a series of equality-constrained problems; the equality constraintsform the current active set that is then incrementally updated to construct the final activeset. The famous NNLS algorithm of Lawson and Hanson [21] is an active set method, andhas been the de facto method for solving (1) for many years. In fact, Matlab continues toship lsqnonneg, an implementation of the original Lawson–Hanson NNLS algorithm [21].Bro and Jong [10] modified the latter algorithm and developed a method called Fast-NNLS(FNNLS) that is often faster than the Lawson–Hanson algorithm. The rationale behind FNNLSis simple: it accepts ATA and ATb instead of A and b, thus taking advantage of the reduceddimensionality of ATA when m � n for A ∈ R

m×n. However, constructing ATA is expensive,which makes the method prohibitive for large-scale problems, i.e. when both m and n arelarge.

The third category of methods includes algorithms based on more general iterative approachesthat produce a sequence of intermediate solutions which converge to the optimal solution. Forexample, the gradient projection method [29] and some of its variants have been applied toNNLS [4,23]. The main advantage of this class of algorithms is that by using information fromthe projected gradient step to obtain a good approximation of the final active set, one can handlemultiple active constraints per iteration. However, the projected gradient approach frequently suf-fers from slow convergence (zig-zagging), a difficulty potentially alleviated by more sophisticatedmethods such as LBFGS-B [12] or TRON [22].

The method proposed in this paper belongs to the third category. Observe that since NNLSis one of the simplest constrained optimization problems, any modern constrained optimiza-tion technique can be applied to solve it. However, generic off-the-shelf approaches frequentlyfail to exploit the inherent advantages arising from the simplicity of the problem, resulting inunnecessary computational and implementation overheads. In the following section, we illus-trate the computational advantages of our method by comparing it with established optimizationsoftware.

Finally, we point out that there exist other BB-based methods that could also be applied toNNLS. But in contrast to our optimistic diminishment approach, all these approaches eitheremploy linesearch [5–8,13,17] or introduce an explicit active-variable identification step to utilisethe unconstrained BB stepsizes (2) [15]. SPG [6] belongs to the group of methods employing anonmonotone linesearch, and it is one of the methods against which we compare, not only becauseit provides a publicly available implementation, but also because it is highly competitive. Morerecently, in [18] a BB-based active set algorithm (ASA) was proposed. ASA has two phases, anon-monotone gradient projection step, and an unconstrained optimization step. It utilises the BBstep in the gradient projection step and restricts its computation in a way similar to that developedin this paper. In our notation, their modified BB steps can be written as

‖�xk‖2

〈�xk , ∇f+k − ∇f+k−1〉 and〈�xk , ∇f+k − ∇f+k−1〉

‖∇f+k − ∇f+k−1‖2.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 15: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1025

4. Numerical results

In this section, we show results on a large number of experiments, as validation of the empiricalbenefits of our algorithm, and to position it relative to other competing approaches. Specifically,we compare the following five methods:

(1) FNNLS [10]– Matlab implementation from http://www.mathworks.com/matlabcentral/fileexchange/3388-nnls-and-constrained-regression

(2) LBFGS-B [12] – Fortran implementation from http://users.eecs.northwestern.edu/∼nocedal/lbfgsb.html

(3) SPG [6] – Fortran implementation from http://www.ime.usp.br/ ∼egbirgin/(4) ASA [18] – C implementation from http://www.math.ufl.edu/∼hager/(5) SBB—Matlab implementation of Algorithm 1. The parameters M and σ for the descent

step (10) were set to 1005 and 10−2, respectively. These values were not tuned to afford SBBany particular advantage. The diminishment parameter η in (11) was set to 0.99.

We ran all experiments on a Linux machine, with an Intel Xeon 2.0 GHz CPU and 16 GBmemory.We used Matlab to interface all algorithms, and its multi-threading option was turned offto prevent skewing the results due to multiple cores. Note that LBFGS-B, SPG are implementedin Fortran, while ASA is implemented in C. In contrast, FNNLS and SBB have merely Matlab

implementations.We ran all methods with their default parameters, except that we augmented each method by

implementing for it an ‘elapsed time’ stopping criterion to allow stoppage after a user-specifiedtime-limit was reached. We also note here that SBB uses a threshold on ‖∇f+(x)‖∞ as its stoppingcriterion, while the other methods use ‖x − [x − ∇f (x)]+‖∞ – this which gives the other methodsa slight advantage over SBB. latter choice is always less strict than ‖∇f+(x)‖∞,

4.1 Summary of the data sets

Our first set of results involves simulations, where we simulate NNLS problems and assess thecomputational efficiency of various methods in a ‘clean’ setup. For each data set, we generate anonnegative matrix A and compute an unconstrained observation vector bt with a pre-determinedsparse nonnegative solution x∗, so that bt ← Ax∗. Note that these choices for A and bt form aNNLS problem where the solution is degenerate, i.e. the gradient completely disappears at thesolution. We include a brief summary of the results on such degenerate problems at the end ofthis section.

However, degeneracy in NNLS means that the solution can be obtained by any ordinary leastsquares method, hence we further refine the generated bt to ensure non-degeneracy at the solution.Specifically, given A, bt and x∗, we identify A(x∗), then generate an augmenting vector y such thatyi > 0 for i ∈ A(x∗). Finally, a constrained observation vector b is produced by solving the linearsystem ATb = ATbt − y. Note that x∗ is still a solution for the newly generated NNLS problemwith A and b, and it is ‘clean’ since it satisfies the KKT complementarity conditions strictly.

Based on the above scheme, we simulate several NNLS problems of varying size (data set P1),and varying sparsity (data set P2)—see Table 1.

Our second set of results is on seven data sets drawn from real-world applications (see Table 2).We obtained the first two data sets, namely, ‘ash958’ and ‘well1850’ from the MatrixMarket6 –they arise from problems solving least squares, and we impose nonnegativity constraints on thesolution to form NNLS problems. The third data set ‘orl-face’ is the database of face images7

for which one can build a face recognition algorithm based on solving NNLS [31]. We obtainedthe remaining four data sets, ‘real-sim’, ’mnist’, ’news20’, and ‘webspam’ from the collection of

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 16: A non-monotonic method for large-scale non-negative least squares

1026 D. Kim et al.

Table 1. The size of dense and uniformly random nonnegative matrices A in data set P1 and #nnz (i.e., number ofnonzeros) of the matrices (of size 25600 × 9600) in data set P2. #active denotes the number of active variables in x∗, and||∇f+(x∗)||∞ gives the max-norm of the projected gradient at x∗. Also, κ(ATA) and κ(ATA) denote condition numbers ofATA and of a submatrix formed by retaining only rows and columns corresponding to inactive variables in x∗. The Matlab

commands cond and condest where used to estimate these numbers for dense and sparse matrices, respectively.

P-1 P1-1 P1-2 P1-3 P1-4 P1-5 P1-6

Rows 600 1, 200 2, 400 4, 800 9, 600 19, 200Columns 400 800 1, 600 3, 200 6, 400 12, 800#active 301 599 1,178 2,372 4,743 9,459||∇f+(x∗)||∞ 7.84 × 10−12 5.82 × 10−11 2.93 × 10−10 2.19 × 10−9 1.11 × 108 6.70 × 10−8

κ(ATA) 3.19 × 104 7.12 × 104 1.35 × 105 2.85 × 105 5.64 × 105 1.14 × 106

κ(ATA) 7.65 × 102 1.72 × 103 3.62 × 103 7.09 × 103 1.43 × 104 2.95 × 104

P-2 P2-1 P2-2 P2-3 P2-4 P2-5 P2-6

#nnz 1,225,734 2,445,519 3,659,062 4,866,734 6,068,117 7,263,457#active 7,117 7,125 7,117 7,104 7,106 7,106||∇f+(x∗)||∞ 2.89 × 10−12 1.21 × 10−11 2.38 × 10−11 1.19 × 10−10 1.59 × 10−10 1.74 × 10−10

κ(ATA) 5.26 × 103 1.03 × 104 1.45 × 104 1.87 × 104 2.13 × 104 2.70 × 104

κ(ATA) 2.42 × 102 4.04 × 102 5.48 × 102 6.99 × 102 9.33 × 102 1.06 × 103

Table 2. The size and sparsity of real-world data sets.

Matrix ash958 well1850 orl-face real-sim mnist news20 webspam

Rows 958 1, 850 400 72, 309 60, 000 19, 996 350, 000Columns 292 712 10, 304 20, 958 780 1, 355, 191 254

#nnz 1, 916 8, 755 4, 121, 478 3, 709, 083 8, 994, 156 9, 097, 916 29, 796, 333

#nnz denotes the number of non-zero entries.

LIBSVM data sets8. On these four data sets one can interpret the use of NNLS as a regressionmethod for predicting labels for a classification task.

4.2 Summary of the experiments

For all data sets, each of the compared methods was started at x0 = 0.We show results with several different stopping thresholds. In summary,

• Tables 3 and 5 use a stopping threshold of 10−2 (low accuracy);• Tables 4 and 6 use a stopping threshold of 10−4 (medium accuracy); and• Tables that use a stopping threshold of 10−6 are provided in the appendix.

Tables 3 and 4 show that for small to midsized problems, e.g., P1-1 through P1-4, FNNLSis very competitive with the other methods. But unlike SBB, FNNLS’s performance starts todeteriorate like that of the other methods (LBFGS-B, SPG, and ASA), as the data sets increasein size. Tables 5 and 6 show a comparison similar to that in Tables 3 and 4, but this time onproblem set P2.

Both sets of results (Tables 3,4 and 5,6) show a marked pattern. All methods attain comparableobjective function values, but FNNLS (whenever it scales) shows an edge over all the othermethods in terms of recovering the active variables.

For low-accuracy solutions (Tables 3 and 5), SBB often lags behind other methods in terms ofidentifying active variables, but exhibits highly competitive computational efficiency and accu-racy otherwise. This competitiveness becomes more pronounced for medium-accuracy solutions(Tables 4 and 6), where SBB estimates the active-sets better, but without losing any of itscomputational advantages.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 17: A non-monotonic method for large-scale non-negative least squares

Optim

izationM

ethods&

Software

1027

Table 3. NNLS experiments on data set P1. As the stopping criteria, we set the maximum elapsed time to 1,000 seconds, ||∇f+||∞ ≤ 10−2 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−2 for other methods.

Method P1-1 P1-2 P1-3 P1-4 P1-5 P1-6

FNNLS Time (s) 0.07 0.39 4.90 38.91 356.93 1000.35f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#active 301 599 1178 2372 4743 11247

LBFGS-B Time (s) 62.56 231.35 949.27 1001.18 1003.07 1006.23f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 394 731 1520 791 389 183#∇f 394 731 1520 791 389 183#active 156 450 287 1007 729 273||∇f+||∞ 1.06 × 10−2 1.40 × 10−2 3.59 × 10−2 9.52 × 101 4.53 × 102 1.00 × 103

||x − [x − ∇f (x)]+||∞ 9.92 × 10−3 9.86 × 10−3 3.59 × 10−2 9.52 × 101 4.53 × 102 1.00 × 103

SPG Time (s) 5.50 103.80 775.00 1000.14 1001.17 1006.65f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 5404 12017 23802 11890 3617 897#∇f 4035 9045 17473 8139 1825 465#active 138 0 0 124 2103 3951||∇f+||∞ 1.13 × 10−2 1.38 × 10−2 2.26 × 10−2 3.29 × 100 8.55 × 101 1.56 × 102

||x − [x − ∇f (x)]+||∞ 9.83 × 10−3 9.85 × 10−3 1.34 × 10−2 1.07 × 100 2.00 × 101 1.56 × 102

ASA Time(s.) 0.12 1.65 1000.22 1001.62 1002.49 1005.20f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 470 447 46530 12409 3202 843#∇f 278 256 30665 8243 2085 492#active 77 241 0 0 0 0||∇f+||∞ 9.41 × 10−3 9.95 × 10−3 7.03 × 10−1 2.31 × 100 7.48 × 100 1.59 × 101

||x − [x − ∇f (x)]+||∞ 9.41 × 10−3 9.23 × 10−3 4.34 × 10−1 1.51 × 100 4.30 × 100 9.66 × 100

SBB Time (s) 0.11 1.38 9.87 29.86 172.30 1001.28f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 2 2 3 2 3 5#∇f 202 219 315 247 360 519#active 118 0 547 2219 2802 7950||∇f+||∞ 6.28 × 10−3 9.82 × 10−3 9.21 × 10−3 5.78 × 10−3 9.23 × 10−3 2.15 × 100

||x − [x − ∇f (x)]+||∞ 6.28 × 10−3 9.82 × 10−3 7.49 × 10−3 5.69 × 10−3 9.23 × 10−3 9.99 × 10−1

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 18: A non-monotonic method for large-scale non-negative least squares

1028D

.Kim

etal.

Table 4. NNLS experiments on data set P1. As the stopping criteria, we set the maximum elapsed time to 1,000 seconds, ||∇f+||∞ ≤ 10−4 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−4 for other methods.

Method P1-1 P1-2 P1-3 P1-4 P1-5 P1-6

FNNLS Time (s) 0.07 0.39 4.90 38.91 356.93 1000.35f (x) 1.63 × 106 2.61 × 107 3.21 × 10+8 1.10 × 1010 1.02 × 1011 9.27 × 1011

#active 301 599 1178 2372 4743 11247

LBFGS-B Time (s) 31.61 171.38 923.93 1000.85 1002.87 1005.07f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 581 828 1520 801 389 183#∇f 581 828 1520 801 389 183#active 269 324 287 945 729 273||∇f+||∞ 1.07 × 10−3 6.80 × 10−3 3.59 × 10−2 5.71 × 10+1 4.53 × 10+2 1.00 × 103

||x − [x − ∇f (x)]+||∞ 3.61 × 10−4 6.01 × 10−3 3.59 × 10−2 3.01 × 101 4.53 × 102 1.00 × 103

SPG Time (s) 6.30 114.38 1000.13 1000.11 1000.47 1004.91f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 6575 13613 32125 11915 3606 897#∇f 4757 9593 19275 8160 1822 465#active 300 563 1162 30 1753 3951||∇f+||∞ 1.03 × 10−4 5.81 × 10−4 2.30 × 10−4 2.73 × 100 7.13 × 101 1.56 × 102

||x − [x − ∇f (x)]+||∞ 1.03 × 10−4 2.13 × 10−4 2.30 × 10−4 2.73 × 100 7.13 × 101 1.56 × 102

ASA Time (s) 0.17 2.39 1000.74 1000.88 1001.85 1003.22f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 10+10 1.02 × 1011 9.27 × 10+11

#f 634 610 46,739 12,481 3193 843#∇f 404 378 30,803 8291 2079 492#active 301 599 0 0 0 0||∇f+||∞ 4.69 × 10−5 4.53 × 10−5 7.22 × 10−1 3.41 × 100 7.37 × 100 1.59 × 101

||x − [x − ∇f (x)]+||∞ 4.69 × 10−5 4.53 × 10−5 4.60 × 10−1 2.02 × 100 4.30 × 100 9.66 × 100

SBB Time (s) 0.14 1.75 12.66 42.91 227.00 1000.95f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 2 2 4 3 4 5#∇f 283 277 409 355 474 519#active 301 597 1178 2364 4708 7950||∇f+||∞ 7.25 × 10−5 6.25 × 10−5 9.82 × 10−5 7.22 × 10−5 9.79 × 10−5 2.15 × 100

||x − [x − ∇f (x)]+||∞ 7.25 × 10−5 6.25 × 10−5 9.82 × 10−5 7.22 × 10−5 9.79 × 10−5 9.99 × 10−1

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 19: A non-monotonic method for large-scale non-negative least squares

Optim

izationM

ethods&

Software

1029

Table 5. NNLS experiments on data set P2. As the stopping criteria, we set the maximum elapsed time to 1,000 seconds, ||∇f+||∞ ≤ 10−2 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−2 for other methods.

Method P2-1 P2-2 P2-3 P2-4 P2-5 P2-6

FNNLS Time (s) 1000.10 1000.95 1000.52 1000.81 1000.64 1000.50f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#active 7541 8036 8104 8103 8110 8122

LBFGS-B Time (s) 339.12 636.40 613.56 955.52 885.94 900.53f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 114 185 177 411 290 297#∇f 114 185 177 411 290 297#active 4428 6697 1822 747 2352 6121||∇f+||∞ 1.74 × 10−2 7.23 × 10−3 6.57 × 10−2 1.33 × 10−1 2.80 × 10−1 2.63 × 10−2

||x − [x − ∇f (x)]+||∞ 1.48 × 10−2 6.61 × 10−3 6.57 × 10−2 1.05 × 10−1 2.59 × 10−1 2.63 × 10−2

SPG Time (s) 2.47 30.65 29.70 1000.27 327.52 249.19f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 114 763 560 29325 7791 4854#∇f 113 754 463 3200 1033 722#active 1 5574 0 3837 5202 1009||∇f+||∞ 9.89 × 10−3 5.00 × 10−3 1.27 × 10−2 1.93 × 10−2 1.51 × 10−2 6.92 × 10−2

||x − [x − ∇f (x)]+||∞ 9.89 × 10−3 4.46 × 10−3 1.27 × 10−2 1.93 × 10−2 1.51 × 10−2 6.32 × 10−2

ASA Time (s) 3.05 11.33 1000.89 1001.10 1000.84 196.58f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 210 363 19827 15947 12300 2133#∇f 107 243 15473 11295 9370 1487#active 5924 3939 0 0 0 5642||∇f+||∞ 1.75 × 10−2 8.93 × 10−3 3.01 × 101 4.36 × 101 8.39 × 101 9.37 × 10−3

||x − [x − ∇f (x)]+||∞ 9.89 × 10−3 4.34 × 10−3 3.01 × 101 4.36 × 101 8.39 × 101 9.37 × 10−3

SBB Time (s) 1.55 3.29 7.77 9.42 13.59 14.86f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 0 0 1 1 1 1#∇f 62 71 114 105 120 111#active 7094 3667 4833 6922 2552 6608||∇f+||∞ 8.28 × 10−3 8.36 × 10−3 8.04 × 10−3 5.20 × 10−3 9.65 × 10−3 7.93 × 10−3

||x − [x − ∇f (x)]+||∞ 8.28 × 10−3 7.44 × 10−3 8.04 × 10−3 4.85 × 10−3 9.65 × 10−3 6.96 × 10−3

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 20: A non-monotonic method for large-scale non-negative least squares

1030D

.Kim

etal.

Table 6. NNLS experiments on data set P2. As the stopping criteria, we set the maximum elapsed time to 1000 seconds, ||∇f+||∞ ≤ 10−4 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−4 for other methods.

Method P2-1 P2-2 P2-3 P2-4 P2-5 P2-6

FNNLS Time(s) 1000.10 1000.95 1000.52 1000.81 1000.64 1000.50f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#active 7541 8036 8104 8103 8110 8122

LBFGS-B Time(s) 338.08 644.42 614.44 954.45 886.82 900.28f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 114 218 177 411 290 297#∇f 114 218 177 411 290 297#active 4428 6697 1822 747 2352 6121||∇f+||∞ 1.74 × 10−2 7.23 × 10−3 6.57 × 10−2 1.33 × 10−1 2.80 × 10−1 2.63 × 10−2

||x − [x − ∇f (x)]+||∞ 1.48 × 10−2 6.61 × 10−3 6.57 × 10−2 1.05 × 10−1 2.59 × 10−1 2.63 × 10−2

SPG Time (s) 3.10 32.86 1000.33 1000.29 1000.35 1000.75f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 145 849 42,315 29,380 26,911 22,391#∇f 142 797 1367 3201 1185 1164#active 3180 7096 7062 3837 1382 5334||∇f+||∞ 7.40 × 10−3 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.39 × 10−2 1.06 × 10−2

||x − [x − ∇f (x)]+||∞ 1.85 × 10−3 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.12 × 10−2 6.60 × 10−3

ASA Time (s) 4.35 14.32 1000.53 1000.82 1001.30 203.84f (x) 1.50 × 10+10 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 290 461 19838 15937 12300 2203#∇f 161 307 15481 11287 9370 1545#active 7117 7124 0 0 0 7106||∇f+||∞ 7.14 × 10−5 4.25 × 10−5 1.80 × 101 2.93 × 101 8.39 × 101 5.10 × 10−5

||x − [x − ∇f (x)]+||∞ 7.14 × 10−5 4.25 × 10−5 1.50 × 101 2.26 × 101 8.39 × 101 5.10 × 10−5

SBB Time(s) 1.90 4.58 9.52 11.92 17.44 18.96f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 0 0 1 1 1 1#∇f 77 99 140 133 156 142#active 7085 7123 7112 7046 7086 7098||∇f+||∞ 9.07 × 10−5 3.35 × 10−5 7.79 × 10−5 9.10 × 10−5 8.41 × 10−5 9.40 × 10−5

||x − [x − ∇f (x)]+||∞ 9.07 × 10−5 3.35 × 10−5 7.79 × 10−5 6.48 × 10−5 8.41 × 10−5 9.40 × 10−5

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 21: A non-monotonic method for large-scale non-negative least squares

Optim

izationM

ethods&

Software

1031

Table 7. NNLS experiments on degenerate problems derived from P2. As the stopping criteria, we set the maximum elapsed time to 1,000 seconds, ||∇f+||∞ ≤ 10−4 for SBBand ||x − [x − ∇f (x)]+||∞ ≤ 10−4 for other methods.

Method P2-1 P2-2 P2-3 P2-4 P2-5 P2-6

FNNLS Time (s) 1000.66 1000.41 1000.28 1000.35 1000.92 1001.47f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#active 7200 7483 7516 7499 7510 7512

LBFGS-B Time (s) 338.76 645.73 613.80 954.55 887.18 901.17f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 114 218 177 411 290 297#∇f 114 218 177 411 290 297#active 4428 6697 1822 747 2352 6121||∇f+||∞ 1.72 × 10−2 6.61 × 10−3 6.57 × 10−2 1.27 × 10−1 2.75 × 10−1 2.63 × 10−2

||x − [x − ∇f (x)]+||∞ 1.48 × 10−2 6.61 × 10−3 6.57 × 10−2 1.05 × 10−1 2.59 × 10−1 2.63 × 10−2

SPG Time (s) 3.09 32.86 1000.16 1001.65 1000.73 1000.08f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 145 849 42,315 29,380 26,911 22,353#∇f 142 797 1367 3201 1185 1163#active 3180 7096 7062 3837 1382 5334||∇f+||∞ 1.85 × 10−3 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.12 × 10−2 6.60 × 10−3

||x − [x − ∇f (x)]+||∞ 1.85 × 10−3 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.12 × 10−2 6.60 × 10−3

ASA Time (s) 4.35 14.30 1001.02 1001.48 1001.38 203.52f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 290 461 19,848 15,958 12,300 2203#∇f 161 307 15,489 11,303 9370 1545#active 7117 7124 0 0 0 7106||∇f+||∞ 7.14 × 10−5 4.25 × 10−5 2.58 × 101 3.41 × 101 8.39 × 101 5.10 × 10−5

||x − [x − ∇f (x)]+||∞ 7.14 × 10−5 4.25 × 10−5 2.58 × 101 2.76 × 101 8.39 × 101 5.10 × 10−5

SBB Time (s) 2.22 5.39 9.65 15.40 17.80 22.94f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 0 1 1 1 1 1#∇f 90 116 142 172 159 172#active 0 0 3555 0 2988 6147||∇f+||∞ 8.78 × 10−5 5.70 × 10−5 5.62 × 10−5 9.19 × 10−5 7.89 × 10−5 8.41 × 10−5

||x − [x − ∇f (x)]+||∞ 6.43 × 10−5 5.70 × 10−5 5.62 × 10−5 9.19 × 10−5 7.89 × 10−5 8.41 × 10−5

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 22: A non-monotonic method for large-scale non-negative least squares

1032 D. Kim et al.

Also note that the performance of FNNLS substantially degrades on problem set P2 as comparedto dense problems of data set P1. We attribute this degradation to FNNLS’s computation of ATA;though this effectively reduces the problem size when A ∈ R

m×n and m � n, this product ismore likely to be dense even when A is sparse. Thus, it can actually increase the amount ofcomputation required. Tables 3–6 offer numerical evidence that SBB frequently outperformsthe other methods; and this difference becomes more pronounced with increasing data set size.In Appendix 3, we provide additional results with a stopping threshold of 10−6, which furtherconfirms the conclusions drawn above.

As mentioned earlier, we also show some results on degenerate problems – see Table 7 andFigure 6. We observe that most of the results largely mirror those of the non-degenerate problems;

FNNLS LBFGS−B SPG ASA SBB0

1000

2000

3000

4000

5000

6000

7000

8000

#act

ive

Non−degenerateDegenerate

P2-1FNNLS LBFGS−B SPG ASA SBB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

#act

ive

Non−degenerateDegenerate

P2-2

FNNLS LBFGS−B SPG ASA SBB0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

#act

ive

Non−degenerateDegenerate

P2-3

FNNLS LBFGS−B SPG ASA SBB0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

#act

ive

Non−degenerateDegenerate

P2-4

FNNLS LBFGS−B SPG ASA SBB0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

#act

ive

Non−degenerateDegenerate

P2-5FNNLS LBFGS−B SPG ASA SBB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

#act

ive

Non−degenerateDegenerate

P2-6

(a) (b)

(c) (d)

(e) (f)

Figure 6. Number of active variables discovered in the non-degenerate problems (Table 6, left-bars) and the corre-sponding degenerate problems (Table 7, right-bars). Note that in our synthetic data sets, non-degenerate and degenerateproblems vary only in the components of the gradient for the final active variables, hence the target number of activevariables as well as the objective value and the values of ‖∇f+‖∞ remain the same.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 23: A non-monotonic method for large-scale non-negative least squares

Optim

izationM

ethods&

Software

1033

Table 8. NNLS experiments on real-world data sets. For LBFGS-B, SPG, ASA and SBB, as before, we set the maximum elapsed time to 1,000 seconds, used ||∇f+||∞ for SBBand ||x − [x − ∇f (x)]+||∞ for other methods. For the data sets ash958, well1850 and orl-face, we set the threshold ≤ 10−6 since we have the true solutions where ∇f+ vanishes. Forthe remaining four large−scale problems, we relax the condition to ||∇f+||∞ ≤ 10−1 to obtain solutions of low accuracy. We mainly report the elapsed running time and the objectivefunction value from each method. For LBFGS-B, SPG, ASA and SBB we additionally report the number of objective function computations (#f ), the number of gradient evaluations(#∇f ), and the approximate optimality condition (||∇f+||∞) and the value of ||x − [x − ∇f (x)]+||∞ at the final iteration. In this experiment, FNNLS, LBFGS-B and SPG did not scalefor ‘news20’.

Method ash958 well1850 orl-face real-sim mnist news20 webspam

FNNLS Time(s) 0.04 0.11 1000.19 1000.77 15.15 – 26.70f (x) 5.70 × 10−30 3.80 × 10−30 1.11 × 105 1.48 × 103 1.58 × 103 – 1.44 × 104

LBFGS-B Time(s) 2.58 33.05 1001.87 1002.86 1000.43 – 194.18f (x) 3.54 × 10−13 6.65 × 10−11 2.16 × 104 1.13 × 103 1.58 × 103 – 1.44 × 104

#f 26 134 322 137 2741 – 404#∇f 26 134 322 137 2741 – 404||∇f+||∞ 8.76 × 10−7 8.26 × 10−7 6.93 × 103 8.69 × 10−1 1.09 × 100 – 5.92 × 10−2

||x − [x − ∇f (x)]+||∞ 8.76 × 10−7 8.26 × 10−7 2.31 × 103 8.69 × 10−1 1.09 × 100 – 5.92 × 10−2

SPG Time(s) 0.05 0.05 661.94 30.48 1000.02 – 1000.63f (x) 2.72 × 10−13 9.41 × 10−11 2.12 × 104 1.10 × 103 1.58 × 103 – 1.77 × 104

#f 29 212 50000 597 9356 – 2005#∇f 29 172 1771 405 6018 – 1605||∇f+||∞ 9.68 × 10−7 9.57 × 10−7 2.78 × 10−4 3.14 × 10−1 7.49 × 10−1 – 1.18 × 104

||x − [x − ∇f (x)]+||∞ 9.68 × 10−7 9.57 × 10−7 2.78 × 10−4 9.70 × 10−2 7.49 × 10−1 – 4.42 × 100

ASA Time (s) 0.01 0.05 23.50 28.94 352.42 150.29 661.80f (x) 9.48 × 10−13 9.37 × 10−11 2.12 × 104 1.10 × 103 1.58 × 103 6.78 × 101 1.44 × 104

#f 60 320 1035 497 3651 867 1243#∇f 38 213 686 427 2004 527 1039||∇f+||∞ 1.27 × 10−6 9.59 × 10−7 8.73 × 10−7 4.11 × 10−1 3.56 × 10−1 3.54 × 10−1 9.71 × 10−2

||x − [x − ∇f (x)]+||∞ 9.50 × 10−7 9.59 × 10−7 8.73 × 10−7 9.94 × 10−2 9.97 × 10−2 9.94 × 10−2 9.71 × 10−2

SBB Time (s) 0.01 0.03 18.53 10.42 1000.10 67.46 209.80f (x) 1.89 × 10−13 9.89 × 10−11 2.12 × 104 1.10 × 103 1.58 × 103 4.99 × 101 1.44 × 104

#f 0 1 5 1 62 2 3#∇f 30 103 534 142 6271 224 303||∇f+||∞ 5.78 × 10−7 9.61 × 10−7 7.89 × 10−7 9.73 × 10−2 2.60 × 10−1 7.62 × 10−2 6.05 × 10−2

||x − [x − ∇f (x)]+||∞ 4.31 × 10−7 9.61 × 10−7 7.89 × 10−7 9.73 × 10−2 1.17 × 10−1 7.62 × 10−2 6.05 × 10−2

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 24: A non-monotonic method for large-scale non-negative least squares

1034 D. Kim et al.

hence, we only present the results on data set P2, with a medium-accuracy threshold 10−4. Com-paring Table 7 with its nondegenerate counterpart Table 6, one can observe that most of the valuesmatch each other. The only noticeable difference occurs in the number of active variables found attermination: FNNLS and SBB miss some of the active variables. Given that SBB still performscompetitively in terms of the convergence criterion ‖∇f+‖∞, we attribute this phenomenon to thelack of an explicit active set identification phase in SBB where each component of the gradientmay diminish too quickly near the true active variables, which, in turn, prevents the iterationsto obtain sufficient reduction for near-active variables. On the other hand, the other methods –LBFGS-B, SPG and ASA share an explicit exploration phase of the active face throughout iter-ations, hence when they terminate ‘correctly,’ the active set in both problem sets remains virtuallyintact (Figure 6).

Our final set of results (Table 8) is on real-world data sets. FNNLS as before works; it runs wellon small matrices but rapidly becoming impractical for larger ones. It seems that the problem ofusing ATA appears even more prominently for real-world problems where highly sparse matricesA with m ≤ n are not uncommon. For example, in comparison to other methods, FNNLS is highlypenalised for ‘orl-face’of 400 × 10, 304 and ‘real-sim’of 72, 309 × 20, 958, but has considerableadvantage for ‘mnist’ of 60, 000 × 780, though eventually it fails to scale for ‘news20’ wherem � n. From Table 8, we observe that SBB is highly competitive on several real-world data setstoo: it frequently attains quickly solutions that are as accurate (or even more accurate), in termsof ‖∇f+‖∞, than its competitors.

5. Conclusion and discussion

In this paper we have presented a new non-monotonic algorithm for solving nonnegativeleast squares (NNLS) problems. Our algorithm builds on the unconstrained Barzilai-Borweinmethod [1], whose simplicity it retains. Moreover, our method employs an optimistic diminish-ment strategy which allows it to ensure convergence without a potentially expensive dependenceon linesearch. We reported numerical results of our method applied to synthetic and real-worlddata sets, showing that our Matlab implementation performs competitively across a wide rangeof problems, both in terms of running time and accuracy.

Acknowledgements

We acknowledge support of NSF grants CCF-0431257 and CCF-0728879.

Notes

1. Other choices are also possible, e.g., limk→∞∑k

i=1 β i = ∞, and limk→∞∑k

i=1(βi)2 < ∞.

2. This tolerance is very tight for first-order methods. We will later see that for experiments with real-world data, suchhigh tolerances are often not achieved.

3. Note that in practice, we need not compute U at all, and without compromising the theoretical guarantees we canreplace it by the maximum value permitted by the machine, provided that the solution x∗ is representable withoutoverflow.

4. To run our method for rank-deficient ATA in practice, we do not need to compute σ+min(A) nor λ+

min(ATA), as it

is sufficient to place an arbitrary large upper bound αU and a small positive lower-bound αL on the subspace-BBcomputation (7). The diminishing {βr} safeguards from the stepsizes becoming too large or too small, therebyeventually ensuring convergence.

5. SPG and ASA have similar parameters ‘m’ and ‘asaParm.nm’, respectively. In our experiments, we could improvethe performance of ASA by setting ‘asaParm.nm=100’, however, we set ‘m=5’ for SPG as ‘m=100’ significantlydegraded its performance.

6. http://math.nist.gov/MatrixMarket7. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html8. http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 25: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1035

References

[1] J. Barzilai and J.M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal. 8 (1988), pp. 141–148.[2] A. Ben-Tal and A. Nemirovski, Non-euclidean restricted memory level method for large-scale convex optimization,

Math. Progr. A 102 (2005), pp. 407–456.[3] D.P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, MA, 1999.[4] M. Bierlaire, P.L. Toint, and D. Tuyttens, On iterative algorithms for linear least squares problems with bound

constraints, Linear Algebra Appl. 143 (1991), pp. 111–143.[5] E.G. Birgin, J.M. Martínez, and M. Raydan, Nonmonotone spectral projected gradient methods on convex sets,

SIAM J. Optim. 10 (2000), pp. 1196–1211.[6] E.G. Birgin, J.M. Martínez, and M. Raydan, Algorithm 813: SPG - Software for Convex-constrained Optimization,

ACM Trans. Math. Softw. 27 (2001), pp. 340–349.[7] E.G. Birgin, J.M. Martínez, and M. Raydan, Large-scale active-set box-constrained optimization method with

spectral projected gradients, Comput. Optim. Appl. 23 (2002), pp. 101–125.[8] E.G. Birgin, J.M. Martínez, and M. Raydan, Inexact spectral projected gradient methods on convex sets, IMA J.

Numer. Anal. 23 (2003), pp. 539–559.[9] A. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.

[10] R. Bro and S.D. Jong, A fast non-negativity-constrained least squares algorithm, J. Chemometrics 11 (1997),pp. 393–401.

[11] J.V. Burke and J.J. Moré, On the identification of active constraints, SIAM J. Numer. Anal. 25 (1988),pp. 1197–1211.

[12] R. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm for bound constrained optimization, SIAM J.Scient. Comput. 16 (1995), pp. 1190–1208.

[13] W.L. Cruz and M. Raydan, Nonmonotone spectral methods for large-scale nonlinear systems, Optim. Meth. Softw.18 (2003), pp. 583–599.

[14] Y.H. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic program-ming, Numerische Mathematik 100 (2005), pp. 21–47.

[15] A. Friedlander, J.M. Martínez, and M. Raydan, A new method for large-scale box constrained convex quadraticminimization problems, Optim. Meth. Softw. 5 (1995), pp. 55–74.

[16] P.E. Gill, W. Murray, and M.H. Wright, Practical Optimization, Academic Press, New York, NY, 1981.[17] L. Grippo and M. Sciandrone, Nonmonotone globalization techniques for the Barzilai-Borwein gradient method,

Comput. Optim. Appl. 23 (2002), pp. 143–169.[18] W.W. Hager and H. Zhang, A new active set algorithm for box constrained optimization, SIAM J. Optim. 17 (2006),

pp. 526–557.[19] M. Hirsch, S. Sra, , B. Schölkopf, and S. Harmeling, Efficient filter flow for space-variant multiframe blind

deconvolution, in CVPR, June, 2010.[20] D. Kim, S. Sra, and I.S. Dhillon, Fast newton-type methods for the least squares nonnegative matrix approximation

problem, in SIAM DM, 2007.[21] C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice–Hall, Englewood Cliffs, NJ, 1974.[22] C.J. Lin and J.J. Moré, Newton’s method for large bound-constrained optimization problems, SIAM J. Optim. 9

(1999), pp. 1100–1127.[23] J.J. Moré and G. Toraldo, On the solution of large quadratic programming problems with bound constraints, SIAM

J. Optim. 1 (1991), pp. 93–113.[24] J. Nagy and Z. Strakos, Enforcing nonnegativity in image reconstruction algorithms, Math. Model. Estimation Imag.

4121 (2000), pp. 182–190.[25] A. Nemirovski, A. Juditsky, G. Land, and A. Shapiro, Robust stochastic approximation approach to stochastic

programming, SIAM J. Optim. 19 (2009), pp. 1574–1609.[26] J. Nocedal and S. Wright, Numerical Optimization, 2nd ed., Springer, Berlin, 2006.[27] M. Raydan, On the Barzilai and Borwein choice of the steplength for the gradient method, IMA J. Numer. Anal. 13

(1993), pp. 321–326.[28] A.J. Reader and H. Zaidi, Advances in PET image reconstruction, PET Clin. 2 (2007), pp. 173–190.[29] J.B. Rosen, The gradient projection method for nonlinear programming. part I. linear constraints, J. SIAM 8 (1960),

pp. 181–217.[30] K. Schittkowski, The numerical solution of constrained linear least-squares problems, IMA J. Numer.Anal. 3 (1983),

pp. 11–36.[31] N. Vo, B. Moran, and S. Challa, Nonnegative-least-square classifier for face recognition, Adv. Neural Networks–

ISNN 2009 (2009), pp. 449–456.

Appendix 1. Ordinary algorithm (OA) counterexample

We illustrate in Figure A1 that plugging in (2) naively into GP does not work. Our illustration isinspired by the instructive counterexample of [14] (Table A1).

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 26: A non-monotonic method for large-scale non-negative least squares

1036 D. Kim et al.

−3 −2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

2

Level setOASBB

Nonnegative optimum (2.3729,0)

Unconstrained optimum (3,−1)

(x4) (−1.9013,−4.5088)

x0 (0,0)

x1

x3

x2

(x3)

x*

Figure A1. Comparison between OA and SBB. The figure shows a 2D counterexample where OA fails to converge.

Given A =[

0.8147 0.12700.9058 0.9134

]and b =

[2.31721.8040

], we start both OA and SBB at the same initial point x0 = [0 0]T.

At iteration k, each method first computes an intermediate point (xk); this point is then projected onto R2+ to obtain

a feasible iterate xk . Both methods generate the same iterate sequence x0, x1, x2, (x3), x3 for the first 3 iterations. OAstarts behaving differently at the 4th iteration where it converges to the optimal solution x∗. OA, in contrast, generates(x4), which upon subsequent projection brings it back to the initial point x0, leading OA to cycle indefinitely withoutconverging.

Table A1. Coordinates of the iterates in Figure A1.

Iterate x0 x1 x2 (x3)

Fixed set {} {} {} {}OA [0, 0] [1.7934, 0.6893] [1.8971, 0.5405] [2.9779, −0.9683]SBB [0, 0] [1.7934, 0.6893] [1.8971, 0.5405] [2.9779, −0.9683]Iterate x3 (x4) x4

Fixed set {2} {2}OA [2.9779, 0] [−1.9013, −4.5088] [0, 0] = x0

SBB [2.9779, 0] [2.3729, 0] = x∗

Appendix 2. The various BB based algorithms

Original BB steps

γ k = ‖∇f (xk−1)‖2

〈∇f (xk−1), ATA∇f (xk−1)〉 and γ k = 〈∇f (xk−1), ATA∇f (x)k−1〉‖ATA∇f (xk−1)‖2

. (A1)

Subspace BB steps

αk = ‖∇ f k−1‖2

〈∇ f k−1, ATA∇ f k−1〉 and αk = 〈∇ f k−1, ATA∇ f k−1〉‖ATA∇ f k−1‖2

. (A2)

where ∇ f k−1 is defined as ∇i f k−1 = ∂i∇f (xk−1) for i /∈ B(xk), ∇i f k−1 = 0

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 27: A non-monotonic method for large-scale non-negative least squares

Optimization Methods & Software 1037

A.1 Pseudocode for the variants

Variants of the Ordinary Algorithm: OA.Basic Algorithm (OA).

Given x0 and x1

for i = 1, · · · until (12) is satisfied doCompute γ i using (A1)xi+1 ← [xi − γ i∇f (xi)]+

end for

OA + diminishing scalar (OA+DS).

Given x0 and x1

for i = 1, · · · until (12) is satisfied doCompute γ i using (A1)xi+1 ← [xi − β i · γ i∇f (xi)]+β i+1 ← ηβ i, where η ∈ (0, 1)

end for

OA + linesearch (OA+LS).

Given x0 and x1

for i = 1, · · · until (12) is satisfied dox0 ← xi−1 and x1 ← xi

for j = 1, · · · , M doCompute γ j using (A1)xj+1 ← [xj − γ j∇f (xj

)]+end forrepeat {/* linesearch */}

Compute xi+1 ← [xM − τ∇f (xM)]+

Update τ

until xi+1 and xM satisfy (10)end for

Subspace−BB step based algorithms.Subspace−BB only (SA).

Given x0 and x1

for i = 1, · · · until (12) is satisfied doCompute αi using (A2)xi+1 ← [xi − αi∇f (xi)]+

end for

SA + diminishing scalar (SA+DS).

Given x0 and x1

for i = 1, · · · until (12) is satisfied doCompute αi using (A2)xi+1 ← [xi − β i · αi∇f (xi)]+β i+1 ← ηβ i, where η ∈ (0, 1)

end for

SA + linesearch (SA+LS).

Given x0 and x1

for i = 1, · · · until (12) is satisfied dox0 ← xi−1 and x1 ← xi

for j = 1, · · · , M doCompute αj using (A2)xj+1 ← [xj − αj∇f (xj

)]+end forrepeat {/* linesearch */}

Compute xi+1 ← [xM − τ∇f (xM)]+

Update τ

until xi+1 and xM satisfy (10)end for

Appendix 3. Additional experiments

Tables A2 and A3Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 28: A non-monotonic method for large-scale non-negative least squares

1038D

.Kim

etal.

Table A2. NNLS experiments on data set P1. As the stopping criteria, we set the maximum elapsed time to 1,000 seconds, ||∇f+||∞ ≤ 10−6 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−6 for other methods.

Method P1-1 P1-2 P1-3 P1-4 P1-5 P1-6

FNNLS Time (s) 0.07 0.39 4.90 38.91 356.93 1000.35f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#active 301 599 1178 2372 4743 11247

LBFGS-B Time (s) 31.64 172.26 926.14 1000.67 1000.78 1006.32f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 581 828 1520 799 388 183#∇f 581 828 1520 799 388 183#active 269 324 287 1011 462 273||∇f+||∞ 1.07 × 103 6.80 × 10−3 3.59 × 10−2 7.39 × 101 2.72 × 102 1.00 × 103

||x − [x − ∇f (x)]+||∞ 3.61 × 10−4 6.01 × 10−3 3.59 × 10−2 1.34 × 101 2.72 × 102 1.00 × 103

SPG Time (s) 201.85 461.42 1000.34 1000.14 1000.46 1004.21f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 50000 50000 31981 11896 3617 897#∇f 5827 10396 19269 8144 1825 465#active 290 593 1162 124 2103 3951||∇f+||∞ 2.25 × 10−4 1.09 × 10−4 2.30 × 10−4 3.29 × 100 8.55 × 101 1.56 × 102

||x − [x − ∇f (x)]+||∞ 2.54 × 10−5 1.09 × 10−4 2.30 × 10−4 1.06 × 100 2.00 × 101 1.56 × 102

ASA Time (s) 0.17 2.50 1000.47 1000.34 1001.76 1002.62f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 647 623 46621 12436 3184 843#∇f 422 398 30725 8261 2073 492#active 301 599 0 0 0 0||∇f+||∞ 7.05 × 10−7 5.64 × 10−7 6.80 × 10−1 3.12 × 100 7.23 × 100 1.59 × 101

||x − [x − ∇f (x)]+||∞ 7.05 × 10−7 5.64 × 10−7 4.19 × 10−1 1.84 × 100 4.32 × 100 9.66 × 100

SBB Time (s) 0.15 1.89 14.08 51.04 276.46 1001.25f (x) 1.63 × 106 2.61 × 107 3.21 × 108 1.10 × 1010 1.02 × 1011 9.27 × 1011

#f 3 2 4 4 5 5#∇f 316 299 455 423 578 518#active 301 599 1178 2371 4743 8174||∇f+||∞ 8.82 × 10−7 8.42 × 10−7 4.02 × 10−7 8.13 × 10−7 5.79 × 10−7 1.37 × 10−2

||x − [x − ∇f (x)]+||∞ 8.82 × 10−7 8.42 × 10−7 4.02 × 10−7 3.15 × 10−7 5.79 × 10−7 1.37 × 10−2

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14

Page 29: A non-monotonic method for large-scale non-negative least squares

Optim

izationM

ethods&

Software

1039

Table A3. NNLS experiments on data set P2. As the stopping criteria, we set the maximum elapsed time to 1000 seconds, ||∇f+||∞ ≤ 10−6 for SBB and||x − [x − ∇f (x)]+||∞ ≤ 10−6 for other methods.

Method P2-1 P2-2 P2-3 P2-4 P2-5 P2-6

FNNLS Time (s) 1000.10 1000.95 1000.52 1000.81 1000.64 1000.50f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#active 7541 8036 8104 8103 8110 8122

LBFGS-B Time (s) 337.80 645.14 614.35 953.23 884.68 900.36f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 114 218 177 411 290 297#∇f 114 218 177 411 290 297#active 4428 6697 1822 747 2352 6121||∇f+||∞ 1.74 × 10−2 7.23 × 10−3 6.57 × 10−2 1.33 × 10−1 2.80 × 10−1 2.63 × 10−2

||x − [x − ∇f (x)]+||∞ 1.48 × 10−2 6.61 × 10−3 6.57 × 10−2 1.05 × 10−1 2.59 × 10−1 2.63 × 10−2

SPG Time (s) 506.86 862.70 1000.28 1001.47 1001.06 1000.44f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 50,000 50,000 42,260 29,325 26,911 22,391#∇f 1368 1725 1366 3200 1185 1164#active 6831 7096 7062 3837 1382 5334||∇f+||∞ 3.23 × 10−4 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.39 × 10−2 1.06 × 10−2

||x − [x − ∇f (x)]+||∞ 2.12 × 10−4 5.43 × 10−4 5.07 × 10−4 1.93 × 10−2 1.12 × 10−2 6.60 × 10−3

ASA Time (s) 4.57 14.75 1000.30 1000.72 1000.56 204.91f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 297 470 19838 15947 12321 2208#∇f 173 319 15481 11295 9386 1554#active 7117 7125 0 0 0 7106||∇f+||∞ 4.19 × 10−7 3.46 × 10−7 1.80 × 101 4.36 × 101 8.39 × 101 8.04 × 10−7

||x − [x − ∇f (x)]+||∞ 4.19 × 10−7 3.46 × 10−7 1.50 × 101 4.36 × 101 8.39 × 101 8.04 × 10−7

SBB Time (s) 2.24 5.29 10.50 13.30 19.47 21.77f (x) 1.50 × 1010 7.16 × 109 6.96 × 1010 1.56 × 1011 6.87 × 1011 3.28 × 1011

#f 0 1 1 1 1 1#∇f 90 114 154 148 175 163#active 7117 7124 7117 7104 7106 7106||∇f+||∞ 2.07 × 10−7 3.60 × 10−7 4.66 × 10−7 8.91 × 10−7 5.50 × 10−7 9.55 × 10−7

||x − [x − ∇f (x)]+||∞ 2.07 × 10−7 1.60 × 10−7 4.66 × 10−7 8.91 × 10−7 5.50 × 10−7 9.55 × 10−7

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

5:31

11

Nov

embe

r 20

14