getting to the bottom of matrix completion and nonnegative

Getting to the Bottom of Matrix Completion and

Nonnegative Least Squares with the MM Algorithm

Jocelyn T. Chi and Eric C. Chi

March 31, 2014

This article on the MM Algorithmn was written as part of the Getting to the Bottom learning series on

optimization and machine learning methods for statistics. It was written for Statisticsviews.com and John

Wiley & Sons, Ltd retain the copyright for it. The series can be found at the Getting to the Bottom website.

Introduction

The Expectation Maximization (EM) algorithm has a rich and celebrated role in computational statistics. Ahallmark of the EM algorithm is its numerical stability. Because the iterates of the EM algorithm alwaysmake steady monotonic progress towards a solution, iterates never wildly overshoot the target. The principlesthat lead to this monotonic behavior, however, are not limited to estimation in the context of missing data. Inthis article, we review its conceptually simpler generalization, the majorization-minimization (MM) algorithm,that embodies the core principles that give rise to the numerical stability enjoyed by the EM algorithm.ˆ1We will see that despite its simplicity, the basic MM strategy can take us surprisingly far. MM algorithmsare applicable to a broad spectrum of non-trivial problems and often yield algorithms that are almost trivialto implement.

The basic idea behind the MM algorithm is to convert a hard optimization problem (for example, non-di�erentiable, non-convex, or constrained) into a sequence of simpler ones (for example, smooth, convex,or unconstrained). Thus, like the EM algorithm, the MM algorithm is not an algorithm, but a principledframework for the construction of one. As a preview of things to come, we will describe the MM algorithmframework and show how the gradient descent algorithm we discussed in the first article is an instance of theMM algorithm. This will give the intuitive picture of how the MM algorithm works. We then dive right intotwo case studies in the application of MM algorithms to the matrix completion and nonnegative least squaresproblems. We discuss implementation considerations along the way.

Overview of the MM Algorithm

Consider the basic optimization problem

minimize f(x)subject to x œ Rn

.

Sometimes, it may be di�cult to optimize f(x) directly. In these cases, the MM framework can provide asolution to minimizing f(x) through the minimization of a series of simpler functions. The resulting algorithmalternates between taking two kinds of steps: majorizations and minimizations. The majorizing stepidentifies a suitable surrogate function for the objective function f(x), and the minimizing step identifies thenext iterate through minimization of the surrogate function. In the following section, we discuss these twosteps in greater detail.

The Two Steps of the MM Algorithm

Suppose we have generated n iterates, the n

th of which we denote by x

n

. We describe how to generatethe n + 1th iterate. The first step of the MM algorithm framework is to majorize the objective function

1

http://jocelynchi.com/gettingtothebottom

s

http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html

f(x) with a surrogate function g(x|xn

) anchored at x

n

. The majorizing function g(x|xn

) must satisfy twoconditions.

1. The majorizing function must equal the objective function at the anchor point. This tangency conditionrequires that g(x

n

|xn

) = f(xn

).

2. The majorizing function must be at least as great as the objective function at all points. Thisdomination condition requires that g(x|x

n

) Ø f(x) for all x.

Thus, a majorization touches the objective function at the anchor point and lies above it everywhere else.

Once a suitable surrogate function has been found, the second step of the MM algorithm framework is tominimize the surrogate function g(x|x

n

). Let x

n+1 denote a minimizer of g(x|xn

). Then the tangency anddomination conditions guarantee that minimizing the majorization results in a descent algorithm since

f(xn+1) Æ g(x

n+1|xn

) Æ g(xn

|xn

) = f(xn

).

Consider these comparisons from left to right. The first inequality is due to the domination condition, thesecond arises since x

n+1 is the minumum of g(x|xn

), and the third is a result of the tangency condition.Hence, we are assured that the algorithm produces iterates that make monotonic progress towards minimizingthe objective function. We repeat these two steps using the new minimizer x

n+1 as the anchor for thesubsequent majorizing function. In practice, exact minimization of the surrogate function is not essentialsince the domination and tangency conditions guarantee the descent property of the algorithm.

The “MM” algorithm derives its name from the two-step majorization-minimization process for minimiza-tion problems, or minorization-maximization for maximization problems. The idea is that it should bemuch easier to minimize the surrogate function than it would be to minimize the original objective function.The art, of course, is in finding a surrogate function whose minimizer can be simply derived explicitly, ordetermined quickly with iterative methods.

Revisiting Gradient Descent as an MM Algorithm

In the gradient descent tutorial, we saw how di�erent step-size values – in the gradient descent algorithmlead to di�erent quadratic approximations for the objective function, shown below in blue.

library(gettingtothebottom)

## Loading required package: ggplot2## Loading required package: grid## Loading required package: Matrix## Loading required package: lpSolve## Loading required package: reshape2

example.quadratic.approx(alpha1 = 0.01, alpha2 = 0.12)

2


0

100

200

−2.5 0.0 2.5 5.0 7.5b

Valu

e of

the

Loss

Fun

ctio

n

In the figure above, the black and red quadratic approximations determined by the di�erent step-sizes alsoresult in two majorizing functions. Both the black and red curves are tangent to the blue curve at theanchor point, depicted by the green dot. The curves also dominate the blue curve at all values of b.

Quadratic majorizations are particularly powerful and useful. In the next section, we discuss how a simplequadratic majorizer can help us solve matrix completion problems.

Matrix Completion

The Netflix Prize

Between October 2006 and September 2009, Netflix, an American on-demand online streaming media provider,hosted the Netflix Prize challenge, a competition for the best algorithm to predict movie preferences based onuser movie ratings. The competition featured a $1 million grand prize and a large training dataset containingapproximately 480, 000 user ratings on 18, 000 movies with over 98 percent missing data.

Movie Titles Anna Ben Carl . . .

Star Wars 2 5 ? . . .

3

https://en.wikipedia.org/wiki/Netflix

http://en.wikipedia.org/wiki/Netflix_Prize

Movie Titles Anna Ben Carl . . .

Harry Potter ? 1 ? . . .Miss Congeniality 1 5 1 . . .Lord of the Rings 5 2 ? . . .. . . . . . . . . . . . . . .

The matrix above depicts how a miniscule portion of the training dataset may have appeared. In this example,Anna, Ben, and Carl have each rated a small number movies. The challenge would be to come up with analgorithm to predict what kind of movie Anna would want to watch based on her own ratings, and ratings byNetflix users with similar movie preferences.

Representation as the Matrix Completion Problem

Predicting unobserved movie ratings is an example of the matrix completion problem. In a matrix completionproblem, we seek to “complete” a matrix of partially observed entries with the “simplest” complete matrixconsistent with the observed entries in the original data. In this case, we take simplicity to mean low rank.We say that a matrix is rank-r if it can be expressed as the weighted sum of r rank-1 matrices,

X =rÿ

i=1‡

i

u

i

v

t

i

.

Every matrix X admits a decomposition of the above form, called the singular value decomposition (SVD).The decomposition can be rewritten as X = UDV

t, where the columns of U and V are orthonormal (i.e.U

t

U = I and V

t

V = I), and D is a diagonal matrix containing the ordered singular values of X such that

D =

Q

ca‡1 0

. . .0 ‡

r

R

db ,

with ‡1 Ø ‡2 Ø · · · Ø ‡

r

> 0.

Since we seek the model matrix Z that is most consistent with the original data matrix X, we require ameasure of closeness between the data and the model. To do that, we can employ the Frobenius norm of amatrix, given by

ÎXÎF =ı̂ıÙ

rÿ

i=1‡

2i

.

The matrix completion problem can then be stated as the following optimization problem

minimize 12ÎP�c(X) ≠ P�c(Z)Î2

F,

subject to the constraint that rank(Z) Æ r, where X is the original data, Z is the model, and � denotes theset of unobserved indices. Note that r is a user-defined upper limit to the rank of the matrix or complexity ofthe model. The function P�c(Z) is the projection of the matrix Z onto the set of matrices with zero entriesgiven by indices in the set �, namely

ij

=I

z

ij

for (i, j) ”œ �, and0 otherwise.

4

The 12 ÎP�c(X) ≠ P�c(Z)Î2

F portion of the objective penalizes di�erences between X and Z over the observedentries. This formulation balances the tradeo� between the complexity (rank) of the model and how wellthe model matches the data. As we relax the bound on the rank r, we can better fit the data at the risk ofoverfitting it.

Unfortunately, the formulated problem is di�cult to solve, since the rank constraint makes the optimizationtask combinatorial in nature. Using matrix rank as a notion of complexity, however, is a very useful idea inmany applications. In the Netflix data, the low rank assumption is one way to model our belief that thereare a relatively small number of fundamental movie genres and moviegoer tastes that can explain much ofthe systematic variation in the ratings.

We can reach a compromise and relax the problem by substituting an easier objective to minimize that stilltrades o� the goodness of fit with the matrix rank. In place of a rank constraint, we tack on a regularizationterm, or penalty, on the nuclear (or trace) norm of the model matrix Z. The nuclear norm is given by

ÎXÎú =rÿ

i=1‡

i

,

and it is known to be a good surrogate of the rank in matrix approximation problems. Employing it as aregularizer results in the following formulation

minimize 12ÎP�c(X) ≠ P�c(Z)Î2

F + ⁄ÎZÎú,

where the tuning parameter ⁄ Ø 0 trades o� the complexity of the model with how well the model agreeswith the data over the observed entries. Although we have made progress towards solving a simpler problem,the presence of the projection operator P�c still presents a challenge. This is where majorization comes toour aid. We next derive a quadratic majorization of the troublesome squared error term in our objective.First note that we can rewrite this term more verbosely as

12ÎP�c(X) ≠ P�c(Z)Î2

F = 12

ÿ

(i,j)œ�c

(xij

≠ z

ij

)2.

Suppose we already have the n

th iterate Z

(n). Observe that the following quadratic function of Z is alwaysnonnegative

0 Æ 12

ÿ

(i,j)œ�(z

ij

≠ z

(n)ij

)2,

and the inequality becomes equality when Z = Z

(n). Adding the above inequality to the equality above itproduces the desired quadratic majorization anchored at the n

th iterate Z

(n)

12ÎP�c(X) ≠ P�c(Z)Î2

F Æ 12

ÿ

(i,j)œ�c

(xij

≠ z

ij

)2 + 12

ÿ

(i,j)œ�(z

ij

≠ z

(n)ij

)2

= 12

ÿ

(i,j)(y

ij

≠ z

ij

)2

= 12ÎY ≠ ZÎ2

F,

where

y

ij

=I

x

ij

for (i, j) ”œ �,z

(n)ij

for (i, j) œ �.

Thus, we have converted our original problem to one of repeatedly solving the following problem

minimize 12ÎY ≠ ZÎ2

F + ⁄ÎZÎú.

5

Taking a step back, we see that the MM principle has allowed us to replace a problem with missing entrieswith one without missing entries. This majorization is very straighforward to minimize compared to theoriginal problem. In fact, minimizing the majorization can be thought of as a matrix version of the ubiquitouslasso problem and can be accomplished in four steps. One of those steps involves the soft-threshold operator,synonymous with the lasso, which is itself the solution to an optimization problem involving the 1-norm andis given by

S(s, ⁄) = arg min 12(s ≠ t)2 + ⁄|t|

=

Y_]

_[

s ≠ ⁄ if s Ø ⁄,

s + ⁄ if s Æ ≠⁄, and0 |s| < ⁄.

The figure below illustrates what the soft-threshold operator does. The gray line depicts the line f(x) = x

and the blue curve indicates how the soft-threshold function S(x, ⁄) shrinks f(x) by sending f(x) œ (≠⁄, ⁄)to zero, and shrinking all other f(x) towards zero by a constant amount ⁄.

Note: If you already installed the gettingtothebottom package for the gradient descient tutorial, you’ll needto update the package to version 2.0 in order to call the following functions. If you are using RStudio, youcan do that by selecting Tools > Check for Package Updates.

plot_softhreshold(from=-5,to=5,lambda=3)

−5.0

−2.5

0.0

2.5

5.0

−5.0 −2.5 0.0 2.5 5.0x

Softh

resh

old

valu

e

The minimizer to the majorization is given by Z = US(D, ⁄)Vt, where Y has the singular value decompositionUDV

t and S(D, ⁄) denotes the application of the soft-threshold operator on the elements of the matrix D,

6

http://jocelynchi.com/gettingtothebottom/software.html


namely

S(D, ⁄) =

Q

caS(‡1, ⁄) 0

. . .0 S(‡

r

, ⁄)

R

db .

We provide a sketch of how we derived the minimizer of the majorization later in this article.

MM Algorithm for Matrix Completion

A formalization of the steps described above presents the following MM algorithm for matrix completion.Given a data matrix X, a set of unobserved indices �, and an initial model matrix Z

(0), repeat the followingfour steps until convergence.

1. y

(n+1)ij

=I

x

ij

for (i, j) ”œ �, andz

(n)ij

for (i, j) œ �.

2. (U, D, V) = SVD[Y(n+1)]3. D̃ = S(D, ⁄)4. Z

(n+1) = UD̃V

t

This imputation method is known as the soft-impute algorithm. In words, we fill in the missing values usingthe relevant entries from our most recent MM iterate to get Y. We then take Y apart by factoring it intothe matrices U, D, and V with an SVD and soft-threshold the diagonal matrix D to obtain a new diagonalmatrix D̃. Finally, to get the next iterate we put the pieces back together, substituing D̃ for D.

Since both the majorization and objective functions are convex, the algorithm is guaranteed to proceedtowards a global minimum. Hence, the choice of the initial guess for the model matrix Z is not materiallyimportant for eventually reaching a solution, although an informed choice is likely get to a good answersooner. We will give some guidance on how to make that informed choice in a moment.

Convergence may be defined by several metrics. As the algorithm approaches convergence, the change in theiterates (and the resulting objective function values) will be minimal. Hence, we may decide to terminatethe MM algorithm once the absolute di�erence in the objective function values of adjacent iterates fallsbelow some small threshold. Alternatively, the relative change in adjacent iterates may also be used. In thematrixcomplete function of the gettingtothebottom package, we watch the relative change in the iteratesto decide when to halt our algorithm.

We further notice that this algorithm provides a solution Z for a given value of ⁄. In practice, the algorithmis typically run over a sequence of decreasing ⁄ values. The resulting sequence of solutions is called theregularization, or solution, path. This path of solutions may then be used to identify the ⁄ value thatminimizes the error between the data X and the model Z on a held-out set of entries. We do not go intodetail on how to choose ⁄ in this article. Instead, since we illustrate the method on simulated data, we arecontent to show that there is a ⁄ which minimizes the true prediction error on the unobserved entries. Beforemoving on to some R code, we point out that using the solution at the previous value of ⁄ as the initial valueZ

(0) for the next lower value of ⁄ in the sequence of ⁄’s can drastically reduce the number of iterations. Thispractice is often referred to as using “warm starts” and leverages the fact that solutions to problems withsimilar ⁄’s are often close to each other.

An MM Algorithm for Matrix Completion

The following example implements the MM algorithm for matrix completion described above. In our code,we store the set of missing indices � as a vector.

7

http://jmlr.org/papers/volume11/mazumder10a/mazumder10a.pdf

http://jocelynchi.com/gettingtothebottom/software.html

# Generate a test matrixset.seed(12345)m <- 1000n <- 1000r <- 5Mat <- testmatrix(m,n,r)

# Add some noise to the test matrixE <- 0.1*matrix(rnorm(m*n),m,n)A <- Mat + E

# Obtain a vector of unobserved entriestemp <- makeOmega(m,n,percent=0.5)omega <- temp$omega

# Remove unobserved entries from test matrixX <- AX[omega] <- NA

# Make initial model matrix Z and find initial lambdaZ <- matrix(0,m,n)lambda <- init.lambda(X,omega)

# Run exampleSol <- matrixcomplete(X,Z,omega,lambda)

## Optimization completed.

# Error (normed difference)diff_norm(Sol$sol,A,omega)

## [1] 121.5

In this example, we utilized the init.lambda function to obtain an initial value for ⁄. For a su�ciently large⁄, the solution is the matrix of all zeros, namely Z = 0. The smallest ⁄ that gives this answer is given by thelargest singular value of a matrix that equals X on �c and is zero on �. The init.lambda function returnsthis value. We use this ⁄ as the first of a decreasing sequence of regularization parameters, since we can makean informed choice on the first initial starting point, namely Z

(0) = 0. If the next smaller ⁄ is not too farfrom the ⁄ returned by init.lambda our solution to the first problem, namely Z = 0, should not be too farfrom the solution to the current problem. As we progress through our sequence of decreasing ⁄ values, werepeatedly recycle the solution at one ⁄ as the initial guess Z

(0) for the next smaller ⁄.

# Generate a test matrixseed <- 12345m <- 100n <- 100r <- 3Mat <- testmatrix(m,n,r,seed=seed)

# Add some noise to the test matrixE <- 0.1*matrix(rnorm(m*n),m,n)A <- Mat + E

8

# Obtain a vector of unobserved entriestemp <- makeOmega(m,n,percent=0.5)omega <- temp$omega

# Remove unobserved entries from test matrixX <- AX[omega] <- NA

# Make initial model matrix Z and find initial lambdaZ <- matrix(0,m,n)lambda.start <- init.lambda(X,omega)lambdaseq_length=20tol <- 1e-2

ans <- solutionpaths(A,X,Z,omega,lambda.start,tol=tol,liveupdates=TRUE,lambdaseq_length=lambdaseq_length)

## Optimization completed.## Completed results for lambda 1 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Completed iteration 4 .## Completed iteration 5 .## Optimization completed.## Completed results for lambda 2 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Completed iteration 4 .## Optimization completed.## Completed results for lambda 3 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Completed iteration 4 .## Optimization completed.## Completed results for lambda 4 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Optimization completed.## Completed results for lambda 5 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Optimization completed.## Completed results for lambda 6 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Completed iteration 3 .## Optimization completed.## Completed results for lambda 7 of 20 .

9

## Completed iteration 1 .## Completed iteration 2 .## Optimization completed.## Completed results for lambda 8 of 20 .## Completed iteration 1 .## Completed iteration 2 .## Optimization completed.## Completed results for lambda 9 of 20 .## Completed iteration 1 .## Optimization completed.## Completed results for lambda 10 of 20 .## Completed iteration 1 .## Optimization completed.## Completed results for lambda 11 of 20 .## Completed iteration 1 .## Optimization completed.## Completed results for lambda 12 of 20 .## Completed iteration 1 .## Optimization completed.## Completed results for lambda 13 of 20 .## Completed iteration 1 .## Optimization completed.## Completed results for lambda 14 of 20 .## Optimization completed.## Completed results for lambda 15 of 20 .## Optimization completed.## Completed results for lambda 16 of 20 .## Optimization completed.## Completed results for lambda 17 of 20 .## Optimization completed.## Completed results for lambda 18 of 20 .## Optimization completed.## Completed results for lambda 19 of 20 .## Optimization completed.## Completed results for lambda 20 of 20 .

The following figure is a plot of the true values (obtained from the data we generated in A) against theimputed values in the minimum error solution found using our comparison in the solutionpaths function.The plot shows the errors using the model with the ⁄ that gives the solution with the least discrepancy withA. Of course, in practice we do not know A. The point we want to make is that there is a ⁄ that can lead togood predictions. The plot indicates that the matrix completion algorithm does well in finding a completemodel Z that is similar to X since the imputed values are quite similar to the true values in our example.You will notice, however, that the imputed entries are all biased towards zero. This is an artifact of usingthe nuclear norm heuristic in place of the rank constraint. It is a small price to pay for a computationallytractable model. This bias is in fact a common theme in penalized estimation. Penalties like the 1-norm inthe lasso and 2-norm in ridge regression, trade o� increased bias for reduced variance to achieve an overalllower mean squared error.

plot_solpaths_error(A,omega,ans)

10

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1 2True values

Impu

ted

valu

es

The figure below depicts the result of the comparisons from the solutionpaths function. The figure showsa plot of the error from each ⁄ in our comparison as a function of log10(⁄) (with rounded ⁄ values beloweach point). The blue line indicates the ⁄ value resulting in the minimum normed di�erence error betweenthe data and the solution model. We observe that higher ⁄ values resulted in larger errors, and as ⁄ movedtowards zero, the change in the model error also decreases. Although it is hard to see on the graph, the erroractually increased again after some ⁄ value.

plot_solutionpaths(ans)

11

10.76

8.44

6.62

5.2

4.08

3.2

2.511.97

1.551.21

0.950.750.590.460.360.280.220.170.140.11

10

20

30

−1.0 −0.5 0.0 0.5 1.0log10(λ)(rounded values beneath points)

Erro

r

We conclude our brief foray into matrix completion with an abridged derivation of the minimizer of themajorizing function.

Derivation of the minimizer of the majorization

In deriving the MM algorithm we used the fact that the following function

12ÎY ≠ ZÎ2

F + ⁄ÎZÎú

has a unique global minimizer Z

ı = UDV

t, where Y = U�V

t is the singular value decomposition of Y andD is a diagonal matrix of soft-thresholded values of �. We begin by noting that

12ÎY ≠ ZÎ2

F + ⁄ÎZÎú = 12ÎYÎ2

F ≠ tr(Yt

Z) + 12ÎZÎ2

F + ⁄ÎZÎú

= 12

nÿ

i=1‡

2i

≠ tr(Yt

Z) + 12

nÿ

i=1d

2i

+ ⁄

nÿ

i=1d

i

.

By Fan’s inequality

tr(Yt

Z) Ænÿ

i=1‡

i

d

i

.

Equality is attained if and only if Y and Z share the same left and right singular vectors U and V. Therefore,the optimal Z has singular value decomposition Z = UDV

t and the singular values d

i

satisfy

(d1, . . . , d

n

) = arg mind1,...,dn

nÿ

i=1

512‡

2i

≠ ‡

i

d

i

+ 12d

2i

+ ⁄|di

|6

.

12

In other words for each i,

d

i

= arg mindi

12 [‡

i

≠ d

i

]2 + ⁄|di

|.

But the solution to this problem is the soft-thresholded value S(‡i

, ⁄).

Nonnegative Least Squares

While quadratic majorizations are extremely useful, the majorization principle can fashion iterative algorithmsfrom a wide array of inequalities. In our next example, we show how to use Jensen’s inequality to constructan MM algorithm for solving the constrained minimization problem of nonnegative least squares regression.

We begin by reviewing the formal definition of convexity. Recall that a real valued function f over R isconvex if for all – œ [0, 1] and pairs of points u, v œ R,

f(–u + (1 ≠ –)v) Æ –f(u) + (1 ≠ –)f(v).

Geometrically, the inequality implies that the function evaluated at every point in the segment {w =–u + (1 ≠ –)v : – œ [0, 1]} must lie below the line connecting the points f(u) and f(v). Curved functionsthat obey this inequality are bowl shaped.

This inequality can be generalized to multiple points. By an induction argument, we arrive at Jensen’sinequality, which states that if f is convex, for any collection of positive –1, . . . , –

p

and u1, . . . , u

p

œ R

f

Aqp

j=1 –

i

u

iqp

j

Õ=1 –

j

Õ

BÆ

qp

j=1 –

j

f(uj

)q

p

j

Õ=1 –

j

Õ.

By Jensen’s inequality, if f is a convex function over the reals, the following inequality

f(xt

b) Æpÿ

j=1–

j

f

3x

j

[bj ≠ b̃

j

]–

j

+ x

t

b̃

4,

where–

j

= x

j

b̃

j

x

t

b̃

,

always holds whenever the elements of x and b are positive. Moreover, the bound is sharp and equality isattained when b = b̃. Therefore, given n convex functions f1, . . . , f

n

, the function

g(b | b̃) =nÿ

i=1

pÿ

j=1–

ij

f

i

3x

ij

[bj

≠ b̃

j

]–

ij

+ x

t

i

b̃

4,

where–

ij

= x

ij

b̃

j

x

t

i

b̃

,

majorizesnÿ

i=1f

i

(xt

i

b)

at b̃.

Suppose we have n samples and p covariates onto which we wish to regress a response. Let y œ Rn

+ andX œ Rn◊p

+ denote the response and design matrix respectively. Suppose further that we require the regression

13

coe�cients be nonnegative. Let x

i

denote the i

th row of X. Then formally, we can cast the nonnegative leastsquares problem as the following constrained optimization problem.

minb

12

nÿ

i=1(y

i

≠ x

t

i

b)2

subject to b

j

Ø 0 for j = 1, . . . , p. Note that the functions

f

i

(u) = 12(y

i

≠ u)2

are convex. Therefore, the objective is majorized by

g(b | b̃) = 12

nÿ

i=1

pÿ

j=1–

ij

3x

ij

[bj

≠ b̃

j

]–

ij

+ x

t

i

b̃ ≠ y

i

42

.

Despite how complicated this majorization may appear, it has a unique global minimizer which can beanalytically derived. Note that the partial derivatives are given by

ˆ

ˆb

j

g(b | b̃) =nÿ

i=1x

ij

b

j

≠nÿ

i=1

x

ij

b̃

j

x

t

i

b̃

y

i

.

Setting them equal to zero, we conclude that the minimizer b of the majorization is an element-wise multipleof the anchor point b̃, namely

b

j

=Cq

n

i=1xijyi

x

tib̃q

n

i=1 x

ij

Db̃

j

.

Clearly, if we initialize our MM algorithm at a positive set of regression coe�cients b, subsequent MM iterateswill also be positive since all terms in the brackets are positive. Thus, this majorization gracefully turns aconstrained optimization problem into an unconstrained one.

We can re-express the updates to simplify coding the algorithm. Let

w

ij

= x

ijqn

i=1 x

ij

.

The updates can be written more compactly as

b =#W

t

#y £ Xb̃

$$ú b̃,

where £ denotes element-wise division and ú denotes element-wise multiplication. Thus, the MM updatesconsist of multiplying the last iterate element-wise by a correction term.

An Example from Chemometrics

A classic application of nonnegative least squares in chemometrics involves decomposing chemical traces into“components“. Di�erent pure compounds of interest have di�erent known measurement profiles. Given asample containing a mixture of these known pure compounds, the task is to determine how much of thesebasic compounds was contained in the mixed sample. This was often done by fitting mixtures of normals ofknown location and width, so that only the weights were fitted. Clearly, the weights should be nonnegative.

In the following example, we generate a basis mixture of ten components. The resulting plot depicts the tenbasis components in our generated mixture.

14

n <- 1e3p <- 10nnm <- generate_nnm(n,p)plot_nnm(nnm)

0

2

4

6

8

0 250 500 750 1000Frequency

Intensity

We then set up our nonnegative least squares regression problem and utilize the MM algorithm describedabove to obtain a solution.

set.seed(12345)n <- 100p <- 3X <- matrix(rexp(n*p,rate=1),n,p)b <- matrix(runif(p),p,1)y <- X %*% b + matrix(abs(rnorm(n)),n,1)sol <- nnls_mm(y,X,b)

Next, we generate a small simulated de-mixing problem with noise added to a true mixture of 3 components.

## Setup mixture examplen <- 1e3p <- 10nnm <- generate_nnm(n,p)

set.seed(12345)X <- nnm$Xb <- double(p)nComponents <- 3

15

k <- sample(1:p,nComponents,replace=FALSE)b[k] <- matrix(runif(nComponents),ncol=1)y <- X%*%b + 0.25*matrix(abs(rnorm(n)),n,1)

# Plot the mixtureplot_spect(n,y,X,b,nnm)

0

1

2

3

4

−10 −5 0 5 10Frequency

Intensity

The following plot shows the unadulterated mixture that we wish to de-mix.

# Plot the true signalplot_nnm_truth(X,b,nnm)

16

0

1

2

3


True

Inte

nsity

The next few lines of R code show the resulting mixture estimated by nonnegative least squares.

# Obtain solution to mixture problemnnm_sol <- nnls_mm(y,X,runif(p))

# Plot the reconstructionplot_nnm_reconstruction(nnm,X,nnm_sol)

17

0

1

2

3

4


Rec

onst

ruct

ed In

tens

ity

The plot below shows the nonnegative regression components obtained using our algorithm.

# Plot the regression coefficientsplot_nnm_coef(nnm_sol)

18

0

1

2

3

2.5 5.0 7.5 10.0k

b k

The three largest components of the nonnegative least squares estimate coincide with the three true componentsin the simulated mixture. Nonetheless, the estimator also picked up several smaller false positives.

Before wrapping up with nonnegative least squares, we highlight the importance of verifying the monotonicityof our MM algorithm. The figure below depicts the decrease in the objective (or loss) function with eachiteration in the algorithm after running it on the toy problem. As expected, the MM algorithm we utilize ismonotonically decreasing. If this were not true, we would know immediately that something was amiss ineither our code or our derivation of the algorithm.

set.seed(12345)n <- 100p <- 3X <- matrix(rexp(n * p, rate = 1), n, p)b <- matrix(runif(p), p, 1)y <- X %*% b + matrix(abs(rnorm(n)), n, 1)# Plot the objectiveplot_nnm_obj(y,X,b)

19

22.15

22.20

22.25

22.30

0 25 50 75 100Iterates

Valu

e of

the

loss

func

tion

Conclusion

The ideas underlying the MM algorithm are extremely simple but when combined with a well-chosen inequality,they open the door to powerful and easily implementable algorithms for solving hard optimization problems.We readily admit that deriving an MM algorithm can be somewhat involved, as demonstrated in our examples.Nonetheless, the resulting algorithms are typically simple and easy to code. The matrix completion algorithmrequires four simple steps, and the nonnegative least squares algorithm boils down to a multiplicative updatethat can be written in two lines of code. For both examples, the actual action part of the code rivals theamount of code needed to check for convergence in terms of brevity. Moreover, the monotonicity propertyprovides a straightforward sanity check for the correctness of our code. We think most would agree with usthat the time spent on messy calculus and algebra is worth it to avoid time spent debugging code.

In short, in the space of iterative methods, MM algorithms are relatively simple and can drastically cut downon code development time. While more sophisticated approaches might be faster in the end for any givenproblem, the MM algorithm allows us to rapidly evaluate the quality of our statistical models by minimizingthe total time we spend developing ways to compute the estimators of parameters in those models. We haveonly skimmed the surface of the possible applications, and refer interested readers to Hunter and Lange formore examples.

Footnote1 Exploring the connection between the EM and MM algorithm is interesting in its own right and wouldrequire a separate article to do it justice. For the time being, however, we point the interested reader toseveral nice papers which take up this comparison. See 1, 2, 3, and 4. A line-by-line derivation which showsexplicitly the minorization used in the EM algorithm is given in 1.

20

http://www.tandfonline.com/doi/abs/10.1198/0003130042836

http://amazon.com

http://www.tandfonline.com/doi/abs/10.1080/10618600.2000.10474858

http://www.ncbi.nlm.nih.gov/pubmed/9185289

https://projecteuclid.org/euclid.ss/1300108233

http://www.sciencedirect.com/science/article/pii/S0167947312002174

http://www.tandfonline.com/doi/abs/10.1080/10618600.2000.10474858

getting to the bottom of matrix completion and nonnegative

Documents