matrix completion presentation

29
Sparse Matrix Reconstruction Michael Hankin University of Southern California [email protected] December 5, 2013 Michael Hankin (USC) Matrix Completion December 5, 2013 1 / 28

Upload: michael-hankin

Post on 28-Jul-2015

132 views

Category:

Education


2 download

TRANSCRIPT

Sparse Matrix Reconstruction

Michael Hankin

University of Southern California

[email protected]

December 5, 2013

Michael Hankin (USC) Matrix Completion December 5, 2013 1 / 28

Overview

1 Initial Problem

2 AlgorithmAlgorithm Explanation

3 Convergence

4 Extensions

5 Demos

6 Other topics

7 References

Michael Hankin (USC) Matrix Completion December 5, 2013 2 / 28

Overview of Matrix Completion Problem

Motivation: Say that Netflix has NMovies movies and NUsers users. Givenuniversal knowledge they could construct an NMoviesxNUsers matrix ofratings and thus predict which movies their users would enjoy, and howmuch so.However, all they have are the few ratings their users have taken the timeto input, and the data on which accounts have watched which movies.Can the full matrix be reconstructed from this VERY sparse, noisysample?

Michael Hankin (USC) Matrix Completion December 5, 2013 3 / 28

Overview of Matrix Completion Problem

Idea: Without some constraint the values of the missing points could beany real (or even complex!) number. Obviously we have to impose somerestrictions beginning with real numbers only!Less obvious is the condition that the matrix be of low rank. In the Netflixproblem this is obvious: there really aren’t that many types of people (asfar a taste profile goes) or movies (as far as genre/appeal profile goes).However this condition is relevant in many other scenarios as well.

Michael Hankin (USC) Matrix Completion December 5, 2013 4 / 28

Notational Interlude

For a matrix X define the nuclear norm to be

‖X‖∗ =r∑

i=1

|σi |

where the σi ’s are the singular values of the matrix and r is its rank (andtherefore the number of nonzero singular values).Grievously abusing notation, we might say ‖X‖∗ = ‖~σ‖`1

If the true matrix is M and we observe only Mi ,j ∀(i , j) ∈ Ω for some Ωthen let

PΩ(X ) =

Xi ,j : (i , j) ∈ Ω0 : (i , j) /∈ Ω

Michael Hankin (USC) Matrix Completion December 5, 2013 5 / 28

Problem statement

Given PΩ(M) and the knowledge that M is of low rank, recover M.To do so we work with an approximation X . We want to minimize therank of X , however a direct approach would be NP-hard, in the same waythat ‖~σ‖`0 would be, so we relax our conditions in the same vein asLASSO, and set up the problem:

min ‖X‖∗ ≈ min ‖~σ‖`1

s.t. PΩ(M) = PΩ(X )(1)

we end up with something slightly resembling the Dantzig selector, whichwe know gives sparse results, and sparsity in ~σ is equivalent to low rank forX .

Michael Hankin (USC) Matrix Completion December 5, 2013 6 / 28

To expand on this notion, consider the Dantzig case.Level sets of ‖ ‖`1 can be visually represented as thediamond in the image to the right.

1

In the case of LASSO regression ‖ ‖`1 ≤ 1 can be considered to be theconvex hull of the single-parameter euclidean basis vectors of unit length.In the nuclear norm case, ‖ ‖∗ ≤ 1 is just the convex hull of the set ofrank 1 matrices whose spectral norm ‖X‖`2 ≤ 1 (keeping in mind that‖X‖`2 = ‖~σ‖`∞)The solution to the previous minimization problem is the point at whichthe smallest level set of the nuclear norm to intersect the subspaceX : PΩ(M) = PΩ(X ) does so. Using the spatial intuition gleaned fromour study of LASSO we recognize that this will give a sparse set of singularvalues, and therefore a low rank matrix, that agrees with M on all of Ω.

1Credit to Nicolai Meinshausen: http://www.stats.ox.ac.uk/~meinshau/Michael Hankin (USC) Matrix Completion December 5, 2013 7 / 28

Algorithm Background

Candes, Cai, and Shen introduced an algorithm that comes close tosolving our problem.Let X be of low rank r and UΣV ∗ be its SVD, whereΣ = diag(σi1≤i≤r ) (because it has only r nonzero singular values.

N

ext they define the soft-thresholding operator:

Dτ (X ) = UDτ (Σ)V ∗

Dτ (Σ) = diag((σi − τ)+1≤i≤r )

for τ > 0, so that it shrinks all of the singular values of X , setting any thatwere originally ≤ τ to 0, thereby reducing its rank.

Note: Dτ (X ) = arg minY

12‖Y − X‖2

F + τ‖Y ‖∗

.This will affect the output of the algorithm.

Michael Hankin (USC) Matrix Completion December 5, 2013 8 / 28

Algorithm

Start with some Y 0 that vanishes outside of Ω (an efficient choice forY 0 will be discussed later, but for now just use 0 or even M.)

Choose values for τ > 0 and a sequence δk corresponding to step sizes

At step k set X k = Dτ (Y k−1)

Then set Y k = Y k−1 + δkPΩ(M − X k)

Michael Hankin (USC) Matrix Completion December 5, 2013 9 / 28

Algorithm Discussion

Notes on the algorithm:

Low Rank

The X k ’s will tend to have low rank, unless to many of the singular valuesend up growing beyond τ so that further iterations do not lower the rank.Both the authors and I found (empirically) that the rank of the X k ’s tendto start low, and grow to a stable point after a few dozen iterations. Aslong as the original matrix was of low rank, this stable point also tends tobe of low rank. Unfortunately the authors have been unable to prove this.When the dimensions of X k are high, this low rank property allows us toeconomize on memory by maintaining only the portion of its SVDcorresponding to non-0’d singular values instead of the entire, densematrix itself.

Michael Hankin (USC) Matrix Completion December 5, 2013 10 / 28

Algorithm Discussion

Notes on the algorithm:

Sparsity

The Y k ’s will always be sparse, and vanish outside of Ω. This is obviousbecause we require that Y 0 be either equal to 0 or at least vanish outsideof Ω. PΩ(M − X k) vanishes outside of Ω by definition, and if we assumeY k−1 does to, then Y k = Y k−1 + PΩ(M − X k) must have the sameproperty, and is therefore sparse.This lessens our storage requirements (though we must still maintain thedense matrices X k) but more importantly it makes computing the SVD ofY k much faster as long as clever computational approaches and a sparsesolver are used.

Michael Hankin (USC) Matrix Completion December 5, 2013 11 / 28

Proof that algorithm gives a solution to

min τ‖X‖∗ +1

2‖X‖2

F

s.t. PΩ(M) = PΩ(X )(2)

Michael Hankin (USC) Matrix Completion December 5, 2013 12 / 28

Convergence Significance

Figure : Convergence towards true value for different tau and delta values

Michael Hankin (USC) Matrix Completion December 5, 2013 13 / 28

Convergence Significance

Figure : Convergence towards true value for different tau and delta values

Michael Hankin (USC) Matrix Completion December 5, 2013 14 / 28

Convergence Significance

As seen in the proof, the algorithm converges to the solution of:

min τ‖X‖∗ +1

2‖X‖2

F

s.t. PΩ(M) = PΩ(X )(3)

Michael Hankin (USC) Matrix Completion December 5, 2013 15 / 28

Convergence Significance

Why is a solution to

min τ‖X‖∗ +1

2‖X‖2

F

s.t. PΩ(M) = PΩ(X )(4)

satisfactory when we’re looking for a solution to

min ‖X‖∗s.t. PΩ(M) = PΩ(X )

(5)

Michael Hankin (USC) Matrix Completion December 5, 2013 16 / 28

Proof of adequacy in a more general case.

Michael Hankin (USC) Matrix Completion December 5, 2013 17 / 28

Convergence Significance

Figure : Convergence towards true value for different tau and delta values

Michael Hankin (USC) Matrix Completion December 5, 2013 18 / 28

General Convex Constraints

Cai Candes and Shen extend their algorithm to the more general case,addressed in the previous proof:

min fτ (X )

s.t. fi (X ) ≤ 0 ∀i(6)

Where the fi (X )’s are convex, left semi-continuous functionals.

Michael Hankin (USC) Matrix Completion December 5, 2013 19 / 28

Generalized Algorithm

In that case, the algorithm is as follows:

Denote F(X ) = (f1(X )...fn(X )) and initialize y0

X k = arg minXfτ (X ) +

⟨yk−1,F(X )

⟩yk = (yk−1 + δkF(X k))+

In the special case where the constraints are linear, ie A(X ) ≤ b for somelinear functional A, the iterations are as follows:

X k = Dτ (A∗(yk−1))

yk = (yk−1 + δk(b −A(X k))+

Consider b = Mi ,j(i ,j)∈Ω, A(X ) = Mi ,j(i ,j)∈Ω, and its adjoint A∗(y)mapping y to a sparse matrix X with entries only on indices in Ω andvalues equal to those in y .

Michael Hankin (USC) Matrix Completion December 5, 2013 20 / 28

Use Case

Noise!

If our data is noisy we can use |Xi ,j −Mi ,j | < ε ∀(i , j) ∈ ΩThis is

Example

Triangulation If the matrix in question is distances between points we canfill in the relative locations with just a few entries.

Michael Hankin (USC) Matrix Completion December 5, 2013 21 / 28

Noise Free

Figure : Rank 10 matrix

Michael Hankin (USC) Matrix Completion December 5, 2013 22 / 28

Noisy

Figure : Rank 10 matrix with a little noise, using exact matrix reconstruction

Michael Hankin (USC) Matrix Completion December 5, 2013 23 / 28

Images

Michael Hankin (USC) Matrix Completion December 5, 2013 24 / 28

Images

Michael Hankin (USC) Matrix Completion December 5, 2013 25 / 28

Stopping Criteria

Because we expect PΩ(M − X ) to converge to zero the authors suggest

using ‖PΩ(M−X )‖F‖PΩ(M)‖F ≤ ε as a stopping criteria. Because I generated my own

data I can actually plot ‖M−X‖F‖M‖F

Figure : Rank 10 matrix with no noise

Michael Hankin (USC) Matrix Completion December 5, 2013 26 / 28

WORK IN PROGRESS When can a matrix be reconstructed, and howmuch data is required? The most obvious issues arise when either a rowor a column of PΩ(M) is all 0. In that case nothing can be done as thatrow (or column) could be totally independent of the others.Along those lines, if any row or column in the unshredded M is all 0, weare out of luck, as PΩ(M) must also have a 0 row (or column). Even whenthere are no such rows or columns in M, if any of its singular vectors aretoo heavily skewed in a euclidean basis direction, the likelihood of one ofthe rows (or columns) of PΩ(M) being 0 is high. Also, note that an n1xn2

matrix of rank r has (n1 − r)r + r2 + (n2 − r)r degrees of freedom.

Michael Hankin (USC) Matrix Completion December 5, 2013 27 / 28

References

Cai, J.-F., Cands, E. J. and Shen, Z. (2010)

A singular value thresholding algorithm for matrix completion.

SIAM J. Optim. 20, 1956-1982.

Cands, E. J. and Plan, Y. (2010)

Matrix completion with noise

Proceedings of the IEEE 98, 925-936

Michael Hankin (USC) Matrix Completion December 5, 2013 28 / 28

The End

Michael Hankin (USC) Matrix Completion December 5, 2013 29 / 28