collaborative filtering matrix completion alternating least squares · 2016-05-20 · women on the...

©Sham Kakade 2016 1

Machine Learning for Big Data CSE547/STAT548, University of Washington

Sham Kakade

May 19, 2016

Collaborative FilteringMatrix Completion

Alternating Least Squares

Case Study 4: Collaborative Filtering

Collaborative Filtering

• Goal: Find movies of interest to a user based on movies watched by the user and others

• Methods: matrix factorization


City of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

What do I recommend???


Netflix Prize

• Given 100 million ratings on a scale of 1 to 5, predict 3 million ratings to highest accuracy

• 17770 total movies

• 480189 total users

• Over 8 billion total ratings

• How to fill in the blanks?

©Sham Kakade 2016 4Figures from Ben Recht

Matrix Completion Problem

• Filling missing data?


Xij known for black cells

Xij unknown for white cells

Rows index moviesColumns index users

X =Rows index users

Columns index movies

Interpreting Low-Rank Matrix Completion (aka Matrix Factorization)


X LR’

=

Identifiability of Factors

• If ruv is described by Lu , Rv what happens if we redefine the “topics” as

• Then,


X LR’

=

Matrix Completion via Rank Minimization

• Given observed values:

• Find matrix

• Such that:

• But…

• Introduce bias:

• Two issues:


Approximate Matrix Completion

• Minimize squared error:– (Other loss functions are possible)

• Choose rank k:

• Optimization problem:


Coordinate Descent for Matrix Factorization

• Fix movie factors, optimize for user factors

• First observation:


Minimizing Over User Factors

• For each user u:

• In matrix form:

• Second observation: Solve by


Coordinate Descent for Matrix Factorization: Alternating Least-Squares

• Fix movie factors, optimize for user factors– Independent least-squares over users

• Fix user factors, optimize for movie factors– Independent least-squares over movies

• System may be underdetermined:

• Converges to


Effect of Regularization


X LR’

=

What you need to know…

• Matrix completion problem for collaborative filtering

• Over-determined -> low-rank approximation

• Rank minimization is NP-hard

• Minimize least-squares prediction for known values for given rank of matrix

– Must use regularization

• Coordinate descent algorithm = “Alternating Least Squares”




Sham Kakade

May 19, 2016

SGD for Matrix Completion,more algorithms

(Matrix-norm Minimization)


Stochastic Gradient Descent

• Observe one rating at a time ruv

• Gradient observing ruv:

• Updates:


Local Optima v. Global Optima

• We are solving:

• We (kind of) wanted to solve:

• Which is NP-hard… – How do these things relate???


Eigenvalue Decompositions for PSD Matrices

• Given a (square) symmetric positive semidefinite matrix:

– Eigenvalues:

• Thus rank is:

• Approximation:

• Property of trace:

• Thus, approximate rank minimization by:


Generalizing the Trace Trick

• Non-square matrices have no trace

• For (square) positive semidefinite matrices, eigendecomposition:

• For rectangular matrices, singular value decomposition:

• Nuclear norm:


Nuclear Norm Minimization

• Optimization problem:

Possible to relax equality constraints:

Both are convex problems!(solved by semidefinite programming)


Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions

• Nuclear norm minimization:

– Annoying because:

• Instead:


– But

• So

• And

• Under certain conditions [Burer, Monteiro ‘04]©Sham Kakade 2016 21

Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions

• Nuclear norm minimization:


• Instead:


– But

• So

• And

• Under certain conditions [Burer, Monteiro ‘04]©Sham Kakade 2016 22

Theory

• Suppose true matrix is exactly low rank, what might we hope for?– Exact recovery?

– Statistically?

– Computationally?

• Is this possible?

• Assumptions:


Analysis of Nuclear Norm

• Nuclear norm minimization = convex relaxation of rank minimization:

• Theorem [Candes, Recht ‘08]: – If there is a true matrix of rank k,

– And, we observe at least

random entries of true matrix

– Then true matrix is recovered exactly with high probability via convex nuclear norm minimization!

• Under certain conditions


Alternating Minimization

• Alt. Min. used in practice.

• Alt. Min. was widely thought to be a search heuristic, does it work?


SGD

• SGD also widely used in practice (streaming, fast)

• Does it work?



• Stochastic gradient descent for matrix factorization

• Norm minimization as convex relaxation of rank minimization– Trace norm for PSD matrices

– Nuclear norm in general

• Intuitive relationship between nuclear norm minimization and direct (bilinear) minimization




Emily Fox

May 6th, 2015

Nonnegative Matrix FactorizationProjected Gradient


Matrix factorization solutions can be unintuitive…

• Many, many, many applications of matrix factorization

• E.g., in text data, can do topic modeling (alternative to LDA):

• Would like:

• But…


X LR’

=

Nonnegative Matrix Factorization

• Just like before, but

• Constrained optimization problem– Many, many, many, many solution methods… we’ll check out a simple one


X LR’

=

Recall: Projected Gradient

• Standard optimization:

– Want to minimize:

– Use, e.g., gradient updates:

• Constrained optimization:

– Given convex set C of feasible solutions

– Want to find minima within C:

• Projected gradient:

– Take a gradient step (ignoring constraints):

– Projection into feasible set:


Projected Stochastic Gradient Descent for Nonnegative Matrix Factorization

• Gradient step observing ruv ignoring constraints:

• Convex set:

• Projection step:



• In many applications, want factors to be nonnegative

• Corresponds to constrained optimization problem

• Many possible approaches to solve, e.g., projected gradient




Sham Kakade

May 19, 2016

Cold Start Problem


Cold-Start Problem

• Challenge: Cold-start problem (new movie or user)

• Methods: use features of movie/user


IN THEATERS

Cold-Start Problem More Formally

• Consider a new user u’ and predicting that user’s ratings– No previous observations

– Objective considered so far:

– Optimal user factor:

– Predicted user ratings:


An Alternative Formulation

• A simpler model for collaborative filtering– We would not have this issue if we assumed all users were identical

– What about for new movies? What if we had side information?

– What dimension should w be?

– Fit linear model:

– Minimize:


Personalization

• If we don’t have any observations about a user, use wisdom of the crowd– Address cold-start problem

• Clearly, not all users are the same

• Just as in personalized click prediction, consider model with global and user-specific parameters

• As we gain more information about the user, forget the crowd


User Features…

• In addition to movie features, may have information about the user:

• Combine with features of movie:

• Unified linear model:


Feature-Based Approach vs. Matrix Factorization

• Feature-based approach: – Feature representation of user and movies fixed

– Can address cold-start problem

• Matrix factorization approach:– Suffers from cold-start problem

– User & movie features are learned from data

• A unified model:


Unified Collaborative Filtering via SGD

• Gradient step observing ruv

– For L,R

– For w and wu:



• Cold-start problem

• Feature-based methods for collaborative filtering

– Help address cold-start problem

• Unified approach


collaborative filtering matrix completion alternating least squares · 2016-05-20 · women on the...

Documents