collaborative filtering matrix completion alternating least squares · 2016-05-20 · women on the...
TRANSCRIPT
©Sham Kakade 2016 1
Machine Learning for Big Data CSE547/STAT548, University of Washington
Sham Kakade
May 19, 2016
Collaborative FilteringMatrix Completion
Alternating Least Squares
Case Study 4: Collaborative Filtering
Collaborative Filtering
• Goal: Find movies of interest to a user based on movies watched by the user and others
• Methods: matrix factorization
©Sham Kakade 2016 2
City of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of aNervous Breakdown
What do I recommend???
©Sham Kakade 2016 3
Netflix Prize
• Given 100 million ratings on a scale of 1 to 5, predict 3 million ratings to highest accuracy
• 17770 total movies
• 480189 total users
• Over 8 billion total ratings
• How to fill in the blanks?
©Sham Kakade 2016 4Figures from Ben Recht
Matrix Completion Problem
• Filling missing data?
©Sham Kakade 2016 5
Xij known for black cells
Xij unknown for white cells
Rows index moviesColumns index users
X =Rows index users
Columns index movies
Interpreting Low-Rank Matrix Completion (aka Matrix Factorization)
©Sham Kakade 2016 6
X LR’
=
Identifiability of Factors
• If ruv is described by Lu , Rv what happens if we redefine the “topics” as
• Then,
©Sham Kakade 2016 7
X LR’
=
Matrix Completion via Rank Minimization
• Given observed values:
• Find matrix
• Such that:
• But…
• Introduce bias:
• Two issues:
©Sham Kakade 2016 8
Approximate Matrix Completion
• Minimize squared error:– (Other loss functions are possible)
• Choose rank k:
• Optimization problem:
©Sham Kakade 2016 9
Coordinate Descent for Matrix Factorization
• Fix movie factors, optimize for user factors
• First observation:
©Sham Kakade 2016 10
Minimizing Over User Factors
• For each user u:
• In matrix form:
• Second observation: Solve by
©Sham Kakade 2016 11
Coordinate Descent for Matrix Factorization: Alternating Least-Squares
• Fix movie factors, optimize for user factors– Independent least-squares over users
• Fix user factors, optimize for movie factors– Independent least-squares over movies
• System may be underdetermined:
• Converges to
©Sham Kakade 2016 12
Effect of Regularization
©Sham Kakade 2016 13
X LR’
=
What you need to know…
• Matrix completion problem for collaborative filtering
• Over-determined -> low-rank approximation
• Rank minimization is NP-hard
• Minimize least-squares prediction for known values for given rank of matrix
– Must use regularization
• Coordinate descent algorithm = “Alternating Least Squares”
©Sham Kakade 2016 14
©Sham Kakade 2016 15
Machine Learning for Big Data CSE547/STAT548, University of Washington
Sham Kakade
May 19, 2016
SGD for Matrix Completion,more algorithms
(Matrix-norm Minimization)
Case Study 4: Collaborative Filtering
Stochastic Gradient Descent
• Observe one rating at a time ruv
• Gradient observing ruv:
• Updates:
©Sham Kakade 2016 16
Local Optima v. Global Optima
• We are solving:
• We (kind of) wanted to solve:
• Which is NP-hard… – How do these things relate???
©Sham Kakade 2016 17
Eigenvalue Decompositions for PSD Matrices
• Given a (square) symmetric positive semidefinite matrix:
– Eigenvalues:
• Thus rank is:
• Approximation:
• Property of trace:
• Thus, approximate rank minimization by:
©Sham Kakade 2016 18
Generalizing the Trace Trick
• Non-square matrices have no trace
• For (square) positive semidefinite matrices, eigendecomposition:
• For rectangular matrices, singular value decomposition:
• Nuclear norm:
©Sham Kakade 2016 19
Nuclear Norm Minimization
• Optimization problem:
Possible to relax equality constraints:
Both are convex problems!(solved by semidefinite programming)
©Sham Kakade 2016 20
Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions
• Nuclear norm minimization:
– Annoying because:
• Instead:
– Annoying because:
– But
• So
• And
• Under certain conditions [Burer, Monteiro ‘04]©Sham Kakade 2016 21
Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions
• Nuclear norm minimization:
– Annoying because:
• Instead:
– Annoying because:
– But
• So
• And
• Under certain conditions [Burer, Monteiro ‘04]©Sham Kakade 2016 22
Theory
• Suppose true matrix is exactly low rank, what might we hope for?– Exact recovery?
– Statistically?
– Computationally?
• Is this possible?
• Assumptions:
©Sham Kakade 2016 23
Analysis of Nuclear Norm
• Nuclear norm minimization = convex relaxation of rank minimization:
• Theorem [Candes, Recht ‘08]: – If there is a true matrix of rank k,
– And, we observe at least
random entries of true matrix
– Then true matrix is recovered exactly with high probability via convex nuclear norm minimization!
• Under certain conditions
©Sham Kakade 2016 24
Alternating Minimization
• Alt. Min. used in practice.
• Alt. Min. was widely thought to be a search heuristic, does it work?
©Sham Kakade 2016 25
SGD
• SGD also widely used in practice (streaming, fast)
• Does it work?
©Sham Kakade 2016 26
What you need to know…
• Stochastic gradient descent for matrix factorization
• Norm minimization as convex relaxation of rank minimization– Trace norm for PSD matrices
– Nuclear norm in general
• Intuitive relationship between nuclear norm minimization and direct (bilinear) minimization
©Sham Kakade 2016 27
©Sham Kakade 2016 28
Machine Learning for Big Data CSE547/STAT548, University of Washington
Emily Fox
May 6th, 2015
Nonnegative Matrix FactorizationProjected Gradient
Case Study 4: Collaborative Filtering
Matrix factorization solutions can be unintuitive…
• Many, many, many applications of matrix factorization
• E.g., in text data, can do topic modeling (alternative to LDA):
• Would like:
• But…
©Sham Kakade 2016 29
X LR’
=
Nonnegative Matrix Factorization
• Just like before, but
• Constrained optimization problem– Many, many, many, many solution methods… we’ll check out a simple one
©Sham Kakade 2016 30
X LR’
=
Recall: Projected Gradient
• Standard optimization:
– Want to minimize:
– Use, e.g., gradient updates:
• Constrained optimization:
– Given convex set C of feasible solutions
– Want to find minima within C:
• Projected gradient:
– Take a gradient step (ignoring constraints):
– Projection into feasible set:
©Sham Kakade 2016 31
Projected Stochastic Gradient Descent for Nonnegative Matrix Factorization
• Gradient step observing ruv ignoring constraints:
• Convex set:
• Projection step:
©Sham Kakade 2016 32
What you need to know…
• In many applications, want factors to be nonnegative
• Corresponds to constrained optimization problem
• Many possible approaches to solve, e.g., projected gradient
©Sham Kakade 2016 33
©Sham Kakade 2016 34
Machine Learning for Big Data CSE547/STAT548, University of Washington
Sham Kakade
May 19, 2016
Cold Start Problem
Case Study 4: Collaborative Filtering
Cold-Start Problem
• Challenge: Cold-start problem (new movie or user)
• Methods: use features of movie/user
©Sham Kakade 2016 35
IN THEATERS
Cold-Start Problem More Formally
• Consider a new user u’ and predicting that user’s ratings– No previous observations
– Objective considered so far:
– Optimal user factor:
– Predicted user ratings:
©Sham Kakade 2016 36
An Alternative Formulation
• A simpler model for collaborative filtering– We would not have this issue if we assumed all users were identical
– What about for new movies? What if we had side information?
– What dimension should w be?
– Fit linear model:
– Minimize:
©Sham Kakade 2016 37
Personalization
• If we don’t have any observations about a user, use wisdom of the crowd– Address cold-start problem
• Clearly, not all users are the same
• Just as in personalized click prediction, consider model with global and user-specific parameters
• As we gain more information about the user, forget the crowd
©Sham Kakade 2016 38
User Features…
• In addition to movie features, may have information about the user:
• Combine with features of movie:
• Unified linear model:
©Sham Kakade 2016 39
Feature-Based Approach vs. Matrix Factorization
• Feature-based approach: – Feature representation of user and movies fixed
– Can address cold-start problem
• Matrix factorization approach:– Suffers from cold-start problem
– User & movie features are learned from data
• A unified model:
©Sham Kakade 2016 40
Unified Collaborative Filtering via SGD
• Gradient step observing ruv
– For L,R
– For w and wu:
©Sham Kakade 2016 41
What you need to know…
• Cold-start problem
• Feature-based methods for collaborative filtering
– Help address cold-start problem
• Unified approach
©Sham Kakade 2016 42