cs910: foundations of data analytics graham cormode [email protected] recommender systems

CS910: Foundations of Data AnalyticsGraham Cormode

[email protected]

Recommender Systems

Objectives

¨ To understand the concept of recommendation¨ To see neighbour based methods¨ To see latent factor methods¨ To see how recommender systems are evaluated

CS910 Foundations of Data Analytics2

Recommendations

¨ A modern problem: a very large number of possible items– Which item should I try next, based on my preferences?

¨ Arises in many different places:– Recommendations of content: books, music, movies, videos...– Recommendations of places to travel, hotels, restaurants– Recommendations of food to eat, sites to visit– Recommendations of articles to read: news, research, gossip

¨ Each person has different preferences/interests– How to elicit and model these preferences?– How to customize recommendations to a particular individual?


Recommendations in the Wild


Recommender Systems

¨ Recommender systems: produce tailored recommendations– Inputs: ratings of items by users

Possibly also: user profiles, item profiles– Outputs: for a given user, output a list of recommended items

Or, for a given (user, item) pair, output a predicted rating¨ Ratings can be in many forms

– “Star rating” (out of 5)– Binary rating (thumbs up/thumbs down) – Likert scale (Strongly like, like, neutral, dislike, strongly dislike)– Comparisons: prefer X to Y

¨ Will use movie recommendation as a running example


Ratings MatrixItem 1 Item 2 Item 3 Item 4 Item 5 ...

User 1 5 3 1 2

User 2 2 ?

User 3 5

User 4 3 4 2 1

User 5 ? 4

...


¨ n m matrix of ratings R, where ru,i is rating of user u for item i¨ Typically, matrix is large and sparse

¨ Thousands of users (n) and thousands of items (m)¨ Each user has rated only a few items¨ Each item is rated by at most a small fraction of users

¨ Goal is to provide predictions pu,i for certain (user, item) pairs

Evaluating a recommender system

¨ Evaluation is similar to evaluating classifiers– Break labeled data into training and test data– For each test (user, item) pair, predict the user score for the item– Measure the difference, and aim to minimize over N tests

¨ Combine the differences into a single score to compare systems– Most common: Root-Mean-Square-Error (RMSE) between pu,i & ru,i

RMSE = √(u,i (pu,i – ru,i)2 / N )– Sometimes also use Mean Absolute Error (MAE) u,i |pu,i – ru,i| / N – If recommendations are either ‘good’ or ‘bad’, can use precision,

recall


Initial attempts

¨ Can we use existing methods: classification, regression etc.?¨ Assume we have features for each user and each item:

– User: Demographics, stated preferences– Item: E.g. Genre, director, actors

¨ Can treat as a classification problem: predict a score– Train classifier from examples

¨ Limitations of the classifier approach:– Don’t necessarily have user and item information– Ignores what we do have: lots of ratings between users and items

Hard to use as features, unless everyone has rated a fixed set


Neighbourhood method

¨ Neighbourhood-based collaborative filtering– Users “collaborate” to help recommend (filter) items

1. Find k other users K who are similar to target user u– Possibly assign a weight based on how similar, wu,v

2. Combine the k users’ (weighted) preferences– Use these to make predictions for u

¨ Can use existing methods to measure similarity– PMCC to measure correlation of ratings as wu,v

– Cosine similarity of vectors


Neighbourhood example (unweighted)

¨ 3 users like the same set of movies as Joe (exact match)– All three like “Saving Private Ryan”, so this is top recommendation


Different Rating Scales

¨ Every user rates slightly differently– Some consistently rate high, some consistently rate low

¨ Using PMCC avoids this effect when picking neighbours but needs adjustment for making predictions

¨ Make an adjustment when computing a score:– Predict: pu,i = ru + (vK (rv,i – rv ) wu,v )/ (vK wu,v )– ru : average rating for user u– wu,v : weight assigned to user v based on their similarity to u

E.g. The correlation coefficient value– pu,i computes the weighted deviation from v’s average score, and

adds onto u’s average score


Item-based Collaborative Filtering

¨ Often there are many more users than items– E.g. Only few thousand movies available, but millions of users– Comparing to all users can be slow

¨ Can do neighbourhood-based filtering using items– Two items are similar if the users rating them are similar– Compute PMCC between the users rating them both as wi,j

– Find k most similar items J– Compute simple weighted average pu,i = jJ ru,j wi,j / (jJ wi,j )

No adjustment by mean as we assume no bias from items


Latent Factor Analysis

¨ We rejected methods based on features of items, since we could not guarantee they would be available

¨ Latent Factor Analysis tries to find “hidden” features from the rating matrix. – Factors might correspond to recognisable features like genre– Other factors: Child-friendly, comedic, light/dark– More abstract: depth of character, quirkiness– Could find factors that are hard to interpret


Latent Factor Example


Matrix Factorization

¨ Model each user and item as a vector of (inferred) factors– Let qi be the vector for item i, wu be the vector for user u– The predicted rating pu,i is then the dot product (wu q∙ i)

¨ How to learn the factors from the given data? – Given ratings matrix R, try to express R as a product WQ

W is n f matrix of users and their latent factors Q is a f m matrix of items and their latent factors

– A matrix factorization problem: factor R into W Q¨ Can be solved by Singular Value Decomposition


Singular Value Decomposition

¨ Given m x n matrix M, decompose into M = U VT, where:– U is a m x m matrix of orthogonal columns [left singular vectors]– is a rectangular m x n diagonal matrix [singular values]– VT is a n x n matrix of orthogonal rows [right singular vectors]

¨ The Singular Value Decomposition is highly structured– The singular values are the square roots of eigenvalues of MMT

– The left (right) singular vectors are eigenvectors of MMT (MTM)¨ SVD can be used to give approximate representations

– Take the k largest singular values, set rest to zero – Picks out the k most important “directions” – Gives the k latent factors to describe the data


SVD for recommender systems

¨ Textbook SVD doesn’t work when matrix has missing values!– Could try to fill in the missing values somehow, then factor

¨ Instead, set up as an optimization problem:– Learn length k vectors qi, wu to solve the following optimization:

Minq, v ∑(u,i) R (ru,i – qi wu)2

Minimize the squared error between the predicted and true value¨ If we had a complete matrix, SVD would solve this problem

– Set W = Uk ½k and Q = ½

k Vk

Uk, Vk are singular vectors corresponding to k largest singular values¨ Additional problem: too much freedom (not enough ratings)

– Risk of overfitting the training data, failing to generalize


Regularization

¨ Regularization is a technique used in many places– Here, avoid overfitting by penalizing having too many parameters– Achieve this by adding the size of the parameters to optimization

Minq, v ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ2

2)ǁxǁ2

2 is the L2 (Euclidean) norm squared: sum of squared values– Effect is to set more values of q and v to 0 to minimize complexity

¨ Many different forms of regularization: – L2 regularization: add terms of the form ǁxǁ2

2 – L1 regularization: terms of the form ǁxǁ1 (can give sparser solutions)

¨ The form of the regularization should fit the optimization


Solving the optimization: Gradient Descent

¨ How to solve Minq, v ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ2

2) ?¨ Gradient Descent

– For each training example, find error of current predictioneu,i = ru,i – qi wu

– Modify the parameters by taking a step in direction of the gradientqi qi + γ (eu,i wu - λ qi) [derivative of target with respect to q]wu wu + γ (eu,i qi - λ wu) [derivative with respect to p]

– γ is parameter to control the speed of descent¨ Advantages and disadvantages of gradient descent

– ++ Fairly easy to implement: easy to compute update at each step– -- Can be slow: hard to parallelize


Solving the optimization: Least Squares

¨ How to solve Minq, v ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ2

2) ?¨ Reducing to Least Squares

– Suppose the values of wu are fixed– Then the goal is to minimize a function of the squares of qis– Solved by techniques from regression: least squares minimization

¨ Alternating least squares– Pretend values of wu are fixed, optimize values of qi

– Swap, pretend values of qi are fixed, optimize values of wu

– Repeat until convergence ¨ Can be slower than gradient descent on a single machine

– But can parallelize: compute each qi independently


Adding biases

¨ Can generalize matrix factorization to incorporate other factors– E.g. Fred always rates 1 star less than average– E.g. Citizen Kane is rated 0.5 higher than other films on average

¨ These are not captured as well by a model of the form qi wu

– Explicitly modeling biases (intercepts) can give better fitModel with biases: pu,i = + bi + bu + (wu q∙ i) : global average ratingbi : bias for item ibu : rating bias from user u (similar to neighborhood method)

¨ Optimize the new error function in the same way: Minq,v,b ∑(u,i)R(ru,i – – bu – bi – qiwu)2 + (ǁqiǁ2

2 + ǁwuǁ22 + bu

2 + bi2)

– Can add more biases e.g. incorporate variation over time


Cold start problem: new items

¨ How to cope when new objects are added to the system?– New users arrive, new movies are released: “cold start” problem

¨ New item is created: no ratings, so will not be recommended?– Use attributes of the item (actors, genre) to give some score– Randomly suggest it to users to get some ratings


Cold start problem: new users

¨ New users arrive: we have no idea what they like!– Recommend globally popular items to them (Harry Potter…)

May not give much specific information about their tastes– Encourage new users to rate some items before recommending– Suggest items that are “divisive”: try to maximize information

Tradeoff: “poor” recommendations may drive users away


Case Study: The Netflix Prize

¨ Netflix ran competition from 2006-09– Netflix streams movies over internet (and rents DVDs by mail)– Users rate each movie on a 5-star scale– Netflix makes recommendations of what to watch next

¨ Object of competition: improve over current recommendations– “Cinematch” algorithm: “uses straightforward linear models…”– Prize: $1M to improve RMSE by 10%

¨ Training data: 100M dated ratings from 480K users to 18K movies– Can submit ratings of test data at most once per day– Avoid stressing of servers, attempts to elicit true answers


The Netflix Prize


https://www.youtube.com/watch?v=ImpV70uLxyw




Netflix prize factors

¨ Postscript: Netflix adopted some ideas but not all– “Explainability” of recommendations is an additional requirement– Cost of fitting models, making predictions is also important


Recommender Systems Summary

¨ Introduced the concept of recommendation¨ Saw neighbour based methods¨ Saw latent factor methods¨ Understood how recommender systems are evaluated

– Netflix prize as a case study in applied recommender systems

¨ Recommended reading:– Recommender systems (Encyclopedia of Machine Learning)– Matrix Factorization Techniques for Recommender SystemsKoren,

Bell, Volinsky, IEEE Software


http://www.prem-melville.com/publications/recommender-systems-eml2010.pdf

http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf

cs910: foundations of data analytics graham cormode [email protected] recommender systems

Documents

user u w u

target user u

user v

j j r u

r v w u

user score

predictions p u

test user