lecture 5 instructor: max welling squared error matrix factorization

Lecture 5

Instructor: Max Welling

Squared Error Matrix Factorization

Remember User Ratingsm

ovie

s (+

/- 1

7,77

0)

users (+/- 240,000)

total of +/- 400,000,000 nonzero entries(99% sparse)

• We are going to “squeeze out” the noisy, un-informative information from the matrix.

• In doing so, the algorithm will learn to only retain the most valuable information for predicting the ratings in the data matrix.

• This can then be used to more reliably predict the entries that were not rated.

Squeezing out Informationm

ovie

s (+

/- 1

7,77

0)

users (+/- 240,000)

total of +/- 400,000,000 nonzero entries(99% sparse)

users (+/- 240,000)

mov

ies

(+/-

17,

770)

x

K

K

“K” is the number of factors, or topics.

Interpretationm

ovie

s (+

/- 1

7,77

0)

K• Before, each movie was characterized by a signature over 240,000 user-ratings.

• Now, each movie is represented by a signature of K “movie-genre” values (or topics, or factors)

• Movies that are similar are expected to have similar values for their movie genres.

star-wars = [10,3,-4,-1,0.01]

K

users (+/- 240,000)

• Before, users were characterized by their ratings over all movies.

• Now, each user is represented by his/her values over movie-genres.

• Users that are similar have similar values for their movie-genres.

Dual Interpretationm

ovie

s (+

/- 1

7,77

0)

K • Interestingly, we can also interpret the factors/topics as user-communities.

• Each user is represented by its “participation” in K communities.

• Each movie is represented by the ratings of these communities for the movie.

• Pick what you like! I’ll call them topics or factors from now on.

K

users (+/- 240,000)

The Algorithm

1

K

mu mk kuk

R A B

• In an equation, the squeezing boils down to:

• We want to find A,B such that we can reproduce the observed ratings as best as we can, given that we are only given K topics.

• This reconstruction of the ratings will not be perfect (otherwise we over-fit)

• To do so, we minimize the squared error (Frobenius norm):

A

B

2 2

1 1 1

|| || ( )M U K

F mu mk kum u k

Error R A B R A B

Prediction

• We train A,B to reproduce the observed ratings only and we ignore all the non-rated movie-user pairs.

• But here come the magic: after learning A,B, the product AxB will have filled in values for the non-rated movie-user pairs!

• It has implicitly used the information from similar users and similar movies to generate ratings for movies that weren’t there before.

• This is what “learning” is: we can predict things from we haven’t seen before by looking at old data.

The Algorithm• We want to minimize the Error. Let’s take the gradients again.

( ) ( )

( ) ( )

Tmu mi iu ku

u i

Tmu mi iumk

m i

dErrorR AB B R A B B

dA

dErrorA R AB A R A B

dB

• But wait, • how do we ignore the non-observed entries ?• for netflix, this will actually be very slow, how can we make this practical ?

• We could now follow the gradients again:

dError dErrorA A B B

dA dB

Stochastic Gradient Descent• First we pick a single observed movie-user pair at random: Rmu.

• The we ignore the sums over u,m in the exact gradients, and do an update:

( )

( )

mu mi iumk mk kui

mu mi iuku ku mki

A A R A B B k

B B A R A B k

• The trick is that although we don’t follow the exact gradient, on average we do move in the correct direction.

Stochastic Gradient Descent

stochastic updates

full updates (averaged over all data-items)

• Stochastic gradient descent does not converge to the minimum, but “dances” around it.• To get to the minimum, one needs to decrease the step-size as one get closer to the minimum.• Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).

Weight-Decay

• Often it is good to make sure the values in A,B do not grow too big.

• We can make that happen by adding weight decay terms which keep them small.

• Small weights are often better for generalization, but this depends on how many topics, K, you choose.

• The simplest weight decay results in:

• But a more aggressive variant is to replace

( )

( )

mu mi iumk mk ku mki

mu mi iuku ku mk kui

A A R A B B A k

B B A R A B B k

( ) ( )mk mk ku kuA sign A B sign B

Incremental Updates

• One idea is to train one factor at a time.• It’s fast• One can monitor performance on held out test data

• A new factor is trained on the residuals of the model up till now.

• Hence, we get:

• If you fit slow (i.e. don’t go to convergence, weight decay, shrinkage) results are better.

12

1 1 1 1 1

Error ( ) ( )M U K M U

mu mK Ku mu mK Kumk kum u k m u

R A B A B R A B

( ) ( )

( ) ( )

mK mK mu mK Ku Ku mK

Ku Ku mK mu mK Ku Ku

A A R A B B A

B B A R A B B

1mu mu mK KuR R A B

K K

Pre-Processing• We are almost ready for implementation. But we can make a head-start if we first remove some “obvious” structure from the data, so the algorithm doesn’t have to search for it.

• In particular, you can subtract the user and movie means where you ignore missing entries.

Nmu is the total number of observed movie-user pairs.

• Note that you can now predict missing entries by simply using:

1 1 1mu mu mu ms ru sr

s r s rm u mu

R R R R R RU M N

1 1 1predictmu ms ru sr

s r s rm u mu

R R R RU M N

Homework

• Subtract the movie and user means from the netflix data.

• Check theoretically and experimentally that the both user and movie means are 0.

• make a prediction for the quiz data and submit it to netflix. (your RMSE should be around 0.99)

• Implement the stochastic matrix factorization and submit to netflix. (you should get around 0.93)

lecture 5 instructor: max welling squared error matrix factorization

Documents