lecture 5 instructor: max welling squared error matrix factorization
TRANSCRIPT
![Page 1: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/1.jpg)
Lecture 5
Instructor: Max Welling
Squared Error Matrix Factorization
![Page 2: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/2.jpg)
Remember User Ratingsm
ovie
s (+
/- 1
7,77
0)
users (+/- 240,000)
total of +/- 400,000,000 nonzero entries(99% sparse)
• We are going to “squeeze out” the noisy, un-informative information from the matrix.
• In doing so, the algorithm will learn to only retain the most valuable information for predicting the ratings in the data matrix.
• This can then be used to more reliably predict the entries that were not rated.
![Page 3: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/3.jpg)
Squeezing out Informationm
ovie
s (+
/- 1
7,77
0)
users (+/- 240,000)
total of +/- 400,000,000 nonzero entries(99% sparse)
users (+/- 240,000)
mov
ies
(+/-
17,
770)
x
K
K
“K” is the number of factors, or topics.
![Page 4: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/4.jpg)
Interpretationm
ovie
s (+
/- 1
7,77
0)
K• Before, each movie was characterized by a signature over 240,000 user-ratings.
• Now, each movie is represented by a signature of K “movie-genre” values (or topics, or factors)
• Movies that are similar are expected to have similar values for their movie genres.
star-wars = [10,3,-4,-1,0.01]
K
users (+/- 240,000)
• Before, users were characterized by their ratings over all movies.
• Now, each user is represented by his/her values over movie-genres.
• Users that are similar have similar values for their movie-genres.
![Page 5: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/5.jpg)
Dual Interpretationm
ovie
s (+
/- 1
7,77
0)
K • Interestingly, we can also interpret the factors/topics as user-communities.
• Each user is represented by its “participation” in K communities.
• Each movie is represented by the ratings of these communities for the movie.
• Pick what you like! I’ll call them topics or factors from now on.
K
users (+/- 240,000)
![Page 6: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/6.jpg)
The Algorithm
1
K
mu mk kuk
R A B
• In an equation, the squeezing boils down to:
• We want to find A,B such that we can reproduce the observed ratings as best as we can, given that we are only given K topics.
• This reconstruction of the ratings will not be perfect (otherwise we over-fit)
• To do so, we minimize the squared error (Frobenius norm):
A
B
2 2
1 1 1
|| || ( )M U K
F mu mk kum u k
Error R A B R A B
![Page 7: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/7.jpg)
Prediction
• We train A,B to reproduce the observed ratings only and we ignore all the non-rated movie-user pairs.
• But here come the magic: after learning A,B, the product AxB will have filled in values for the non-rated movie-user pairs!
• It has implicitly used the information from similar users and similar movies to generate ratings for movies that weren’t there before.
• This is what “learning” is: we can predict things from we haven’t seen before by looking at old data.
![Page 8: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/8.jpg)
The Algorithm• We want to minimize the Error. Let’s take the gradients again.
( ) ( )
( ) ( )
Tmu mi iu ku
u i
Tmu mi iumk
m i
dErrorR AB B R A B B
dA
dErrorA R AB A R A B
dB
• But wait, • how do we ignore the non-observed entries ?• for netflix, this will actually be very slow, how can we make this practical ?
• We could now follow the gradients again:
dError dErrorA A B B
dA dB
![Page 9: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/9.jpg)
Stochastic Gradient Descent• First we pick a single observed movie-user pair at random: Rmu.
• The we ignore the sums over u,m in the exact gradients, and do an update:
( )
( )
mu mi iumk mk kui
mu mi iuku ku mki
A A R A B B k
B B A R A B k
• The trick is that although we don’t follow the exact gradient, on average we do move in the correct direction.
![Page 10: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/10.jpg)
Stochastic Gradient Descent
stochastic updates
full updates (averaged over all data-items)
• Stochastic gradient descent does not converge to the minimum, but “dances” around it.• To get to the minimum, one needs to decrease the step-size as one get closer to the minimum.• Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).
![Page 11: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/11.jpg)
Weight-Decay
• Often it is good to make sure the values in A,B do not grow too big.
• We can make that happen by adding weight decay terms which keep them small.
• Small weights are often better for generalization, but this depends on how many topics, K, you choose.
• The simplest weight decay results in:
• But a more aggressive variant is to replace
( )
( )
mu mi iumk mk ku mki
mu mi iuku ku mk kui
A A R A B B A k
B B A R A B B k
( ) ( )mk mk ku kuA sign A B sign B
![Page 12: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/12.jpg)
Incremental Updates
• One idea is to train one factor at a time.• It’s fast• One can monitor performance on held out test data
• A new factor is trained on the residuals of the model up till now.
• Hence, we get:
• If you fit slow (i.e. don’t go to convergence, weight decay, shrinkage) results are better.
12
1 1 1 1 1
Error ( ) ( )M U K M U
mu mK Ku mu mK Kumk kum u k m u
R A B A B R A B
( ) ( )
( ) ( )
mK mK mu mK Ku Ku mK
Ku Ku mK mu mK Ku Ku
A A R A B B A
B B A R A B B
1mu mu mK KuR R A B
K K
![Page 13: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/13.jpg)
Pre-Processing• We are almost ready for implementation. But we can make a head-start if we first remove some “obvious” structure from the data, so the algorithm doesn’t have to search for it.
• In particular, you can subtract the user and movie means where you ignore missing entries.
Nmu is the total number of observed movie-user pairs.
• Note that you can now predict missing entries by simply using:
1 1 1mu mu mu ms ru sr
s r s rm u mu
R R R R R RU M N
1 1 1predictmu ms ru sr
s r s rm u mu
R R R RU M N
![Page 14: Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization](https://reader036.vdocuments.site/reader036/viewer/2022082613/5697bfcf1a28abf838ca9cee/html5/thumbnails/14.jpg)
Homework
• Subtract the movie and user means from the netflix data.
• Check theoretically and experimentally that the both user and movie means are 0.
• make a prediction for the quiz data and submit it to netflix. (your RMSE should be around 0.99)
• Implement the stochastic matrix factorization and submit to netflix. (you should get around 0.93)