i like it... i like it not
TRANSCRIPT
I like it... I like it not
Evaluating User Ratings Noise in
Recommender Systems
Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver
Telefonica Research
Recommender Systems are everywhere
Netflix: 2/3 of the movies rented were recommended
Google News: recommendations generate 38% more clickthrough
Amazon: 35% sales from recommendations
We are leaving the age of Information and entering the Age of Recommendation - The Long Tail (Chris Anderson)
The Netflix Prize
500K users x 17K movie titles = 100M ratings = $1M (if you only improve existing system by 10%! From 0.95 to 0.85 RMSE)
This is what Netflix thinks a 10% improvement is worth for their business
49K contestants on 40K teams from 184 countries.
41K valid submissions from 5K teams; 64 submissions in the last 24 hours
But, is there a limit to RS accuracy?
Evolution of accuracy in Netflix Prize
The Magic Barrier
Magic Barrier = Limit on prediction accuracy due to noise in original data
Natural Noise = involuntary noise introduced by users when giving feedback
Due to (a) mistakes, and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items).
Magic Barrier >= Natural Noise Threshold
We cannot predict with less error than the resolution in the original data
The Question in the Wind
Our related research questions
Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure?
Q2. How large is the prediction error due to these inconsistencies?
Q3. What factors affect user inconsistencies?
Experimental Setup (I)
Test-retest procedure: you need at least 3 trials to separate
Reliability: how much you can trust the instrument you are using (i.e. ratings)
r = r12r23/r13
Stability: drift in user opinion
s12=r13/r23; s23=r13/r12; s13=r13/r12r23
Users rated movies in 3 trials
Trial 1 24 h Trial 2 15 days Trial 3
Experimental Setup (II)
100 Movies selected from Netflix dataset doing a stratified random sampling on popularity
Ratings on a 1 to 5 star scale
Special not seen symbol.
Trial 1 and 3 = random order; trial 2 = ordered by popularity
118 participants
Results
Comparison to Netflix Data
Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate)
Test-retest Stability and Reliability
Overall reliability = 0.924 (good reliabilities are expected to be > 0.9)
Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower
Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951
Stabilities might also be accounting for learning effect (note s12