i like it... i like it not

I like it... I like it not

Evaluating User Ratings Noise in

Recommender Systems

Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver

Telefonica Research

Recommender Systems are everywhere

Netflix: 2/3 of the movies rented were recommended

Google News: recommendations generate 38% more clickthrough

Amazon: 35% sales from recommendations

We are leaving the age of Information and entering the Age of Recommendation - The Long Tail (Chris Anderson)

The Netflix Prize

500K users x 17K movie titles = 100M ratings = $1M (if you only improve existing system by 10%! From 0.95 to 0.85 RMSE)

This is what Netflix thinks a 10% improvement is worth for their business

49K contestants on 40K teams from 184 countries.

41K valid submissions from 5K teams; 64 submissions in the last 24 hours

But, is there a limit to RS accuracy?

Evolution of accuracy in Netflix Prize

The Magic Barrier

Magic Barrier = Limit on prediction accuracy due to noise in original data

Natural Noise = involuntary noise introduced by users when giving feedback

Due to (a) mistakes, and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items).

Magic Barrier >= Natural Noise Threshold

We cannot predict with less error than the resolution in the original data

The Question in the Wind

Our related research questions

Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure?

Q2. How large is the prediction error due to these inconsistencies?

Q3. What factors affect user inconsistencies?

Experimental Setup (I)

Test-retest procedure: you need at least 3 trials to separate

Reliability: how much you can trust the instrument you are using (i.e. ratings)

r = r12r23/r13

Stability: drift in user opinion

s12=r13/r23; s23=r13/r12; s13=r13/r12r23

Users rated movies in 3 trials

Trial 1 24 h Trial 2 15 days Trial 3

Experimental Setup (II)

100 Movies selected from Netflix dataset doing a stratified random sampling on popularity

Ratings on a 1 to 5 star scale

Special not seen symbol.

Trial 1 and 3 = random order; trial 2 = ordered by popularity

118 participants

Results

Comparison to Netflix Data

Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate)

Test-retest Stability and Reliability

Overall reliability = 0.924 (good reliabilities are expected to be > 0.9)

Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower

Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951

Stabilities might also be accounting for learning effect (note s12

i like it... i like it not

Technology