probabilistic rating systemssergey/slides/n15_ratingsaist.pdf · rating systems this work...
TRANSCRIPT
Rating systemsThis work
Probabilistic Rating Systems
Sergey Nikolenko
Steklov Mathematical Institute at St. Petersburg
Laboratory for Internet Studies,National Research University Higher School of Economics, St. Petersburg
April 9, 2015
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Outline
1 Rating systemsProblem setting and motivationTrueSkill
2 This workChanges in the problem settingModels and results
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
My personal motivation
«What? Where? When?»: a team game of answeringquestions. Sometimes it looks like this...
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
My personal motivation
...but usually it looks like this:
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
My personal motivation
Teams of ≤ 6 players answer questions.
Whoever gets the most correct answers wins.
My motivation was to create a rating system that wouldpredict tournament results by team rosters.Characteristic features that make the problem hard:
it’s a hobby: players have no contracts, teams do not havepermanent rosters, playing for a different team is common;hence, we cannot just make a rating list of the teams, we needto go deeper, to individual players;but we do not know how players do, only team results;relatively few questions per tournament (36, 45, 60), hencemultiway ties;undersized teams are common.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Introduction
In probabilistic rating models, Bayesian inference aims to finda linear ordering on a certain set given noisy comparisons ofrelatively small subsets of this set.
Useful whenever there is no way to compare a large number ofentities directly, but only partial (noisy) comparisons areavailable.
We will stick to the metaphor of matches and players.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Introduction
Elo rating system: first probabilistic rating model (chess: twoplayers).
Bradley–Terry models: assume that each player has a “true”rating γi , and the win probability is proportional to this rating:γ1 wins over γ2 with probability γ1
γ1+γ2.
Inference: fit this model to the data from matches played.
Several extensions, but large matches are hard forBradley–Terry models.
The model that looked right to us for «What? Where?When» was TrueSkill.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
TrueSkill
TrueSkill was initially developed in MS Research for Xbox Livegaming servers [Graepel, Minka, Herbrich, 2007].
Given results of team competitions, learn the ratings of playersof these teams.
Direct application – matchmaking: find interesting opponentsfor a player or team.
[Graepel et al., 2010]: AdPredictor. Predicts CTRs ofadvertisements based on a set of features: the features are ateam, and the team wins whenever a user clicks the ad.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
TrueSkill factor graph
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
TrueSkill
There is no evidence per se, it is incorporated in the structureof the graph, we just have to marginalize by message passing.The marginalization problem is complicated by the stepfunctions at the bottom; solved with Expectation Propagation[Minka, 2001]:
approximate messages from I(di > ε) and I(|di | ≤ ε) to di
with normal distributions;repeat message passing on the bottom layer of the graph untilconvergence.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Example: a match of four players
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
TrueSkill problems and solutions
As I said, TrueSkill looked perfect for «What? Where?When».But it didn’t really work due to the following properties of the«What? Where? When» dataset:
large multiway ties are common; it is common to have 30–40different places (because there were 35-50 questions in total)in a tournament with a thousand teams;teams vary in size (max 6 players, but often incomplete).
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
TrueSkill problems and solutions
An undesired feature of TrueSkill is the assumption that ateam’s performance is the sum of player performances.
In many competitions (and comparison problems), anundersized team stands a very good chance against a full one,and it would be an unfair boost for the smaller team.
To alleviate the team performance formula problem, we simplyselect a different function (linear functions fit naturally).
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Multiway ties
Large multiway ties are deadly for TrueSkill. Consider fourteams in a tournament with performances p1, . . . , p4.
Team 1 has won, while teams 2–4, listed in this order, drewbehind the first.
Then the factor graph tells us that
t2 < t1 − ε, |t2 − t3| ≤ ε, |t3 − t4| ≤ ε.
Team 3’s performance t3 may actually nearly equal t1, and t4may exceed t1!
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Multiway ties
Moreover, these boundary cases are realized in practice whenunexpected results occur.
t2 < t1 − ε, |t2 − t3| ≤ ε, |t3 − t4| ≤ ε.
Suppose the winning team t1 was an underdog, and its priordistribution fell behind the priors of t2, t3, and t4, t4 being theprior leader of all four.
Then the maximum likelihood value of t4 is likely to exceed t1.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Changes in the factor graph
For the multiway tie problem, we add another layer in thefactor graph, namely the layer of place performances li .
Each team performs in the ε-neighborhood of its placeperformance, and place performances relate to each other withstrict inequalities like l2 < l1 − 2ε.
Then it’s inference as usual. We have not experienced anyslowdown in convergence.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
New factor graph
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Problem setting and motivationTrueSkill
Experimental results
100 200 300 400 500 600 700
0.5
0.6
0.7
0.8
Tournaments
AUC
TSa TSbTS2a TS2bTS2c
Average AUC over a sliding window of 50 tournaments.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Outline
1 Rating systemsProblem setting and motivationTrueSkill
2 This workChanges in the problem settingModels and results
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Changes
A couple of years ago, «What? Where? When?» tournamentdatabase started collecting question-wise data.
That is, we now know which specific questions a team hasanswered; previously we only had standings in a tournament.
So when I got back to the problem of «What? Where?When?» ratings, I found the problem greatly simplified.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Changes
Sample relevant application:consider a test suite with many questions that tests something(e.g., IQ);participants answer a random subset of questions;we need to rate participants but questions are different, so thecomplexity level cannot be perfectly balanced.
«What? Where? When?» is just like that, but participants areworking on the test in teams.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Notation
Notation:set of tournaments D = {d1, . . . , d|D|};each tournament contains a certain number nd of individualtasks Q(d), d ∈ D, and each question has binary results: it canbe either performed correctly or not;teams T (d) = {t(d)1 , . . . , t(d)
M(d) } participate in tournamentd ∈ D;each team t ∈ T (d) consists of players Pt = {p1, . . . , pmt }, atotal of mt players in the team;for each question q ∈ Q(d) we know for every participatingteam t ∈ T (d) whether this question has been answeredcorrectly; we denote these bits by xtq.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Baseline: logistic regression
Baseline model – logistic regression; we model:each player i with his or her skill si ,each question q with its complexity score cq,add the global average µ,and train the logistic model
p(xtq | si , cq) ∼ σ(µ+ si + cq)
for each player i ∈ t of a participating team t ∈ T (d) and eachquestion q ∈ Q(d), where σ(x) = 1/(1+ ex) is the logisticsigmoid.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Model with latent variables
The logistic model basically assumes that each playersuccessfully answered every question that the team hadanswered.
But in fact we do not know which player or players haveanswered.
We only can assume that if the team has failed then no onefrom this team has done it.
This situation is similar in spirit to presence-only data modelsfound in, e.g., ecology [Ward et al., 2009; Royle et al., 2012].
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Model with latent variables
Hence, a model with latent variables.
For each player-question pair, we add a latent variable ziq
which means “player i has answered question q”.For these variables, we have the following constraints:
if xtq = 0 then ziq = 0 for every player i ∈ t;if xtq = 1 then ziq = 1 for at least one player i ∈ t.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Model with latent variables
Model parameters are still skill and complexity of the tasks:
p(ziq | si , cq) ∼ σ(µ+ si + cq).
Training with EM:E-step: fix all si and cq, compute expected values of latentvariables ziq as
E [ziq ] =
0 if xtq = 0,p(ziq = 1 | ∃j ∈ t zjq = 1) = σ(µ+si+cq)
1−∏
j∈t(1−σ(µ+sj+cq)), if xtq = 1;
M-step: fix E [ziq] and train the logistic model
E [ziq] ∼ σ(µ+ si + cq).
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Metrics and results
And, sure enough, it works fine.Evaluation metrics – ranking quality:
AUC – probability that the relative standings of a random pairof teams have been predicted correctly;WTA (winner takes all) – 1 if the top rated team has won thetournament, 0 otherwise;Top3 – correctly predicted teams among the top 3;Top5 – correctly predicted teams among the top 5;MAP (mean average precision) – correctly predicted teamsamong the top 10.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Results
MAP
0.6
0.7
0.8
AUC
04.20
12
07.20
12
10.20
12
01.20
13
05.20
13
08.20
13
11.20
13
03.20
14
06.20
14
09.20
140.7
0.8
0.9
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Implementation
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Implementation
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Final points
So here’s the takeaway:try to collect new data! the new model is much simpler thanTrueSkill because we have new data available;do not be afraid to work on your passions; if you are excitedabout the problem, you will make better progress, and «real»applications will find you.
RuSSIR 2015: August 24–28, St. Petersburg, Higher School ofEconomics.
Sergey Nikolenko Probabilistic rating systems
Rating systemsThis work
Changes in the problem settingModels and results
Thank you!
Thank you for your attention!
Sergey Nikolenko Probabilistic rating systems