probabilistic rating systemssergey/slides/n15_ratingsaist.pdf · rating systems this work...

Rating systemsThis work

Probabilistic Rating Systems

Sergey Nikolenko

Steklov Mathematical Institute at St. Petersburg

Laboratory for Internet Studies,National Research University Higher School of Economics, St. Petersburg

April 9, 2015

Sergey Nikolenko Probabilistic rating systems


Problem setting and motivationTrueSkill

Outline

1 Rating systemsProblem setting and motivationTrueSkill

2 This workChanges in the problem settingModels and results




My personal motivation

«What? Where? When?»: a team game of answeringquestions. Sometimes it looks like this...





...but usually it looks like this:





Teams of ≤ 6 players answer questions.

Whoever gets the most correct answers wins.

My motivation was to create a rating system that wouldpredict tournament results by team rosters.Characteristic features that make the problem hard:

it’s a hobby: players have no contracts, teams do not havepermanent rosters, playing for a different team is common;hence, we cannot just make a rating list of the teams, we needto go deeper, to individual players;but we do not know how players do, only team results;relatively few questions per tournament (36, 45, 60), hencemultiway ties;undersized teams are common.




Introduction

In probabilistic rating models, Bayesian inference aims to finda linear ordering on a certain set given noisy comparisons ofrelatively small subsets of this set.

Useful whenever there is no way to compare a large number ofentities directly, but only partial (noisy) comparisons areavailable.

We will stick to the metaphor of matches and players.




Introduction

Elo rating system: first probabilistic rating model (chess: twoplayers).

Bradley–Terry models: assume that each player has a “true”rating γi , and the win probability is proportional to this rating:γ1 wins over γ2 with probability γ1

γ1+γ2.

Inference: fit this model to the data from matches played.

Several extensions, but large matches are hard forBradley–Terry models.

The model that looked right to us for «What? Where?When» was TrueSkill.




TrueSkill

TrueSkill was initially developed in MS Research for Xbox Livegaming servers [Graepel, Minka, Herbrich, 2007].

Given results of team competitions, learn the ratings of playersof these teams.

Direct application – matchmaking: find interesting opponentsfor a player or team.

[Graepel et al., 2010]: AdPredictor. Predicts CTRs ofadvertisements based on a set of features: the features are ateam, and the team wins whenever a user clicks the ad.




TrueSkill factor graph




TrueSkill

There is no evidence per se, it is incorporated in the structureof the graph, we just have to marginalize by message passing.The marginalization problem is complicated by the stepfunctions at the bottom; solved with Expectation Propagation[Minka, 2001]:

approximate messages from I(di > ε) and I(|di | ≤ ε) to di

with normal distributions;repeat message passing on the bottom layer of the graph untilconvergence.




Example: a match of four players




TrueSkill problems and solutions

As I said, TrueSkill looked perfect for «What? Where?When».But it didn’t really work due to the following properties of the«What? Where? When» dataset:

large multiway ties are common; it is common to have 30–40different places (because there were 35-50 questions in total)in a tournament with a thousand teams;teams vary in size (max 6 players, but often incomplete).




TrueSkill problems and solutions

An undesired feature of TrueSkill is the assumption that ateam’s performance is the sum of player performances.

In many competitions (and comparison problems), anundersized team stands a very good chance against a full one,and it would be an unfair boost for the smaller team.

To alleviate the team performance formula problem, we simplyselect a different function (linear functions fit naturally).




Multiway ties

Large multiway ties are deadly for TrueSkill. Consider fourteams in a tournament with performances p1, . . . , p4.

Team 1 has won, while teams 2–4, listed in this order, drewbehind the first.

Then the factor graph tells us that

t2 < t1 − ε, |t2 − t3| ≤ ε, |t3 − t4| ≤ ε.

Team 3’s performance t3 may actually nearly equal t1, and t4may exceed t1!




Multiway ties

Moreover, these boundary cases are realized in practice whenunexpected results occur.

t2 < t1 − ε, |t2 − t3| ≤ ε, |t3 − t4| ≤ ε.

Suppose the winning team t1 was an underdog, and its priordistribution fell behind the priors of t2, t3, and t4, t4 being theprior leader of all four.

Then the maximum likelihood value of t4 is likely to exceed t1.




Changes in the factor graph

For the multiway tie problem, we add another layer in thefactor graph, namely the layer of place performances li .

Each team performs in the ε-neighborhood of its placeperformance, and place performances relate to each other withstrict inequalities like l2 < l1 − 2ε.

Then it’s inference as usual. We have not experienced anyslowdown in convergence.




New factor graph




Experimental results

100 200 300 400 500 600 700

0.5

0.6

0.7

0.8

Tournaments

AUC

TSa TSbTS2a TS2bTS2c

Average AUC over a sliding window of 50 tournaments.



Changes in the problem settingModels and results

Outline

1 Rating systemsProblem setting and motivationTrueSkill

2 This workChanges in the problem settingModels and results




Changes

A couple of years ago, «What? Where? When?» tournamentdatabase started collecting question-wise data.

That is, we now know which specific questions a team hasanswered; previously we only had standings in a tournament.

So when I got back to the problem of «What? Where?When?» ratings, I found the problem greatly simplified.




Changes

Sample relevant application:consider a test suite with many questions that tests something(e.g., IQ);participants answer a random subset of questions;we need to rate participants but questions are different, so thecomplexity level cannot be perfectly balanced.

«What? Where? When?» is just like that, but participants areworking on the test in teams.




Notation

Notation:set of tournaments D = {d1, . . . , d|D|};each tournament contains a certain number nd of individualtasks Q(d), d ∈ D, and each question has binary results: it canbe either performed correctly or not;teams T (d) = {t(d)1 , . . . , t(d)

M(d) } participate in tournamentd ∈ D;each team t ∈ T (d) consists of players Pt = {p1, . . . , pmt }, atotal of mt players in the team;for each question q ∈ Q(d) we know for every participatingteam t ∈ T (d) whether this question has been answeredcorrectly; we denote these bits by xtq.




Baseline: logistic regression

Baseline model – logistic regression; we model:each player i with his or her skill si ,each question q with its complexity score cq,add the global average µ,and train the logistic model

p(xtq | si , cq) ∼ σ(µ+ si + cq)

for each player i ∈ t of a participating team t ∈ T (d) and eachquestion q ∈ Q(d), where σ(x) = 1/(1+ ex) is the logisticsigmoid.




Model with latent variables

The logistic model basically assumes that each playersuccessfully answered every question that the team hadanswered.

But in fact we do not know which player or players haveanswered.

We only can assume that if the team has failed then no onefrom this team has done it.

This situation is similar in spirit to presence-only data modelsfound in, e.g., ecology [Ward et al., 2009; Royle et al., 2012].





Hence, a model with latent variables.

For each player-question pair, we add a latent variable ziq

which means “player i has answered question q”.For these variables, we have the following constraints:

if xtq = 0 then ziq = 0 for every player i ∈ t;if xtq = 1 then ziq = 1 for at least one player i ∈ t.





Model parameters are still skill and complexity of the tasks:

p(ziq | si , cq) ∼ σ(µ+ si + cq).

Training with EM:E-step: fix all si and cq, compute expected values of latentvariables ziq as

E [ziq ] =

0 if xtq = 0,p(ziq = 1 | ∃j ∈ t zjq = 1) = σ(µ+si+cq)

1−∏

j∈t(1−σ(µ+sj+cq)), if xtq = 1;

M-step: fix E [ziq] and train the logistic model

E [ziq] ∼ σ(µ+ si + cq).




Metrics and results

And, sure enough, it works fine.Evaluation metrics – ranking quality:

AUC – probability that the relative standings of a random pairof teams have been predicted correctly;WTA (winner takes all) – 1 if the top rated team has won thetournament, 0 otherwise;Top3 – correctly predicted teams among the top 3;Top5 – correctly predicted teams among the top 5;MAP (mean average precision) – correctly predicted teamsamong the top 10.




Results

MAP

0.6

0.7

0.8

AUC

04.20

12

07.20

12

10.20

12

01.20

13

05.20

13

08.20

13

11.20

13

03.20

14

06.20

14

09.20

140.7

0.8

0.9




Implementation




Final points

So here’s the takeaway:try to collect new data! the new model is much simpler thanTrueSkill because we have new data available;do not be afraid to work on your passions; if you are excitedabout the problem, you will make better progress, and «real»applications will find you.

RuSSIR 2015: August 24–28, St. Petersburg, Higher School ofEconomics.




Thank you!

Thank you for your attention!


probabilistic rating systemssergey/slides/n15_ratingsaist.pdf · rating systems this work...

Documents