you can not be serious! dr tim paulden earl 2014, london, 16 september 2014 how to build a tennis...

You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager, Atass Sports) Introduction ATASS Sports Sports forecasting Hardcore statistical research Fusion of academic and pragmatic Today Building a very basic tennis model Highlighting some key ideas Tennis modelling Data obtained from tennis-data.co.uk Spreadsheet for each year You can easily get the data yourself! Ultimate goal of modelling is to determine the probability of different outcomes Can we forecast the probability of victory in a match from the players world rankings? How do we identify a "good" model? Concept 1: Model calibration An effective model must be well-calibrated The probabilities produced by the model must be consistent with the available data Think in terms of bins if we gather together all the cases where our generated win probability lies between 0.6 and 0.7 (say), the observed proportion of wins should match the mean win probability for the bin (roughly 0.65) Heres an extract from Nate Silvers recent bestseller, The Signal and the Noise Concept 2: Model score Suppose we use a model to produce probabilities for a large number of sporting events (e.g. a collection of tennis matches) We can assess the model's quality by summing log(p) over all predictions, where p is the probability we assigned to the outcome that occurred - this is the model score The closer we match the "true" probabilities, the higher the model score (closer to zero) The data set tennis has 68,972 rows of data, with each match appearing twice (A vs B and B vs A) > dim(tennis) [1] > head(tennis) matchid date day ago surf bestof aname arank bname brank res hard 3 clement a 18 gaudenzi a hard 3 gaudenzi a 101 clement a hard 3 goldstein p 81 jones a hard 3 jones a 442 goldstein p hard 3 haas t 23 smith l hard 3 smith l 485 haas t 23 0 > tail(tennis) matchid date day ago surf bestof aname arank bname brank res clay 3 rosol l 48 simon g clay 3 simon g 16 rosol l clay 3 garcia lopez g 87 mayer f clay 3 mayer f 29 garcia lopez g clay 3 rosol l 48 garcia lopez g clay 3 garcia lopez g 87 rosol l 48 0 From ranks to probabilities How might we map the players' rankings onto a win probability? Well look at an extremely rudimentary approach in a moment as a worked example But first, consider for a moment how you might mathematically combine the players rankings to get a win probability for each player What are the important properties? A "first stab" Model 1 Suppose our first guess is that if the two players' rankings are A and B, the probability of A winning the match is B/(A+B) matchid aname arank bname brank res aprob1 1 1 clement a 18 gaudenzi a gaudenzi a 101 clement a goldstein p 81 jones a jones a 442 goldstein p haas t 23 smith l smith l 485 haas t henman t 10 rusedski g rusedski g 69 henman t hewitt l 7 arthurs w arthurs w 83 hewitt l A "first stab" Model 1 In this case, the model score is The "null" model in which each player is always assigned a probability of 0.5 gets a model score of So Model 1 gives an improvement of 1710 over the null model (closer to zero is better) How about the calibration? Let's generate a calibration plot for Model 1 We'll use bins of width 0.1 (0 to 0.1, 0.1 to 0.2, etc), closed at the left hand side (e.g. 0.6 x < 0.7) For each bin, we consider all instances where our model probability lies inside the bin, and plot a point whose x-coordinate is the mean of the model probabilities and whose y-coordinate is the observed proportion of wins for these instances Example: In this case, for the bin 0.6 x < 0.7, the point plotted is (0.648, 0.588) Systematic bias of Model 1 A quick fix... Probabilities systematically too extreme, so could try and blend Model 1 with 0.5 What weighting on Model 1 minimises the model score? A weighting of 0.71 on Model 1 is best the model score improves from to Obtaining best weighting can be done as a one-liner in R Quick fix (one-liner in R) glm(tennis$res~tennis$aprob1, family=binomial(link="identity")) Call: glm(formula = tennis$res ~ aprob1, family = binomial(link = "identity")) Coefficients: (Intercept) aprob Degrees of Freedom: Total (i.e. Null); Residual Null Deviance: Residual Deviance: AIC: 85330 Bias reduced, but still apparent A substantial improvement matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Score Model 2 Score Stepping up a gear Invlogit function widely used to predict binary sports outcomes (logistic regression) Logistic regression Invlogit function widely used to predict binary sports outcomes (logistic regression) Let's do a logistic regression of the result on the difference in rank, (B A) This is equivalent to player A's win probability being: invlogit( k*(B A)) The optimal value of k can be found using glm Logistic regression rankdiff = tennis$brank - tennis$arank g1 = glm(tennis$res~rankdiff-1, family=binomial(link="logit")) Logistic regression summary(g1) Call: glm(formula = tennis$res ~ rankdiff - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) rankdiff |z|) logterm < *** --- Signif. codes: 0 *** ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3 Our paper sum(log(g1$fitted.values[which(tennis$res==1)])) [1] Our paper summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) logterm < *** --- Signif. codes: 0 *** ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3 Almost perfect calibration Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 2 Score Model 3 Score Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Model 3 matchidarankbrankresaprob Model 2 Coming full circle In fact, a bit of algebra shows that invlogit( 0.58*(log(B) - log(A)) ) is exactly the same as B 0.58 / (A B 0.58 ) And invlogit( B-A ) is the same as exp(B)/(exp(A) + exp(B)) Try the simplest thing that could possibly work! Graphically A final extension What about the effect of the number of sets? Let's take the best model (Model 3) and look at the calibration plots Model 3 - All matches Model 3 - Best of 3 sets Model 3 - Best of 5 sets Model 4 This suggests we should have a combined model "Model 4" based on the rules that are in operation For "best of 3 sets": invlogit( 0.54*(log(B) - log(A)) ) For "best of 3 sets": invlogit( 0.72*(log(B) - log(A)) ) Model 4 All matches Model 4 Best of 3 sets Model 4 Best of 5 sets The best model score so far For Model 4, the model score is A final comparison: Model (probabilities all 0.5) Model (simple B/(A+B) model) Model (Model 1 squeezed) Logistic-22480(based on B-A) Model (logistic with logs) Combined-21252(split version of Model 3) Some further questions How can we incorporate some of the other data available into the model? Surface Individual players Mapping rankings to probabilities is only one component of the modelling process you could use your own rankings or ratings! Final thoughts Try it yourself! Modelling principles: Start Simple Generalise Gradually Capture Curvature Banish Bias Thank you for listening! Dr Tim Paulden

you can not be serious! dr tim paulden earl 2014, london, 16 september 2014 how to build a tennis...

Documents