predicting baseball win/loss records from player projections · predicting baseball win/loss...

Predicting Baseball Win/Loss Records from Player

Projections

Connor Daly

[email protected]

November 29, 2017

1 Introduction

When forecasting future results in major league baseball (MLB), there are essentially two

sources from which you can derive your predictions: teams and players. How do players

perform individually, and how do their collaborative actions coalesce to form a team’s results?

Several methods of both types exist, but often are shrouded with proprietary formulas.

Currently, several mature and highly sophisticated player projection systems are used to

forecast season results. None are abundantly transparent about their methodology. Here I

set out to develop the simplest possible player-based team projection system and try to add

one basic improvement.

2 Predicting Wins and Losses

2.1 Team-Based Projections

One approach to such forecasting is to analyze team performance in head to head matchups.

A common implementation of this approach is known as an elo rating and prediction system.

1

Elo systems start by assigning teams an average rating. After games are played, winning

teams’ ratings increase and losing teams’ ratings decrease relative to the expected outcome

of the matchup. Expected outcomes are determined by the difference in rating between the

two teams. If a very good team almost loses to a really bad team, its rating will only increase

slightly. If an underdog pulls of an upset, however, it will earn relatively more points. As

more games are played, older games become progressively less meaningful. Essentially, this

prediction method considers only who you played, what the margin of victory was, and

where the match was played (home team advantage is adjusted for). Using Monte Carlo

simulations, one can predict the outcomes of individual seasons for each team. In between

seasons, teams are regressed towards the mean. For a detailed explanation of a baseball elo

model, see FiveThirtyEight [Boi].

The main advantage of this kind of team-based aproach is that it can capture some of the

hard to pin down factors that make teams more than just the sum of their parts. Without

figuring out what the secret sauce is, this method estimates the sum total contributions

of ownership, coaches, team philosophy, and an uncountable number of other factors. The

method does have a significant downside, however, in that it can’t take advantage of known

changes in team dynamics, such as changes in players and coaches. If I know Babe Ruth is

leaving the Yankess after a particular season, I probably want to project them differently

than I would have otherwise. This model fails to capture that.

2.2 Player-Based Projections

Baseball enjoys a unique advantage over other major American sports in that it is sig-

nificantly easier to decouple the performance of individual players to determine who was

ultimately responsible for creating a certain result. If a batter hits a home run, we can say

with a high degree of certainty that the batter and the pitcher combined to cause this event.

By looking at the large number of combinations of batter/pitcher matchups, we can gauge

the relative skill of each by their performance against a wide variety of opponents. On the

2

other hand, a sport such as football presents significant challenges to gauging the true skill of

individual players. Looking at the running game, how can one intelligently and objectively

pass out credit and blame? If a running back runs for a seven yard gain on a toss sweep

to the right, how much credit should the left guard receive? Decoupling in baseball isn’t

perfect, but compared to other sports, it’s much easier.

2.2.1 Wins Above Replacement

A foundational pillar of sabermetrics, the empirical, quantitative analysis of baseball, is

the concept of wins above replacement (WAR). Essentially, the idea is that all meaningful

baseball statistics must measure how events on the field help or hurt a team’s chances of

winning in expectation. The way games are won is by teams scoring runs and preventing

runs from being scored. Thus, every event can thus be understood in the context of runs

allowed or runs created.

This idea can be hard to grasp at first. How many runs does a home run create? One?

Rather counter-intuitively, the generally accepted value is around 1.4 runs. How is this?

Well, not only did the batter score himself, but he will have also batted in any potential

runners on base. You must also consider the possibility that had the batter made an out

instead of scoring these base runners, following batters could have driven them in. Using

real playing data, we can determine the expected run creating or subtracting value of every

event in baseball. See Table 1 for a complete breakdown of the run value of such events.

By looking at the total contributions of a player over the course of a season, we can

sum up the expected run contributions of every event the player caused. Now we need to

compare our player against a baseline. A first intuition might be to compare the player

to league average. Well, defining league average to be a baseline of zero runs created sells

league average players short. A league average player is better than approximately half of the

players in the league. That’s valuable production! Instead, we scale our player’s contribution

against the idea of a replacement level player. The production of a replacement level player

3

is intended to be equivalent to the contributions of an infinitely replaceable minimum salary

veteran or minor league free agent. For reference, a team of replacement players is defined

by Fangraphs to win approximately 48 games over the course of a 162 game season. Using

this replacement level, we determine the original player’s runs above replacement player.

Next, we scale the runs above replacement by the amount of runs per win in an average

game. Finally, we scale this calculated WAR to the the number of possible wins in a season,

so that the sum of all WAR and replacement runs equals the total number of wins in season.

On average, the player’s context free stats would have resulted in a team winning an extra

number of games corresponding to his WAR than if the same player had been replaced with

a replacement level player.

There is a finite pool of WAR for all players. When one player performs better, that

means less WAR will be allocated to the rest of the players.

Unfortunately for the reader, there are several variants of WAR, and all define things

slightly differently. Several rely on inexplicably chosen constants or proprietary formulas.

The main basis of my calculations relies on Fangraphs WAR, but I did make some alterations,

which will be explained later. For more in depth explanations of WAR and its underpinnings,

see [Joh85], [Tom06], and [Fanb].

2.2.2 WAR to Wins

By projecting a season’s worth of players’ expected WAR contributions, we can group players

by target year team and take the sum total of their contributions. The combined total of

their WAR should be able to help predict the team’s actual number of wins. This relationship

isn’t necessarily one-to-one, as will be discussed in 4.5. This method benefits from being

able to track players as they change teams.

4

Table 1: Run Values by Event

Event Run Value Event Run ValueHome Run 1.397 Balk 0.264Triple 1.070 Intentional Walk 0.179Double 0.776 Stolen Base 0.175Error 0.508 Defensive Indifference 0.120Single 0.475 Bunt 0.042Interference 0.392 Sacrifice Bunt -0.096Hit By Pitch 0.352 Pickoff -0.281Non-intetional Walk 0.323 Out -0.299Passed Ball 0.269 Strikeout -0.301Wild Pitch 0.266 Caught Stealing -0.467

Empirical measurements of the run value of events from 1999 - 2002 season. Data from[Tom06]

3 Projecting Players

To create a player projection based season long team projecting system, the first step is to

project players. Essentially, you need to look at player’s past performance and predict how

he will perform in the future. Some methods of doing this are highly sophisticated, others

quite simple. Systems like Baseball Prospectus’s PECOTA, Dan Szymborski’s ZiPS, and

Chris Mitchell’s KATOH all combine bunches of variables and various calculations to com-

pute projected outcomes. PECOTA in particular is based primarily around player similarity

scores. Mainly, it uses various metrics to find comparable players for a given to-be-projected

player and uses the performance of those comparables to infer a trajectory for the targeted

player’s future performance. Although its general methodology has been discussed, its spe-

cific implementation is proprietary. On the other end of the sophistication system is perhaps

the simplest possible projection system: Marcel the Monkey.

3.1 Marcel the Monkey

Marcel the Monkey, or simply Marcel, is a player projection system invented by Tom Tango

[Tan]. It sets out to be the simplest possible player projection system. Essentially, it takes

5

a weighted average of a player’s last three years (5/4/3 for batters and 3/2/1 for pitchers),

regresses the player toward the mean by 1200 plate appearances, and applies an aging curve to

increase player’s skills until age 29 after which point they begin to decline. These projections

make no attempt to differentiate for team, league, or position, with the exception that some

different constants are used for starting pitchers and relief pitchers.

Rather than calculating counting stats such as hits or home runs specifically, Marcel

projects rate stats like hits or home runs per plate attempt. Plate attempts for batters are

calculated from the previous two years and then added to a baseline of 200 plate appearances.

Thus, all players are projected to have at least 200 plate appearances in the target year,

even a player that may have retired two years prior. When translating from player to team

projections, this is controlled for by setting rosters with the actual players who played on

teams in the target year.

A note about pitchers. Pitchers are projected per inning pitched rather than by plate

appearance. Starting pitchers are projected to a minimum of 60 innings and relievers are

projected to a minimum of 25. A pitcher’s starter or relief role is defined by the ratio of

games started to games played. A starter has started more than half his appearances in the

given period.

Marcel player projections are the foundation of my Marcel-based projection system. The

first phase of my project centered around implementing a Marcel projection scheme in R for

both batters and pitchers. Going back in time, older seasons don’t contain the same amount

of statistical data that modern seasons do. Because of this, I am only able to create Marcel

projections for seasons from 1955 onwards.

4 Marcel the Monkey to Marcel the Manager

After developing my Marcel projections, the next step in projecting team seasonal results

was to group the players into teams and sum their accomplishments.

6

4.1 Season Lengths

Prior to 1961, both the American and National league played a 154 game season before

later switching to 162. Other regular seasons have also been shortened such as by the player

strike in 1994. As such, all projections must account for varying season lengths. All reported

accuracy statistics will be scaled to a 162 game season.

4.2 Adjustments to WAR Calculation

Although I generally followed standard calculations for Fangraphs WAR, my calculations did

diverge enough to be considered significantly different. For position players, I only considered

batting runs created, not fielding runs or baserunning runs. The numbers also aren’t position,

league, or park adjusted. Because WAR is designed to be a retrospective statistic and my

numbers are forward looking, I did my best to remove them from all possible context. My

projections don’t take things like park factors or league adjustments into account, so neither

should my WAR calculations. Fielding runs were not calculated because advanced fielding

statistics are not provided in the Lahman database I used to gather my projections. See A.1

for more information on data sources.

4.3 New Season’s Rosters

Rosters for target year teams were assembled by looking at batting statistics for the following

year. I defined being “on the team” for that year to be having at least one plate appearance

for said team and that being the first team the player appeared with that year. Because I

drew my batting stats from the Lahman database (see A.1), I only could project through

the 2016 season. I will soon be able to predict the 2017 season when the 2017 version of

the Lahman database is published, likely in the coming weeks. To predict a season before it

actually happens, I would need to add a new source of data to determine which players to

include on a roster.

7

4.4 Rescaling WAR

Going by Fangraphs definition of WAR, a team of only replacement players is expected

to achieve a winning percentage of approximately 0.294. Over a 162 game season, this

corresponds to about 48 wins; however, not all seasons since 1955 contained 162 games.

Additionally, not all seasons feature 30 teams. Hence, the number of available wins for

players to earn fluctuates from year to year. To calculate the available WAR in year x:

WAR(x) = (NumTeams(x) ∗NumGames(x)) ∗ (1/2 −ReplacementLevel) (1)

where NumTeams(x) is the number of teams playing in year x, NumGames(x) is the mode

of a team’s played games in year x, and ReplacementLevel is the winning percentage for a

team of exclusively replacement level players. WAR is then divided so that 57% is allocated

for position players and 43% for pitchers.

Once the total amount of WAR has been allocated, players must be scaled so that their

projected WAR sums to the number of available WAR per season.

4.5 Correcting Diminishing Returns

The next step in projecting teams is to sum individual player WAR to establish a team

WAR. Once that has been done, we can add a team’s total WAR to the season’s per team

replacement win total. The resulting win total is a that team’s win projection for the year.

You could stop there; however, doing so makes a key incorrect assumption about win totals.

That is, win totals increase linearly with run differentials. Unfortunately, that is not the

case. As you will see in the results section, there are clear diminishing returns at extreme

ends of the projection spectrum. The more WAR a team adds over a projected 81 wins (a

.500 season), the more the model will overestimate the value of those WAR in predicting

the number of wins. Similarly, the fewer WAR a team has below .500, more the model will

underestimate them. The relationship between WAR and wins is not entirely linear!

8

A simple solution to this problem is to apply a correcting function to the projections.

I looked at applying two different correction models to the data, one linear one cubic. I

used simultaneous perturbation stochastic approximation (SPSA) to help determine the

parameters [Spa03]. For a more detailed explanation of model selection, see 5.1.

4.6 Measuring Correctness

4.6.1 Validity and Verification

When constructing mathematical models of reality, one must always ask two questions: is

the model correctly implemented and does the model actually represent some semblance of

reality. To answer the first question, we will look at publicly available data on Marcel player

projections and wins above replacement. For the second, we will construct a loss function

to measure how predictive our model can be.

The first step in model verification is to ensure that my implementation of Marcel pro-

jections matches the intended projections of the method’s creator. Thanksfully, he has many

years of Marcel projections posted on his website [Tan]. Although our numbers aren’t in

complete agreement, they appear to be within an acceptable bound. Differences are on

the order of one or two per stat and are likely due to implementation details such as digit

precisions and rounding decisions.

To verify I’m computing WAR correctly, I compared my projected WAR totals to the

actual WAR earned in the target season. Looking mostly at the top of the board, I checked

that my WAR projections seemed to be a reasonable weighted average from the previous

three years. If a player averaged three WAR per year and was projected for six, I’d know

something was off. I did recognize, however, that there would likely be reasonably large

divergences for players who were extreme defensively, either extremely good or extremely

bad. On the aggregate, the WAR totals seemed to match up pretty well, but I don’t have a

rigorout calculation showing this is true.

Finally, to verify I aggregated team projections correctly, I look at the sum of all projected

9

wins per year and compared it to the total number of available wins. I made sure the

calculated number was within a couple wins of the actual. I allowed small differences because

some years teams play a different number of games and rounding can cause a win or two to

fall through the cracks. The projections will still be reasonable.

4.6.2 The Loss Function

When measuring the validity of the model, it may seem tempting to say that we can measure

its accuracy directly. But what exactly is it that our model is trying to measure? Are we

trying to predict actual wins and losses or are we trying to predict true talent, which can

only be measured noisily via wins and losses. I would espouse that we attempt to ascertain

true talent by noisily measuring wins and losses. Thus, we define our loss function y(θ):

y(θ) = L(θ) + ε(θ) (2)

y(θ) =162

n

n∑i=1

abs(x̂i − xi)

NumGames(i)(3)

where L(θ) measures the loss of the prediction’s ability to measure true talent and ε(θ)

is a noise term. The more concrete version specifies that the loss can be measured as the

mean absolute error of the model’s predictions scaled to a 162 game season. That is, for

teams numbered 1, ..., i, ..., n, x̂i is the model’s predicted number of wins for team i, xi is the

team’s actual number of wins, and NumGames(i) is the number of games played by team

i. This computes the mean absolute error for all teams, scaled to 162 games. This allows us

to compare results from teams who played seasons of different lengths.

5 Results

Without applying any correction model, I was able to achieve a loss of 8.16 wins per team per

162 game season. See Figure 1 for a visual representation of results. Although, not perfect,

10

Figure 1: Results of Uncorrected Projections

(a) Uncorrected Projections for 1955 - 2016 (b) Residuals for Uncorrected Projections

there is a clear trend line between the predictions and results. The residuals from the one-to-

one, however, appear to show a positive trend, meaning the model is overestimating teams

at the right end of the graph and underestimating teams at the left. We can attempt to

correct for this.

5.1 Calculating Correction Parameters

I used two different models to attempt to apply corrections: one linear and one cubic. For

both, I used SPSA to determine the optimal value. I chose a linear and a cubic because

I assumed that the correction needed to be reasonably antisymmetric around 81 projected

wins, corresponding to a .500 record. Both a negative sloping linear and a cubic function

could provide that correction. I picked my initial parameters by guessing a scaling factor and

choosing the other terms such that the x-intercept was 81. I attempted to find the correction

factor such that:

CorrectedWins = ProjectedWins+ Correction (4)

11

5.1.1 Linear Model

For the linear model, I modelled Correction = β0 ∗ proj.wins − β1 starting with an intial

β value set [−.25, 20.25]. The intial beta values were determined by manually guessing and

checking a few test values. After a million runs with parameters A = 1000, a = .01, c =

.015, α = 0.602, γ = 0.101, and a Bernoulli distribution (+1,-1) for my deltas, I determined

my optimal value to be [−.250002, 20.249998]. The resulted in a net loss of 7.77 wins per

team per 162 game season.

I used the gain sequence provided in [Spa03], so I know that the gain sequence con-

ditions for convergence are satisfied. By using a Bernoulli distribution for my deltas, as

in [Spa03], I’ve satisfied conditions on deltas. The rest of the conditions are unknowable

without knowledge of L, but it seems reasonable that it is sufficiently smooth and bounded.

5.1.2 Cubic Model

For the cubic model, I used vertex form to model Correction = β0∗(proj.wins−β1)3 starting

with an intial β value set [−.01, 81]. The intial beta values were determined by manually

guessing and checking a few test values. After a million runs with parameters A = 1000, a

= .0001, c = .0015, α = 0.602, γ = 0.101, and a Bernoulli distribution (+1,-1) for my deltas,

I determined my optimal value to be [−0.0003673105, 80.9289773302]. The resulted in a net

loss of 8.02 wins per team per 162 game season.

I used a scalar multiple of the gain sequence provided in [Spa03], so I know that the

gain sequence conditions for convergence are satisfied. By using a Bernoulli distribution for

my deltas, as in [Spa03], I’ve satisfied conditions on deltas. The rest of the conditions are

unknowable without knowledge of L, but it seems reasonable that it is sufficiently smooth

and bounded.

12

5.2 Results with Corrections

After analyzing both the linear and the cubic model, I needed to decide which to the use

for my corrections. I decided to use cross validation to determine which model to use. I

used three different test sets that were created by grouping the data points by their position

modulo three. Performing the same SPSA calculations as in the individual trials but with

10,000 runs, I found the linear model had an average loss of 7.82 wins per 162 game season

and the cubic model had 8.04 wins per 162 game season. I decided to use the linear model

for my corrections.

Although this helped reduce our loss function, our corrected model still isn’t perfect.

Noticeably, the corrected left tail in figure 2a isn’t as well predicted as in the uncorrected

version. Overall, though, the corrected model sees noticeable improvements year to year

over the uncorrected model as in figure 2b.

Figure 2: Looking at Corrected Projections

(a) Corrected Projections for 1955 - 2016 (b) Year to Year Correction Improvement 1955 -2016

13

5.3 Perfect and Perfectly Imperfect Knowledge

So how good is the model actually? We know we can achieve an average loss of under

eight games per season, but is that any good? If we were to assume we knew nothing

about individual MLB teams and instead only knew the distribution of MLB records. We

can assume that it is approximately normal and by definition will have an average winning

percentage of .5. The standard deviation in win percent turns out to be around 0.07. That

corresponds to around 11.3 wins per 162 game season. If we randomly assign teams a win

percentage from this distribution, we end up with a loss function around 13 wins per 162

game season.

Similarly, if we were to project a .500 record for every team, we’d off by about 9.5 wins

per 162 game season.

Contrastingly, how good of a projection could we ever hope to get? The best predictor

of how many wins a team accrues turns out to be an estimation based solely on its runs

scored and runs allowed. These estimations are called pythagorean win projections, the

most accurate of which is referred to as the pythagenpat win total [Pro]. If we had perfect

knowledge of how many runs a team would score and allow, we could use their pythagenpat

wins to predict their record, like in Figure 3. Yet still with this perfect knowledge, we can

only come within 3.18 wins per 162 game season.

If we consider projecting all teams to a .500 record to be the low point and with 3.18

wins as a theoretical upper bound, our model appears to have achieved 27% of all possible

knowledge gain.

6 Park Effects

Now, I attempt to add one final improvement to the model: park effects. Essentially, not

all ballparks in major league baseball are created equal; they have different dimensions and

atmospheric effects that make some parks easier to score runs in than others. Using park

14

Figure 3: Predicting Wins from Pythagenpat Wins 1955-2016

effects data, see A, I deflated all player stats to remove park effects before computing their

Marcel projection. After their season was projected, I looked at their destination home

ballpark and inflated their numbers to reflect their new home. Surprisingly, these made

my projections worse across the board, only beating my standard Marcel model twice in 60

years, shown in Figure 4. A clear shift occurs around 1973. In 1974, greater detailed park

effects were released which led to improved predictions. Although park effects are certainly

real, I’m left to conlcude that averaged out over a very large sample of players, the current

level of granularity is too course to be very predictive.

7 Conclusions

At the start of this, I wanted to build the dumbest player based projection model possible

and see if I could improve it. Beyond a simply error correction, I couldn’t in the short time I

had. Although my model may be dumb enough for a monkey, it is still reasonably predictive

and appears likely to hold up with far more sophisticated predictions.

15

Figure 4: Comparing Projections With and Without Park Effects

Marcel PECOTA FanGraphs Davenport Banished to the Pen Essays Composite6.00 6.20 5.80 6.97 5.97 6.37 5.73

Table 2: 2016 Projection Comparison

7.1 Comparison to Other Models

Unfortunately, many of the data points required to do a full many year model comparison lie

behind pay walls or aren’t easily searchble on the internet. Baseball Prospectus has currently

taken down the seasonal PECOTA projections as they upgrade their site. We can, however,

look at the year 2016. After training my model on the years 1955-2015, it predicts 2016

with a loss of 6 wins per 162 game season. Look at Table 2 to see how it stacked up to the

competetion. Basically, Marcel projections went toe-to-toe with the best of the best. Data

is courtesy of [Aus].

7.2 Challenges and Future Directions

There are several limitations to my model, some mathematical, some sabermetrical. First,

my measurements of WAR only look at a batting and defense independent pitching. This

removes skills related to baserunning and fielding from the game. This causes players with

16

fielding or baserunning talent signficantly different from league average to be incorrectly

valued. Secondly, Marcel doesn’t do a great job of adjusting for playing time. Every player

is projected a minimum of 200 plate attempts with no regard for their expected role on the

team. More intelligently modelling fielding and baserunning skill as well as better adjusting

for playing time could significantly improve the model. Another simple improvement would

be to add a more robust aging curve. Different positions tend to age differently. A position

specific aging curve could add benefits.

Obviously, I’d like a better way to manage roster data so I can project current rosters

into the future without relying on Lahman data.

Mathematically, I would have liked to run better SPSA optimizations. For the amount

of time I ran them, I wasn’t able to move my final parameters very far from my initial guess.

This caused me to need to check a lot of values by hand to figure out where the best place

to start the optimization was. Better choices of SPSA parameters and longer runnings times

likely would have helped.

Additionally, my model doesn’t account well for uncertainty. Marcel has a way to measure

reliability based on how much the player’s projection comes from his own stats versus how

much it is regressed towards the mean. I would’ve liked to have added a similar component

that could perhaps provide confidence intervals around a team’s projection.

A Appendix

A.1 Sources of Data

Seasonal batting and pitching data was obtained from the Lahman database [Lah]. I made

use of years through 2016, which was the last published year with entries at the time of

writing. Park effect factors came from Fangraphs [Fana]. Full detailed factors were available

after 1973 thru 2015. Earlier years only had basic effects available.

17

References

[Joh85] Pete Palmer John Thorn. The Hidden Game of Baseball. University of Chicago

Press, 1985. isbn: 9780226242484.

[Spa03] James C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience

Series in Discrete Mathematics and Optimization. John Wiley and Sons, 2003.

isbn: 9780471330523.

[Tom06] Andrew E. Dolphin Tom M. Tango Mitchel G. Lichtman. The Book: Playing the

Percentages in Baseball. TMA Press, 2006. isbn: 9781494230170.

[Aus] Darius Austin. Evaluating the 2016 Season Preview Predictions. url: http://

www.banishedtothepen.com/evaluating-the-2016-season-preview-predictions/.

[Boi] Jay Boice. How Our 2017 MLB Predictions Work. url: https://fivethirtyeight.

com/features/how-our-2017-mlb-predictions-work/.

[Fana] Fangraphs. Park Factors. url: http://www.fangraphs.com/guts.aspx?type=

pf&teamid=0&season=2012.

[Fanb] Fangraphs. WAR for Position Players. url: https://www.fangraphs.com/

library/war/war-position-players/.

[Lah] Sean Lahman. Lahman’s Baseball Database. url: http://www.seanlahman.com/

baseball-archive/statistics/.

[Pro] Baseball Prospectus. Pythagenpat. url: http://legacy.baseballprospectus.

com/glossary/index.php?mode=viewstat&stat=136.

[Tan] Tom Tango. The 2004 Marcels. url: http://www.tangotiger.net/archives/

stud0346.shtml.

18

predicting baseball win/loss records from player projections · predicting baseball win/loss...

Documents