eco 231 empirical project mets attendance
Post on 21-Apr-2015
102 Views
Preview:
TRANSCRIPT
Empirical Project:
Mets Attendance (1986-2011)
Kevin Mulcahey
ECO-231
Dr. Letcher
The College of New Jersey
2
I. Statement of the Problem
This empirical study seeks to discover the explanatory variables that best reflect the
stadium attendance of the New York Mets from 1986 to 2011. The explanatory variables that I
have decided to study are the years since last playoff appearance, payroll (adjusted for inflation),
number of all-stars, winning percentage, and average batter’s age. In order to discover the
relationship between these explanatory variables and the dependent variable of attendance,
multiple regression, residual plots, normal probability plots, and various other statistical analyses
will be employed. A prediction for the Mets stadium attendance at Citi Field in 2012 will be
conducted using a regression equation as well.
II. Review of Literature Related to the Variables
Before beginning the study, statistical journals and analyses regarding Major League
Baseball attendance, payroll, winning percentages, and other variables will be consulted. The
first journal, written by Don N. Macdonald and Morgan O. Reynolds of Texas A&M University,
analyzes the relationship between players and their marginal product by using many of the same
explanatory variables from my empirical study. Marginal product is defined as the amount of
total revenue earned by the company by hiring one extra unit of labor.
Research from the seasons of 1986 and 1987 in Major League Baseball proves that
players are paid for what they earn for their respective teams in ticket sales. The reason for
payroll correlating directly to revenue for MLB organizations has a lot to do with the institutions
of free agency and final offer arbitration. Free agency allows players to test the market and seek
the best offer for their abilities among all major league teams. Arbitration allows for pay
increases during contracts, based upon performance. These two contractual outlets for players
3
allow their marginal revenue product to correlate more directly with ticket sales, attendance, and
team revenue. These findings relate to my data very closely, as my dependent variable is
attendance, and the explanatory variable with the lowest p-value is payroll (adjusted for
inflation). My data also begins in 1986, just like the study performed by MacDonald and
Reynolds. The allowance for arbitration officially began in 1970, when a second MLB collective
bargaining agreement allowed impartial arbitrators to settle player contract disputes, as opposed
to the commissioner of baseball. In the season of 1985, arbitrators discovered that owners across
Major League Baseball were colluding to keep baseball player salaries artificially low, thus
reducing competitive bidding. For their collusion, owners were fined $280 million dollars in
damages, and baseball team payrolls have steadily increased each season thereafter.
The journal article by Macdonald and Reynolds also relates to another one of my
variables. Winning percentage, they say, is not as important of a significant predictor of
attendance when compared with statistics that forecast a team’s success. For example, if a team’s
winning percentage increases from .500 to .550, the increase from .550 to .600 will not make a
noticeable difference on stadium attendance. This is due to the fact that fans view entertainment
on an, “ex ante” basis rather than an, “ex post,” basis (Macdonald, 445). In other words, the
forecasting of a team’s success is more conducive of sales and attendance than post performance
success. People are more likely to buy more tickets when they expect a team to perform well,
rather than once a team is already doing well. In relating this idea to my variables, the number of
all-stars and team payroll would be a more significant predictor of attendance than winning
percentage. Payroll and all-stars are similarly related because when a roster is popular and high
quality, attendance is more likely to increase in a given season.
4
Another statistical journal, written by Michael C. Davis of Missouri-Rolla University
analyzes the interaction between baseball attendance and winning percentage. According to
Davis, the interaction between baseball attendance and winning may not be completely obvious.
It is expected that as a team performs well, the organization’s, “bandwagon effect,” will come to
fruition and a team should, “therefore expect to see an increase in attendance during and
following seasons in which the team played well on the field (Davis, 4).” This journal article
implies that although winning percentage affects attendance directly (winning has become
increasingly important to fans in recent years), attendance also could affect winning percentage.
When a team generates superior attendance and revenue, winning percentage should rise because
successful organizations have more room in their budget to attain high quality players. In my
regression, I chose to place payroll and winning percentage as the explanatory variables, and
attendance as the dependent variable.
The conclusion of the study proved that in the long-term, all ten of the sampled teams
(Cubs, Reds, Yankees, White Sox, Phillies, Pirates, Indians, Tigers, Cardinals, Red Sox)
exemplified positive attendance growth with winning percentage as an explanatory variable.
Also, by the conclusion of the study, the data proved that only one team, the Indians, had a
positive effect on winning percentage with attendance as an explanatory variable. This would
indicate that my chosen dependent variable, attendance, is the best choice between the two.
Winning percentage is a better explanatory variable in Major League Baseball.
A final statistical analysis, conducted by market research analyst David P. Kronheim,
takes into account another variable that affected New York Met attendance in the past 3 years.
This journal discusses the effect of the stadium, which can have an effect when taken into
account with the numerical data I employed in my analysis. Kronheim raises the point that when
5
the Mets moved to their new stadium in 2009, Citi Field, attendance was going to decrease
regardless of performance. The total amount of seats in Shea Stadium was 57,365, whereas Citi
Field has only 41,800 seats. From 2005 to 2007, The Mets had gains in attendance of more than
470,000 per year, which made them one of the top 2 teams in the National League in attendance
increases. The Mets were very competitive during these last few years at Shea Stadium.
Kronheim notes that, “If the Mets had sold every single ticket possible in 2009, including player
and ‘comp’ tickets, their attendance still would have fallen by 656,243 (Kronheim, 31).”
However, the Mets still had quality attendance in 2009 at Citi Field, as 3,168,571 spectators
attended.
The huge drop off from 2009 to 2010 of 576,166 (an 18.4% decline), has to do with the
lesser amount of seats, as well as other statistics. The Mets fell below a .500 winning percentage
again in 2009-2011 and had less all-stars. In fact, “the smallest attendance at any Mets home
game in 2008 was 45,321, which is more than 3,500 higher than Citi Field’s capacity (Kronheim,
31).” The Mets were playoff contenders in their last year at Shea Stadium, although they missed
out on the playoffs on the last game of the season. Attendance that year was 4,042,045. I decided
not to include the type of stadium in my personal regression analysis, because the data dates back
to 1986, and the Mets have only been at Citi Field since 2009. The majority of the analysis
comes from Shea Stadium from 1986-2008, and the new stadium statistics would only appear to
be outliers. However, I wanted to include this market research journal in my report because it
could partially explain the drastic drop in the most recent data I have compiled (2009-2011).
III. Data Sources and Descriptions
6
In compiling the Mets data set from 1986 to 2011, I used two main sources. For the
dependent variable of attendance, as well as the explanatory variables of winning percentage,
years since last playoff appearance, payroll, and average batter’s age, I used Baseball-
Reference.com. In order to discover the number of all-stars per year for the New York Mets, I
used Mets.com. I also decided to adjust the payroll for each year from 1986 to 2010 for inflation
in order to have the most accurate comparison possible. The inflation calculator on bls.gov aided
me in this process.
As mentioned earlier, my data organizes the effect of years since last playoff appearance,
payroll (adjusted for inflation), number of all-stars, winning percentage, and average batter’s age
on stadium attendance for the New York Mets from 1986 to 2011 (Figure 1). For the first few
years of the data, namely 1986 to 1990, the Mets were very successful. After making the playoffs
in 1985, 1986, and 1988, and winning the World Series in 1986, total season stadium attendance
ranged from 2.7 million to 3 million. In these 5 years, The Mets had high winning percentages
ranging from .537 to .667, and a total of 19 all-stars, which is an extremely high amount.
In direct contrast to the years of 1986-1990, the Mets performed horribly from the years
of 1991 to 1998. The average batter’s age during these years was much younger than during
successful years (27 as opposed to 30 in 2000 when they made it to the World Series), winning
percentages were in the dismal range of .364 to .478, and payroll was much lower, highlighted
by the 35,015,247.14 team payroll in 1996. Attendance in these years was very low, as it rarely
broke 2 million. The Mets performed poorly again from 2001 to 2005, performed well from 2006
to 2008, and performed poorly again from 2009 to 2011. These three Mets eras indicate
fluctuations in the dependent variable of attendance in correlation with most of the explanatory
variables.
7
IV. Scatter Plots, Multiple Regression, Variable Selection, and Analysis
I have identified my explanatory variables, or X-variables, in this study as years since last
playoff appearance (YSLPA), payroll adjusted for inflation (Payroll), winning percentage (Win
%), number of all-stars (All-stars), and average batter’s age (Avg Batt. Age). The dependent
variable, or Y-variable, is attendance (Attendance). The data set is made up entirely of numeric
explanatory variables. First, individual scatter plots of each explanatory variable were created.
The scatter plot of the X-variable YSPLA against the Y-variable attendance is shown below.
0 2 4 6 8 10 120
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 712380.93056898 ln(x) + 3285693.67286353R² = 0.499236624667553
f(x) = 3339126.98358037 x -̂0.304548335135481R² = 0.485690393216885
f(x) = 12415.3785763 x³ − 159529.959663 x² + 290506.684811 x + 2967486.31414R² = 0.63413029095782
YSLPA v. Attendance (Figure 2)
Years Since Last Playoff Appearance
Attendance
The scatter plot of “YSPLA” against Attendance indicates that as the amount of years
since the Mets have reached the playoffs increases, the attendance decreases. Originally, I tried a
linear trend line to fit the data. The linear line fit the data relatively well, aside from three data
points from years 8 through 10. The R-square for the linear line, was only .4, however, and I
opted to try a quadratic or cubic equation to reflect the curvature of the data in years 8, 9, and 10.
8
The R-square improved dramatically from .4 to .634. Although it would appear that there should
be a direct, negatively linear line that fits the data, there could be an explanation for the
curvature. In years 8, 9, and 10 of missed playoff berths, according to the data set, the Mets were
starting to come out of their decade-long slump to become possible playoff contenders. It is
possible that the fans, in anticipation of the better performance of the team and potential playoff
implications, started to attend more games
“Payroll”, my second X-variable, has a scatter plot that reveals some curvature as well.
By using the R-square as a measure of fit, I decided that a linear equation was not appropriate. A
linear equation had an R-square of .1, while the 4-order quartic equation
0 50,000,000 100,000,000 150,000,000 200,000,0000
50000010000001500000200000025000003000000350000040000004500000
f(x) = 1.8609402E-25 x⁴ − 7.187176E-17 x³ + 9.79745286E-09 x² − 0.538350076 x + 11955827.293R² = 0.481238298393091
Payroll v. Attendance (Figure 3)
Payroll
Attendance
had an R-square of .48. In choosing a 4-order equation, I took into account the R-squares of
quadratic and cubic equations, and decided that parsimony did not apply. With each increase of
higher orders, I received improved R-squares ranging from differences of .08 to .10. Therefore,
the increases in goodness of fit were significant enough to alter the equation further.
9
For my third X-variable, “All-stars”, I analyzed the coefficient of determination, the R-
square, once again. Although a linear equation had a decent fit at .397, I still opted with a cubic
polynomial equation. The R-square for the fit of this equation was .419.
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 41835.174790099 x³ + 270645.06697743 x² − 49071.849825538 x + 1795789.1082761R² = 0.419157873670009
All-stars v. Attendance (Figure 4)
All-stars
Attendance
Clearly, as teams acquire more talented players, attendance increases. However, there is still
some curvature that prevents the data from being linear. A possible explanation for this curvature
could be that as a team has 1, 2, or 3 all-stars, the team’s prospects for attendance rises
dramatically, but the excitement fans have for all-stars 4, 5, and 6, increase at a slower rate.
While the attendance rates are still higher, this may suggest that a team only performs marginally
better with more than 3 or 4 all-stars.
The fourth X-variable, “Win%”, has somewhat of a sporadic scatter plot. The goodness
of fit, regardless of the type of equation, seems to be relatively low. The highest R-square I was
able to attain was .284 with a cubic equation. The curvature indicates that there is low attendance
from winning percentages of .35 to .45. This may suggest that regardless of higher or lower
winning percentages, fans do not wish to attend games because the team is not competitive
10
within this percentage range. There are, however, dramatic increases in attendance from the
winning percentages of .45 to .55. With these percentages, the Mets have a chance at playoff
aspirations.
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.70
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 243251228.120707 x³ + 370083180.27432 x² − 179730140.437134 x + 30221937.0151547R² = 0.283880070614012
Win% v. Attendance (Figure 5)
Win%
Attendance
There are still modest increases in attendance from .55 to .6, before it levels off. The next pattern
in the curvature reflects that attendance actually decreases at the winning percentage of .667, but
this could reflect an outlier, because the majority of the data has already leveled off from .6
to .65.
My final X-variable is the Mets average batter’s age, “Avg. Batt. Age|”, for each season
from 1986 to 2011.
11
27 27.5 28 28.5 29 29.5 30 30.5 310
10000002000000300000040000005000000
f(x) = − 93160.639 x⁴ + 11018274.11 x³ − 488154740.6 x² + 9602092047 x − 70754078648R² = 0.333701062726698
Avg. Batt. Age V. Attendance (Figure 7)
Avg. Batt. Age
Attendance
For this fifth variable, a polynomial equation was appropriate once again. A quartic
equation appeared to have the highest R-square, with a value of .334. A linear equation would
not have been as appropriate, because it appears that attendance rises from the average batter’s
ages of 27.5 to 28, before leveling off from 28.5 to 29.5. Attendance then rises dramatically from
the age of 30 and up. A possible explanation for this curvature could be that an older lineup may
have more experience, reflect better performance, and thus affect attendance. This variable,
however, proved to be insignificant toward attendance, as shown by the multiple regression
performed in the subsequent portion of this study.
Additional statistical analysis, aside from scatter plots and goodness of fit, is required in
order to discover significant predictors of attendance. A multiple regression including each
explanatory variable against the dependent variable of attendance indicated that a form of
12
variable selection was necessary (Figure 8).
First, a global F-test was run against each of the variables to decide whether any of them were
significant predictors. The hypothesis test for the global F-test is as follows:
H0: B1+B2+B3+B4+B5= 0Ha: At Least One of the Betas ≠ 0
As far as the alpha level for the hypothesis test, I decided to use an alpha of .15. My
reasoning is that studies of social sciences are conducted with human beings, and there is usually
more variation. Therefore, I do not want to reject any variables that could be found significant.
The global F-test showed a very low P-value of .00002 (Figure 8). According to the P-value, I
chose to reject the null hypothesis, and concluded that at least one of the explanatory variables is
significant.
After conducting the global F-test and deciding at least one variable was significant, I
chose to use backward selection to narrow down my set of explanatory variables. Right away,
according to the first regression (Figure 8), the X-variable of “Average Batter’s Age” had an
13
extremely high P-value of .6058. The P-value for this explanatory variable is much higher than
the alpha level, and therefore is eliminated. Figure 9 is shown below to reflect the new multiple
regression without the eliminated variable.
Upon eliminating the variable of “Average Batter’s Age”, the P-values for every other variable
were well below the alpha level of .15, and were rendered significant predictors of attendance.
The overall R-square for the second multiple regression with only 4 variables fell by just .004,
and the standard error increased by minimal amounts as well. These statistics are not significant
enough to indicate that the removal of “Average Batter’s Age” causes a weaker regression
equation. Also, the Global F-test after the removal of “Average Batter’s Age” indicated that the
P-value fell even lower, indicating that the right decision was made.
V. Forming the Final Regression Equation and Testing Assumptions
After running two multiple regressions and employing backward selection once, I was
able to form the following regression equation:
Predicted Ŷ = 393803.6703 + B1 -80945.75352(X1) + B2 0.008003511(X2) + B3 176153.9487(X3) + B4 2567089(X4)=
14
1. Ŷ = Mets Attendance 2. X1 = Years Since Last Playoff Appearance3. X2= Payroll (adjust for inflation)4. X3= Number of All-Stars5. X4= Winning Percentage
Upon recommending a regression equation, the disturbances must be checked to make
sure that they do not violate any of the assumptions. The four assumptions include that the
expected values of the disturbances add up to zero, the disturbances have constant variance, the
disturbances are normally distributed, and the disturbances are independent. In order to check
these assumptions, a residual plot of the residuals versus the predicted-y values is made.
500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
-1000000-500000
0500000
1000000
Residuals v. Predicted Attendance (Figure 10)
Predicted Attendance
Residuals
While interpreting the residual plot, it is clear that all of the residuals are normally and randomly
distributed. There are no concave or fanning patterns, and almost all of the data points are within
95% of the data, accurately portraying the empirical rule. This reflects a very desirable residual
plot that does not violate any of the assumptions of the disturbances.
Another measure that ensures that the residuals fall within a normal distribution is a
normal probability plot. In this graph, the residuals are plotted against Z-scores, which creates a
range of a normal standard distribution for the points. The main assessment made in determining
whether or not the data in a normal probability plot is desirable is the straightness of the points.
15
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-800000-600000-400000-200000
0200000400000600000800000
1000000Normal Probability Plot (Figure 11)
Z-Values
Residuals
Upon review, the normal probability plot seems to be almost completely straight, and reflects a
desirable plot. The residuals are normally distributed according to the residual plot and the
normal probability plot, and no assumptions appear to be violated.
VI. Prediction from the Regression Equation
After assessing that my equation was reasonable, and that none of the assumptions of the
disturbances of the data were violated, I made a prediction of the Mets stadium attendance for
the 2012 season. While deciding the values for my explanatory variables, I took many
considerations into account. First, the Mets have payroll constraints for the season of 2012 that
have been set by GM Sandy Alderson. He is restricting the Mets payroll between 100 and 110
million. I decided to take an average and make “Payroll” 105 million. For my “All-star” variable,
I realized that the Mets will probably not be able to re-sign 2-time all-star Jose Reyes. I decided
to keep my total amount of all-stars for the Mets next year at 3. As for the amount of years since
their last playoff appearance, “YSLPA”, that number will increase from 5 to 6 in 2012. Finally, I
decided to increase the Mets winning percentage, “Win%”, to .531, under the assumption that
16
they will not have the same amount of injuries as 2011, and that they will acquire a better set of
relief pitchers. My prediction appears below:
Ŷ ( Predicted Mets Attendance) = B0 393803.6703 + B1 YSLPA -80945.75352(6) + B2 Payroll 0.008003511
(105,000,000) + B3 All-stars 176153.9487(3) + B4 Win%2567089.4 (.531) = 2,640,084
This prediction is reasonable, as it reflects an increase in attendance of 287,488 for the
season of 2012. If the Mets have more wins and one more all-star than 2011 (indicating popular,
high caliber players for fans to see), it is reasonable that they could have close to 300,000 more
fans attend games next season. However, the Mets payroll and amount of years since the playoffs
will worsen. This equation shows that popular players and wins are more important to fans than
the previous year’s success or the total team payroll.
VII. Suggestions for Future Research
After completing my empirical project of Mets attendance, there are a multitude of
additional factors I would like to study if I had the time or money. One issue that I had with my
regression is that I wanted to invert my years since last playoff appearance (YSLPA) variable,
but the zero values became undefined. I tried multiple remedies for this problem, but they all
involved manipulations of the data set. Inversion would have dramatically improved my R-
square, and would have fit the data much more appropriately. Inverting an X-variable is typically
appropriate when a scatter plot reveals data that has a downward sloping curve and diminishes
over time. My scatter plot of “YSLPA” v. Attendance reflects this description, and my standard
error, R-square, and adjusted R-square all would have improved. I would be interested to see
how this issue could be resolved.
17
Another suggestion for future research would be to employ quadratic, cubic, and quartic
manipulations of all of my X-variables. Although this would be far too time consuming for the
purposes of this project, the goodness of fit for each of the scatter plots would improve
dramatically, and the predicted equation would be more accurate as well.
A final aspect that I would like to study is the effect of specific teams, players, and
promotions on stadium attendance. If I was able to gather these categorical X-variables, I would
be able to get a sense of how fans react when players such as Jose Reyes decide to test the free
agent market and sign elsewhere. Also, from a marketing perspective, it would be prudent to
understand which opponents draw less spectators to Mets home games. This way, market
researchers for the Mets could schedule promotions for games that are played against the less
popular opponents.
18
VIII. Reference List
Data
Baseball-Reference Web Site. (2011). Retrieved November 5, 2011, from http://www.baseball-reference.com/teams/NYM/attend.shtml.
New York Mets Web Site. (2011). Retrieved November 5, 2011, from http://www.newyork.mets.mlb.com.
Statistical Journals
Are Baseball Players Paid Their Marginal Products? Don N. MacDonald and Morgan O. Reynolds Managerial and Decision Economics , Vol. 15, No. 5, Special Issue: The Economics of Sports Enterprises (Sep. - Oct., 1994), pp. 443-457
The Interaction Between Baseball Attendance and Winning Percentage: A VAR Analysis. Michael C Davis. University of Missouri-Rolla, Department of Economics. http://umresearchboard.org/resources/davis/Baseball_Attendance_Winning.pdf
Major League Baseball 2010 Attendance Analysis: METS SUFFER A SECOND STRAIGHT HUGE LOSS. David P. Kronheim. Retrieved November 19, 2011, from http://www.numbertamer.com/files/2010_MLB_Attendance_Analysis.pdf
Other Information
Mets’ Payroll Will Hinder Teams Success. Adam Rubin. ESPN Web Site. Retrieved November 19, 2011, from http://espn.go.com/new-york/mlb/story/_/id/7123377/new-york-mets-payroll-100-million-amazin-struggle-compete-2012.
19
APPENDIX
Figure 1
New York Mets Attendance Data Set
20
21
top related