eco 231 empirical project mets attendance
TRANSCRIPT
Empirical Project:
Mets Attendance (1986-2011)
Kevin Mulcahey
ECO-231
Dr. Letcher
The College of New Jersey
2
I. Statement of the Problem
This empirical study seeks to discover the explanatory variables that best reflect the
stadium attendance of the New York Mets from 1986 to 2011. The explanatory variables that I
have decided to study are the years since last playoff appearance, payroll (adjusted for inflation),
number of all-stars, winning percentage, and average batter’s age. In order to discover the
relationship between these explanatory variables and the dependent variable of attendance,
multiple regression, residual plots, normal probability plots, and various other statistical analyses
will be employed.
II. Review of Literature Related to the Variables
Before beginning the study, statistical journals and analyses regarding Major League
Baseball attendance, payroll, winning percentages, and other variables will be consulted. The
first journal, written by Don N. Macdonald and Morgan O. Reynolds of Texas A&M University,
analyzes the relationship between players and their marginal product by using many of the same
explanatory variables from my empirical study. Marginal product is defined as the amount of
total revenue earned by the company by hiring one extra unit of labor.
Research from the seasons of 1986 and 1987 in Major League Baseball prove that players
are paid for what they earn for their respective baseball organizations in sales. The reason for
payroll correlating directly to revenue for MLB organizations has a lot to do with the institutions
of free agency and final offer arbitration. Free agency allows players to test the market and seek
the best offer for their abilities among all major league teams. Arbitration allows for pay
increases during the season, based upon performance. These two contractual outlets for players
allow their marginal revenue product to correlate more directly with ticket sales, attendance, and
3
team revenue. These findings relate to my data very closely, as my dependent variable is
attendance, and my explanatory variable with the lowest p-value is payroll (adjusted for
inflation). My data also begins in 1986, just like this study performed by MacDonald and
Reynolds. The allowance for arbitration officially began in 1970, when a second MLB collective
bargaining agreement allowed for an impartial arbitrator in settling player contract
disagreements, rather than the commissioner of baseball. In the season of 1985, arbitrators
discovered that owners across Major League Baseball were colluding to keep baseball player
salaries artificially low, thus reducing competitive bidding. For their collusion, owners were
fined $280 million dollars in damages, and baseball team payrolls have steadily increased each
season.
The journal article by Macdonald and Reynolds also relates to another one of my
variables. Winning percentage, they say, is not as important of a significant predictor of
attendance when compared with statistics that forecast a team’s success. For example, if a team’s
winning percentage increases from .500 to .550, the increase from .550 to .600 will not make a
noticeable impact on stadium attendance. This is due to the fact that fans view entertainment on
an, “ex ante” basis rather than an, “ex post,” basis. In other words, the forecasting of a team’s
success is more conducive of sales and attendance than post performance. People are more likely
to buy more tickets when they expect a team to perform well, rather than once they are already
doing well. In relating this idea to my variables, number of All-Stars would be a more significant
predictor of attendance than winning percentage. Payroll and All-Stars are similarly related in
that the more popular and high-quality a roster is, the more attendance will increase in a given
season.
4
Another statistical journal, written by Michael C. Davis of Missouri-Rolla University
analyzes the interaction between baseball attendance and winning percentage. According to
Davis, the interaction between baseball attendance and winning may not be completely obvious.
It is expected that as a team performs well, the organization’s, “bandwagon effect,” will come to
fruition and a team should, “therefore expect to see an increase in attendance during and
following seasons in which the team played well on the field.” This journal article implies that
although winning percentage affects attendance directly because winning has become
increasingly important to fans in recent years, attendance also could affect winning percentage.
When a team is successful and generates superior attendance and revenue, winning percentage
should rise. Successful organizations have more room in their budget to attain high quality
players. In my regression, I chose to place payroll and winning percentage as the explanatory
variables, and attendance as the dependent variable.
Interestingly enough, according to the article, only about half of the National League
teams in the MLB had “up-ticks” in attendance. This would indicate that winning percentage as a
significant predictor of attendance varies by team. Also, as far as the American League, some
teams such as the Yankees actually showed a negative shock response to winning, as it may be
possible that fans have almost become indifferent to the team’s consistent winning nature.
However, in the conclusion of the study, it showed that in the long-term, all ten of the sampled
teams (Cubs, Reds, Yankees, White Sox, Phillies, Pirates, Indians, Tigers, Cardinals, Red Sox)
showed positive attendance growth in regard to winning percentage. Also, by the conclusion of
the study, the data showed that only one team, the Indians, had a positive effect on winning
percentage from attendance. This would indicate that my chosen dependent variable, attendance,
5
is the best choice between the two. Winning percentage is a better explanatory variable in Major
League Baseball.
A final statistical analysis, conducted by market research analyst David P. Kronheim,
takes into account another variable that affected New York Met attendance in the past 3 years.
This journal discusses the effect of the stadium, which can have an effect when taken into
account with the numerical data I employed in my analysis. Kronheim raises the point that when
the Mets moved to their new stadium in 2009, Citi Field, attendance was going to decrease
regardless of performance. The total amount of seats in Shea Stadium was 57,365, whereas Citi
Field has only 41,800 seats. From 2005 to 2007, The Mets had gains in attendance of more than
470,000 per year, which left them within the top 2 teams in the National League in regard to
attendance increases. The Mets were very competitive during these last few years at Shea
Stadium. Kronheim notes that, “If the Mets had sold every single ticket possible in 2009,
including player and ‘comp’ tickets, their attendance still would have fallen by 656,243.
However, the Mets still had quality attendance in 2009 at Citi Field, as 3,168,571 spectators
attended.
The huge drop off from 2009 to 2010 of 576,166 (an 18.4% decline), has to do with the
lesser amount of seats, as well as other statistics. The Mets fell below a .500 winning percentage
again in 2009-2011 and had less all stars. In fact, “the smallest attendance at any Mets home
game in 2008 was 45,321, which is more than 3,500 higher than Citi Field’s capacity.” The Mets
were playoff contenders in their last year at Shea Stadium, although they missed out on the
playoffs on the last game of the season. Attendance that year was 4,042,045. I decided not to
include the type of stadium in my personal regression analysis, because the data dates back to
1986, and the Mets have only been at Citi Field since 2009. The majority of the analysis comes
6
from Shea Stadium from 1986-2008, and the new stadium statistics would only appear to be
outliers. However, I wanted to include this market research journal in my report because it could
partially explain the drastic drop in the most recent data I have compiled (2009-2011).
III. Data Sources and Descriptions
In compiling the information for my Mets data set from 1986 to 2011, I used two main
sources. For the dependent variable of attendance, as well as the explanatory variables of
winning percentage, years since last playoff appearance, payroll, and average batter’s age, I used
Baseball-Reference.com. In order to discover the number of all-stars per year for the New York
Mets, I used Mets.com. I also decided to adjust the payroll for each year from 1986 to 2010 for
inflation in order to have the most accurate comparison possible. The inflation calculator on
bls.gov aided me in this process.
As mentioned earlier, my data organizes the effect of years since last playoff appearance,
payroll (adjusted for inflation), number of all-stars, winning percentage, and average batter’s age
on stadium attendance for the New York Mets from 1986 to 2011 (Figure 1). For the first few
years of the data, namely 1986 to 1990, the Mets were very successful. After making the playoffs
in 1985, 1986, and 1988, and winning the World Series in 1986, total season stadium attendance
ranged from 2.7 million to 3 million. In these 5 years, The Mets had high winning percentages
ranging from .537 to .667, and a total of 19 all-stars, which is an extremely high amount.
In direct contrast to the years of 1986-1990, the Mets performed horribly from the years
of 1991 to 1998. The average batter’s age during these years was much younger than during
successful years (27 as opposed to 30 in 2000 when they made it to the World Series), winning
percentages were in the dismal range of .364 to .478, and payroll was much lower, highlighted
7
by the 35,015,247.14 team payroll in 1996. Attendance in these years was very low, rarely
breaking 2 million. The Mets performed poorly again from 2001 to 2005, performed well from
2006 to 2008, and are performed poorly from 2009 to 2011. These three Mets eras indicate
fluctuations in the dependent variable of attendance with most of the explanatory variables.
IV. Regression and Analysis
I have identified my explanatory variables, or X-variables, in this study as years since last
playoff appearance (YSLPA), payroll adjusted for inflation (Payroll), winning percentage (Win
%), number of all-stars (All-stars), and average batter’s age (Avg Batt. Age). My dependent
variable, or Y-variables, is attendance (Attendance). My data set is made up entirely of numeric
explanatory variables. First, individual scatter plots of each explanatory variable were created.
The scatter plot of the X-variable YSPLA against the Y-variable attendance is shown below.
0 2 4 6 8 10 120
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 712380.93056898 ln(x) + 3285693.67286353R² = 0.499236624667553
f(x) = 3339126.98358037 x -̂0.304548335135481R² = 0.485690393216885
f(x) = 12415.3785763 x³ − 159529.959663 x² + 290506.684811 x + 2967486.31414R² = 0.63413029095782
YSLPA v. Attendance (Figure 2)
Years Since Last Playoff Appearance
Attendance
The scatter plot of YSPLA against Attendance indicates that as the amount of years since
the Mets have reached the playoffs increases, the attendance decreases. Originally, I tried a linear
trend line to fit the data. The linear line fit the data relatively well, aside from three data points
8
from years 8 through 10. The R-square for the linear line, was only .4, however, and I opted to
try a quadratic or cubic equation to reflect the curvature of the data in years 8, 9, and 10. The R-
square improved dramatically from .4 to .634. Although it would appear that there should be a
direct, negatively linear line that fits the data, there could be an explanation that explains the
curvature. In years 8, 9, and 10 of missed playoff berths, according to the data set, the Mets were
starting to come out of their decade-long slump and becoming possible playoff contenders. It is
possible that the fans, in anticipation of the improved play of the team, started to attend more
games.
Payroll, my second X-variable, against the Y-variable, has a scatter plot that also reveals
some curvature. By using the R-square as a measure of fit again, I decided that a linear fit was
not appropriate. A linear equation had an R-square of .1, while the 4-order quartic equation
0 50,000,000 100,000,000 150,000,000 200,000,0000
50000010000001500000200000025000003000000350000040000004500000
f(x) = 1.8609402E-25 x⁴ − 7.187176E-17 x³ + 9.79745286E-09 x² − 0.538350076 x + 11955827.293R² = 0.481238298393091
Payroll v. Attendance (Figure 3)
Payroll
Attendance
had an R-square of .48. In choosing a 4-order equation, I took into account the R-squares of
quadratic and cubic equations and decided that parsimony did not apply. With each increase of
higher orders, I received improved R-squares ranging from .08 to .10. Therefore, the increases in
fit were not minimal enough to simply leave the equation alone as a quadratic or cubic equation.
9
For my third X-variable, All-stars, I analyzed the coefficient of determination, the R-
square, once again. Although a linear equation was stronger than payroll, at .397, I still opted
with a cubic polynomial equation. The R-square for the fit of this equation was .419
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 41835.174790099 x³ + 270645.06697743 x² − 49071.849825538 x + 1795789.1082761R² = 0.419157873670009
All-stars v. Attendance (Figure 4)
Number of All-stars
Attendance
Clearly, as the more talented players a team acquires increases, attendance increases. However,
there is still some curvature. A possible explanation for this curvature could be that as a team has
1, 2, 3 all-stars, the team’s prospects for attendance rises dramatically, but the excitement fans
have for all-stars 4, 5 and 6, increases at a slower rate. While the attendance rates are still higher,
this may suggest that a team only performs marginally better with more than 3 or 4 all-stars.
The fourth X-variable, Win%, has somewhat of a sporadic scatter plot. The goodness of
fit, regardless of the type of equation, seems to be relatively low. The highest R-square I was
able to attain was .284 with a cubic equation. The curvature indicates that there is low attendance
from winning percentages of .35 to .45 with little increases. This may suggest that even though it
is a much higher winning percentages, fans still do not wish to attend games because the team is
not competitive in the championship season. There are, however, dramatic increases from
10
winning percentages of .45 to .55. With these percentages, the Mets have a chance at playoff
aspirations.
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.70
50000010000001500000200000025000003000000350000040000004500000
f(x) = − 243251228.120716 x³ + 370083180.274333 x² − 179730140.437141 x + 30221937.0151557R² = 0.283880070614012
Win% v. Attendance (Figure 5)
Winning Percentage
Attendance
There are still modest increases in attendance from .55 to .6, before it levels off. The next pattern
in the curvature reflects that attendance actually decreases at the winning percentage of .667, but
this could reflect an outlier, because the majority of the data has already leveled off from .6
to .65.
My final X-variable is the Mets average batter’s age, Avg. Batt. Age, for each season
from 1986 to 2011.
27 27.5 28 28.5 29 29.5 30 30.5 310
1000000
2000000
3000000
4000000
5000000
f(x) = − 93160.63899996 x⁴ + 11018274.10692 x³ − 488154740.5889 x² + 9602092047.389 x − 70754078648.07R² = 0.333701062726698
Avg. Batt. Age V. Attendance (Figure 7)
Average Batter's Age
Attendance
11
For this fifth variable, a polynomial equation was appropriate once again. A quartic equation
appeared to have the highest R-square, with a value of .334. A linear equation would not have
been as appropriate, because it appears that attendance rises from the average batter’s ages of
27.5 to 28, then levels off from 28.5 to 29.5, and rises dramatically from the age of 30 on. A
possible explanation for this curvature could be that an older lineup may have more experience,
reflect better performance, and thus affect attendance. This variable, however, proved to be
insignificant toward attendance, as shown by the multiple regression performed in the
subsequent portion of this study.
Additional statistical analysis, aside from scatter plots and goodness of fit, is required in
order to discover significant predictors of attendance. A multiple regression including each
explanatory variable against the dependent variable of attendance indicated that a form of
variable selection was necessary (Figure 8). First, a global F-test was run against each of the
variables to decide whether any of them were significant predictors. The hypothesis test for the
global F-test is as follows:
H0: B1+B2+B3+B4+B5= 0
Ha: At Least One of the Betas ≠ 0
As far as the alpha level for this hypothesis test, I decided to use an alpha of .15. My
reasoning is that when studies of social sciences are conducted with human beings, there is more
variation. Therefore, I do not want to reject any variables that could be found significant. The
global F-test showed a very low P-value of .00002. According to the P-value, I chose to reject
the null hypothesis, and concluded that at least one of the explanatory variables was significant.