eco 231 empirical project mets attendance

Empirical Project:

Mets Attendance (1986-2011)

Kevin Mulcahey

ECO-231

Dr. Letcher

The College of New Jersey

I. Statement of the Problem

This empirical study seeks to discover the explanatory variables that best reflect the

stadium attendance of the New York Mets from 1986 to 2011. The explanatory variables that I

have decided to study are the years since last playoff appearance, payroll (adjusted for inflation),

number of all-stars, winning percentage, and average batter’s age. In order to discover the

relationship between these explanatory variables and the dependent variable of attendance,

multiple regression, residual plots, normal probability plots, and various other statistical analyses

will be employed. A prediction for the Mets stadium attendance at Citi Field in 2012 will be

conducted using a regression equation as well.

II. Review of Literature Related to the Variables

Before beginning the study, statistical journals and analyses regarding Major League

Baseball attendance, payroll, winning percentages, and other variables will be consulted. The

first journal, written by Don N. Macdonald and Morgan O. Reynolds of Texas A&M University,

analyzes the relationship between players and their marginal product by using many of the same

explanatory variables from my empirical study. Marginal product is defined as the amount of

total revenue earned by the company by hiring one extra unit of labor.

Research from the seasons of 1986 and 1987 in Major League Baseball proves that

players are paid for what they earn for their respective teams in ticket sales. The reason for

payroll correlating directly to revenue for MLB organizations has a lot to do with the institutions

of free agency and final offer arbitration. Free agency allows players to test the market and seek

the best offer for their abilities among all major league teams. Arbitration allows for pay

increases during contracts, based upon performance. These two contractual outlets for players

allow their marginal revenue product to correlate more directly with ticket sales, attendance, and

team revenue. These findings relate to my data very closely, as my dependent variable is

attendance, and the explanatory variable with the lowest p-value is payroll (adjusted for

inflation). My data also begins in 1986, just like the study performed by MacDonald and

Reynolds. The allowance for arbitration officially began in 1970, when a second MLB collective

bargaining agreement allowed impartial arbitrators to settle player contract disputes, as opposed

to the commissioner of baseball. In the season of 1985, arbitrators discovered that owners across

Major League Baseball were colluding to keep baseball player salaries artificially low, thus

reducing competitive bidding. For their collusion, owners were fined $280 million dollars in

damages, and baseball team payrolls have steadily increased each season thereafter.

The journal article by Macdonald and Reynolds also relates to another one of my

variables. Winning percentage, they say, is not as important of a significant predictor of

attendance when compared with statistics that forecast a team’s success. For example, if a team’s

winning percentage increases from .500 to .550, the increase from .550 to .600 will not make a

noticeable difference on stadium attendance. This is due to the fact that fans view entertainment

on an, “ex ante” basis rather than an, “ex post,” basis (Macdonald, 445). In other words, the

forecasting of a team’s success is more conducive of sales and attendance than post performance

success. People are more likely to buy more tickets when they expect a team to perform well,

rather than once a team is already doing well. In relating this idea to my variables, the number of

all-stars and team payroll would be a more significant predictor of attendance than winning

percentage. Payroll and all-stars are similarly related because when a roster is popular and high

quality, attendance is more likely to increase in a given season.

Another statistical journal, written by Michael C. Davis of Missouri-Rolla University

analyzes the interaction between baseball attendance and winning percentage. According to

Davis, the interaction between baseball attendance and winning may not be completely obvious.

It is expected that as a team performs well, the organization’s, “bandwagon effect,” will come to

fruition and a team should, “therefore expect to see an increase in attendance during and

following seasons in which the team played well on the field (Davis, 4).” This journal article

implies that although winning percentage affects attendance directly (winning has become

increasingly important to fans in recent years), attendance also could affect winning percentage.

When a team generates superior attendance and revenue, winning percentage should rise because

successful organizations have more room in their budget to attain high quality players. In my

regression, I chose to place payroll and winning percentage as the explanatory variables, and

attendance as the dependent variable.

The conclusion of the study proved that in the long-term, all ten of the sampled teams

(Cubs, Reds, Yankees, White Sox, Phillies, Pirates, Indians, Tigers, Cardinals, Red Sox)

exemplified positive attendance growth with winning percentage as an explanatory variable.

Also, by the conclusion of the study, the data proved that only one team, the Indians, had a

positive effect on winning percentage with attendance as an explanatory variable. This would

indicate that my chosen dependent variable, attendance, is the best choice between the two.

Winning percentage is a better explanatory variable in Major League Baseball.

A final statistical analysis, conducted by market research analyst David P. Kronheim,

takes into account another variable that affected New York Met attendance in the past 3 years.

This journal discusses the effect of the stadium, which can have an effect when taken into

account with the numerical data I employed in my analysis. Kronheim raises the point that when

the Mets moved to their new stadium in 2009, Citi Field, attendance was going to decrease

regardless of performance. The total amount of seats in Shea Stadium was 57,365, whereas Citi

Field has only 41,800 seats. From 2005 to 2007, The Mets had gains in attendance of more than

470,000 per year, which made them one of the top 2 teams in the National League in attendance

increases. The Mets were very competitive during these last few years at Shea Stadium.

Kronheim notes that, “If the Mets had sold every single ticket possible in 2009, including player

and ‘comp’ tickets, their attendance still would have fallen by 656,243 (Kronheim, 31).”

However, the Mets still had quality attendance in 2009 at Citi Field, as 3,168,571 spectators

attended.

The huge drop off from 2009 to 2010 of 576,166 (an 18.4% decline), has to do with the

lesser amount of seats, as well as other statistics. The Mets fell below a .500 winning percentage

again in 2009-2011 and had less all-stars. In fact, “the smallest attendance at any Mets home

game in 2008 was 45,321, which is more than 3,500 higher than Citi Field’s capacity (Kronheim,

31).” The Mets were playoff contenders in their last year at Shea Stadium, although they missed

out on the playoffs on the last game of the season. Attendance that year was 4,042,045. I decided

not to include the type of stadium in my personal regression analysis, because the data dates back

to 1986, and the Mets have only been at Citi Field since 2009. The majority of the analysis

comes from Shea Stadium from 1986-2008, and the new stadium statistics would only appear to

be outliers. However, I wanted to include this market research journal in my report because it

could partially explain the drastic drop in the most recent data I have compiled (2009-2011).

III. Data Sources and Descriptions

In compiling the Mets data set from 1986 to 2011, I used two main sources. For the

dependent variable of attendance, as well as the explanatory variables of winning percentage,

years since last playoff appearance, payroll, and average batter’s age, I used Baseball-

Reference.com. In order to discover the number of all-stars per year for the New York Mets, I

used Mets.com. I also decided to adjust the payroll for each year from 1986 to 2010 for inflation

in order to have the most accurate comparison possible. The inflation calculator on bls.gov aided

me in this process.

As mentioned earlier, my data organizes the effect of years since last playoff appearance,

payroll (adjusted for inflation), number of all-stars, winning percentage, and average batter’s age

on stadium attendance for the New York Mets from 1986 to 2011 (Figure 1). For the first few

years of the data, namely 1986 to 1990, the Mets were very successful. After making the playoffs

in 1985, 1986, and 1988, and winning the World Series in 1986, total season stadium attendance

ranged from 2.7 million to 3 million. In these 5 years, The Mets had high winning percentages

ranging from .537 to .667, and a total of 19 all-stars, which is an extremely high amount.

In direct contrast to the years of 1986-1990, the Mets performed horribly from the years

of 1991 to 1998. The average batter’s age during these years was much younger than during

successful years (27 as opposed to 30 in 2000 when they made it to the World Series), winning

percentages were in the dismal range of .364 to .478, and payroll was much lower, highlighted

by the 35,015,247.14 team payroll in 1996. Attendance in these years was very low, as it rarely

broke 2 million. The Mets performed poorly again from 2001 to 2005, performed well from 2006

to 2008, and performed poorly again from 2009 to 2011. These three Mets eras indicate

fluctuations in the dependent variable of attendance in correlation with most of the explanatory

variables.

IV. Scatter Plots, Multiple Regression, Variable Selection, and Analysis

I have identified my explanatory variables, or X-variables, in this study as years since last

playoff appearance (YSLPA), payroll adjusted for inflation (Payroll), winning percentage (Win

%), number of all-stars (All-stars), and average batter’s age (Avg Batt. Age). The dependent

variable, or Y-variable, is attendance (Attendance). The data set is made up entirely of numeric

explanatory variables. First, individual scatter plots of each explanatory variable were created.

The scatter plot of the X-variable YSPLA against the Y-variable attendance is shown below.

0 2 4 6 8 10 120

50000010000001500000200000025000003000000350000040000004500000

f(x) = − 712380.93056898 ln(x) + 3285693.67286353R² = 0.499236624667553

f(x) = 3339126.98358037 x -̂0.304548335135481R² = 0.485690393216885

f(x) = 12415.3785763 x³ − 159529.959663 x² + 290506.684811 x + 2967486.31414R² = 0.63413029095782

YSLPA v. Attendance (Figure 2)

Years Since Last Playoff Appearance

Attendance

The scatter plot of “YSPLA” against Attendance indicates that as the amount of years

since the Mets have reached the playoffs increases, the attendance decreases. Originally, I tried a

linear trend line to fit the data. The linear line fit the data relatively well, aside from three data

points from years 8 through 10. The R-square for the linear line, was only .4, however, and I

opted to try a quadratic or cubic equation to reflect the curvature of the data in years 8, 9, and 10.

The R-square improved dramatically from .4 to .634. Although it would appear that there should

be a direct, negatively linear line that fits the data, there could be an explanation for the

curvature. In years 8, 9, and 10 of missed playoff berths, according to the data set, the Mets were

starting to come out of their decade-long slump to become possible playoff contenders. It is

possible that the fans, in anticipation of the better performance of the team and potential playoff

implications, started to attend more games

“Payroll”, my second X-variable, has a scatter plot that reveals some curvature as well.

By using the R-square as a measure of fit, I decided that a linear equation was not appropriate. A

linear equation had an R-square of .1, while the 4-order quartic equation

0 50,000,000 100,000,000 150,000,000 200,000,0000

50000010000001500000200000025000003000000350000040000004500000

f(x) = 1.8609402E-25 x⁴ − 7.187176E-17 x³ + 9.79745286E-09 x² − 0.538350076 x + 11955827.293R² = 0.481238298393091

Payroll v. Attendance (Figure 3)

Payroll

Attendance

had an R-square of .48. In choosing a 4-order equation, I took into account the R-squares of

quadratic and cubic equations, and decided that parsimony did not apply. With each increase of

higher orders, I received improved R-squares ranging from differences of .08 to .10. Therefore,

the increases in goodness of fit were significant enough to alter the equation further.

For my third X-variable, “All-stars”, I analyzed the coefficient of determination, the R-

square, once again. Although a linear equation had a decent fit at .397, I still opted with a cubic

polynomial equation. The R-square for the fit of this equation was .419.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50

50000010000001500000200000025000003000000350000040000004500000

f(x) = − 41835.174790099 x³ + 270645.06697743 x² − 49071.849825538 x + 1795789.1082761R² = 0.419157873670009

All-stars v. Attendance (Figure 4)

All-stars

Attendance

Clearly, as teams acquire more talented players, attendance increases. However, there is still

some curvature that prevents the data from being linear. A possible explanation for this curvature

could be that as a team has 1, 2, or 3 all-stars, the team’s prospects for attendance rises

dramatically, but the excitement fans have for all-stars 4, 5, and 6, increase at a slower rate.

While the attendance rates are still higher, this may suggest that a team only performs marginally

better with more than 3 or 4 all-stars.

The fourth X-variable, “Win%”, has somewhat of a sporadic scatter plot. The goodness

of fit, regardless of the type of equation, seems to be relatively low. The highest R-square I was

able to attain was .284 with a cubic equation. The curvature indicates that there is low attendance

from winning percentages of .35 to .45. This may suggest that regardless of higher or lower

winning percentages, fans do not wish to attend games because the team is not competitive

within this percentage range. There are, however, dramatic increases in attendance from the

winning percentages of .45 to .55. With these percentages, the Mets have a chance at playoff

aspirations.

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.70

50000010000001500000200000025000003000000350000040000004500000

f(x) = − 243251228.120707 x³ + 370083180.27432 x² − 179730140.437134 x + 30221937.0151547R² = 0.283880070614012

Win% v. Attendance (Figure 5)

Attendance

There are still modest increases in attendance from .55 to .6, before it levels off. The next pattern

in the curvature reflects that attendance actually decreases at the winning percentage of .667, but

this could reflect an outlier, because the majority of the data has already leveled off from .6

to .65.

My final X-variable is the Mets average batter’s age, “Avg. Batt. Age|”, for each season

from 1986 to 2011.

27 27.5 28 28.5 29 29.5 30 30.5 310

10000002000000300000040000005000000

f(x) = − 93160.639 x⁴ + 11018274.11 x³ − 488154740.6 x² + 9602092047 x − 70754078648R² = 0.333701062726698

Avg. Batt. Age V. Attendance (Figure 7)

Avg. Batt. Age

Attendance

For this fifth variable, a polynomial equation was appropriate once again. A quartic

equation appeared to have the highest R-square, with a value of .334. A linear equation would

not have been as appropriate, because it appears that attendance rises from the average batter’s

ages of 27.5 to 28, before leveling off from 28.5 to 29.5. Attendance then rises dramatically from

the age of 30 and up. A possible explanation for this curvature could be that an older lineup may

have more experience, reflect better performance, and thus affect attendance. This variable,

however, proved to be insignificant toward attendance, as shown by the multiple regression

performed in the subsequent portion of this study.

Additional statistical analysis, aside from scatter plots and goodness of fit, is required in

order to discover significant predictors of attendance. A multiple regression including each

explanatory variable against the dependent variable of attendance indicated that a form of

variable selection was necessary (Figure 8).

First, a global F-test was run against each of the variables to decide whether any of them were

significant predictors. The hypothesis test for the global F-test is as follows:

H0: B1+B2+B3+B4+B5= 0Ha: At Least One of the Betas ≠ 0

As far as the alpha level for the hypothesis test, I decided to use an alpha of .15. My

reasoning is that studies of social sciences are conducted with human beings, and there is usually

more variation. Therefore, I do not want to reject any variables that could be found significant.

The global F-test showed a very low P-value of .00002 (Figure 8). According to the P-value, I

chose to reject the null hypothesis, and concluded that at least one of the explanatory variables is

significant.

After conducting the global F-test and deciding at least one variable was significant, I

chose to use backward selection to narrow down my set of explanatory variables. Right away,

according to the first regression (Figure 8), the X-variable of “Average Batter’s Age” had an

extremely high P-value of .6058. The P-value for this explanatory variable is much higher than

the alpha level, and therefore is eliminated. Figure 9 is shown below to reflect the new multiple

regression without the eliminated variable.

Upon eliminating the variable of “Average Batter’s Age”, the P-values for every other variable

were well below the alpha level of .15, and were rendered significant predictors of attendance.

The overall R-square for the second multiple regression with only 4 variables fell by just .004,

and the standard error increased by minimal amounts as well. These statistics are not significant

enough to indicate that the removal of “Average Batter’s Age” causes a weaker regression

equation. Also, the Global F-test after the removal of “Average Batter’s Age” indicated that the

P-value fell even lower, indicating that the right decision was made.

V. Forming the Final Regression Equation and Testing Assumptions

After running two multiple regressions and employing backward selection once, I was

able to form the following regression equation:

Predicted Ŷ = 393803.6703 + B1 -80945.75352(X1) + B2 0.008003511(X2) + B3 176153.9487(X3) + B4 2567089(X4)=

1. Ŷ = Mets Attendance 2. X1 = Years Since Last Playoff Appearance3. X2= Payroll (adjust for inflation)4. X3= Number of All-Stars5. X4= Winning Percentage

Upon recommending a regression equation, the disturbances must be checked to make

sure that they do not violate any of the assumptions. The four assumptions include that the

expected values of the disturbances add up to zero, the disturbances have constant variance, the

disturbances are normally distributed, and the disturbances are independent. In order to check

these assumptions, a residual plot of the residuals versus the predicted-y values is made.

500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000

-1000000-500000

0500000

1000000

Residuals v. Predicted Attendance (Figure 10)

Predicted Attendance

Residuals

While interpreting the residual plot, it is clear that all of the residuals are normally and randomly

distributed. There are no concave or fanning patterns, and almost all of the data points are within

95% of the data, accurately portraying the empirical rule. This reflects a very desirable residual

plot that does not violate any of the assumptions of the disturbances.

Another measure that ensures that the residuals fall within a normal distribution is a

normal probability plot. In this graph, the residuals are plotted against Z-scores, which creates a

range of a normal standard distribution for the points. The main assessment made in determining

whether or not the data in a normal probability plot is desirable is the straightness of the points.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-800000-600000-400000-200000

0200000400000600000800000

1000000Normal Probability Plot (Figure 11)

Z-Values

Residuals

Upon review, the normal probability plot seems to be almost completely straight, and reflects a

desirable plot. The residuals are normally distributed according to the residual plot and the

normal probability plot, and no assumptions appear to be violated.

VI. Prediction from the Regression Equation

After assessing that my equation was reasonable, and that none of the assumptions of the

disturbances of the data were violated, I made a prediction of the Mets stadium attendance for

the 2012 season. While deciding the values for my explanatory variables, I took many

considerations into account. First, the Mets have payroll constraints for the season of 2012 that

have been set by GM Sandy Alderson. He is restricting the Mets payroll between 100 and 110

million. I decided to take an average and make “Payroll” 105 million. For my “All-star” variable,

I realized that the Mets will probably not be able to re-sign 2-time all-star Jose Reyes. I decided

to keep my total amount of all-stars for the Mets next year at 3. As for the amount of years since

their last playoff appearance, “YSLPA”, that number will increase from 5 to 6 in 2012. Finally, I

decided to increase the Mets winning percentage, “Win%”, to .531, under the assumption that

they will not have the same amount of injuries as 2011, and that they will acquire a better set of

relief pitchers. My prediction appears below:

Ŷ ( Predicted Mets Attendance) = B0 393803.6703 + B1 YSLPA -80945.75352(6) + B2 Payroll 0.008003511

(105,000,000) + B3 All-stars 176153.9487(3) + B4 Win%2567089.4 (.531) = 2,640,084

This prediction is reasonable, as it reflects an increase in attendance of 287,488 for the

season of 2012. If the Mets have more wins and one more all-star than 2011 (indicating popular,

high caliber players for fans to see), it is reasonable that they could have close to 300,000 more

fans attend games next season. However, the Mets payroll and amount of years since the playoffs

will worsen. This equation shows that popular players and wins are more important to fans than

the previous year’s success or the total team payroll.

VII. Suggestions for Future Research

After completing my empirical project of Mets attendance, there are a multitude of

additional factors I would like to study if I had the time or money. One issue that I had with my

regression is that I wanted to invert my years since last playoff appearance (YSLPA) variable,

but the zero values became undefined. I tried multiple remedies for this problem, but they all

involved manipulations of the data set. Inversion would have dramatically improved my R-

square, and would have fit the data much more appropriately. Inverting an X-variable is typically

appropriate when a scatter plot reveals data that has a downward sloping curve and diminishes

over time. My scatter plot of “YSLPA” v. Attendance reflects this description, and my standard

error, R-square, and adjusted R-square all would have improved. I would be interested to see

how this issue could be resolved.

Another suggestion for future research would be to employ quadratic, cubic, and quartic

manipulations of all of my X-variables. Although this would be far too time consuming for the

purposes of this project, the goodness of fit for each of the scatter plots would improve

dramatically, and the predicted equation would be more accurate as well.

A final aspect that I would like to study is the effect of specific teams, players, and

promotions on stadium attendance. If I was able to gather these categorical X-variables, I would

be able to get a sense of how fans react when players such as Jose Reyes decide to test the free

agent market and sign elsewhere. Also, from a marketing perspective, it would be prudent to

understand which opponents draw less spectators to Mets home games. This way, market

researchers for the Mets could schedule promotions for games that are played against the less

popular opponents.

VIII. Reference List

Baseball-Reference Web Site. (2011). Retrieved November 5, 2011, from http://www.baseball-reference.com/teams/NYM/attend.shtml.

New York Mets Web Site. (2011). Retrieved November 5, 2011, from http://www.newyork.mets.mlb.com.

Statistical Journals

Are Baseball Players Paid Their Marginal Products? Don N. MacDonald and Morgan O. Reynolds Managerial and Decision Economics , Vol. 15, No. 5, Special Issue: The Economics of Sports Enterprises (Sep. - Oct., 1994), pp. 443-457

The Interaction Between Baseball Attendance and Winning Percentage: A VAR Analysis. Michael C Davis. University of Missouri-Rolla, Department of Economics. http://umresearchboard.org/resources/davis/Baseball_Attendance_Winning.pdf

Major League Baseball 2010 Attendance Analysis: METS SUFFER A SECOND STRAIGHT HUGE LOSS. David P. Kronheim. Retrieved November 19, 2011, from http://www.numbertamer.com/files/2010_MLB_Attendance_Analysis.pdf

Other Information

Mets’ Payroll Will Hinder Teams Success. Adam Rubin. ESPN Web Site. Retrieved November 19, 2011, from http://espn.go.com/new-york/mlb/story/_/id/7123377/new-york-mets-payroll-100-million-amazin-struggle-compete-2012.

APPENDIX

Figure 1

New York Mets Attendance Data Set

eco 231 empirical project mets attendance

Documents

mets 2014 - news

brain mets video

accords mets & whisky

les mets suisses

mets reference manual

mets vetting

brochure mets 2010

mets awareness training

iarxiu - guia d'ús llibreria mets (mets-samples)€¦ ·...

mets and mods / minerva part 3 mets profiles for web sites

mets revisited

can you draw the other half€¦ · 170 new york mets 2019...

mets ja koduloomad

rdf mets metadata interoperability. metadata encoding &...

skeletal mets

mets viewer

english 231 ~ shakespeare’s major plays...

spinal mets

puudest algab mets

tigran mets cjsc.equipment