regression analysis of nba points final

1

Regression Analysis:

NBA Points

By:

Matthew Adkins

John Michael Croft

Chima Iheme

Anthony Podolak

2

Table of Contents

Regression Analysis of NBA ppg

I.) Table of Contents …………………………………………………………………. 2

II.) Abstract …………………………………………………………………………. 3

III.) Introduction …………………………………………………………………. 4

IV.) Model Specifications ...……………………………………………………….. 4

A.) Explanation of Coefficients ...……………………………………….. 5

B.) Test for Non-Linearity ….……………………………………………… 6

C.) Test for Heteroskedasticity …………………...…………………….. 6

D.) Normality Test …………………………………………………………. 6

i.) Histogram, PP Plot, other graphical methods

ii.) Quantitative Analysis of Normality

E.) Outlier Control …………………………………………………………. 7

i.) Decision to keep/disregard outliers

ii.) Explanation for decision

V.) Conclusion …………………………………………………………………. 7

A.) Final Model

B.) Limitations of Model

C.) Possible Improvements

Appendix: Includes relevant graphs, plots, and non-essential output reports. 9

3

Abstract

In this project, analyzed a set of data and determine its relationship to NBA ppg. The first thing

that we did was create a baseline model from which we were able to compare all of our later

results against. We then tested for non-linearity in our independent variables to ascertain the

forms of the variables that explained the most of the variation in our model. After we built a

preliminary model, we tested it for multi-collinearity, heteroskedasticity, and normality. Once

we adjusted our model to deal with these issues, we made the decision to retain the outliers as

they did not seem to significantly affect our model. By doing these things, we feel comfortable

that we have created a model that accurately assesses the information that we were given.

However, we are subject to the limited data and in order to build a truly explanatory model, we

would need more data.

4

Model Introduction

This is a project that is a required assignment by our professor of Econometrics. It

incorporates the building of a model and testing for problems in said model. In addition, it

provides an avenue for us to practice the concepts that we learned this semester in school. As we

proceed, we will lead you in a step-by-step process that will allow you to intuitively understand

how we came to our specific model.

Initially, we were given a data set that included our dependent value, PointsPerGame, and

a number of possible factors and statistics for NBA players in one completed season. Based on

this, we created a baseline model that resulted in the following Adjusted R².

Model Summary

Model R R Square

Adjusted R

Square

Std. Error of the

Estimate

1 .928a .861 .858 2.215

a. Predictors: (Constant), coll, assists, rebounds, allstar, wage, avgmin

This model is sub-optimal and is useful only as a starting point for us to compare with our later

models. However, it does allow us to have an idea of how our results compare to this and

whether we are progressing or regressing in our analysis of this data.

Model Specifications

After many regressions, we were able to create what we believe to be the optimal model,

given the data we are working with and the limitations placed upon us. In order to achieve this,

we regressed pointspergame against every possible (In this case, the possibilities are the inverse,

5

square root, natural log, quadratic, and cubic) variation of our available independent variables.

We worked to eliminate variables by first using the stepwise function. After we completed that

initial round of variable elimination, we focused on eliminating variables with a VIF of over ten.

Lastly, once we were able to reduce the VIF’s of all of our remaining variables to numbers under

ten, we then eliminated variables that we could not prove were statistically significant. By using

this process, we were able to slightly increase our adjusted R², which increases the explanatory

power of our model. The summary of the final model is as follows:

Explanation of the Coefficients

Now that we have our model we can move on to testing it for reliability, but before we do

that we will explain the meaning behind the independent variables we incorporated.

• Constant : 1.259

• Avgmin: As the average minutes increase by 1, LN(PPG) increases by .065.

• Wage: As the average annual salary increases by $1000 , LN(PPG) increases by 4.286E-

5.

• Coll: As the years of college played increases by 1, LN(PPG) decreases by .068.

• Assists: As average assist per game increase by 1, LN(PPG) decreases by .063

• Allstar: If ever an all-star, LN(PPG) is increases by .220

6

• Invminutes: As the inverse of minutes per year increase by 1, LN(PPG) decreases by .001

• Invassists: As the inverse of assist per game increase by 1, LN(PPG) decreases by .058

• Cubeminutes: As minutes-cubed increases by 1, LN(PPG) decreases by .008

• Cuberebounds: As rebounds-cubed increases by 1, LN(PPG) decreases by .000

Testing for Non-Linearity

One of the main problems that can be found in regression analysis comes from the non-

linearity of variables. It is not always easy to tell whether a variable is linear or not, but there are

a couple of methods we can use to account for possible non-linearity in our variables. The first

thing to do is visually inspect the data by plotting each independent variable out against

pointspergame. We did this and found a few variables that were questionable. We then

incorporated all of the possible variations that we could think of into our regression and

eliminated the ones that did not increase the explanatory power of our model. These two

methods allowed us to account for the non-linearity in our variables, which in turn allowed us to

further optimize our model.

7

Testing for Heteroskedasticity

Our analysis shows no heteroskadticity while observing different residual output charts

and graphs. The residual histogram shows the residuals are normally distributed, and the residual

scatter plot shows no obvious trend. The number of observations coincides with the Empirical

Rule, with 99.7% of data within three standard deviations of the expected value. We decided no

correction was necessary by way of GLS in regards to the Breusch Pagan test.

Testing for Normality

There are two methods for observing whether a model is normally distributed. The first

is graphically and the second is quantitatively. We use two different types of graphs to assess the

normality of our regression. When we graph our standardized residual by a PP plot and a

Histogram, we view a distribution that looks very normal. These visual aids are helpful to us,

but we cannot accurately determine whether our regression is normally distributed solely through

this method. We also need to test it quantitatively. Thus, we use the Kolmogorov-Smirnov and

Shapiro-Wilks tests to determine the normality of our regression. Through these tests we are

able to determine that our regression is in fact normal.

Outlier Control

In order to obtain a better perspective on our outliers, we created a box plot to graph our

outliers and see where they fit in relationship to our data as a whole. We also ran some

descriptive statistics on our standardized residual, which allowed us to observe our outliers

statistically. We found that most of our outliers lay right at or within three standard deviations of

the mean and even the few that were outside of three standard deviations, were within four.

8

Because of this statistical data and the size of our sample, we decided that it would be

best to leave our outliers in the model. We do not believe that they will significantly skew our

data and keeping them may help us in explaining the variation of our model.

Conclusion

After analyzing the data, we constructed a model that as far as we can tell explains the

most variation of any possible model. As we tested the model, we were able to further refine it

and eliminate a great deal of the multi-collinearity and insignificant factors found in our model.

This was primarily achieved by applying various natural log, inverse, square, cube, and square

root functions against our independent and dependent variables. Our final model is:

LN(PPG) = 1.295 + .065(AvgMin) + 4.286E-5(Wage) - .068(Coll) - .063(Assists) + .

220(Allstar) - .001(Minutes)⁻¹ - .058(Assists)⁻¹ - .008(Minutes)³ + .000(Rebounds)³

Model Limitations

Due to the nature of this model, there are certain inherent limitations. Foremost among

these is its limited ability to predict future values. A model like this one is more of an

explanatory model. It lacks multiple years of historical data, which would allow it to then

9

predict future values based on the analyzed trends of older data. In addition, at least one of the

variables (draft) was entered in such a way that there were blanks if the player was undrafted.

There is probably a better way to enter this information perhaps using dummy variables which

might add to the models’ effectiveness. The model is also limited by the exposure of the various

tools and techniques presented to us in an undergrad class. In other words, we’re certain that

there are countless other tests and methods that can be applied to improve the model further.

Possible Improvements

The major improvement that could be made to this model would be the importation of

more variables. If we were able to more variables, like # of players of same position per team,

points per game from prior season or college, average team points scored per game, average

opponent points allowed per game, injuries; motivation level, win/loss record, contract year, then

we would definitely be able to create a more accurate model. This improvement would allow us

to account for a great deal more variation than we are currently able to under the scope of this

data.

10

Appendix

III. Final Model

12

III.D.

13

Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

Standardized Residual .056 269 .040 .981 269 .001

a. Lilliefors Significance Correction

III.E.

regression analysis of nba points final

Documents

limitations of model

optimal model

model specifications

baseline model

model introduction

final model

explanatory model

preliminary model