checking regression model assumptions
Post on 05-Jan-2016
51 Views
Preview:
DESCRIPTION
TRANSCRIPT
Checking Regression Model Assumptions
NBA 2013/14 Player Heights and Weights
Data Description / Model
• Heights (X) and Weights (Y) for 505 NBA Players in 2013/14 Season.
• Other Variables included in the Dataset: Age, Position• Simple Linear Regression Model: Y = b0 + b1X + e• Model Assumptions:
~ e N(0,s2) Errors are independent Error variance (s2) is constant Relationship between Y and X is linear No important (available) predictors have been ommitted
65 70 75 80 85 90150
175
200
225
250
275
300
Weight (Y) vs Height (X) - 2013/2014 NBA Players
Height (inches)
Wei
ght (
lbs)
Regression ModelRegression Statistics
Multiple R 0.821R Square 0.674Adjusted R Square 0.673Standard Error 15.237Observations 505
ANOVAdf SS MS F Significance F
Regression 1 240985 240985 1038 0.0000Residual 503 116782 232Total 504 357767
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept -279.869 15.551 -17.997 0.0000 -310.423 -249.316Height 6.331 0.197 32.217 0.0000 5.945 6.717
^ ^ ^
0 10 1
^
11
^
* 110 1 1 ^
11
1
279.869 6.331
{ } 0.197
cdf-based: 0.975;503 = upper-tail based: 0.025;503 1.965
6.331: 0 : 0 : 32.217
{ } 0.197
95% Confidence Interval for : 6.331 1
A
Y b b X X X
s b s
t t
bH H TS t
s b s
2
1
2^
Reg1
2^
1
.965(0.197) 5.945 , 6.717
Total (Corrected)Sum of Squares: 357767
Regression Sum of Squares: Reg 240985 1
Error Sum of Squares: Res 11
n
ii
n
i
i
n
iii
SSTO Y Y
SSR SS Y Y df
SSE SS Y Y
Err
*0 1 1
2
2
6782 505 2 503
240985 1Reg: 0 : 0 : 1038
Res 116782 503
Reg 2409850.674
357767116782
Res 232 232 15.24503
A
df
MSR MSH H TS F
MSE MS
SSR SSr
SSTO SSTO
s MSE MS s
Checking Normality of Errors
• Graphically Histogram – Should be mound shaped around 0 Normal Probability Plot – Residuals versus expected values under
normality should follow a straight line.• Rank residuals from smallest (large negative) to highest (k = 1,…,n)• Compute the quantile for the ranked residual: p=(k-0.375)/(n+0.25)• Obtain the Z-score corresponding to the quantiles: z(p)• Expected Residual = √MSE*z(p)• Plot Ordered residuals versus Expected Residuals
• Numerical Tests: Correlation Test: Obtain correlation between ordered residuals
and z(p). Critical Values for n up to 100 are provided by Looney and Gulledge (1985)).
Shapiro-Wilk Test: Similar to Correlation Test, with more complex calculations. Printed directly by statistical software packages
Normal Probability Plot / Correlation Test
-60 -40 -20 0 20 40 60
-60
-40
-20
0
20
40
60
80
Normal Probability Plot of Residuals
Expected Value Under Normality
Resid
ual
e rank quantile z(p)*s-45.583 1 0.0012 -46.115-44.921 2 0.0032 -41.519-39.929 3 0.0052 -39.045-36.921 4 0.0072 -37.306-36.590 5 0.0092 -35.949
… … … …-0.260 251 0.4960 -0.151-0.260 252 0.4980 -0.076-0.260 253 0.5000 0.000-0.260 254 0.5020 0.0760.063 255 0.5040 0.151
… … … …40.748 501 0.9908 35.94942.079 502 0.9928 37.30644.417 503 0.9948 39.04549.740 504 0.9968 41.51956.079 505 0.9988 46.115
Extreme and Middle Residuals
The correlation between the Residuals and their expected values under normality is 0.9972. Based on the Shapiro-Wilk test in R, the P-value for H0: Errors are normal is
P = .0859 (Do not reject Normality)
Checking the Constant Variance Assumption• Plot Residuals versus X or Predicted Values
Random Cloud around 0 Linear Relation Funnel Shape Non-constant Variance Outliers fall far above (positive) or below (negative) the
general cloud pattern Plot absolute Residuals, squared residuals, or square
root of absolute residuals Positive Association Non-constant Variance
• Numerical Tests Brown-Forsyth Test – 2 Sample t-test of absolute
deviations from group medians Breusch-Pagan Test – Regresses squared residuals on
model predictors (X variables)
150 165 180 195 210 225 240 255 270 285 300-60
-40
-20
0
20
40
60Residuals vs Fitted Values
Fitted Values
Resid
uals
140 160 180 200 220 240 260 2800
10
20
30
40
50
60
Absolute Residuals vs Fitted Values
Fitted Values
Abso
lute
Res
idua
ls
Equal (Homogeneous) Variance - I
2 20
Brown-Forsythe Test:
: Equal Variance Among Errors
: Unequal Variance Among Errors (Increasing or Decreasing in )
1) Split Dataset into 2 groups based on levels of (or fitted values) wi
i
A
H i
H X
X
1 2
1 2
th sample sizes: ,
2) Compute the median residual in each group: ,
3) Compute absolute deviation from group median for each residual:
1,..., 1,2
4) Compute the mean and varianc
jij ij j
n n
e e
d e e i n j
0
2 21 21 2
2 21 1 2 22
1 2
1 2
1 2
1 2
0
e for each group of : , ,
1 15) Compute the pooled variance:
2
Test Statistic: 21 1
Reject if 1 2 ; 2
~
ij
H
BF
BF
d d s d s
n s n ss
n n
d dt t n n
sn n
H t t n
Equal (Homogeneous) Variance - II
2 20
2 21 1
2
1
Breusch-Pagan (aka Cook-Weisberg) Test:
: Equal Variance Among Errors
: Unequal Variance Among Errors ...
1) Let from original regression
2) Fit Regression
i
A i i p ip
n
ii
H i
H h X X
SSE e
0
21
2 22
2
1
2 20
of on ,... and obtain Reg*
Reg* 2Test Statistic:
Reject H if 1 ; = # of predictors
~
i i ip
H
BP pn
ii
BP
e X X SS
SSX
e n
X p p
Brown-Forsyth and Breusch-Pagan Tests
Brown-Forsyth TestGroup Heights(Grp) n(Grp) Med(e|grp) Mean(d|Grp) Var(d|Grp)
1 69-79 252 -1.2673 10.8039 70.41862 80-87 253 0.7482 12.9193 108.7256
MeanDiff -2.1155PooledVar 89.6102PooledSD 9.4663sqrt(1/n1+1/n2) 0.0890s{d1bar-d2bar} 0.8425t*(BF) -2.5110t(.975,505-2) 1.9647P-value 0.0247
Brown-Forsyth Test: Group 1: Heights ≤ 79”, Group 2: Heights ≥ 80”H0: Equal Variances Among Errors (Reject H0)
Regression of Weight on HeightANOVA
df SSRegression 1 240984.7782Residual 503 116782.3109Total 504 357767.0891
Regression of e^2 on HeightANOVA
df SSRegression 1 963633.2703Residual 503 67658845.93Total 504 68622479.2
SSE(Model1) 116782.311n 505SS(Reg*) 963633.270X2(BP):Num 481816.635X2(BP):Denom 53477.534X2(BP) 9.010Chisq(.95,1) 3.841P-value 0.003
Breusch-Pagan Test: H0: Equal Variances Among Errors (Reject H0)
Linearity of Regression
0 0 1 0 1
2
1 1
-Test for Lack-of-Fit ( observations at distinct levels of " ")
: :
Compute fitted value and sample mean for each distinct level
Lack-of-Fit: j
j
i i A i i i
j j
n
j j
j i
F n c X
H E Y X H E Y X
Y Y X
SS LF Y Y
0
2
1 1
2,
0
2
Pure Error:
( ) 2 ( )Test Statistic:
( )( )
Reject H if 1 ; 2,
~
j
c
LF
nc
jij PEj i
H
LOF c n c
LOF
df c
SS PE Y Y df n c
SS LF c MS LFF F
MS PESS PE n c
F F c n c
Linearity of Regression
^
2
1 1
^
0 0 1 0 1
2
1 1
Full Model :
( ) means are estimated
Reduced Model :
( ) 2 2 means are estimate
j
j
jjA ij j
nc
jij Fj i
jjij j j
nc
jij Rj i
H E Y Y
SSE F Y Y SS PE df n c c
H E Y X Y b b X
SSE R Y Y SSE df n
2 22
1 1 1 1 1 1 1 1
22
1 1 1 1 1 1
22
1 1 1 1
d
2
2
0
j j j j
j j j
j j
n n n nc c c c
j j j j j j jij ij ijj i j i j i j i
n n nc c c
j j j j j jij ijj i j i j i
n nc c
j j jijj i j i
Y Y Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y
Y Y Y Y SSE SS PE SS LF
0
2,
0
2 2 ( )
( )
Reject H if 1 ; 2,
Computing Strategy:
1) For each group ( ): Co
~H
R FLOF c n c
F
LOF
SSE SS PESSE R SSE F SS LF
n n cdf df c MS LFF F
MS PESSE F SS PE SS PE
df n c n c
F F c n c
j
1
2
12
^
0 1
2 2^ ^
1 1 1
22
1 1 1
mpute:
11
0 otherwise
2)
3) 1
j
j
j
j
n
iji
j
j
n
jiji
jjj
j j
n c c
j j j jji j j
n c c
jij j ji j j
YY
n
Y Yns
n
Y b b X
SS LF Y Y n Y Y
SS PE Y Y n s
Height and Weight Data – n=505, c=18 GroupsHeight n Mean SD Y-hat SSLF SSPE SSE
69 2 182.50 3.54 156.95 1305.39 12.50 1317.8971 4 175.75 15.52 169.61 150.62 722.75 873.3772 13 181.00 13.00 175.94 332.27 2028.00 2360.2773 16 186.13 12.09 182.28 237.15 2191.75 2428.9074 21 183.33 9.26 188.61 583.79 1716.67 2300.4575 41 193.71 11.58 194.94 61.96 5360.49 5422.4476 32 200.84 11.96 201.27 5.74 4434.22 4439.9677 31 204.13 10.70 207.60 373.06 3433.48 3806.5578 43 211.00 12.83 213.93 368.86 6912.00 7280.8679 49 221.35 18.70 220.26 57.94 16781.10 16839.0480 46 227.33 15.13 226.59 24.90 10300.11 10325.0181 67 232.49 19.63 232.92 12.30 25430.75 25443.0582 53 241.49 14.79 239.25 265.64 11369.25 11634.8883 44 245.66 17.55 245.58 0.26 13241.89 13242.1484 34 254.62 14.70 251.91 248.66 7128.03 7376.6985 7 247.86 10.75 258.24 755.21 692.86 1448.0786 1 278.00 0.00 264.57 180.24 0.00 180.2487 1 263.00 0.00 270.91 62.50 0.00 62.50
Sum 505 #N/A #N/A #N/A 5026.479 111755.8 116782.3
Source df SS MS F(LOF) F(.95) P-valueLackFit 16 5026.5 314.2 1.369 1.664 0.1521PureError 487 111755.8 229.5
Do not rejectH0: mj = b0 + b1Xj
Box-Cox Transformations
• Automatically selects a transformation from power family with goal of obtaining: normality, linearity, and constant variance (not always successful, but widely used)
• Goal: Fit model: Y’ = b0 + b1X + e for various power transformations on Y, and selecting transformation producing minimum SSE (maximum likelihood)
• Procedure: over a range of l from, say -2 to +2, obtain Wi and regress Wi on X (assuming all Yi > 0, although adding constant won’t affect shape or spread of Y distribution)
11
2 1 11 22
1 0 1
ln 0
nni
i iii
K YW K Y K
KK Y
Box-Cox Transformation – Obtained in R
Maximum occurs near l = 0 (Interval Contains 0) – Try taking logs of Weight
Results of Tests (Using R Functions) on ln(WT)Normality of Errors (Shapiro-Wilk Test)
> shapiro.test(e2) Shapiro-Wilk normality testdata: e2W = 0.9976, p-value = 0.679
> nba.mod2 <- lm(log(Weight) ~ Height)> summary(nba.mod2)
Call:lm(formula = log(Weight) ~ Height)
Coefficients: Est Std. Error t value Pr(>|t|) (Intercept) 3.0781 0.0696 44.20 <2e-16 Height 0.0292 0.0009 33.22 <2e-16
Residual standard error: 0.06823 on 503 degrees of freedomMultiple R-squared: 0.6869, Adjusted R-squared: 0.6863 F-statistic: 1104 on 1 and 503 DF, p-value: < 2.2e-16
Constant Error Variance (Breusch-Pagan Test)> bptest(log(Weight) ~ Height,studentize=FALSE) Breusch-Pagan test
data: log(Weight) ~ HeightBP = 0.4711, df = 1, p-value = 0.4925
Linearity of Regression (Lack of Fit Test) nba.mod3 <- lm(log(Weight) ~ factor(Height))> anova(nba.mod2,nba.mod3)Analysis of Variance Table
Model 1: log(Weight) ~ HeightModel 2: log(Weight) ~ factor(Height) Res.Df RSS Df Sum of Sq F Pr(>F)1 503 2.3414 2 487 2.2478 16 0.093642 1.268 0.2131
Model fits well on all assumptions
top related