diagnostics and remedial measures
DESCRIPTION
Diagnostics and Remedial Measures. KNNL – Chapter 3. Diagnostics for Predictor Variables. Problems can occur when: Outliers exist among X levels X levels are associated with run order when experiment is run sequentially Useful plots of X levels Dot plot for discrete data Histogram - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/1.jpg)
Diagnostics and Remedial Measures
KNNL – Chapter 3
![Page 2: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/2.jpg)
Diagnostics for Predictor Variables
• Problems can occur when: Outliers exist among X levels X levels are associated with run order when experiment is
run sequentially Useful plots of X levels
Dot plot for discrete data Histogram Box Plot Sequence Plot (X versus Run #)
![Page 3: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/3.jpg)
Residuals
^
0 1
1
22
2 21 1
Note this is different from
Properties of Residuals:
Mean: 0
Variance: Correct Model { }2 2 2
Nonindependence: Residua
ii i i i i
n
ii
n n
i ii i
e Y Y Y X
ee
n
e e eSSE
s MSE E MSEn n n
1 1
ls are not independent (based on same fitted regression).
2 Constraints: 0
Not a problem if number of observations is large compared to parameters in model
Semistudentized Residual
n n
i i ii i
e X e
*
*
s:
where approximates
is like a -statistic and can be used to detect outliers
i ii i
i
e e ee MSE s e
MSE MSE
e t
![Page 4: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/4.jpg)
Model Departures Detected With Residuals and Plots• Relation between Y and X is not linear• Errors have non-constant variance• Errors are not independent• Existence of Outlying Observations• Non-normal Errors• Missing predictor variable(s)• Common Plots
Residuals/Absolute Residuals versus Predictor Variable Residuals/Absolute Residuals versus Predicted Values Residuals versus Omitted variables Residuals versus Time Box Plots, Histograms, Normal Probability Plots
![Page 5: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/5.jpg)
Detecting Nonlinearity of Regression Function
• Plot Residuals versus X Random Cloud around 0 Linear Relation U-Shape or Inverted U-Shape Nonlinear Relation
Maps Distribution Example (Table 3.1,Figure 3.5, p.10)
![Page 6: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/6.jpg)
Non-Constant Error Variance / Outliers / Non-Independence
• Plot Residuals versus X or Predicted Values Random Cloud around 0 Linear Relation Funnel Shape Non-constant Variance Outliers fall far above (positive) or below (negative) the general
cloud pattern Plot absolute Residuals, squared residuals, or square root
of absolute residuals Positive Association Non-constant Variance
Measurements made over time: Plot Residuals versus Time Order (Expect Random Cloud if independent) Linear Trend Process “improving” or “worsening” over time Cyclical Trend Measurements close in time are similar
![Page 7: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/7.jpg)
Non-Normal Errors• Box-Plot of Residuals – Can confirm symmetry and
lack of outliers• Check Proportion that lie within 1 standard deviation
from 0, 2 SD, etc, where SD=sqrt(MSE)• Normal probability plot of residual versus expected
values under normality – should fall approximately on a straight line (Only works well with moderate to large samples) qqnorm(e); qqline(e) in R
Expected value of Residuals under Normality:
1) Rank residuals from smallest (large/negative) to highest (large/positive) Rank =
0.3752) Compute the percentile using and obtain correspon
0.25
k
kp
n
ding -value: ( )
3) Multiply by expected residual = ( )
z z p
s MSE MSE z p
![Page 8: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/8.jpg)
Omission of Important Predictors
• If data are available on other variables, plots of residuals versus X, separate for each level or sub-ranges of the new variable(s) can lead to adding new predictor variable(s)
• If, for instance, if residuals from a regression of salary versus experience was fit, we could view residuals versus experience for males and females separately, if one group tends to have positive residuals, and the other negative, then add gender to model
![Page 9: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/9.jpg)
Test for Independence – Runs Test
• Runs Test (Presumes data are in time order)1) Write out the sequence of +/- signs of the residuals2) Count n1 = # of positive residuals, n2 = # of negative residuals
3) Count u = # of “runs” of positive and negative residuals4) If n1 + n2 ≤ 20, refer to Table of critical values (Not random if u
is too small)5) If n1 + n2 > 20, use a large-sample (approximate) z-test:
1 2 1 2 1 21 22
1 2 1 2 1 2
Under Independence:
2 221
1
0.5 P-value =
u u
uu u
u
n n n n n nn nE u u
n n n n n n
uz P Z z
![Page 10: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/10.jpg)
Test For Independence - Durbin-Watson Test
20 1 1
0
~ 0, 1
: 0 Errors are uncorrelated over time
: 0 Positively correlated
1) Obtain Residuals from Regression
2) Compute Durbin-Watson Statistic (given below)
3) Obta
t t t t t t t
A
Y X u u NID
H
H
0
0
2
12
2
1
in Critical Values from Table B.7, pp. 1330-1331
If 1, Reject
If 1, Conclude
Otherwise Inconclusive
Test Statistic:
Note: This generalizes t
L
U
n
t tt
n
tt
DW d n H
DW d n H
e eDW
e
o any number of Predictors ( 1)p
![Page 11: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/11.jpg)
Test for Normality of Residuals
• Correlation Test1) Obtain correlation between observed residuals and
expected values under normality (see slide 7)2) Compare correlation with critical value based on -level
from Table B.6, page 1329 A good approximation for the critical value is: 1.02-1/sqrt(10n)
3) Reject the null hypothesis of normal errors if the correlation falls below the table value
• Shapiro-Wilk Test – Performed by most software packages. Related to correlation test, but more complex calculations – see NFL Point Spreads and Actual Scores Case Study for description of one version
![Page 12: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/12.jpg)
Equal (Homogeneous) Variance - I
2 20
Brown-Forsythe Test:
: Equal Variance Among Errors
: Unequal Variance Among Errors (Increasing or Decreasing in )
1) Split Dataset into 2 groups based on levels of (or fitted values) wi
i
A
H i
H X
X
1 2
1 2
th sample sizes: ,
2) Compute the median residual in each group: ,
3) Compute absolute deviation from group median for each residual:
1,..., 1, 2
4) Compute the mean and varianc
jij ij j
n n
e e
d e e i n j
0
2 21 21 2
2 21 1 2 22
1 2
1 2
1 2
1 2
0
e for each group of : , ,
1 15) Compute the pooled variance:
2
Test Statistic: 21 1
Reject if 1 2 ; 2
~
ij
H
BF
BF
d d s d s
n s n ss
n n
d dt t n n
sn n
H t t n
![Page 13: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/13.jpg)
Equal (Homogeneous) Variance - II
2 20
2 21 1
2
1
Breusch-Pagan (aka Cook-Weisberg) Test:
: Equal Variance Among Errors
: Unequal Variance Among Errors ...
1) Let from original regression
2) Fit Regression
i
A i i p ip
n
ii
H i
H h X X
SSE e
0
21
2 22
2
1
2 20
of on ,... and obtain Reg*
Reg* 2Test Statistic:
Reject H if 1 ; = # of predictors
~
i i ip
H
BP pn
ii
BP
e X X SS
SSX
e n
X p p
![Page 14: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/14.jpg)
Linearity of Regression
0 0 1 0 1
2
1 1
-Test for Lack-of-Fit ( observations at distinct levels of " ")
: :
Compute fitted value and sample mean for each distinct level
Lack-of-Fit: j
j
i i A i i i
j j
n
j j
j i
F n c X
H E Y X H E Y X
Y Y X
SS LF Y Y
0
2
1 1
2,
0
2
Pure Error:
( ) 2 ( )Test Statistic:
( )( )
Reject H if 1 ; 2,
~
j
c
LF
nc
jij PEj i
H
LOF c n c
LOF
df c
SS PE Y Y df n c
SS LF c MS LFF F
MS PESS PE n c
F F c n c
![Page 15: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/15.jpg)
Remedial Measures• Nonlinear Relation – Add polynomials, fit exponential
regression function, or transform Y and/or X• Non-Constant Variance – Weighted Least Squares,
transform Y and/or X, or fit Generalized Linear Model• Non-Independence of Errors – Transform Y or use
Generalized Least Squares• Non-Normality of Errors – Box-Cox tranformation, or
fit Generalized Linear Model• Omitted Predictors – Include important predictors in
a multiple regression model• Outlying Observations – Robust Estimation
![Page 16: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/16.jpg)
Transformations for Non-Linearity – Constant Variance
X’ = √X X’ = ln(X) X’ = X2 X’ = eX X’ = 1/X X’ = e-X
![Page 17: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/17.jpg)
Transformations for Non-Linearity – Non-Constant Variance
Y’ = √Y Y’ = ln(Y) Y’ = 1/Y
![Page 18: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/18.jpg)
Box-Cox Transformations
• Automatically selects a transformation from power family with goal of obtaining: normality, linearity, and constant variance (not always successful, but widely used)
• Goal: Fit model: Y’ = 0 + 1X + for various power transformations on Y, and selecting transformation producing minimum SSE (maximum likelihood)
• Procedure: over a range of from, say -2 to +2, obtain Wi and regress Wi on X (assuming all Yi > 0, although adding constant won’t affect shape or spread of Y distribution)
11
2 1 11 22
1 0 1
ln 0
nni
i iii
K YW K Y K
KK Y
![Page 19: Diagnostics and Remedial Measures](https://reader035.vdocuments.site/reader035/viewer/2022081504/568138e9550346895da09bd6/html5/thumbnails/19.jpg)
Lowess (Smoothed) Plots • Nonparametric method of obtaining a smooth plot of
the regression relation between Y and X• Fits regression in small neighborhoods around points
along the regression line on the X axis• Weights observations closer to the specific point
higher than more distant points• Re-weights after fitting, putting lower weights on
larger residuals (in absolute value)• Obtains fitted value for each point after “final”
regression is fit• Model is plotted along with linear fit, and confidence
bands, linear fit is good if lowess lies within bands