lecture 6 regression diagnostics - department of statisticsghobbs/stat_512/... · sas procedures...
TRANSCRIPT
![Page 1: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/1.jpg)
6-1
Lecture 6
Regression Diagnostics
STAT 512
Spring 2011
Background Reading
KNNL: 3.1-3.6
![Page 2: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/2.jpg)
6-2
Topic Overview
Chapter 3 – Diagnostics & Remedial Measures
for Simple Linear Regression
Diagnostics: Look at the data to diagnose
situations where the assumptions of our model are
violated. (Lecture 6)
Remedies: Changes in analytic strategy to fix
these problems. (Lecture 7)
![Page 3: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/3.jpg)
6-3
What Do We Need To Check?
Main Assumptions: Errors are independent,
normal random variables with common
variance 2
Does the assumption of linearity make
sense?
Were any important predictors excluded
from the model?
![Page 4: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/4.jpg)
6-4
What Do We Need To Check?
Are there “outlying” values for the predictor
variables (X) that could unduly influence
the regression model?
Are there outliers? (Generally the term
outlier refers to a response that is vastly
different from other responses (Y) – see
KNNL pg 108)
How to Get Started? - Look at the Data!
![Page 5: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/5.jpg)
6-5
Diagnostics for Predictors (X)
We do not make any specific assumptions
about X. However, understanding what is
going on with X is necessary to
interpreting what is going on with the
response (Y).
So, we can look at some basic summaries of
the X variables to get oriented.
However, we are not checking our
assumptions at this point.
![Page 6: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/6.jpg)
6-6
Diagnostics for Predictors (X)
Dot plots, Stem-and-leaf plots, Box plots,
and Histograms can be useful in
identifying potential outlying observations
in X. Note that just because it is an
outlying observation does not mean it will
create a problem in the analysis. However
it is a data point that will probably have
higher influence over the regression
estimates.
Sequence plots can be useful for identifying
potential problems with independence.
![Page 7: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/7.jpg)
6-7
SAS Procedures
PROC UNIVARIATE for getting basic
statistics and creating histograms for both
response and predictor variables. Check
for outliers, unusual skewness, clumping.
PROC GPLOT to create a scatter plot of X
against Y. Assess linearity visually.
![Page 8: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/8.jpg)
6-8
Reminder – Scatterplot
![Page 9: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/9.jpg)
6-9
UNIVARIATE Procedure (1)
(06_misc.sas)
PROC UNIVARIATE data=muscle plot;
var age;
histogram age / normal (mu=est sigma=est);
title 'Histogram for Age';
RUN; Basic Statistical Measures
Location Variability
Mean 59.98333 Std Deviation 11.79700
Median 60.00000 Variance 139.16921
Mode 78.00000 Range 37.00000
Interquartile Range 20.50000
![Page 10: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/10.jpg)
6-10
UNIVARIATE Procedure (2)
Stem Leaf # Boxplot
78 00000 5 |
76 0000 4 |
74 0 1 |
72 000 3 |
70 000 3 +-----+
68 000 3 | |
66 0 1 | |
64 0000 4 | |
62 000 3 | |
60 0000 4 *--+--*
58 000 3 | |
56 0000 4 | |
54 000 3 | |
52 000 3 | |
50 0 1 | |
48 00 2 +-----+
46 0000 4 |
44 00 2 |
42 0000 4 |
40 000
----+----+----+----+
![Page 11: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/11.jpg)
6-11
UNIVARIATE Procedure (3)
![Page 12: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/12.jpg)
6-12
UNIVARIATE Procedure (4)
What if we add in a data point for:
age=100, mmass =40?
Stem Leaf # Boxplot
10 0 1 0
9
9
8
8
7 5666788888 10 |
7 001223 6 +-----+
6 5556889 7 | |
6 00013334 8 *--+--*
5 56777999 8 | |
5 123344 6 +-----+
4 5677788 7 |
4 11122334 8 |
----+----+----+----+
Multiply Stem.Leaf by 10**+1
![Page 13: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/13.jpg)
6-13
UNIVARIATE Procedure (5)
![Page 14: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/14.jpg)
6-14
Diagnostics for Residuals (1)
Basic Distributional Assumptions on Errors
Model: Yi = β0 + β1Xi + εi
o Where 2~ 0,iid
i N (i.e., the εi are
independent, normal, and have constant
variance).
The ei (residuals) should be similar to the εi
How do we check this? Plot the Residuals!
![Page 15: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/15.jpg)
6-15
Diagnostics for Residuals (2)
Basic Questions addressed by diagnostics
for residuals
o Is the relationship linear?
o Does the variance depend on X?
o Are the errors normal?
o Are the errors independent?
o Are their outliers?
o Are any important predictors omitted?
![Page 16: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/16.jpg)
6-16
Checking Linearity
Plot Y vs. X (scatterplot)
Plot e vs X (or Y ) - residual plot
Generally can see from a scatter plot when a
relationship is nonlinear
Patterns in residual plots can emphasize
deviations from linear pattern
![Page 17: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/17.jpg)
6-17
Checking Constant Variance
Plot e vs X (or Y ) - residual plot
Patterns suggest issues!
Megaphone shape indicates
increasing/decreasing variance with X
Other shapes can indicate non-linearity
Outliers show up in obvious way
![Page 18: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/18.jpg)
6-18
SAS Code PROC REG data=muscle;
model mmass=age;
output out=diag p=pred r=resid;
RUN;
*Plot residuals vs age;
symbol1 v=dot i=none;
PROC GPLOT data=diag;
plot resid*age;
title 'Residuals for Muscle Mass Data';
run;
![Page 19: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/19.jpg)
6-19
![Page 20: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/20.jpg)
6-20
![Page 21: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/21.jpg)
6-21
![Page 22: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/22.jpg)
6-22
![Page 23: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/23.jpg)
6-23
Checking for Normality Plot residuals in a Normal Probability Plot
o Compare residuals to their expected value
under normality (normal quantiles)
o Should be linear IF normal
Plot residuals in a Histogram
PROC UNIVARIATE is used for both of
these
Book shows method to do this by hand –
you do not need to worry about having to
do that.
![Page 24: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/24.jpg)
6-24
SAS Code
PROC REG data=muscle;
model mmass=age;
output out=diag p=pred r=resid;
RUN;
*Check normality assumption;
PROC UNIVARIATE data=diag normal;
var resid;
histogram resid /normal(mu=est sigma=est);
qqplot resid /normal;
title 'Check for Normality';
RUN;
![Page 25: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/25.jpg)
6-25
![Page 26: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/26.jpg)
6-26
![Page 27: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/27.jpg)
6-27
Normality Plot
Outliers show up in a quite obvious way.
Non-normal distributions can look very
wacky.
Symmetric / Heavy tailed distributions show
an “S” shape.
Skewed distributions show exponential
looking curves (see figure 3.9)
![Page 28: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/28.jpg)
6-28
![Page 29: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/29.jpg)
6-29
![Page 30: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/30.jpg)
6-30
- 4 - 3 - 2 - 1 0 1 2 3 4
- 100000
- 50000
0
50000
100000
150000
R
e
s
i
d
u
a
l
Nor mal Quant i l es
![Page 31: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/31.jpg)
6-31
Checking Independence
Sequence Plot: Residuals against time/order
Patterns suggest non-independence
See figure 3.8 in KNNL.
![Page 32: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/32.jpg)
6-32
Additional Predictors
Plot residuals against other potential
predictors (not predictors from the model)
Patterns indicate an important predictor that
maybe should be in the model.
Example: Suppose we use a muscle mass
dataset that includes both men and women.
![Page 33: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/33.jpg)
6-33
![Page 34: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/34.jpg)
6-34
Residuals vs Age
Plot looks great, right?
But what happens if we separate male and
female?
PROC GPLOT data=diag;
plot resid*age=gender /overlay;
RUN;
![Page 35: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/35.jpg)
6-35
![Page 36: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/36.jpg)
6-36
Additional Predictors
Seems like gender is also an important
predictor of muscle mass (note that gender
is categorical, so we’ll have to wait until
later in the semester for further analysis)
For continuous variables, you look for a
linear pattern with a non-zero slope.
![Page 37: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/37.jpg)
6-37
Summary of Diagnostic Plots You will have noticed that the same plots are
used for checking more than one assumption.
These are your basic tools.
o Plot Y vs. X (check for linearity, outliers)
o Plot Residuals vs. X (check for constant
variance, outliers, linearity)
o Normal Probability Plot and/or
Histogram of residuals (normality, outliers)
If it makes sense, consider also doing a
sequence plot of the residuals (independence)
![Page 38: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/38.jpg)
6-38
Plots vs. Significance Tests If you are uncertain what to conclude after
examining the plots, you may additionally wish
to perform hypothesis tests for model
assumptions (normality, homogeneity of
variance, independence).
These tests are not a replacement for the plots,
but rather a supplement to them.
Note of caution: Plots are more likely to
suggest a remedy and significance test results
are very dependent on sample size.
![Page 39: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/39.jpg)
6-39
Significance Tests for
Model Assumptions
Constancy of Variance:
o Brown-Forsythe (modified Levene)
o Breusch-Pagan
Normality
o Kolmogorov-Smirnov, etc.
Independence of Errors:
o Durbin-Watson Test
![Page 40: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/40.jpg)
6-40
Tests for Normality
PROC UNIVARIATE data=diag normal;
var resid;
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.979585 Pr < W 0.4112
Kolmogorov-Smirnov D 0.079433 Pr > D >0.1500
Cramer-von Mises W-Sq 0.057805 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.383556 Pr > A-Sq >0.2500
Small p-values indicate non-normality
![Page 41: Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f048b0a7e708231d40e7da4/html5/thumbnails/41.jpg)
6-41
Upcoming in Lecture 7...
Remedial Measures: What to do when there
is a problem with your model assumptions
(KNNL: 3.8-3.11)