Download - Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response

6-1

Lecture 6

Regression Diagnostics

STAT 512

Spring 2011

Background Reading

KNNL: 3.1-3.6

6-2

Topic Overview

Chapter 3 – Diagnostics & Remedial Measures

for Simple Linear Regression

Diagnostics: Look at the data to diagnose

situations where the assumptions of our model are

violated. (Lecture 6)

Remedies: Changes in analytic strategy to fix

these problems. (Lecture 7)

6-3

What Do We Need To Check?

Main Assumptions: Errors are independent,

normal random variables with common

variance 2

Does the assumption of linearity make

sense?

Were any important predictors excluded

from the model?

6-4

What Do We Need To Check?

Are there “outlying” values for the predictor

variables (X) that could unduly influence

the regression model?

Are there outliers? (Generally the term

outlier refers to a response that is vastly

different from other responses (Y) – see

KNNL pg 108)

How to Get Started? - Look at the Data!

6-5

Diagnostics for Predictors (X)

We do not make any specific assumptions

about X. However, understanding what is

going on with X is necessary to

interpreting what is going on with the

response (Y).

So, we can look at some basic summaries of

the X variables to get oriented.

However, we are not checking our

assumptions at this point.

6-6

Diagnostics for Predictors (X)

Dot plots, Stem-and-leaf plots, Box plots,

and Histograms can be useful in

identifying potential outlying observations

in X. Note that just because it is an

outlying observation does not mean it will

create a problem in the analysis. However

it is a data point that will probably have

higher influence over the regression

estimates.

Sequence plots can be useful for identifying

potential problems with independence.

6-7

SAS Procedures

PROC UNIVARIATE for getting basic

statistics and creating histograms for both

response and predictor variables. Check

for outliers, unusual skewness, clumping.

PROC GPLOT to create a scatter plot of X

against Y. Assess linearity visually.

6-8

Reminder – Scatterplot

6-9

UNIVARIATE Procedure (1)

(06_misc.sas)

PROC UNIVARIATE data=muscle plot;

var age;

histogram age / normal (mu=est sigma=est);

title 'Histogram for Age';

RUN; Basic Statistical Measures

Location Variability

Mean 59.98333 Std Deviation 11.79700

Median 60.00000 Variance 139.16921

Mode 78.00000 Range 37.00000

Interquartile Range 20.50000

6-10


Stem Leaf # Boxplot

78 00000 5 |

76 0000 4 |

74 0 1 |

72 000 3 |

70 000 3 +-----+

68 000 3 | |

66 0 1 | |

64 0000 4 | |

62 000 3 | |

60 0000 4 *--+--*

58 000 3 | |

56 0000 4 | |

54 000 3 | |

52 000 3 | |

50 0 1 | |

48 00 2 +-----+

46 0000 4 |

44 00 2 |

42 0000 4 |

40 000

----+----+----+----+

6-11


6-12


What if we add in a data point for:

age=100, mmass =40?

Stem Leaf # Boxplot

10 0 1 0

9

9

8

8

7 5666788888 10 |

7 001223 6 +-----+

6 5556889 7 | |

6 00013334 8 *--+--*

5 56777999 8 | |

5 123344 6 +-----+

4 5677788 7 |

4 11122334 8 |

----+----+----+----+

Multiply Stem.Leaf by 10**+1

6-13


6-14

Diagnostics for Residuals (1)

Basic Distributional Assumptions on Errors

Model: Yi = β0 + β1Xi + εi

o Where 2~ 0,iid

i N (i.e., the εi are

independent, normal, and have constant

variance).

The ei (residuals) should be similar to the εi

How do we check this? Plot the Residuals!

6-15

Diagnostics for Residuals (2)

Basic Questions addressed by diagnostics

for residuals

o Is the relationship linear?

o Does the variance depend on X?

o Are the errors normal?

o Are the errors independent?

o Are their outliers?

o Are any important predictors omitted?

6-16

Checking Linearity

Plot Y vs. X (scatterplot)

Plot e vs X (or Y ) - residual plot

Generally can see from a scatter plot when a

relationship is nonlinear

Patterns in residual plots can emphasize

deviations from linear pattern

6-17

Checking Constant Variance

Plot e vs X (or Y ) - residual plot

Patterns suggest issues!

Megaphone shape indicates

increasing/decreasing variance with X

Other shapes can indicate non-linearity

Outliers show up in obvious way

6-18

SAS Code PROC REG data=muscle;

model mmass=age;

output out=diag p=pred r=resid;

RUN;

*Plot residuals vs age;

symbol1 v=dot i=none;

PROC GPLOT data=diag;

plot resid*age;

title 'Residuals for Muscle Mass Data';

run;

6-23

Checking for Normality Plot residuals in a Normal Probability Plot

o Compare residuals to their expected value

under normality (normal quantiles)

o Should be linear IF normal

Plot residuals in a Histogram

PROC UNIVARIATE is used for both of

these

Book shows method to do this by hand –

you do not need to worry about having to

do that.

6-24

SAS Code

PROC REG data=muscle;

model mmass=age;

output out=diag p=pred r=resid;

RUN;

*Check normality assumption;

PROC UNIVARIATE data=diag normal;

var resid;

histogram resid /normal(mu=est sigma=est);

qqplot resid /normal;

title 'Check for Normality';

RUN;

6-27

Normality Plot

Outliers show up in a quite obvious way.

Non-normal distributions can look very

wacky.

Symmetric / Heavy tailed distributions show

an “S” shape.

Skewed distributions show exponential

looking curves (see figure 3.9)

6-30

- 4 - 3 - 2 - 1 0 1 2 3 4

- 100000

- 50000

0

50000

100000

150000

R

e

s

i

d

u

a

l

Nor mal Quant i l es

6-31

Checking Independence

Sequence Plot: Residuals against time/order

Patterns suggest non-independence

See figure 3.8 in KNNL.

6-32

Additional Predictors

Plot residuals against other potential

predictors (not predictors from the model)

Patterns indicate an important predictor that

maybe should be in the model.

Example: Suppose we use a muscle mass

dataset that includes both men and women.

6-34

Residuals vs Age

Plot looks great, right?

But what happens if we separate male and

female?

PROC GPLOT data=diag;

plot resid*age=gender /overlay;

RUN;

6-36

Additional Predictors

Seems like gender is also an important

predictor of muscle mass (note that gender

is categorical, so we’ll have to wait until

later in the semester for further analysis)

For continuous variables, you look for a

linear pattern with a non-zero slope.

6-37

Summary of Diagnostic Plots You will have noticed that the same plots are

used for checking more than one assumption.

These are your basic tools.

o Plot Y vs. X (check for linearity, outliers)

o Plot Residuals vs. X (check for constant

variance, outliers, linearity)

o Normal Probability Plot and/or

Histogram of residuals (normality, outliers)

If it makes sense, consider also doing a

sequence plot of the residuals (independence)

6-38

Plots vs. Significance Tests If you are uncertain what to conclude after

examining the plots, you may additionally wish

to perform hypothesis tests for model

assumptions (normality, homogeneity of

variance, independence).

These tests are not a replacement for the plots,

but rather a supplement to them.

Note of caution: Plots are more likely to

suggest a remedy and significance test results

are very dependent on sample size.

6-39

Significance Tests for

Model Assumptions

Constancy of Variance:

o Brown-Forsythe (modified Levene)

o Breusch-Pagan

Normality

o Kolmogorov-Smirnov, etc.

Independence of Errors:

o Durbin-Watson Test

6-40

Tests for Normality

PROC UNIVARIATE data=diag normal;

var resid;

Tests for Normality

Test --Statistic--- -----p Value------

Shapiro-Wilk W 0.979585 Pr < W 0.4112

Kolmogorov-Smirnov D 0.079433 Pr > D >0.1500

Cramer-von Mises W-Sq 0.057805 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.383556 Pr > A-Sq >0.2500

Small p-values indicate non-normality

6-41

Upcoming in Lecture 7...

Remedial Measures: What to do when there

is a problem with your model assumptions

(KNNL: 3.8-3.11)

Download - Lecture 6 Regression Diagnostics - Department of Statisticsghobbs/STAT_512/... · SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response

Top Related