regression diagnostics checking assumptions and data

Regression Diagnostics

Checking Assumptions and Data

Questions

• What is the linearity assumption? How can you tell if it seems met?

• What is homoscedasticity (heteroscedasticity)? How can you tell if it’s a problem?

• What is an outlier?• What is leverage?

• What is a residual?

• How can you use residuals in assuring that the regression model is a good representation of the data?

• What is a studentized residual?

Linear Model Assumptions

• Linear relations between X and Y

• Independent Errors

• Normal distribution for errors & Y

• Equal Variance of Errors:

Homoscedasticity ( spread of error in Y across levels of X)

Good-Looking Graph

6420-2

X

9

6

3

0

-3

Y

No apparent departures from line.

Problem with Linearity

50 100 150 200 250

Horsepower

10

20

30

40

50M

iles

per

Gal

lon

R Sq Linear = 0.595

Problem with Heteroscedasticity

65320-2

X

10

8

6

4

2

0

Y

Common problem when Y = $

Outliers

65320-2

X

10

8

6

3

1

-1

Y

Outlier

Outlier = pathological point

Residual Plots

• Histogram of Residuals• Residuals vs Fitted Values• Residuals vs Predictor Variable• Normal Q-Q Plots

• Studentized Residuals or standardized Residuals

Residuals• Standardized Residuals

Look for large values (some say |>2)

• Studentized residual:The studentized residual considers the distance of the point from the mean. The farther X is from the mean, the smaller the standard error and the larger the residual. Look for large values.

Residual i Standard deviation

Residual Plots

50403020100-10-20-30-40

10

5

0

Residual

Fre

qu

en

cy

Histogram of the Residuals(response is Crimrate)

6050403020100-10-20-30-40

2

1

0

-1

-2

Norm

al S

core

Residual

Normal Probability Plot of the Residuals(response is Crimrate)

6050403020100-10-20-30-40

2

1

0

-1

-2

Norm

al S

core

Residual

Normal Probability Plot of the Residuals(response is Crimrate)

0

50

100

1stQtr

3rdQtr

EastWestNorth

20015010050

60

50

40

30

20

10

0

-10

-20

-30

-40

Fitted Value

Resid

ual

Residuals Versus the Fitted Values(response is Crimrate)

0

50

100

1stQtr

3rdQtr

EastWestNorth45403530252015105

60

50

40

30

20

10

0

-10

-20

-30

-40

Observation Order

Resid

ual

Residuals Versus the Order of the Data(response is Crimrate)

Abnormal Patterns in Residual PlotsFigures a), b)

Non-linearity

Figure c)

Augtocorrelations

Figure d)

Heteroscedasticity

Patterns of Outliersa) Outlier is extreme in both X

and Y but not in pattern. Removal is

unlikely to alter regression line.

b) Outlier is extreme in both X and Y

as well as in the overall pattern.

Inclusion will strongly influence

regression line

c) Outlier is extreme for X nearly

average for Y.

d) Outlier extreme in Y not in X.

e) Outlier extreme in pattern, but

not in X or Y.

Influence Analysis

• Leverage: h_ii (in page8)• Leverage is an index of the importance of

an observation to a regression analysis.– Function of X only

– Large deviations from mean are influential

– Maximum is 1; min is 1/n

– It is considered large if more than 3 x p /n (p=number of predictors including the constant).

Cook’s distance

• measures the influence of a data point on the regression equation.

i.e. measures the effect of deleting a given observation: data points with large residuals (outliers) and/or high leverage

• Cook’s D > 1 requires careful checking

(such points are influential);

> 4 suggests potentially serious outliers.

Sensitivity in Inference

• All tests and intervals are very sensitive to even minor departures from independence.

• All tests and intervals are sensitive to moderate departures from equal variance.

• The hypothesis tests and confidence intervals for β0 and β1 are fairly "robust" (that is, forgiving) against departures from normality.

• Prediction intervals are quite sensitive to departures from normality.

Remedies• If important predictor variables are

omitted, see whether adding the omitted predictors improves the model.

• If there are unequal error variances, try transforming the response and/or predictor variables or use "weighted least squares regression."

• If an outlier exists, try using a "robust estimation procedure."

• If error terms are not independent, try fitting a "time series model."

• If the mean of the response is not a linear function of the predictors, try a different function.

• For example, polynomial regression involves transforming one or more predictor variables while remaining within the multiple linear regression framework.

• For another example, applying a logarithmic transformation to the response variable also allows for a nonlinear relationship between the response and the predictors.

Data Transformation

• The usual approach for dealing with nonconstant variance, when it occurs, is to apply a variance-stabilizing transformation.

• For some distributions, the variance is

a function of E(Y).

• Box-Cox transformation

λ

λ

λ

λ

regression diagnostics checking assumptions and data

Documents