regression diagnostics checking assumptions and data
TRANSCRIPT
Regression Diagnostics
Checking Assumptions and Data
Questions
• What is the linearity assumption? How can you tell if it seems met?
• What is homoscedasticity (heteroscedasticity)? How can you tell if it’s a problem?
• What is an outlier?• What is leverage?
• What is a residual?
• How can you use residuals in assuring that the regression model is a good representation of the data?
• What is a studentized residual?
Linear Model Assumptions
• Linear relations between X and Y
• Independent Errors
• Normal distribution for errors & Y
• Equal Variance of Errors:
Homoscedasticity ( spread of error in Y across levels of X)
Good-Looking Graph
6420-2
X
9
6
3
0
-3
Y
No apparent departures from line.
Problem with Linearity
50 100 150 200 250
Horsepower
10
20
30
40
50M
iles
per
Gal
lon
R Sq Linear = 0.595
Problem with Heteroscedasticity
65320-2
X
10
8
6
4
2
0
Y
Common problem when Y = $
Outliers
65320-2
X
10
8
6
3
1
-1
Y
Outlier
Outlier = pathological point
Residual Plots
• Histogram of Residuals• Residuals vs Fitted Values• Residuals vs Predictor Variable• Normal Q-Q Plots
• Studentized Residuals or standardized Residuals
Residuals• Standardized Residuals
Look for large values (some say |>2)
• Studentized residual:The studentized residual considers the distance of the point from the mean. The farther X is from the mean, the smaller the standard error and the larger the residual. Look for large values.
Residual i Standard deviation
Residual Plots
50403020100-10-20-30-40
10
5
0
Residual
Fre
qu
en
cy
Histogram of the Residuals(response is Crimrate)
6050403020100-10-20-30-40
2
1
0
-1
-2
Norm
al S
core
Residual
Normal Probability Plot of the Residuals(response is Crimrate)
6050403020100-10-20-30-40
2
1
0
-1
-2
Norm
al S
core
Residual
Normal Probability Plot of the Residuals(response is Crimrate)
0
50
100
1stQtr
3rdQtr
EastWestNorth
20015010050
60
50
40
30
20
10
0
-10
-20
-30
-40
Fitted Value
Resid
ual
Residuals Versus the Fitted Values(response is Crimrate)
0
50
100
1stQtr
3rdQtr
EastWestNorth45403530252015105
60
50
40
30
20
10
0
-10
-20
-30
-40
Observation Order
Resid
ual
Residuals Versus the Order of the Data(response is Crimrate)
Abnormal Patterns in Residual PlotsFigures a), b)
Non-linearity
Figure c)
Augtocorrelations
Figure d)
Heteroscedasticity
Patterns of Outliersa) Outlier is extreme in both X
and Y but not in pattern. Removal is
unlikely to alter regression line.
b) Outlier is extreme in both X and Y
as well as in the overall pattern.
Inclusion will strongly influence
regression line
c) Outlier is extreme for X nearly
average for Y.
d) Outlier extreme in Y not in X.
e) Outlier extreme in pattern, but
not in X or Y.
Influence Analysis
• Leverage: h_ii (in page8)• Leverage is an index of the importance of
an observation to a regression analysis.– Function of X only
– Large deviations from mean are influential
– Maximum is 1; min is 1/n
– It is considered large if more than 3 x p /n (p=number of predictors including the constant).
Cook’s distance
• measures the influence of a data point on the regression equation.
i.e. measures the effect of deleting a given observation: data points with large residuals (outliers) and/or high leverage
• Cook’s D > 1 requires careful checking
(such points are influential);
> 4 suggests potentially serious outliers.
Sensitivity in Inference
• All tests and intervals are very sensitive to even minor departures from independence.
• All tests and intervals are sensitive to moderate departures from equal variance.
• The hypothesis tests and confidence intervals for β0 and β1 are fairly "robust" (that is, forgiving) against departures from normality.
• Prediction intervals are quite sensitive to departures from normality.
Remedies• If important predictor variables are
omitted, see whether adding the omitted predictors improves the model.
• If there are unequal error variances, try transforming the response and/or predictor variables or use "weighted least squares regression."
• If an outlier exists, try using a "robust estimation procedure."
• If error terms are not independent, try fitting a "time series model."
• If the mean of the response is not a linear function of the predictors, try a different function.
• For example, polynomial regression involves transforming one or more predictor variables while remaining within the multiple linear regression framework.
• For another example, applying a logarithmic transformation to the response variable also allows for a nonlinear relationship between the response and the predictors.
Data Transformation
• The usual approach for dealing with nonconstant variance, when it occurs, is to apply a variance-stabilizing transformation.
• For some distributions, the variance is
a function of E(Y).
• Box-Cox transformation
λ
λ
λ
λ