detecting and responding to violations of regression ... · detecting and responding to violations...
TRANSCRIPT
IU-logo
Detecting and Responding to Violations of Regression Assumptions
Chunfeng HuangDepartment of Statistics, Indiana University
1 / 29
IU-logo
Example
x
Frequency
y
2 3 4 5 6 7
6080
100120140160
23
45
67
x
Frequency
x1
60 80 100 120 140 160 20 40 60 80
2040
6080
x
Frequency
x2
2 / 29
IU-logo
Linear Regression Model
y = β0 + β1x1 + . . .+ βkxk + ε
y : response variable.
x1, . . . , xk : regressor variables, independent variables.
β0, β1, . . . , βk : regression coefficients.
ε: model error.I Uncorrelated: cov(εi , εj) = 0, i 6= j .I Mean zero, Same variance: var(εi ) = σ2. (homoscedasticity)I Normally distributed.
3 / 29
IU-logo
Linear Models Examples:
y = β0 + β1x + β2x2 + ε
y = β0 + β1x1 + β2x2 + β12x1x2 + ε
y = β0 + β1 log x1 + β2 log x2 + ε
log y = β0 + β1
(1
x1
)+ β2
(1
x2
)+ ε
Nonlinear Models Examples:
y = β0 + β1xγ11 + β2x
γ22 + ε
y =β0
1 + eβ1x1+ ε
4 / 29
IU-logo
Regression Inferences
Least square estimation of the regression coefficients.b = (XTX )−1XT y .
Variance estimation for σ2: s2.
Coefficient of Determination. R2.
Partial F test or t-test for H0 : βj = 0.
Partial F test versus sequential F test.
General linear hypothesis, F-test. For example:
H0 : β1 = β2, and β3 = 2
5 / 29
IU-logo
Confidence Interval for prediction E(y |x0).
(predicted value)± tα/2,n−k−1 · (standard error of prediction)
For example, for a simple linear regression
y(x0)± tα/2,n−2s
√1
n+
(x0 − x)2∑ni=1(xi − x)2
6 / 29
IU-logo
Use of regression analysis depends on various research purposes (goals)
Prediction.
Model building.
Parameter estimation
7 / 29
IU-logo
1800 1850 1900 1950
050
100
150
200
x
y
0 50 100 150
-20-10
010
2030
R2=0.922
lmout$fitted
lmout$resid
0 50 100 150 200
-6-4
-20
24
R2=0.998
lmout2$fitted
lmout2$resid
0 50 100 150 200
-1.5
-0.5
0.5
1.5
R2=0.9998
lmout3$fitted
lmout3$resid
8 / 29
IU-logo
Residuals
Ordinary residuals: difference between the observed value and fittedvalue.
Studentized residual: scale free, t-like.
R-residual (externally studentized residual): scale free, follow tdistribution.
0 5 10 15 20 25 30
-3-2
-10
12
3
9 / 29
IU-logo
x
Frequency y
0 400 800 0 100 0 20 50 0 200
04000
0400800
x
Frequency x1
x
Frequency x2
01000
0100
x
Frequency x3
x
Frequency x4
020
40
020
50
x
Frequency x5
x
Frequency x6
0300
0 4000
0200
0 1000 0 20 40 0 300
x
Frequency x7
10 / 29
IU-logo
0 2000 4000 6000 8000
-4-2
02
lmout$fitted
ls.diag(lmout)$stud.res
11 / 29
IU-logo
2 4 6 8 10
24
68
10
x1
y1
0 2 4 6 8 10
02
46
810
x2
y2
1 2 3 4
12
34
5
x3
y3
0 2 4 6 8 10
01
23
45
x4
y4
12 / 29
IU-logo
Outliers.
High leverage value observations.
Influential points.I Influence on the fitted value. (DFFITS)i .I Influence on the regression coefficients. (DFBETAS)j,i and Cook’s
distance.I Influence on performance. (COVRATIO)i .
13 / 29
IU-logo
i yi yi ei Stu Res R-Res hii...
......
......
......
23 3534.49 3694.76 -160.27 -3.27 -5.24 0.9824 8266.77 7853.50 413.26 2.58 3.20 0.87...
......
......
......
i (DFBETA)1 (DFBETA2) · · · (DFFIT )i (COVRATIO)i Cook’s
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.23 -44.4 1.16 · · · -48.51 0.0473 11524 -1.65 0.59 · · · 8.53 0.24 5.89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14 / 29
IU-logo
What do we do with outliers/influential points?
Check for coding error.
Run regression with and without the point, and compare results.
Transform the variable, for example, log.
Reconsider the model/study.
Robust regression.
15 / 29
IU-logo
Partial Regression Plots (added variable plots)
ey |X−jagainst exj |X−j
ey |X−j: residuals in which the linear dependency of y on all regressors
apart from xj has been removed.
exj |X−j: residuals in which xj ’s linear dependency with other regressors
has been removed.
If xj enters the regression in a linear fashion, the partial regressionplot should reflect a linear relationship through origin.
16 / 29
IU-logo
x
Frequency
y
0 10000 20000 30000 40000
1000
1500
2000
2500
3000
010000
20000
30000
40000
x
Frequency
x1
1000 1500 2000 2500 3000 5000 10000 15000 20000
5000
10000
15000
20000
x
Frequency
x2
17 / 29
IU-logo
-20000 -10000 0 10000 20000
-1000
-500
0500
Added-Variable Plot
x1 | others
y |
othe
rs
-5000 0 5000
-600
-400
-200
0200
400
Added-Variable Plot
x2 | othersy
| ot
hers
18 / 29
IU-logo
Transformation
Box-Tidwell Transformation on regressor variables.
y = β0 + β1xα11 + . . .+ βkxαk
k + ε
Box-Cox Transformation on the response.
w =
{yλ−1λ , λ 6= 0.
log y , λ = 0
19 / 29
IU-logo
-10 -5 0 5 10
-1000
-500
0500
Added-Variable Plot
x1t | others
y |
othe
rs
-5000 0 5000
-400
-200
0200
Added-Variable Plot
x2 | othersy
| ot
hers
20 / 29
IU-logo
To detect correlated error
Residual plot. Positively or Negatively autocorrelated?
2 4 6 8 10
-0.5
0.00.5
1.0
2 4 6 8 10
-0.4
-0.2
0.00.2
0.40.6
21 / 29
IU-logo
Durbin-Watson (D-W) test to detect positive autocorrelation byassuming
εt = ρεt−1 + (random disturbance)
H0 : ρ = 0 vs Ha : ρ > 0. The D-W test statistic:
d =
∑nt=2(et − et−1)2∑n
i=1 e2i
Reject H0 if d < dL; fail to reject H0 if d > dU ; inclusive ifdL < d < dU . (dL, dU are bounds depending on n and α.)
Dealing with autocorrelation: time series methods. (ARMA, GARCH,etc...)
22 / 29
IU-logo
To detect unequal variance errors (heteroskedasticity)
Residual plots.
Breusch-Pagan/Cook-Weisberg Test
White general test.
Dealing with heteroscedasticity
Transform the variable.
Revisit the model.
Use robust standard errors.
Weighted least squares.
23 / 29
IU-logo
To detect non-normal in errors.
Normal probability plot
Shapiro-Wilk test.
In many cases, a non-normal condition usually suggests other violations.
24 / 29
IU-logo
Revisit response variable y .
y is binary: logistic regression.
P(Y = 1) =1
1 + e−(β0+β1x1+...+βkxk )
y is measured on an ordinal scale: ordinal logistic regression.
y is measured on non-ordered scale: multinomial logistic regression.
y is counts: Poisson or Negative Binomial regression.
Generalized linear models.
Survival models.
25 / 29
IU-logo
Multicollinearity
Detect Multicollinearity
High R2, but most variables show non-significance.
Variance Inflation Factor (VIF).
Eigenvalues of the correlation matrix.
Dealing with multicollinearity
Ridge regression.
Principal component regression.
26 / 29
IU-logo
Other issues:
Model selection: PRESS, Mallow’s Cp, AIC, etc.
Categorical variables: use dummy variables and consider interaction.
Nonparametric regression.
Nonlinear regression.
27 / 29
IU-logo
All models are wrong, but some are useful. - George Box
28 / 29
IU-logo
Reference:
Myers, R. (1990). Classical and modern regression with applications.2nd Edition, Duxbury.
Ravishanker, N. and Dey, D. (2002). A First course in linear modeltheory. Chapman & Hall/CRC, Boca Raton.
Hocking, R. (2003). Methods and applications of linear models. 2ndEdition, Wiley.
29 / 29