detecting and responding to violations of regression ... · detecting and responding to violations...

IU-logo

Detecting and Responding to Violations of Regression Assumptions

Chunfeng HuangDepartment of Statistics, Indiana University

1 / 29

IU-logo

Example

x

Frequency

y

2 3 4 5 6 7

6080

100120140160

23

45

67

x

Frequency

x1

60 80 100 120 140 160 20 40 60 80

2040

6080

x

Frequency

x2

2 / 29

IU-logo

Linear Regression Model

y = β0 + β1x1 + . . .+ βkxk + ε

y : response variable.

x1, . . . , xk : regressor variables, independent variables.

β0, β1, . . . , βk : regression coefficients.

ε: model error.I Uncorrelated: cov(εi , εj) = 0, i 6= j .I Mean zero, Same variance: var(εi ) = σ2. (homoscedasticity)I Normally distributed.

3 / 29

IU-logo

Linear Models Examples:

y = β0 + β1x + β2x2 + ε

y = β0 + β1x1 + β2x2 + β12x1x2 + ε

y = β0 + β1 log x1 + β2 log x2 + ε

log y = β0 + β1

(1

x1

)+ β2

(1

x2

)+ ε

Nonlinear Models Examples:

y = β0 + β1xγ11 + β2x

γ22 + ε

y =β0

1 + eβ1x1+ ε

4 / 29

IU-logo

Regression Inferences

Least square estimation of the regression coefficients.b = (XTX )−1XT y .

Variance estimation for σ2: s2.

Coefficient of Determination. R2.

Partial F test or t-test for H0 : βj = 0.

Partial F test versus sequential F test.

General linear hypothesis, F-test. For example:

H0 : β1 = β2, and β3 = 2

5 / 29

IU-logo

Confidence Interval for prediction E(y |x0).

(predicted value)± tα/2,n−k−1 · (standard error of prediction)

For example, for a simple linear regression

y(x0)± tα/2,n−2s

√1

n+

(x0 − x)2∑ni=1(xi − x)2

6 / 29

IU-logo

Use of regression analysis depends on various research purposes (goals)

Prediction.

Model building.

Parameter estimation

7 / 29

IU-logo

1800 1850 1900 1950

050

100

150

200

x

y

0 50 100 150

-20-10

010

2030

R2=0.922

lmout$fitted

lmout$resid

0 50 100 150 200

-6-4

-20

24

R2=0.998

lmout2$fitted

lmout2$resid

0 50 100 150 200

-1.5

-0.5

0.5

1.5

R2=0.9998

lmout3$fitted

lmout3$resid

8 / 29

IU-logo

Residuals

Ordinary residuals: difference between the observed value and fittedvalue.

Studentized residual: scale free, t-like.

R-residual (externally studentized residual): scale free, follow tdistribution.

0 5 10 15 20 25 30

-3-2

-10

12

3

9 / 29

IU-logo

x

Frequency y

0 400 800 0 100 0 20 50 0 200

04000

0400800

x

Frequency x1

x

Frequency x2

01000

0100

x

Frequency x3

x

Frequency x4

020

40

020

50

x

Frequency x5

x

Frequency x6

0300

0 4000

0200

0 1000 0 20 40 0 300

x

Frequency x7

10 / 29

IU-logo

0 2000 4000 6000 8000

-4-2

02

lmout$fitted

ls.diag(lmout)$stud.res

11 / 29

IU-logo

2 4 6 8 10

24

68

10

x1

y1

0 2 4 6 8 10

02

46

810

x2

y2

1 2 3 4

12

34

5

x3

y3

0 2 4 6 8 10

01

23

45

x4

y4

12 / 29

IU-logo

Outliers.

High leverage value observations.

Influential points.I Influence on the fitted value. (DFFITS)i .I Influence on the regression coefficients. (DFBETAS)j,i and Cook’s

distance.I Influence on performance. (COVRATIO)i .

13 / 29

IU-logo

i yi yi ei Stu Res R-Res hii...

......

......

......

23 3534.49 3694.76 -160.27 -3.27 -5.24 0.9824 8266.77 7853.50 413.26 2.58 3.20 0.87...

......

......

......

i (DFBETA)1 (DFBETA2) · · · (DFFIT )i (COVRATIO)i Cook’s

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.23 -44.4 1.16 · · · -48.51 0.0473 11524 -1.65 0.59 · · · 8.53 0.24 5.89

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14 / 29

IU-logo

What do we do with outliers/influential points?

Check for coding error.

Run regression with and without the point, and compare results.

Transform the variable, for example, log.

Reconsider the model/study.

Robust regression.

15 / 29

IU-logo

Partial Regression Plots (added variable plots)

ey |X−jagainst exj |X−j

ey |X−j: residuals in which the linear dependency of y on all regressors

apart from xj has been removed.

exj |X−j: residuals in which xj ’s linear dependency with other regressors

has been removed.

If xj enters the regression in a linear fashion, the partial regressionplot should reflect a linear relationship through origin.

16 / 29

IU-logo

x

Frequency

y

0 10000 20000 30000 40000

1000

1500

2000

2500

3000

010000

20000

30000

40000

x

Frequency

x1

1000 1500 2000 2500 3000 5000 10000 15000 20000

5000

10000

15000

20000

x

Frequency

x2

17 / 29

IU-logo

-20000 -10000 0 10000 20000

-1000

-500

0500

Added-Variable Plot

x1 | others

y |

othe

rs

-5000 0 5000

-600

-400

-200

0200

400

Added-Variable Plot

x2 | othersy

| ot

hers

18 / 29

IU-logo

Transformation

Box-Tidwell Transformation on regressor variables.

y = β0 + β1xα11 + . . .+ βkxαk

k + ε

Box-Cox Transformation on the response.

w =

{yλ−1λ , λ 6= 0.

log y , λ = 0

19 / 29

IU-logo

-10 -5 0 5 10

-1000

-500

0500

Added-Variable Plot

x1t | others

y |

othe

rs

-5000 0 5000

-400

-200

0200

Added-Variable Plot

x2 | othersy

| ot

hers

20 / 29

IU-logo

To detect correlated error

Residual plot. Positively or Negatively autocorrelated?

2 4 6 8 10

-0.5

0.00.5

1.0

2 4 6 8 10

-0.4

-0.2

0.00.2

0.40.6

21 / 29

IU-logo

Durbin-Watson (D-W) test to detect positive autocorrelation byassuming

εt = ρεt−1 + (random disturbance)

H0 : ρ = 0 vs Ha : ρ > 0. The D-W test statistic:

d =

∑nt=2(et − et−1)2∑n

i=1 e2i

Reject H0 if d < dL; fail to reject H0 if d > dU ; inclusive ifdL < d < dU . (dL, dU are bounds depending on n and α.)

Dealing with autocorrelation: time series methods. (ARMA, GARCH,etc...)

22 / 29

IU-logo

To detect unequal variance errors (heteroskedasticity)

Residual plots.

Breusch-Pagan/Cook-Weisberg Test

White general test.

Dealing with heteroscedasticity

Transform the variable.

Revisit the model.

Use robust standard errors.

Weighted least squares.

23 / 29

IU-logo

To detect non-normal in errors.

Normal probability plot

Shapiro-Wilk test.

In many cases, a non-normal condition usually suggests other violations.

24 / 29

IU-logo

Revisit response variable y .

y is binary: logistic regression.

P(Y = 1) =1

1 + e−(β0+β1x1+...+βkxk )

y is measured on an ordinal scale: ordinal logistic regression.

y is measured on non-ordered scale: multinomial logistic regression.

y is counts: Poisson or Negative Binomial regression.

Generalized linear models.

Survival models.

25 / 29

IU-logo

Multicollinearity

Detect Multicollinearity

High R2, but most variables show non-significance.

Variance Inflation Factor (VIF).

Eigenvalues of the correlation matrix.

Dealing with multicollinearity

Ridge regression.

Principal component regression.

26 / 29

IU-logo

Other issues:

Model selection: PRESS, Mallow’s Cp, AIC, etc.

Categorical variables: use dummy variables and consider interaction.

Nonparametric regression.

Nonlinear regression.

27 / 29

IU-logo

All models are wrong, but some are useful. - George Box

28 / 29

IU-logo

Reference:

Myers, R. (1990). Classical and modern regression with applications.2nd Edition, Duxbury.

Ravishanker, N. and Dey, D. (2002). A First course in linear modeltheory. Chapman & Hall/CRC, Boca Raton.

Hocking, R. (2003). Methods and applications of linear models. 2ndEdition, Wiley.

29 / 29

detecting and responding to violations of regression ... · detecting and responding to violations...

Documents