2/25/2016330 lecture 121 stats 330: lecture 12. 2/25/2016330 lecture 122 diagnostics 4 aim of todays...

31
05/13/22 330 lecture 12 1 STATS 330: Lecture 12

Upload: shanna-cunningham

Post on 19-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

2/25/ lecture 123 Independence  One of the regression assumptions is that the errors are independent.  Data that is collected sequentially over time often have errors that are not independent.  If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable.  Thus, we need to be able to detect lack of independence.

TRANSCRIPT

Page 1: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 1

STATS 330: Lecture 12

Page 2: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 2

Diagnostics 4Aim of today’s lecture

To discuss diagnostics for independence

Page 3: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 3

Independence One of the regression assumptions is that the errors are

independent.

Data that is collected sequentially over time often have errors that are not independent.

If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable.

Thus, we need to be able to detect lack of independence.

Page 4: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 4

Types of dependence

If large positive errors have a tendency to follow large positive errors, and large negative errors a tendency to follow large negative errors, we say the data has positive autocorrelation

If large positive errors have a tendency to follow large negative errors, and large negative errors a tendency to follow large positive errors, we say the data has negative autocorrelation

Page 5: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 5

Diagnostics If the errors are positively autocorrelated,

• Plotting the residuals against time will show long runs of positive and negative residuals

• Plotting residuals against the previous residual (ie ei vs ei-1) will show a positive trend

• A correlogram of the residuals will show positive spikes, gradually decaying

Page 6: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 6

Diagnostics (2)If the errors are negatively autocorrelated,

• Plotting the residuals against time will show alternating positive and negative residuals

• Plotting residuals against the previous residual (ie ei vs ei-1) will show a negative trend

• A correlogram of the residuals will show alternating positive and negative spikes, gradually decaying

Page 7: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 7

Residuals against time

res<-residuals(lm.obj)plot(1:length(res),res, xlab=“time”,ylab=“residuals”, type=“b”)lines(1:length(res),res)abline(h=0, lty=2)

Can omit the “x” vector if it is sequence numbers

Dotted line at 0 (mean residual)

Dots/lines

Page 8: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 8

0 20 40 60 80 100-1

.0-0

.50.

00.

5

Autocorrelation = 0.9

time

resi

dual

0 20 40 60 80 100

-0.4

0.0

0.4

Autocorrelation = 0.0

time

resi

dual

0 20 40 60 80 100

-1.0

0.0

0.5

1.0

Autocorrelation = - 0.9

time

resi

dual

Page 9: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 9

Residuals against previous

res<-residuals(lm.obj)n<-length(res)plot.res<-res[-1] # element 1 has no previousprev.res<-res[-n] # have to be equal lengthplot(prev.res,plot.res, xlab=“previous residual”,ylab=“residual”)

Page 10: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 10

Plots for different degrees of

autocorrelation

-0.5 0.0 0.5 1.0

-0.5

0.0

0.5

1.0

Autocorrelation = 0.9

residual

prev

ious

resi

dual

-0.4 -0.2 0.0 0.2 0.4

-0.4

-0.2

0.0

0.2

0.4

Autocorrelation = 0.0

residual

prev

ious

resi

dual

-1.5 -1.0 -0.5 0.0 0.5 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Autocorrelation = - 0.9

residual

prev

ious

resi

dual

Page 11: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 11

Correlogram

acf(residuals(lm.obj))

Correlogram (autocorrelation function, acf) is plot of lag k autocorrelation versus k

Lag k autocorrelation is correlation of residuals k time units apart

Page 12: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 12

0 5 10 15 20

0.0

0.4

0.8

Lag

AC

F

Autocorrelation = 0.9

0 5 10 15 20

0.0

0.4

0.8

Lag

AC

F

Autocorrelation = 0.0

0 5 10 15 20

-1.0

-0.5

0.0

0.5

1.0

Lag

AC

F

Autocorrelation = - 0.9

Page 13: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 13

Durbin-Watson test We can also do a formal hypothesis test, (the

Durbin-Watson test) for independence The test assumes the errors follow a model of

the form

iii u 1where the ui’s are independent, normal and have constant variance. is the lag 1 correlation: this is the autoregressive model of order 1

NB

Page 14: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 14

Durbin-Watson test (2)

When = 0, the errors are independent The DW test tests independence by testing = 0 is estimated by

n

ii

n

iii

e

ee

2

2

21

Page 15: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 15

Durbin-Watson test (3)

DW test statistic is )ˆ1(2)(

2

2

2

21

n

ii

n

iii

e

eeDW

Value of DW is between 0 and 4 Values of DW around 2 are consistent with

independence Values close to 4 indicate negative serial correlation Values close to 0 indicate positive serial correlation

Page 16: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 16

Durbin-Watson test (4) There exist values dL, dU depending on the

number of variables k in the regression and the sample size n – see table on next slide

Use the value of DW to decide on independence as follows:

0 44-dU 4-dLdL dU

Positive autocorrelation

Negative autocorrelation

Independence

Inconclusive

Page 17: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 17

Durbin-Watson table

Page 18: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 18

Example: the advertising data

Sales and advertising data Data on monthly sales and advertising

spend for 35 months

Model is Sales ~ spend + prev.spend

(prev.spend = spend in previous month)

Page 19: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 19

> ad.df spend prev.spend sales1 16 15 20.52 18 16 21.03 27 18 15.54 21 27 15.35 49 21 23.56 21 49 24.57 22 21 21.38 28 22 23.59 36 28 28.010 40 36 24.011 3 40 15.512 21 3 17.3… 35 lines in all

Advertising data

Page 20: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 20

R code for residual vs previous plot

advertising.lm<-lm(sales~spend + prev.spend, data = ad.df)res<-residuals(advertising.lm)n<-length(res)plot.res<-res[-1]prev.res<-res[-n]plot(prev.res,plot.res, xlab="previous residual",ylab="residual",main="Residual versus previous residual \n for the advertising data")abline(coef(lm(plot.res~prev.res)), col="red", lwd=2)

Page 21: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 21

-5 0 5

-50

5

Residual versus previous residual for the advertising data

previous residual

resi

dual

Page 22: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 22

Time series plot, correlogram – R codepar(mfrow=c(2,1))plot(res, type="b", xlab="Time Sequence", ylab = "Residual", main = "Time series plot of residuals for the advertising data")abline(h=0, lty=2, lwd=2,col="blue")

acf(res, main ="Correlogram of residuals for the advertising data")

Page 23: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 23

0 5 10 15 20 25 30 35

-50

5

Time series plot of residuals for the advertising data

Time Sequence

Res

idua

l

0 5 10 15

-0.2

0.2

0.6

1.0

Lag

AC

F

Correlogram of residuals for the advertising data

Increasing trend?

Page 24: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 24

Calculating DW> rhohat<-cor(plot.res,prev.res)> rhohat[1] 0.4450734> DW<-2*(1-rhohat)> DW[1] 1.109853

For n=35 and k=2, dL = 1.34. Since DW = 1.109 < dL = 1.34 , strong evidence of positive serial correlation

Page 25: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 25

Durbin-Watson tableuse

(1.28 + 1.39)/2= 1.34

Page 26: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 26

Remedy (1) If we detect serial correlation, we need to

fit special time series models to the data.

For full details see STATS 326/726.

Assuming that the AR(1) model is ok, we can use the arima function in R to fit the regression

Page 27: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 27

Fitting a regression with AR(1) errors

> arima(ad.df$sales,order=c(1,0,0), xreg=cbind(spend,prev.spend))

Call:arima(x = ad.df$sales, order = c(1, 0, 0), xreg = cbind(spend, prev.spend))

Coefficients: ar1 intercept spend prev.spend 0.4966 16.9080 0.1218 0.1391s.e. 0.1580 1.6716 0.0308 0.0316

sigma^2 estimated as 9.476: log likelihood = -89.16, aic = 188.32

Page 28: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 28

Comparisonslm arima

Const (std err) 15.60 (1.34) 16.90 (1.67)

Spend (std err) 0.142 (0.035) 0.128 (0.031)

Prev Spend (Std err) 0.166 (0.036) 0.139 (0.031)

1st order Correlation 0.442 0.497

Sigma 3.652 3.078

Page 29: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 29

Remedy (2) Recall there was a trend in the time series

plot of the residuals, these seem related to time

Thus, time is a “lurking variable” , a variable that should be in the regression but isn’t

Try model Sales ~ spend + prev.spend + time

Page 30: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 30

Fitting new modeltime=1:35new.advertising.lm<-lm(sales~spend + prev.spend + time, data = ad.df)res<-residuals(new.advertising.lm)

n<-length(res)plot.res<-res[-1]prev.res<-res[-n]

DW = 2*(1-cor(plot.res,prev.res))

Page 31: 2/25/2016330 lecture 121 STATS 330: Lecture 12. 2/25/2016330 lecture 122 Diagnostics 4 Aim of todays lecture To discuss diagnostics for independence

05/07/23 330 lecture 12 31

DW Retest DW is now 1.73 For a model with 3 explanatory variables,

du is about 1.66 (refer to the table), so no evidence of serial correlation

Time is a highly significant variable in the regression

Problem is fixed!