2/25/2016330 lecture 121 stats 330: lecture 12. 2/25/2016330 lecture 122 diagnostics 4 aim of todays...
DESCRIPTION
2/25/ lecture 123 Independence One of the regression assumptions is that the errors are independent. Data that is collected sequentially over time often have errors that are not independent. If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable. Thus, we need to be able to detect lack of independence.TRANSCRIPT
05/07/23 330 lecture 12 1
STATS 330: Lecture 12
05/07/23 330 lecture 12 2
Diagnostics 4Aim of today’s lecture
To discuss diagnostics for independence
05/07/23 330 lecture 12 3
Independence One of the regression assumptions is that the errors are
independent.
Data that is collected sequentially over time often have errors that are not independent.
If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable.
Thus, we need to be able to detect lack of independence.
05/07/23 330 lecture 12 4
Types of dependence
If large positive errors have a tendency to follow large positive errors, and large negative errors a tendency to follow large negative errors, we say the data has positive autocorrelation
If large positive errors have a tendency to follow large negative errors, and large negative errors a tendency to follow large positive errors, we say the data has negative autocorrelation
05/07/23 330 lecture 12 5
Diagnostics If the errors are positively autocorrelated,
• Plotting the residuals against time will show long runs of positive and negative residuals
• Plotting residuals against the previous residual (ie ei vs ei-1) will show a positive trend
• A correlogram of the residuals will show positive spikes, gradually decaying
05/07/23 330 lecture 12 6
Diagnostics (2)If the errors are negatively autocorrelated,
• Plotting the residuals against time will show alternating positive and negative residuals
• Plotting residuals against the previous residual (ie ei vs ei-1) will show a negative trend
• A correlogram of the residuals will show alternating positive and negative spikes, gradually decaying
05/07/23 330 lecture 12 7
Residuals against time
res<-residuals(lm.obj)plot(1:length(res),res, xlab=“time”,ylab=“residuals”, type=“b”)lines(1:length(res),res)abline(h=0, lty=2)
Can omit the “x” vector if it is sequence numbers
Dotted line at 0 (mean residual)
Dots/lines
05/07/23 330 lecture 12 8
0 20 40 60 80 100-1
.0-0
.50.
00.
5
Autocorrelation = 0.9
time
resi
dual
0 20 40 60 80 100
-0.4
0.0
0.4
Autocorrelation = 0.0
time
resi
dual
0 20 40 60 80 100
-1.0
0.0
0.5
1.0
Autocorrelation = - 0.9
time
resi
dual
05/07/23 330 lecture 12 9
Residuals against previous
res<-residuals(lm.obj)n<-length(res)plot.res<-res[-1] # element 1 has no previousprev.res<-res[-n] # have to be equal lengthplot(prev.res,plot.res, xlab=“previous residual”,ylab=“residual”)
05/07/23 330 lecture 12 10
Plots for different degrees of
autocorrelation
-0.5 0.0 0.5 1.0
-0.5
0.0
0.5
1.0
Autocorrelation = 0.9
residual
prev
ious
resi
dual
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
0.4
Autocorrelation = 0.0
residual
prev
ious
resi
dual
-1.5 -1.0 -0.5 0.0 0.5 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Autocorrelation = - 0.9
residual
prev
ious
resi
dual
05/07/23 330 lecture 12 11
Correlogram
acf(residuals(lm.obj))
Correlogram (autocorrelation function, acf) is plot of lag k autocorrelation versus k
Lag k autocorrelation is correlation of residuals k time units apart
05/07/23 330 lecture 12 12
0 5 10 15 20
0.0
0.4
0.8
Lag
AC
F
Autocorrelation = 0.9
0 5 10 15 20
0.0
0.4
0.8
Lag
AC
F
Autocorrelation = 0.0
0 5 10 15 20
-1.0
-0.5
0.0
0.5
1.0
Lag
AC
F
Autocorrelation = - 0.9
05/07/23 330 lecture 12 13
Durbin-Watson test We can also do a formal hypothesis test, (the
Durbin-Watson test) for independence The test assumes the errors follow a model of
the form
iii u 1where the ui’s are independent, normal and have constant variance. is the lag 1 correlation: this is the autoregressive model of order 1
NB
05/07/23 330 lecture 12 14
Durbin-Watson test (2)
When = 0, the errors are independent The DW test tests independence by testing = 0 is estimated by
n
ii
n
iii
e
ee
2
2
21
05/07/23 330 lecture 12 15
Durbin-Watson test (3)
DW test statistic is )ˆ1(2)(
2
2
2
21
n
ii
n
iii
e
eeDW
Value of DW is between 0 and 4 Values of DW around 2 are consistent with
independence Values close to 4 indicate negative serial correlation Values close to 0 indicate positive serial correlation
05/07/23 330 lecture 12 16
Durbin-Watson test (4) There exist values dL, dU depending on the
number of variables k in the regression and the sample size n – see table on next slide
Use the value of DW to decide on independence as follows:
0 44-dU 4-dLdL dU
Positive autocorrelation
Negative autocorrelation
Independence
Inconclusive
05/07/23 330 lecture 12 17
Durbin-Watson table
05/07/23 330 lecture 12 18
Example: the advertising data
Sales and advertising data Data on monthly sales and advertising
spend for 35 months
Model is Sales ~ spend + prev.spend
(prev.spend = spend in previous month)
05/07/23 330 lecture 12 19
> ad.df spend prev.spend sales1 16 15 20.52 18 16 21.03 27 18 15.54 21 27 15.35 49 21 23.56 21 49 24.57 22 21 21.38 28 22 23.59 36 28 28.010 40 36 24.011 3 40 15.512 21 3 17.3… 35 lines in all
Advertising data
05/07/23 330 lecture 12 20
R code for residual vs previous plot
advertising.lm<-lm(sales~spend + prev.spend, data = ad.df)res<-residuals(advertising.lm)n<-length(res)plot.res<-res[-1]prev.res<-res[-n]plot(prev.res,plot.res, xlab="previous residual",ylab="residual",main="Residual versus previous residual \n for the advertising data")abline(coef(lm(plot.res~prev.res)), col="red", lwd=2)
05/07/23 330 lecture 12 21
-5 0 5
-50
5
Residual versus previous residual for the advertising data
previous residual
resi
dual
05/07/23 330 lecture 12 22
Time series plot, correlogram – R codepar(mfrow=c(2,1))plot(res, type="b", xlab="Time Sequence", ylab = "Residual", main = "Time series plot of residuals for the advertising data")abline(h=0, lty=2, lwd=2,col="blue")
acf(res, main ="Correlogram of residuals for the advertising data")
05/07/23 330 lecture 12 23
0 5 10 15 20 25 30 35
-50
5
Time series plot of residuals for the advertising data
Time Sequence
Res
idua
l
0 5 10 15
-0.2
0.2
0.6
1.0
Lag
AC
F
Correlogram of residuals for the advertising data
Increasing trend?
05/07/23 330 lecture 12 24
Calculating DW> rhohat<-cor(plot.res,prev.res)> rhohat[1] 0.4450734> DW<-2*(1-rhohat)> DW[1] 1.109853
For n=35 and k=2, dL = 1.34. Since DW = 1.109 < dL = 1.34 , strong evidence of positive serial correlation
05/07/23 330 lecture 12 25
Durbin-Watson tableuse
(1.28 + 1.39)/2= 1.34
05/07/23 330 lecture 12 26
Remedy (1) If we detect serial correlation, we need to
fit special time series models to the data.
For full details see STATS 326/726.
Assuming that the AR(1) model is ok, we can use the arima function in R to fit the regression
05/07/23 330 lecture 12 27
Fitting a regression with AR(1) errors
> arima(ad.df$sales,order=c(1,0,0), xreg=cbind(spend,prev.spend))
Call:arima(x = ad.df$sales, order = c(1, 0, 0), xreg = cbind(spend, prev.spend))
Coefficients: ar1 intercept spend prev.spend 0.4966 16.9080 0.1218 0.1391s.e. 0.1580 1.6716 0.0308 0.0316
sigma^2 estimated as 9.476: log likelihood = -89.16, aic = 188.32
05/07/23 330 lecture 12 28
Comparisonslm arima
Const (std err) 15.60 (1.34) 16.90 (1.67)
Spend (std err) 0.142 (0.035) 0.128 (0.031)
Prev Spend (Std err) 0.166 (0.036) 0.139 (0.031)
1st order Correlation 0.442 0.497
Sigma 3.652 3.078
05/07/23 330 lecture 12 29
Remedy (2) Recall there was a trend in the time series
plot of the residuals, these seem related to time
Thus, time is a “lurking variable” , a variable that should be in the regression but isn’t
Try model Sales ~ spend + prev.spend + time
05/07/23 330 lecture 12 30
Fitting new modeltime=1:35new.advertising.lm<-lm(sales~spend + prev.spend + time, data = ad.df)res<-residuals(new.advertising.lm)
n<-length(res)plot.res<-res[-1]prev.res<-res[-n]
DW = 2*(1-cor(plot.res,prev.res))
05/07/23 330 lecture 12 31
DW Retest DW is now 1.73 For a model with 3 explanatory variables,
du is about 1.66 (refer to the table), so no evidence of serial correlation
Time is a highly significant variable in the regression
Problem is fixed!