1 introduction and exploratory data analy- siswiens/stat479/479samplepr… ·  · 2014-04-28it...

40
STAT 479 Sample Project D. P. Wiens Note This is a sample project, as might be prepared by a STAT 479 student. It uses techniques as discussed in Part I of the course. These were modied, as I thought about the data and its analysis more closely. Then as well techniques from Part II were applied in the second month. This current version incorporates nal revisions to the previous parts, as well as the frequency domain methods of Part III. The data set I analyze was found by rst going to the “Data” section of Henryk Kolacz’s “Statistics on the Web” compilation on the course web site, then following various links to www.sci.usq.edu.au/sta/dunn/Datasets/tech-timeseries.html, where I found a link to “Daily rainfall, max and min temperatures in Toowoomba, Australia”. In §1 the various series in the data are examined, and possible relationships are hypothesized. A time domain analysis is carried out in §2, and a frequency domain analysis in §3. The ndings are summarized in §4. 1 Introduction and Exploratory Data Analy- sis The data set, found as described above, is actually a set of data for four locations in the state of Queensland, Australia. One location is Toowoomba; the other three are Emerald, Gatton and Jondaryn - see Figure 1. Toowoomba (west of the coastal city of Brisbane) and Emerald (west of the coastal city of Rockhampton) are on the map; Jondaryn is about 40 km northwest of Toowoomba and Gatton is between Toowoomba and Brisbane. Students who have driven from Edmonton to Calgary might be interested to note the city of Innisfail just south of the northern resort city of Cairns. Innisfail is in turn just down the road from the much smaller village of Edmonton (!) - see Figure 2. The data give the daily rainfall, maximum and minimum at each location from 1 January 1889 to 15 September 2002 (from 1 January 1889 to 22 July 2002 for Toowoomba), as well as data on a number of other variables. The complete description, taken from the web site, is

Upload: vanliem

Post on 17-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

STAT 479 Sample Project

D. P. Wiens

Note This is a sample project, as might be prepared by a STAT 479 student.

It uses techniques as discussed in Part I of the course. These were modified,

as I thought about the data and its analysis more closely. Then as well

techniques fromPart II were applied in the second month. This current version

incorporates final revisions to the previous parts, as well as the frequency

domain methods of Part III.

The data set I analyze was found by first going to the “Data” section of

Henryk Kolacz’s “Statistics on the Web” compilation on the course web site,

then following various links to

www.sci.usq.edu.au/staff/dunn/Datasets/tech-timeseries.html,

where I found a link to “Daily rainfall, max andmin temperatures in Toowoomba,

Australia”.

In §1 the various series in the data are examined, and possible relationships

are hypothesized. A time domain analysis is carried out in §2, and a frequency

domain analysis in §3. The findings are summarized in §4.

1 Introduction and Exploratory Data Analy-

sis

The data set, found as described above, is actually a set of data for four

locations in the state of Queensland, Australia. One location is Toowoomba;

the other three are Emerald, Gatton and Jondaryn - see Figure 1. Toowoomba

(west of the coastal city of Brisbane) and Emerald (west of the coastal city

of Rockhampton) are on the map; Jondaryn is about 40 km northwest of

Toowoomba and Gatton is between Toowoomba and Brisbane. Students who

have driven from Edmonton to Calgary might be interested to note the city of

Innisfail just south of the northern resort city of Cairns. Innisfail is in turn

just down the road from the much smaller village of Edmonton (!) - see Figure

2.

The data give the daily rainfall, maximum and minimum at each location

from 1 January 1889 to 15 September 2002 (from 1 January 1889 to 22 July

2002 for Toowoomba), as well as data on a number of other variables. The

complete description, taken from the web site, is

2 Douglas P. Wiens

Figure 1: Queensland, Australia.

Variables:

Year The year

Month The month

Day The day within each month

maxt The daily maximum temperature

mint The daily minimum temperature

rain The daily rainfall

radn Presumably radiation, in MJ/m2

pan Pan evaporation (in mm)

vpd maximum vapour pressure deficit (in hPa)

Data Quality:

There are no missing values. The temperature data before 1920 (at least)

STAT 479 Project 3

Figure 2: An effect of global warming.

appears filled for Toowoomba (the data is almost exactly the same for each

year before 1920).

Source:

From the Queensland Department of Primary Industries.

The meanings of most of these variables are clear; vpd and pan are two

with which I was unfamiliar. I learned that “vapour pressure deficit” refers to

the ability of the atmosphere to absorb water vapour; this in turn controls the

rate at which water evaporates from the soil. When this rate is high, more

water must be provided to crops. The prediction of vpd is then particularly

important to agricultural scientists in dry climates such as that in Australia.

A search, with the aid of Google, led to the paper Wang, Smith, Bond, Verburg

(2004). These authors are associated with Australia’s Commonwealth Scien-

tific and Industrial Research Organisation (CSIRO) Land and Water group.

The abstract to this paper reads in part:

Vapour pressure deficit (VPD) has a significant effect on the

amount of water required by the crop to maintain optimal growth.

Data required to calculate the mean VPD on a daily basis are rarely

available, and most models use approximations to estimate it. In

APSIM (Agricultural Production Systems Simulator), VPD is es-

timated from daily maximum and minimum temperatures with the

assumption that the minimum temperature equals dew point, and

4 Douglas P. Wiens

1900 1920 1940 1960 1980 2000

1020

3040

date

max

t

1900 1920 1940 1960 1980 2000

−5

515

25

date

min

t

1900 1920 1940 1960 1980 2000

050

100

date

rain

1900 1920 1940 1960 1980 2000

515

25

date

radn

1900 1920 1940 1960 1980 2000

05

1015

date

pan

1900 1920 1940 1960 1980 20005

1525

date

vpd

Toowoomba

Figure 3: Complete data for Toowoomba.

there is little change in vapour pressure or dew point during any

one day. The accuracy of such VPD estimations was assessed us-

ing data collected every 15 min near Wagga Wagga in New South

Wales, Australia. ... the prediction of vapour pressure was poor.

Vapour pressure at 0900 hours was a better estimate of daily mean

vapour pressure. ... Simulations using historical weather data for

1957-2002 show that such improved accuracy in daytime VPD es-

timation slightly increased simulated crop yield and deep drainage,

while slightly reducing crop water uptake. Comparison of the AP-

SIM RUE/TE and CERES-Wheat approaches for modelling poten-

tial transpiration revealed differences in crop water demand esti-

mated by the two approaches. Although the differences had a small

effect on the probability distribution of simulated long-term wheat

yield, water uptake, and deep drainage, this finding highlights the

need for a scientific re-appraisal of the APSIMRUE/TE and energy

balance approaches for the estimation of crop demand, which will

have implications for modelling crop growth under water-limited

conditions and calculation of water required to maintain maximum

growth.

The variable pan (“pan evaporation”) is a measure which incorporates tem-

perature, humidity, wind and solar radiation in order to estimate evaporation

STAT 479 Project 5

−5

515

25

too

05

1525

em

05

1020

gat

−5

515

25

0 2000 4000 6000 8000 10000 12000

jon

Time

mint

Figure 4: Minimum temperatures by location.

rates; it is apparently highly correlated with vpd. It is measured merely by

observing that rate at which water evaporates from a “pan” of a particular

standard shape.

These considerations led me to wonder how well one might estimate and

predict vpd, using the methods of this course, from some or all of the variables

maxt, mint, rain, radn and pan. All are observed over time, and so I will be

attempting to predict one time series using one or more other time series.

Much of the data gathered pre-1970 exhibit suspiciously constant fluctu-

ations, as though the values were truncated above and below. Figure 3, for

Toowoomba, is typical. Thus I chose to analyze only the data from January 1,

1970 onwards. These are plotted for the variable mint in Figure 4. This plot

is typical in showing that there is little variation from one location to another,

as we would expect since the locations are quite close to each other. For this

reason I chose to combine the four locations by averaging. This resulted in

just one series for each of the six variables, each containing 11,891 data values

- one per day for over 32 years. Motivated partly by the fact that R responds

very slowly to data sets of this size, I decided to reduce the data further by

studying only the weekly averages. Thus the first seven points of each series

were averaged, then the second set of seven, etc. (A fortuitous side effect

of this was - as could be foreseen from the Central Limit Theorem - that the

normality of each series was greatly improved.) These weekly averages are

6 Douglas P. Wiens

plotted against time in Figure 5, and against each other in Figure 6. The

acf of vpd, and the ccf of vpd with each of the other five series, are plotted in

Figure 7. These series are highly correlated with vpd at almost all lags, and

so one expects them to predict well.

There was one further data reduction method applied. I regressed vpd on

the other variables, obtaining the following output.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.197411 0.285515 4.194 2.88e-05 ***

Xmaxt 0.270878 0.019928 13.593 < 2e-16 ***

Xmint 0.835236 0.012704 65.745 < 2e-16 ***

Xrain 0.055367 0.008581 6.452 1.44e-10 ***

Xradn 0.063401 0.015831 4.005 6.47e-05 ***

Xpan -0.975426 0.047070 -20.723 < 2e-16 ***

Residual standard error: 0.9386 on 1692 degrees of freedom

Multiple R-squared: 0.9557, Adjusted R-squared: 0.9555

F-statistic: 7293 on 5 and 1692 DF, p-value: < 2.2e-16

From this I inferred that perhaps a reasonable, single predictor based on

the other variables could be obtained from the (estimated) regression equation,

i.e from the linear combination prd defined by

= 3 ∗+ 8 ∗+ 06 ∗ + 06 ∗ −

Here the estimated regression coefficients have been rounded.

In summary, after all of this I had arrived at a reduced set of data containing

1698 weekly average values of vpd, and 1698 corresponding values of prd. I

hope to develop models allowing one to predict future values of vpd from past

and present values of vpd and prd. As a preliminary attempt at this I regressed

vpd on its own lag-1 values and on prd, obtaining the following output as well

as the residual plots in Figure 8.

dynlm(formula = vpd ~L(vpd, 1) + prd)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.520996 0.083888 6.211 6.63e-10 ***

L(vpd, 1) 0.079722 0.009221 8.646 < 2e-16 ***

prd 0.952589 0.009637 98.844 < 2e-16 ***

Residual standard error: 0.9186 on 1694 degrees of freedom

Multiple R-squared: 0.9575, Adjusted R-squared: 0.9574

F-statistic: 1.907e+04 on 2 and 1694 DF, p-value: < 2.2e-16

STAT 479 Project 7

The acf and pacf in Figure 8 indicate that the residuals from this regression

are far from independent. As well, there is a “U-shaped” trend in the plot

against the fitted values. Time series methods should yield substantially better

models.

The major question to be investigated is thus:

Using the methods of this course, can I develop models to predict vpd from prd,

which are better than merely using past values of vpd as predictors? (1)

8 Douglas P. Wiens

1520

2530

35

max

t

05

1015

20

min

t

010

2030

0 500 1000 1500

rain

Time

1015

2025

30

radn

24

68

10

pan

510

1520

25

0 500 1000 1500

vpd

Time

data

Figure 5: Data, averaged across locations and weeks, by variable.

STAT 479 Project 9

maxt

0 10 20 10 20 30 5 15 25

1020

3040

010

20

mint

rain

020

4060

1020

30

radn

pan

26

10

10 20 30 40

515

25

0 20 40 60 2 6 10

vpd

Figure 6: “Pairs” plot of vpd and predictors vs. each other.

10 Douglas P. Wiens

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

Lag

ACF

Series vpd

−30 −20 −10 0 10 20 30

−0.5

0.0

0.5

Lag

ACF

vpd & maxt

−30 −20 −10 0 10 20 30

−0.5

0.0

0.5

1.0

Lag

ACF

vpd & mint

−30 −20 −10 0 10 20 30

−0.2

0.0

0.1

0.2

0.3

0.4

Lag

ACF

vpd & rain

−30 −20 −10 0 10 20 30

−0.5

0.0

0.5

Lag

ACF

vpd & radn

−30 −20 −10 0 10 20 30

−0.5

0.0

0.5

Lag

ACF

vpd & pan

Figure 7: Autocorrelation function of vpd, and cross-correlations of vpd with

each of the five possible predictor series.

STAT 479 Project 11

5 10 15 20 25

−4

−2

02

fits

std.

resi

ds

0 10 20 30 40 50

0.0

0.4

0.8

Series: resids

LAG

AC

F

0 10 20 30 40 50

0.0

0.4

0.8

LAG

PAC

F

Figure 8: Residuals from regression of vpd on its lag-1 values and on prd.

0 10 20 30 40 50

−0.

50.

00.

51.

0

Series: vpd

LAG

AC

F

0 10 20 30 40 50

−0.

50.

00.

51.

0

LAG

PAC

F

0 10 20 30 40 50

−0.

50.

00.

51.

0

Series: prd

LAG

AC

F

0 10 20 30 40 50

−0.

50.

00.

51.

0

LAG

PAC

F

Figure 9: Sample ACF and PACF of vpd and prd.

2 Time Domain Analysis

I first looked for stationary ARIMA models for vpd and prd individually. The

acf and pacf plots in Figure 9 indicate that neither series is stationary but, as

indicated by the plots in Figure 10, the differenced series are stationary. Each

appears to be ARMA(p,q) with in the range 1 − 3 and = 1. I saw no

signs of seasonality. A ccf plot (not shown) shows the differenced series to be

highly correlated at lags 0 and ±1.I fitted all ARIMA(p,d,q) models for ∈ {0 1 2 3 4}, ∈ {1 2} and

∈ {0 1 2}, and select the “best” according to one of the three informationcriteria. For vpd, AIC and AICc both chose ARIMA(4,1,2) and BIC chose

ARIMA(2,1,2). The latter fared poorly with respect to the Ljung-Box test

for independence of the residuals, whereas the former seemed to “overcompen-

sate”, and was even worse. But ARIMA(3,1,2) seems to be quite satisfactory

- see Figure 11. This model was also chosen by AIC and AICc for prd (BIC

chose instead = 2) and again seems to fit well - Figure 12. In each case even

the normality is satisfactory.

STAT 479 Project 13

0 10 20 30 40 50−

0.4

0.0

0.4

0.8

Series: diff(vpd)

LAG

AC

F

0 10 20 30 40 50

−0.

40.

00.

40.

8

LAG

PAC

F

0 10 20 30 40 50

−0.

20.

20.

61.

0

Series: diff(prd)

LAG

AC

F

0 10 20 30 40 50−

0.2

0.2

0.6

1.0

LAG

PAC

F

Figure 10: Sample ACF and PACF of vpd and prd.

The models are:

• Model 1: ARIMA(3,1,2) model for vpd.

Coefficients:

ar1 ar2 ar3 ma1 ma2 constant

1.2521 -0.2722 -0.0752 -1.7034 0.7997 0.0138

s.e. 0.0498 0.0399 0.0339 0.0421 0.0322 0.0552

sigma^2 estimated as 5.045

• Model 2: ARIMA(3,1,2) model for prd.

Coefficients:

ar1 ar2 ar3 ma1 ma2 constant

1.2408 -0.2905 -0.0675 -1.6858 0.8027 0.0129

s.e. 0.0398 0.0403 0.0302 0.0306 0.0268 0.0505

sigma^2 estimated as 4.333

These were fitted using the authors’ (Shumway & Stoffer) contributed func-

tion sarima(). This function fits a constant by default, but in each case the

14 Douglas P. Wiens

estimate of this constant was only a fraction of one standard error, and so the

true value could well be zero.

I next looked at the problem of predicting vpd - with and without the aid

of prd. I decided to aim for 4-weeks ahead predictions. Without prd, the

(approximate) 95% prediction intervals are obtained from the R command

vpd.pr = sarima.for(vpd, n.ahead = 4, p = 3, d = 1, q = 2)

and are plotted in blue in Figure 13. They are

805± 449 769± 512 728± 562 690± 607 (2)

To predict with the aid of prd is more complicated. I first used the R

arima() function to find an appropriate ARIMAmodel for vpd, which regresses

on prd and then fits a time series model to the residuals from this regression.

It turned out that when the regressor is included, an ARIMA(3,1,3) model is

appropriate:

• Model 3: ARIMA(3,1,3) model for vpd after regressing on prd.

Call:

arima(x = vpd, order = c(3, 1, 3), xreg = prd)

Coefficients:

ar1 ar2 ar3 ma1 ma2 ma3 prd

2.0859 -1.2817 0.1465 -2.8538 2.7728 -0.9156 1.0198

s.e. 0.0244 0.0471 0.0241 0.0007 0.0019 0.0006 0.0056

sigma^2 estimated as 0.5665

Now predicting in this model requires that one have at hand future values

of prd to use in the regression, and so one must first predict prd from its

ARIMA(3,1,2) model - “Model 2” given above. These next four values are

obtained from a call to

prd.pr = sarima.for(prd, n.ahead = 4, p = 3, d = 1, q = 2)

and are

prd.pr$pred = 503 449 388 330 (3)

Using these in

predict(fit2.vpd, n.ahead = 4, newxreg = prd.pr$pred),

STAT 479 Project 15

where fit2.vpd contains the output for “Model 3”, yields the predictions

759± 151 700± 155 636± 156 571± 156 (4)

These predictions (4) are all somewhat lower than those in (2). As well,

the intervals are all substantially narrower.

The material in the rest of this section goes beyond what I would expect

of STAT 479 students, but I feel I should add it in order to point out that

the apparent improvement in the prediction intervals, as a result of including

the regressor in the model, might be misleading. A problem which has arisen

is that the method so far does not account for the variation incurred in the

prediction of the next four values of prd, i.e in (3). To see this, note that the

model being fitted is

= · +

where is the regression coefficient and is ARIMA(3,1,3). So we are

predicting the sum of two series { · } and {}, and the relevant standarderror of the prediction is that of the sum of these two predictions. The standard

errors used in (4) are based only on the variation in {} - the regressors aretreated as non-random.

It would be very involved to derive the exact standard error, but there is a

simple bound on the standard deviation of a sum that can be applied:

+ ≤ + (5)

Reason: If and are any two random variables, with standard

deviations and and covariance , then the correlation

satisfies

−1 ≤

= ≤ 1

In particular, − ≤ ≤ and so, from the usual

formula for the variance of the sum of two random variables,

2+ = 2 + 2 + 2 ≤ 2 + 2 + 2 = ( + )2

which is (5). (Using instead the lower bound on gives + ≥| − |.)

Thus, a conservative prediction interval (one that, if anything, will be too

wide) is obtained by using, as the standard errors of the predictions, those

obtained from the prediction of prd :

prd.pr$se = 208 238 261 283 (6)

16 Douglas P. Wiens

plus those obtained by predicting in Model 3. Equivalently, to get the usual

“2 standard errors” prediction intervals I added twice the values in (6) to the

limits in (4), obtaining final prediction intervals plotted in blue in Figure 13.

Specifically, these intervals are, rather than (4), now given by

759± 567 700± 631 636± 678 571± 723

and are slightly wider that those in (2) computed without the regression on

prd.

To this point it appears that my model for the prediction of vpd is not

improved by the inclusion of prd. But what about the accuracy of these

intervals? One way to answer this is to hold back a final portion of the series,

fit the models described above to the initial portions, and then see how well

these models predict the held-back portion by comparing the predictions to

the actual observations. To this end I first held back the final 20 observations

from each of vpd and prd. These I labelled vpd2 and prd2 ; the remaining,

initial portions I labelled vpd1 and prd1. I then:

1. predicted the 20 values in vpd2 using only an ARIMA(3,1,2) model for

vpd1 :

vpd2.pr = sarima.for(vpd1, n.ahead = 20, p = 3, d = 1, q = 2);

2. fit an ARIMA(3,1,3) to vpd1 which included a regression on prd1 :

fit2.vpd1 = arima(vpd1, order = c(3,1,3), xreg = prd1),

then predicted prd2 from an ARIMA(3,1,2) model for prd1 :

prd2.pr2 = sarima.for(prd1, n.ahead = 20, p = 3, d = 1, q = 2),

and finally predicted the 20 values of vpd2 using

vpd2.pr2 = predict(fit2.vpd1, n.ahead = 20, newxreg = prd2.pr2$pred).

The results are shown in Figure 14. Neither set of predictions captures

the drop in vpd over the final 20 weeks. But the intervals based on the use

of prd as a regressor are only slightly wider than those based on vpd alone,

and this extra width allows them to capture one point missed by the narrower

intervals. Figure 15 gives the results of the same procedures, but with 40

observations held back, rather than 40. In this case the predictions based on

the regression on prd are somewhat more reflective of the rise and fall in vpd

over this period. So, by these measures, the more involved Model 3 seems to

be slightly superior to Model 1.

STAT 479 Project 17

Standardized Residuals

Time

0 500 1000 1500

−3−1

13

0 10 20 30 40 50

−0.2

0.0

0.2

0.4

ACF of Residuals

LAG

ACF

Histogram of Stdres and N(0,1) density

stdres

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

6 8 10 12 14 16 18 200.

00.

20.

4

p values for Ljung−Box statistic

lag

p va

lue

0 10 20 30 40 50

−0.1

5−0

.05

0.05

Lag

Parti

al A

CF

PACF of Residuals

−3 −2 −1 0 1 2 3

−3−1

13

Q−Q Plot of Stdres: p = 0.17

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 11: Residual diagnostics for ARIMA(3,1,2) fit to vpd. Shapiro-Wilks

= 17.

18 Douglas P. Wiens

Standardized Residuals

Time

0 500 1000 1500

−3−1

12

3

0 10 20 30 40 50

−0.2

0.0

0.2

0.4

ACF of Residuals

LAG

ACF

Histogram of Stdres and N(0,1) density

stdres

Den

sity

−4 −3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

6 8 10 12 14 16 18 200.

00.

20.

4

p values for Ljung−Box statistic

lag

p va

lue

0 10 20 30 40 50

−0.1

00.

000.

05

Lag

Parti

al A

CF

PACF of Residuals

−3 −2 −1 0 1 2 3

−3−1

12

3

Q−Q Plot of Stdres: p = 0.39

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 12: Residual diagnostics for ARIMA(3,1,2) fit to prd. Shapiro-Wilks

= 39.

STAT 479 Project 19

Time

1650 1660 1670 1680 1690 1700

05

1015

2025

Figure 13: Predictions of vpd with (blue, dotted) and without (red, dashed)

the use of prd as a regressor.

20 Douglas P. Wiens

Time

1640 1650 1660 1670 1680 1690 1700

010

2030

40

Figure 14: Predictions of the final 20 values of vpd from the remaining values

of the series - with (blue, solid) and without (red, dashed) the use of prd as a

regressor.

STAT 479 Project 21

Time

1580 1600 1620 1640 1660 1680 1700

010

2030

40

Figure 15: Predictions of the final 40 values of vpd from the remaining values

of the series - with (blue, solid) and without (red, dashed) the use of prd as a

regressor.

STAT 479 Project 23

0.0 0.1 0.2 0.3 0.4 0.5

02

46

log−spectrum for prd, with 95% CIs

cycles/week

log(

pow

er)

0.0 0.1 0.2 0.3 0.4 0.5

02

46

log−spectrum for vpd, with 95% CIs

cycles/week

log(

pow

er)

Figure 16: Log-spectrum of prd (top) and vpd (bottom); = 15. Horizontal

lines at log (average power).

3 Frequency Domain Analysis

In this section two frequency domain analyses are carried out - one (§3.1) which

relates vpd to prd, and another (§3.2) which relates vpd to radn (radiation).

3.1 Series vpd and prd

I first computed and plotted (Figure 16) the spectrum of each of prd and vpd,

using a smoothing parameter = 15. Each plot peaks at a frequency of 0019

cycles/week, corresponding to a period of 524 weeks, i.e one year. This is

of course not unexpected. The interval in which both series have greater than

average power is [0014 0024], corresponding to periods of 417 weeks to 714

weeks. The coherence plot in Figure 17 shows that the series are strongly

24 Douglas P. Wiens

coherent ( 001) at all frequencies, but in particular at the frequencies

surrounding that of the annual trends.

Motivated by this, I filtered each series in a manner designed to highlight

the frequency range [0014 0024] - see Figure 18 for prd and Figure 19 for vpd.

Figure 20 compares the original and filtered series; from these it appears that

the annual trends in vpd mirror almost exactly (up to a factor of 104) those in

prd. An impulse-response analysis - Figure 21 - indicates that the response is

essentially instantaneous, although the predictions of vpd from prd are slightly

improved if the lag-one values of prd are included:

L = 15 M = 200

The lags at which the coefficients are deemed significant,

the corresponding values of the impulse-response function

and the coefficients re-estimated by least squares, are

lag s beta(s) regr.coef(s)

[1,] 0 0.9799 1.0151

[2,] 1 -0.0458 0.0089

The regression-based prediction equation is detrended.output =

sum( regr.coef(s)*lag(detrended.input, -s)).

MSE = 0.8777

The intercept and slope of the input linear trend are 14.4148 2e-04

The intercept and slope of the output linear trend are 15.5543 1e-04

See Figure 22.

3.2 Series vpd and radn

At this point I wondered if perhaps there was a simpler predictor, using just

one of the original variables, which would be adequate. I chose radn, and

again looked at the spectra (Figure 23) and coherence (Figure 24). The series

are again strongly coherent near the annual frequency, much less so elsewhere.

Filtering each to highlight the frequency range [0014 0042] in which each has

greater than average power results in Figures 25 (for radn) and 26 (for vpd).

But Figure 27 suggests that the correspondence between the filtered series is

shifted, and that perhaps the annual trends in vpd are better reflected by those

of radn at some lag. Thus I looked at plots of the two with radn lagged by

0,1,...,11 weeks - Figure 28. The correlation is highest at lag 6, and a plot

of filtered vpd with lag-6 radn (Figure 29, top plot) shows the annual trends

to be in much closer agreement. A comparison of the middle and bottom

plots in Figure 29 reveals that (unfiltered) radn lagged by 6 weeks is as well a

somewhat better predictor of (unfiltered) vpd than is its unlagged version.

STAT 479 Project 25

I next carried out an impulse-response analysis, with radn as “input” and

vpd as “output”. I isolated a set of seemingly significant, postcava lags and

re-estimated the relationship at these lags:

L = 15 M = 200

The lags at which the coefficients are deemed significant,

the corresponding values of the impulse-response function

and the coefficients re-estimated by least squares, are

lag s beta(s) regr.coef(s)

[1,] 0 -0.3572 -0.1120

[2,] 3 0.0247 0.2615

[3,] 5 0.0228 0.2128

[4,] 6 0.0431 0.2045

[5,] 7 0.0441 0.1938

[6,] 12 0.0418 0.1717

The regression-based prediction equation is detrended.output =

sum( regr.coef(s)*lag(detrended.input, -s)).

MSE = 6.3857

The intercept and slope of the input linear trend are 18.6217 4e-04

The intercept and slope of the output linear trend are 15.5543 1e-04

Fron this output, a model for predicting vpd from radn is

= −1120+2615−3+2128−5+2045−6+1938−7+1717−12(7)

The graphical output is in Figure 30.

An alternate approach is to treat vpd as “input” and radn as “output”,

to isolate a number of significant (positive and negative) lags, and to then

transform the regression equation to one expressing vpd in terms of lagged vpd

and lagged radn. This resulted in the plot in Figure 31 and the equation

= 02631−5+04272−6−01284−10−02140−11+04618−6(8)

The mean squared error arising from (8) is 595 - a 7% reduction compared

to that from (7). By this measure (8) is preferable as well as being more

parsimonious.

26 Douglas P. Wiens

4 Summary

Various models have been constructed and examined for the prediction of vpd

from one or more of the other series. A time domain analysis (§2) yielded

ARIMA models for vpd and for a linear combination prd of the other series;

adequate predictions were obtained using either the model for vpd alone, or a

model incorporating as well a regression on prd. A frequency domain analysis

(§3) showed that the driving force in each series is a strong annual trend;

that of vpd can be predicted almost exactly by that in prd. A companion

analysis yielded impulse-response models relating vpd to the single predictor

radn; against the models seem adequate.

References

Wang, E., Smith, C. J., Bond, W. J., Verburg, K. (2004), “Estimations of

vapour pressure deficit and crop water demand in APSIM and their im-

plications for prediction of crop yield, water use, and deep drainage”,

Australian Journal of Agricultural Research, 55, 1227 - 1240.

STAT 479 Project 27

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

cycles/week

squa

red

cohe

renc

y

Coherence: prd and vpd

Figure 17: Coherence of vpd and prd with = 001 critical values.

28 Douglas P. Wiens

Original series

Time

serie

s

0 500 1000 1500

510

1520

Filtered series

Timese

ries.

filt

0 500 1000 1500

−4

02

4

0.0 0.1 0.2 0.3 0.4 0.5

15

5050

0

frequency

spec

trum

Spectrum of original series

bandwidth = 0.00251

0.0 0.1 0.2 0.3 0.4 0.5

1e−

111e

−02

frequency

spec

trum

Spectrum of filtered series

bandwidth = 0.00289

−100 −50 0 50 100

−0.

015

0.00

5

Filter coefficients

s

a(s)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

Desired and attained freq. response functions

freq

freq

. res

pons

e

Figure 18: Filtering output for prd ; = 15, = 200.

STAT 479 Project 29

Original series

Time

serie

s

0 500 1000 1500

510

1520

25

Filtered series

Time

serie

s.fil

t

0 500 1000 1500

−4

02

4

0.0 0.1 0.2 0.3 0.4 0.5

210

5050

0

frequency

spec

trum

Spectrum of original series

bandwidth = 0.00251

0.0 0.1 0.2 0.3 0.4 0.5

1e−

111e

−05

1e+

01

frequency

spec

trum

Spectrum of filtered series

bandwidth = 0.00289

−100 −50 0 50 100

−0.

015

0.00

00.

015

Filter coefficients

s

a(s)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

Desired and attained freq. response functions

freq

freq

. res

pons

e

Figure 19: Filtering output for vpd ; = 15, = 200.

30 Douglas P. Wiens

5 10 15 20

510

1520

25

vpd vs. prd, correlation = 0.978

predictor

vpd

−4 −2 0 2 4

−4

02

4

filtered vpd vs. filtered predictor, correlation = 0.999 , slope = 1.04

filtered prd

filte

red

vpd

filtered vpd (black) and filtered prd (red)

Time

0 500 1000 1500

−4

02

4

Figure 20: Top: Series vpd vs. prd. Middle: Filtered vpd vs. filtered prd.

Bottom: Both filtered series - they are so close that one can’t distinguish one

from the other in this plot.

STAT 479 Project 31

−100 −50 0 50 100

0.0

0.4

0.8

coefficients beta(s)

s

beta

(s)

−100 −50 0 50 100

−0.

50.

5

Lag

ccf

ccf(out,in)

Detrended output (broken line) and predicted (by the detrended input) values based on ALL coefficients beta(s)

Time

0 500 1000 1500

−10

05

10

Figure 21: Input-output analysis; input = prd, output = vpd ; = 15 = 50.

32 Douglas P. Wiens

Detrended output (broken line) and predictions based on a regression at significant lags

Time

0 500 1000 1500

−10

−5

05

10

Figure 22: Series vpd and regression fit using prd at lags 0 1.

STAT 479 Project 33

0.0 0.1 0.2 0.3 0.4 0.5

02

46

log−spectrum for vpd , with 95% CIs

cycles/week

log(

pow

er)

0.0 0.1 0.2 0.3 0.4 0.5

02

46

log−spectrum for radn , with 95% CIs

cycles/week

log(

pow

er)

Figure 23: Log-spectrum of radn (top) and vpd (bottom); = 15. Horizontal

lines at log(average power).

34 Douglas P. Wiens

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

cycles/week

squa

red

cohe

renc

y

Coherence: radn and vpd

Figure 24: Coherence of vpd and radn with = 001 critical values.

STAT 479 Project 35

Original series

Time

serie

s

0 500 1000 1500

1020

30

Filtered series

Timese

ries.

filt

0 500 1000 1500

−6

−2

26

0.0 0.1 0.2 0.3 0.4 0.5

210

5050

0

frequency

spec

trum

Spectrum of original series

bandwidth = 0.00251

0.0 0.1 0.2 0.3 0.4 0.5

1e−

111e

−02

frequency

spec

trum

Spectrum of filtered series

bandwidth = 0.00289

−100 −50 0 50 100

−0.

020.

020.

06

Filter coefficients

s

a(s)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

Desired and attained freq. response functions

freq

freq

. res

pons

e

Figure 25: Filtering output for radn; = 15 = 200.

36 Douglas P. Wiens

Original series

Time

serie

s

0 500 1000 1500

510

20

Filtered series

Timese

ries.

filt

0 500 1000 1500

−6

−2

26

0.0 0.1 0.2 0.3 0.4 0.5

210

100

frequency

spec

trum

Spectrum of original series

bandwidth = 0.00251

0.0 0.1 0.2 0.3 0.4 0.5

1e−

111e

−02

frequency

spec

trum

Spectrum of filtered series

bandwidth = 0.00289

−100 −50 0 50 100

−0.

020.

020.

06

Filter coefficients

s

a(s)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

Desired and attained freq. response functions

freq

freq

. res

pons

e

Figure 26: Filtering output for vpd in §3.2; = 15 = 200.

STAT 479 Project 37

10 15 20 25 30

510

20

vpd vs. radn ; correlation = 0.405

input

outp

ut

−6 −4 −2 0 2 4 6

−6

−2

26

filtered radn vs. filtered vpd ; correlation = 0.667 , slope = 0.68

filtered radn

filte

red

vpd

filtered radn (red) and filtered vpd (black)

Time

0 500 1000 1500

−5

05

Figure 27: Top: Series vpd vs. radn. Middle: Filtered vpd vs. filtered radn.

Bottom: Both filtered series.

38 Douglas P. Wiens

−6 −2 0 2 4 6

−6

−2

26

r.f(t−0)

v.f(

t)

0.67

−6 −2 0 2 4 6−

6−

22

6

r.f(t−1)

v.f(

t)

0.74

−6 −2 0 2 4 6

−6

−2

26

r.f(t−2)

v.f(

t)

0.81

−6 −2 0 2 4 6

−6

−2

26

r.f(t−3)

v.f(

t)

0.86

−6 −2 0 2 4 6

−6

−2

26

r.f(t−4)

v.f(

t)

0.9

−6 −2 0 2 4 6

−6

−2

26

r.f(t−5)

v.f(

t)

0.93

−6 −2 0 2 4 6

−6

−2

26

r.f(t−6)

v.f(

t)

0.95

−6 −2 0 2 4 6

−6

−2

26

r.f(t−7)

v.f(

t)

0.95

−6 −2 0 2 4 6

−6

−2

26

r.f(t−8)v.

f(t)

0.93

−6 −2 0 2 4 6

−6

−2

26

r.f(t−9)

v.f(

t)

0.91

−6 −2 0 2 4 6

−6

−2

26

r.f(t−10)

v.f(

t)

0.87

−6 −2 0 2 4 6

−6

−2

26

r.f(t−11)

v.f(

t)

0.82

Figure 28: Pairs plots of filtered vpd (vertical axes) vs. lagged and filtered

radn, lags 0 1 11 weeks.

STAT 479 Project 39

filtered, lag−6 radn (red) and filtered vpd (black), mse = 1.55

Time

500 1000 1500

−5

05

unfiltered, lag−6 radn (red) and unfiltered vpd (black), mse = 22.07

Time

0 500 1000 1500

510

2030

unfiltered, unlagged radn (red) and unfiltered vpd (black), mse = 34.85

Time

0 500 1000 1500

510

2030

Figure 29: Top: Filtered vpd and filtered radn, with the latter lagged by 6

weeks. Middle: Neither series filtered, but radn lagged by 6 weeks. Bottom:

No filtering or lagging.

40 Douglas P. Wiens

Detrended output (broken line) and predictions based on a regression at significant lags

Time

0 500 1000 1500

−10

−5

05

10

Figure 30: Series vpd and regression fit using radn through (7). = 639.

STAT 479 Project 41

Detrended vpd (broken line) and predictions based on a lagged regression on vpd and radn

Time

0 500 1000 1500

−10

−5

05

10

Figure 31: Series vpd and regression fit using lagged vpd and radn through

(8). = 595.