time series chap1-5 - lse statisticsstats.lse.ac.uk/fryzlewicz/st304/st304_lecture_notes.pdf · 0...

Time Series and Forecasting

Piotr Fryzlewicz

Department of Statistics

[email protected]

http://stats.lse.ac.uk/fryzlewicz/

January 11, 2010

1 Motivation

Time series are measurements of a quantity xt, taken repeatedly over a certain period of

time.

• The quantity xt can be a scalar, but it can also be a vector, or a more complex object

such as an image or a network.

• The time index t can be continuous (when xt is observed continuously), discrete and

equally spaced (when xt is measured at discrete time intervals, e.g. every day or every

month), or have a more complex form (think of an experiment which needs close

supervision at the beginning, but can later be observed less frequently).

Time series arise in many sciences, or more generally in many “domains of human endeav-

our”. We first look at some examples of time series, before moving on to describe the branch

of statistics called Time Series Analysis.

1

1.1 Examples of time series

1. Finance. We look at two examples related to finance.

(a) Time series of the daily exchange rate between the British Pound and the US

Dollar, from 3 Jan 2000 to 8 May 2009. Plot in Figure 1. Source of the data:

http://www.federalreserve.gov/releases/h10/Hist/. In this example, t

takes discrete (daily) values which we have numbered from 1 to 2440. Note

that plotting the values of a scalar-valued time series is often the most natural

way of visualising such a dataset. We denote this time series by xt; it will be

used again in the next example. The values of xt do not change much from day

to day, but over time, clear trends are formed. Note, for example, the strong

negative trend starting around t = 2000 (which corresponds to August 2008),

associated, perhaps, with the start of the “credit crunch” in the UK.

t (Days)

US

D/G

BP

0 500 1000 1500 2000 2500

1.4

1.6

1.8

2.0

Figure 1: Daily USD/GBP exchange rate, from 03/01/2000 to 08/05/2009.

2

(b) Time series of the daily increments of the exchange rate between the British

Pound and the US Dollar, over the same time period. Plot in Figure 2. Note

that this time series is simply yt = xt − xt−1. Contrary to xt, there are no clear

trends in the mean of yt, which oscillates around zero. However, its variance

changes over time.

t (Days)

incr

emen

t US

D/G

BP

0 500 1000 1500 2000 2500

−0.0

6−0

.02

0.02

0.06

Figure 2: Increments in daily USD/GBP exchange rate, from 03/01/2000 to 08/05/2009.

2. Economics. Time series of yearly percentage increments in inflation-adjusted GDP

per capita of China, 1951–2007. Data from: http://www.gapminder.org. See Figure

3. The series is short, the values are mostly positive and appear to oscillate around a

flat but increasing trend.

3. Social Sciences. Time series of female labour force in Hong Kong, as percentage of

total labour fource, 1980–2005. Data from: http://www.gapminder.org. See Figure

4. The series is not only short but has clear trends and does not appear too “random”.

3

Figure 3: Yearly percentage increments in inflation-adjusted GDP per capita of China,1951–2007.

Figure 4: Female labour force in Hong Kong, as percentage of total labour fource, 1980–2005.

4. Environment. Time series of monthly mean maximum temperatures, recorded in

Oxford between January 1900 and December 2008. Data from:

http://www.metoffice.gov.uk/climate/uk/stationdata/index.html. See Fig-

ure 5. The yearly periodicity is very pronounced, as expected. Might there be a

4

slight upward trend towards the end of the series? Anything to do with the “global

warming”?

month, starting Jan 1900

mea

n m

ax te

mp,

deg

rees

Cel

sius

0 200 400 600 800 1000 1200

05

1015

2025

Figure 5: Monthly mean maximum temperatures in Oxford between 01/1900 and 12/2008.

5. Engineering. Speech signal (digitised acoustic sound wave) representing the word

“Piotr” (my first name) recorded using the wavrecord command in Matlab. Plot in

Figure 6. Both the amplitude and the frequency of the signal change over time.

This list only mentions a few out of many, many possible domains in which time series arise.

Rather than focusing on the different domains, we will mention a few possible types of time

series that we often observe in practice.

Note that all of the above examples are univariate, i.e. they are measurements of a single

scalar quantity. Often, we are faced with multivariate time series, in which a number of

(often related) quantities are measured simultaneously. For example, an EEG recording

measures electrical activity simultaneously in a number of locations around the scalp. Also,

5

0 0.5 1 1.5 2 2.5

x 104

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

Figure 6: The word “Piotr”.

any of the univariate time series above could be a component of a multivariate time series.

For instance, rather than observing a single price process as in Example 1, market profes-

sional often trace prices in a number of markets simultaneously with the purpose of e.g.

portfolio construction.

Finally, we note that any video can be viewed as an image-valued time series. On the other

hand, the evolution of e.g. facebook (as a graph) over time can be viewed as a network- (or

graph-)valued time series.

1.2 Statistical time series analysis

Scientists and analysts are interested in a variety of different questions / issues when faced

with time series data.

For example, one pertinent question in finance and economics / econometrics is that of

6

forecasting future values. This can be done, for example, for the purpose of potential gain

(e.g. in hedge funds or investment banks) or planning for the future (e.g. when should I

buy a house?).

Another common aim is to understand and be able to summarise time series data.

Again, the underlying “science” question differs from one example to another. For example,

in the analysis of EEG recordings, how can we decide whether the subject is “healthy” or

not? Or how can we decide if the labour market in HK (as in example 3 above) has been

evolving “significantly differently” from that, say, in Singapore? Or, more generally, given

a time series, how can we describe and summarise the mechanics of its evolution?

Finally, another frequent objective is to be able to control the evolution of a time series.

This is not quite the same as forecasting, where we do not intervene in the process in

any way. As an example of the control problem, consider the global temperature data: is

“global warming” really happening, and if so, what impacts the temperature and how can

we eliminate or suppress those factors?

As expected, there are often a variety of ways in which those questions can be answered,

and many of them do not formally involve statistics at all: for example, people often debate

expected trends in house prices and investment opportunities, express their informal views

about global warming, or use techniques originating from computer science (e.g. pattern

recognition) to aid medical diagnosis in neuroscience. So do we need statistics in time series

analysis?

The answer is not necessarily, but there are good arguments why the statistical approach

may often be very useful.

1. Firstly, even those informal approaches to time series analysis are in fact often statis-

tical in nature, sometimes in a “hidden” way: for example, people’s subjective views

about time series can often be formally formulated as priors in Bayesian statistics,

and informal forecasts which we often encounter in the media are in fact often in-

7

stances of simple statistical forecasting procedures, such as trend extrapolation. Also,

frequently, techniques originating in computer science (such as: neural networks, ma-

chine learning, pattern recognition, artificial intelligence) often have their counterparts

in statistics, which do exactly the same thing but are named differently.

2. The above-mentioned tasks: forecasting, understanding the structure of time se-

ries, as well as time series control, have inherent uncertainty about them, which

makes probability and statistics a natural tool for describing them. We briefly discuss

those issues in turn.

(a) Most real-life time series are so complex that accurate forecasting is impossible.

For example, rather than saying “tomorrow’s value of Xt will be exactly 2.745”

it often makes more sense to say “tomorrow’s value will be around 2.745”, but

then there is a chance that we will still be wrong, so our forecasts, even those

informal ones, will often be of the form “tomorrow’s value will probably be around

2.745”, which is already in the territory of probability, since it contains a natural

statement of uncertainty. We will find that probability and statistics provide a

natural and simple language to express forecasts and their associated uncertainty.

(b) Again, with the complexity of time series data, it is often impossible to build

exact deterministic models describing their structure. Indeed, if we had a

correct deterministic model, we would be able to predict the evolution of the

time series exactly, but since we are not able to do that, it means that we do

not have the exact model! Often, probabilistic models make more sense: for

example, “tomorrow’s value is about a half of today’s value plus a term which

is best described as random, i.e. there is no clear pattern in its values from one

day to another”. In this way, we have a simple model for the evolution of the

time series, but again we are in the territory of randomness, i.e. probability and

statistics.

8

(c) In the issue of time series control, one natural task that often needs to be

performed is to understand what factors affect the evolution of the series. But

this is often impossible to specify exactly: it is unlikely that any one factor (out

of the ones we are considering), or indeed their combination, is fully responsible

for the evolution of the time series. Therefore, again, a statistical approach,

where we permit uncertainty by building a statistical model, might be of use.

For example, if we suspect that there is a link between pollution and global

warming, it might be helpful to build a statistical model in which we will be able

to test this hypothesis, and answer questions like: “how sure are we that there is

correspondence between pollution and global warming”, or “what is the strength

of this relationship”.

We will use this discussion as a motivation for studying the statistical approach to the

analysis of time series, starting from next section.

1.3 Simple descriptive technique: autocorrelation

Denote by xt the Oxford temperature series from example 4 above. One of the main

characteristics of time series data is dependence between observations at different lags: i.e.

often, there is a relationship between observations separated by a lag k. For numerical data,

we can illustrate this using scatter plots. As an example, consider a scatter plot of xt+6

against xt, as t varies from 1 to N − 6, where N is the length of xt. It is shown in Figure

7. As expected, there is a clear negative dependence between temperatures separated by 6

months, due to seasonality.

Similar scatter plots could be created for k = 1, 2, 3, 4, 5, 7, . . ., but they are unwieldy.

Suppose we make the assumption that a linear relationship holds approximately between

xt+k and xt for all k, i.e.,

xt+k = αk + βkxt + εt+k

9

0 5 10 15 20 25

05

1015

2025

temperature in given month

tem

pera

ture

6 m

onth

s la

ter

Figure 7: Plot of xt+6 against xt.

where εt+k is an error term. Then we can use as a summary statistic a measure of the

strength of the linear relationship between two variables {yt} and {zt} say, namely the

Pearson product moment correlation coefficient

ρ =

∑

(yt − y)(zt − z)√

∑

(yt − y)2∑

(zt − z)2

where y and z are the sample means. Hence if yt = xt+k and zt = xt we are led to the lag

k sample autocorrelation for a time series:

ρk =

∑N−kt=1 (xt+k − x)(xt − x)∑N

t=1(xt − x)2

with ρ0 = 1. The sequence {ρk} is called the sample autocorrelation sequence (sample acs)

for the time series. The sample acs for the Oxford temperature data is given in Figure 8.

10

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

Lag

AC

F

Series temp$temp

Figure 8: Sample autocorrelation for xt.

Note e.g., that xt and xt+6 are negatively correlated, while xt and xt+12 are positively

correlated (consistent with the yearly temperature cycle).

In Section 1.2, we thoroughly justified using probabilistic models in time series analysis.

Consistent with this reasoning, we regard x1, . . . , xN as a realization of the corresponding

random variables X1, . . . ,XN . The quantity ρk is an estimate of a corresponding population

quantity called the lag k theoretical autocorrelation, defined as

ρk =E{(Xt − µ)(Xt+k − µ)}

σ2

where E{·} is the expectation operator, µ = E{Xt} is the population mean, and σ2 =

E{(Xt − µ)2} is the corresponding population variance. (Note that ρk, µ and σ2 do not

depend on ‘t’ here. As we shall see soon, models for which this is true play a central role in

TSA and are called stationary).

11

2 Real-valued discrete time stationary processes

Denote the process by {Xt}. For fixed t, Xt is a random variable (r.v.), and hence there is

an associated cumulative probability distribution function (cdf):

Ft(a) = P(Xt ≤ a),

and

E{Xt} =

∫ ∞

−∞x dFt(x) ≡ µt

var{Xt} =

∫ ∞

−∞(x− µt)

2 dFt(x).

But we are interested in the relationships between the various r.v.s that form the process.

For example, for any t1 and t2 ∈ T ,

Ft1,t2(a1, a2) = P(Xt1 ≤ a1,Xt2 ≤ a2)

gives the bivariate cdf. More generally for any t1, t2, . . . , tn ∈ T ,

Ft1,t2,...,tn(a1, a2, . . . , an) = P(Xt1 ≤ a1, . . . ,Xtn ≤ an)

Stationarity

The class of all stochastic processes is too large to work with in practice. We consider

only the subclass of stationary processes (later, if time permits, we will dicuss also some

subclasses of non-stationary processes).

COMPLETE/STRONG/STRICT stationarity

{Xt} is said to be completely stationary if, for all n ≥ 1, for any t1, t2, . . . , tn ∈ T , and for

12

any τ such that t1 + τ, t2 + τ, . . . , tn + τ ∈ T are also contained in the index set, the joint

cdf of {Xt1 ,Xt2 , . . . ,Xtn} is the same as that of {Xt1+τ ,Xt2+τ , . . . ,Xtn+τ} i.e.,

Ft1,t2,...,tn(a1, a2, . . . , an) = Ft1+τ,t2+τ,...,tn+τ (a1, a2, . . . , an),

so that the probabilistic structure of a completely stationary process is invariant under a

shift in time.

SECOND-ORDER/WEAK/COVARIANCE stationarity

{Xt} is said to be second-order stationary if, for all n ≥ 1, for any t1, t2, . . . , tn ∈ T , and

for any τ such that t1 + τ, t2 + τ, . . . , tn + τ ∈ T are also contained in the index set, all the

joint moments of orders 1 and 2 of {Xt1 ,Xt2 , . . . ,Xtn} exist, are finite, and equal to the

corresponding joint moments of {Xt1+τ ,Xt2+τ , . . . ,Xtn+τ}. Hence,

E{Xt} ≡ µ ; var{Xt} ≡ σ2 (= E{X2t } − µ2),

are constants independent of t. If we let τ = −t1,

E{Xt1Xt2} = E{Xt1+τXt2+τ}

= E{X0Xt2−t1},

and with τ = −t2,

E{Xt1Xt2} = E{Xt1+τXt2+τ}

= E{Xt1−t2X0}.

Hence, E{Xt1Xt2} is a function of the absolute difference |t2 − t1| only, similarly, for the

13

covariance between Xt1 & Xt2 :

cov{Xt1 ,Xt2} = E{(Xt1 − µ)(Xt2 − µ)} = E{Xt1Xt2} − µ2.

For a discrete time second-order stationary process {Xt} we define the autocovariance se-

quence (acvs) by

sτ ≡ cov{Xt,Xt+τ} = cov{X0,Xτ}.

Note,

1. τ is called the lag.

2. s0 = σ2 and s−τ = sτ .

3. The autocorrelation sequence (acs) is given by

ρτ =sτ

s0=

cov{Xt,Xt+τ}σ2

.

4. By Cauchy-Schwarz inequality, |sτ | ≤ s0.

5. The sequence {sτ} is positive semidefinite, i.e., for all n ≥ 1, for any t1, t2, . . . , tn

contained in the index set, and for any set of nonzero real numbers a1, a2, . . . , an

n∑

j=1

n∑

k=1

stj−tkajak ≥ 0.

Proof

Let

a = (a1, a2, . . . , an)T, V = (Xt1 ,Xt2 , . . . ,Xtn)T

and let Σ be the variance-covariance matrix of V . Its j, k-th element is given by

stj−tk = E{(Xtj − µ)(Xtk − µ)}.

14

Define the r.v.

w =

n∑

j=1

ajXtj = aTV ,

Then

0 ≤ var{w} = var{aTV } = aTvar{V }a = aTΣa =

n∑

j=1

n∑

k=1

stj−tkajak.

6. The variance-covariance matrix of equispaced X’s, (X1,X2, . . . ,XN )T has the form

s0 s1 . . . sN−2 sN−1

s1 s0 . . . sN−3 sN−2

.... . .

sN−2 sN−3 . . . s0 s1

sN−1 sN−2 . . . s1 s0

which is known as a symmetric Toeplitz matrix – all elements on a diagonal are the

same. Note the matrix has only N unique elements, s0, s1, . . . , sN−1.

7. A stochastic process {Xt} is called Gaussian if, for all n ≥ 1 and for any t1, t2, . . . , tn

contained in the index set, the joint cdf of Xt1 ,Xt2 , . . . ,Xtn is multivariate Gaussian.

• 2nd-order stationary Gaussian ⇒ complete stationarity (since MVN completely

characterized by 1st and 2nd moments). Not true in general.

• Complete stationarity ⇒ 2nd-order stationary in general.

2.1 Examples of discrete stationary processes

[1] White noise process

Also known as a purely random process. Let {Xt} be a sequence of uncorrelated r.v.s

15

such that

E{Xt} = µ, var{Xt} = σ2 ∀t

and

sτ =

σ2 τ = 0

0 τ 6= 0or ρτ =

1 τ = 0

0 τ 6= 0

forms a basic building block in time series analysis. Very different realizations of white

noise can be obtained for different distributions of {Xt}. Examples are given in Figure

9.

Gaussian white noise

0 50 100 150 200 250

−3−2

−10

12

Exponential white noise

0 50 100 150 200 250

01

23

45

6

Figure 9: Simulated realisations of white noise.

16

[2] q-th order moving average process MA(q)

Xt can be expressed in the form

Xt = µ− θ0,qǫt − θ1,qǫt−1 − . . . − θq,qǫt−q

= µ−q∑

j=0

θj,qǫt−j ,

where µ and θj,q’s are constants (θ0,q ≡ −1, θq,q 6= 0), and {ǫt} is a zero-mean white

noise process with variance σ2ǫ .

W.l.o.g. assume E{Xt} = µ = 0.

Then cov{Xt,Xt+τ } = E{XtXt+τ}.

Recall: cov(X,Y ) = E{(X − E{X})(Y − E{Y })}.

Since E{ǫtǫt+τ} = 0 ∀ τ 6= 0 we have for τ ≥ 0.

cov{Xt,Xt+τ} =

q∑

j=0

q∑

k=0

θj,qθk,qE{ǫt−jǫt+τ−k}

= σ2ǫ

q−τ∑

j=0

θj,qθj+τ,q (k = j + τ)

≡ sτ ,

which does not depend on t. Since sτ = s−τ , {Xt} is a stationary process with acvs

given by

sτ =

σ2ǫ

∑q−|τ |j=0 θj,qθj+|τ |,q |τ | ≤ q

0 |τ | > q

N.B. No restrictions were placed on the θj,q’s to ensure stationarity. (Though obvi-

ously, |θj,q| <∞ ∀ j).

Some examples are shown in Figure 10.

Further examples

Xt = ǫt − θ1,1ǫt−1 MA(1)

17

0 50 100 150 200 250

−50

5

0 50 100 150 200 250

−50

5

Figure 10: Top: realisation of the MA(9) process Xt =∑9

i=0 ǫt−i. Bottom: realisation ofthe MA(9) process Xt =

∑9i=0(−1)iǫt−i.

acvs:

sτ = σ2ǫ

1−|τ |∑

j=0

θj,1θj+|τ |,1 |τ | ≤ 1,

so,

s0 = σ2ǫ (θ0,1θ0,1 + θ1,1θ1,1)

= σ2ǫ (1 + θ2

1,1);

18

and,

s1 = σ2ǫ θ0,1θ1,1

= −σ2ǫ θ1,1.

acs:

ρτ =sτ

s0.

ρ0 = 1.0, ρ1 =−θ1,1

1 + θ21,1

(a) θ1,1 = 1.0, σ2ǫ = 1.0,

we have,

s0 = 2.0, s1 = −1.0, s2, s3, . . . = 0.0,

giving,

ρ0 = 1.0, ρ1 = −0.5, ρ2, ρ3, . . . = 0.0.

(b) θ1,1 = −1.0, σ2ǫ = 1.0,

we have,

s0 = 2.0, s1 = 1.0, s2, s3, . . . = 0.0,

giving,

ρ0 = 1.0, ρ1 = 0.5, ρ2, ρ3, . . . = 0.0.

Note: if we replace θ1,1 by θ−11,1 the model becomes

Xt = ǫt −1

θ1,1ǫt−1

19

and the autocorrelation becomes

ρ1 =− 1

θ1,1

1 +(

1θ1,1

)2 =−θ1,1

θ21,1 + 1

,

i.e., is unchanged!

We cannot identify the MA(1) process uniquely from the autocorrelation.

[3] p-th order autoregressive process AR(p)

{Xt} is expressed in the form

Xt = φ1,pXt−1 + φ2,pXt−2 + . . . + φp,pXt−p + ǫt,

where φ1,p, φ2,p, . . . , φp,p are constants (φp,p 6= 0) and {ǫt} is a zero mean white noise

process with variance σ2ǫ . In contrast to the parameters of an MA(q) process, the

{φk,p} must satisfy certain conditions for {Xt} to be a stationary process – i.e., not

all AR(p) processes are stationary (more later).

Some example are in Figure 11.

Further examples

Xt = φ1,1Xt−1 + ǫt AR(1) – Markov process (1)

= φ1,1{φ1,1Xt−2 + ǫt−1}+ ǫt

= φ21,1Xt−2 + φ1,1ǫt−1 + ǫt

= φ31,1Xt−3 + φ2

1,1ǫt−2 + φ1,1ǫt−1 + ǫt

...

=

∞∑

k=0

φk1,1ǫt−k,

Where we take the initial condition X−N = 0 and let N →∞.

20

0 50 100 150 200 250

−3−2

−10

12

3

0 50 100 150 200 250

−3−2

−10

12

3

Figure 11: Top: realisation of the AR(2) process Xt = 0.5Xt−1 + 0.2Xt−2 + ǫt. Bottom:realisation of the AR(2) process Xt = 0.5Xt−1 − 0.2Xt−2 + ǫt.

Note E{Xt} = 0.

var{Xt} = var

{

∞∑

k=0

φk1,1ǫt−k

}

=

∞∑

k=0

var{φk1,1ǫt−k} = σ2

ǫ

∞∑

k=0

φ2k1,1.

For var{Xt} <∞ we must have |φ1,1| < 1, in which case

var{Xt} =σ2

ǫ

1− φ21,1

.

To find the form of the acvs, we notice that for τ > 0, Xt−τ is a linear function of

21

ǫt−τ , ǫt−τ−1, . . . and is therefore uncorrelated with ǫt. Hence

E{ǫtXt−τ} = 0,

so, assuming stationarity and multiplying the defining equation (1) by Xt−τ :

XtXt−τ = φ1,1XtXt−τ + ǫtXt−τ

⇒ E{XtXt−τ} = φ1,1E{Xt−1Xt−τ}

i.e., sτ = φ1,1sτ−1 = φ21,1sτ−2 = . . . = φτ

1,1s0

⇒ ρτ =sτ

s0= φτ

1,1.

But ρτ is an even function of τ , so

ρτ = φ|τ |1,1 τ = 0,±1,±2, . . . .

Xt = 0.5Xt−1 + ǫt Xt = −0.5Xt−1 + ǫt

Lag

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.5

1.0

φ=0.5

φ=0.5

Lag

-5 -4 -3 -2 -1 0 1 2 3 4 5

-1.0

-0.5

0.0

0.5

1.0

φ=0.5

φ=−0.5

– exponential decay.

[4] (p, q)’th order autoregressive-moving average process ARMA(p, q)

Here {Xt} is expressed as

Xt = φ1,pXt−1 + . . . + φp,pXt−p + ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

22

where the φj,p’s and the θj,q’s are all constants (φp,p 6= 0; θq,q 6= 0) and again {ǫt} is

a zero mean white noise process with variance σ2ǫ . The ARMA class is important as

many data sets may be approximated in a more parsimonious way (meaning fewer

parameters are needed) by a mixed ARMA model than by a pure AR or MA process.

In brief, time series analysts like MA and AR models for different reasons. MA models

are appealing because they are easy to manipulate mathematically, e.g. as we saw, no

restrictions on parameter values are needed to ensure stationarity. On the other hand,

AR models are more convenient for forecasting, which we will see later. Obviously, the

main criterion for whether a model is or isn’t useful is whether it performs well at our

desired task, which will often be: modelling (or: understanding) the data, forecasting,

or control.

The ARMA model shares the best, and the worst, features of the AR and MA classes!

Models for changing variance

Objective: obtain better estimates of local variance in order to obtain a better assess-

ment of risk.

Note: not to get better estimates of the trend.

[5] p’th order autoregressive conditionally heteroscedastic

model ARCH(p)

Assume we have a derived time series {Yt} that is (approximately) uncorrelated but

has a variance (volatility) that changes through time,

Yt = σtεt (2)

where {εt} is a white noise sequence with zero mean and unit variance. Here, σt

represents the local conditional standard deviation of the process. Note that σt is not

observable directly.

23

{Yt} is ARCH(p) if it satisfies equation (2) and

σ2t = α+ β1,py

2t−1 + . . . + βp,py

2t−p, (3)

where α > 0 and βj,p ≥ 0, j = 1, . . . , p (to ensure the variance remains positive), and

yt−1 is the observed value of the derived time series at time (t− 1).

Notes:

(a) the absence of the error term in equation (3).

(b) unconstrained estimation often leads to violation of the non-negativity constraints

that are needed to ensure positive variance.

(c) quadratic form (i.e. modelling σ2t ) prevents modelling of asymmetry in volatility

(i.e. volatility tends to be higher after a decrease than after an equal increase

and ARCH cannot account for this).

Example: ARCH(1)

σ2t = α+ β1,1y

2t−1

Define,

vt = y2t − σ2

t , ⇒ σ2t = y2

t − vt.

The model can also be written:

y2t = α+ β1,1y

2t−1 + vt,

i.e. an AR(1) model for {y2t }.

where the errors, {vt}, have zero mean, but as vt = σ2t (ǫ

2t − 1) the errors are het-

eroscedastic.

[6] (p, q)’th order generalized autoregressive conditionally

heteroscedastic model GARCH(p, q)

24

{Yt} is GARCH(p, q) if it satisfies equation (2) and

σ2t = α+ β1,py

2t−1 + . . . + βp,py

2t−p + γ1,qσ

2t−1 + . . . γq,qσ

2t−q,

where the parameters are chosen to ensure positive variance. GARCH models were

introduced because it was observed that the ARCH class does not account sufficiently

well for the persistence of volatility in financial time series data; i.e. according to the

ARCH model, the series y2t often has less (theoretical) autocorrelation than real data

tend to have in practice.

2.2 Trend removal and seasonal adjustment

There are certain, quite common, situations where the observations exhibit a trend – a

tendency to increase or decrease slowly steadily over time – or may fluctuate in a periodic

manner due to seasonal effects. The model is modified to

Xt = µt + Yt

µt = time dependent mean.

Yt = zero mean stationary process.

Example Oxford temperature data, last 30 years. The data are plotted in the top-left

plot of Figure 12.

Model suggested by plot: Xt = α+ βt+ Yt.

Trend adjustment

At least two possible approaches:

25

(a) Estimate α and β by least squares, and work with the residuals

Yt = Xt − α− βt.

For the Oxford data these are shown in the top-right plot of Figure 12.

(b) Take first differences:

X(1)t = Xt −Xt−1 = α+ βt+ Yt − (α+ β(t− 1) + Yt−1)

= β + Yt − Yt−1.

For the Oxford temperature data these are shown in the bottom-left plot of Figure

12.

Note: if {Yt} is stationary so is {Y (1)t }

In the case of linear trend, if we difference again:

X(2)t = X

(1)t −X(1)

t−1 = (Xt −Xt−1)− (Xt−1 −Xt−2)

= (β + Yt − Yt−1)− (β + Yt−1 − Yt−2)

= Yt − 2Yt−1 + Yt−2, (≡ Y (1)t − Y (1)

t−1 = Y(2)t ),

so that the effect of µt(= α+ βt) has been completely removed.

If µt is a polynomial of degree (d− 1) in t, then dth differences of µt will be zero (d = 2 for

linear trend).

Further,

X(d)t =

d∑

k=0

(

d

k

)

(−1)kXt−k

=

d∑

k=0

(

d

k

)

(−1)kYt−k.

26

There are other ways of writing this. Define the difference operator

∆ = (1−B)

where BXt = Xt−1 is the backward shift operator (sometimes known as the lag operator L

– especially in econometrics). Then,

X(d)t = ∆dXt = ∆dYt.

For example, for d = 2:

X(2)t = (1−B)2Xt = (1−B)(Xt −Xt−1)

= (Xt −Xt−1)− (Xt−1 −Xt−2)

= (β + Yt − Yt−1)− (β + Yt−1 − Yt−2)

= (Yt − Yt−1)− (Yt−1 − Yt−2)

= (1−B)2Yt = ∆2Yt.

This notation can be incorporated into the ARMA set up, recall if {Xt} is ARMA(p, q),

Xt = φ1,pXt−1 + . . .+ φp,pXt−p + ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q

(1− φ1,pB − φ2,pB2 − . . .− φp,pB

p)Xt = (1− θ1,qB − θ2,qB2 − . . .− θq,qB

q)ǫt

Φ(B)Xt = Θ(B)ǫt.

27

Where

Φ(B) = 1− φ1,pB − φ2,pB2 − . . .− φp,pB

p

Θ(B) = 1− θ1,qB − θ2,qB2 − . . . − θq,qB

q

are known as the associated or characteristic polynomials.

Further, we can generalize the class of ARMA models to include differencing to account for

certain types of non-stationarity, namely, Xt is called ARIMA(p, d, q) if

Φ(B)(1−B)dXt = Θ(B)ǫt,

Φ(B)∆dXt = Θ(B)ǫt.

Seasonal adjustment

The model is modified to

Xt = st + Yt

where

st = seasonal component,

Yt = zero mean stationary process.

Presuming that the seasonal component maintains a constant pattern over time with period

s, there are again several approaches to removing st. A popular approach used by Box &

Jenkins is to use the operator (1−Bs):

X(s)t = (1−Bs)Xt = Xt −Xt−s

= (st + Yt)− (st−s + Yt−s)

= Yt − Yt−s

28

since st has period s (and so st−s = st).

The bottom-right plot of Figure 12 shows this technique applied to the Oxford temperature

data – most of the seasonal structure and trend has been removed by applying the following

differencing:

(1−Bs)(1−B)Xt

with s = 12.

2.3 ARMA — stationarity, invertibility and autocorrelation

For this section, assume that Φ(z) and Θ(z) have no common zeroes (for technical reasons).

Stationarity

Consider a general ARMA process

Φ(B)Xt = Θ(B)ǫt.

Note that the RHS, being an MA process, is always stationary. The following general result

will help us determine when Xt itself is stationary, and find its autocovariance sequence.

Proposition 3.1.2, Brockwell & Davis. If Yt is a second-order stationary process with

autocovariance function sYτ and if

∑∞j=−∞ |ψj | <∞, then for each t ∈ Z the series

Xt :=

∞∑

j=−∞

ψjYt−j

converges absolutely with probability one, and in mean square to the same limit. Further-

more, Xt is second-order stationary with autocovariance sequence

sXτ =

∞∑

j,k=−∞

ψjψksYτ−j+k.

29

Proof. Brockwell & Davis, p. 84.

General ARMA can be represented as

Xt = Φ−1(B)Θ(B)ǫt.

The above proposition is telling us that if we can represent Φ−1(B) as∑∞

j=0 φjBj with

∑∞j=0 |φj | <∞, then Xt will be stationary.

Fact. Φ−1(z), where Φ(z) is a polynomial, can be represented as∑∞

j=0 φjzj if Φ(z) 6= 0 for

all z ∈ C s.t. |z| ≤ 1.

To summarise, Xt is stationary if any roots of Φ(z) = 0 lie outside the unit circle.

Invertibility

Again, let Φ(B)Xt = Θ(B)ǫt. The process Xt is said to be invertible if there exists a

sequence of constants πj s.t.∑∞

j=0 |πj | <∞ and

ǫt =∞∑

j=0

πjXt−j .

Theorem 3.1.2., Brockwell & Davis. Xt is invertible if and only if Θ(z) 6= 0 for all

z ∈ C s.t. |z| ≤ 1. The coefficients πj are obtained by Taylor-expanding Φ(z)/Θ(z).

Proof. Brockwell & Davis, p. 87.

Example 1

Consider the following process

Xt = ǫt − 1.3ǫt−1 + 0.4ǫt−2

30

Writing this in B notation:

Xt = (1− 1.3B + 0.4B2)ǫt

= Θ(B)ǫt

to check if invertible, find roots of Θ(z) = 1− 1.3z + 0.4z2,

1− 1.3z + 0.4z2 = 0

4z2 − 13z + 10 = 0

(4z − 5)(z − 2) = 0

roots of Θ(z) are z = 2 and z = 5/4, which are both outside the unit circle ⇒ invertible.

Example 2

Determine whether the following model is stationary and/or invertible,

Xt = 1.3Xt−1 − 0.4Xt−2 + ǫt − 1.5ǫt−1.

Writing in B notation:

(1− 1.3B + 0.4B2)Xt = (1− 1.5B)ǫt

we have

Φ(z) = 1− 1.3z + 0.4z2

with roots z = 2 and 5/4 (from previous example), so the roots of Φ(z) = 0 both lie outside

the unit circle, therefore model is stationary, and

Θ(z) = 1− 1.5z,

31

so the root of Θ(z) = 0 is given by z = 2/3 which lies inside the unit circle and the model

is not invertible.

Computing the autocovariance sequence of an AR process. The above Proposition

provides us with an “easy” method for computing the autocovariance sequence of a station-

ary AR process. This is best seen in an example. Let us simplify slightly the above model

to give

Xt = 1.3Xt−1 − 0.4Xt−2 + ǫt.

Write as

(1− 1.3B + 0.4B2)Xt = ǫt.

Then, having found the roots of the polynomial on the RHS, we factorise

(

1− B

2

)(

1− 4

5B

)

Xt = ǫt.

Formally, this is equivalent to:

Xt =1

(

1− B2

) (

1− 45B)ǫt,

which, using Taylor expansion, gives

Xt =

(

∞∑

i=0

Bi

2i

)(

∞∑

i=0

4iBi

5i

)

ǫt.

Collecting the terms (not always easy – Maple can help), we get

Xt =

(

1 +13

10B +

129

100B2 + . . .

)

ǫt

= ǫt +13

10ǫt−1 +

129

100ǫt−2 + . . .

Truncating this expansion at a sufficiently large lag, we then proceed in the same way as in

32

calculating the acs of an MA process.

Remark. The above discussion implies that

MA (finite order) ≡ AR (infinite order)

AR (finite order) ≡ MA (infinite order)

provided the infinite order expansions exist!

SUMMARY

AR(p) MA(q) ARMA(p, q)

Stationarity Roots of Φ(z) Always Roots of Φ(z)

outside |z| ≤ 1 stationary outside |z| ≤ 1

Invertibility Always Roots of Θ(z) Roots of Θ(z)

invertible outside |z| ≤ 1 outside |z| ≤ 1

Why are stationarity and invertibility desirable?

• Stationarity — because it assumes that the joint distribution of Xt (or second-order

properties) does not change over time. Therefore, we can average over time to obtain

better estimates of the characteristics of the process.

Example. Assume {Xt}Tt=1 stationary and E(Xt) = µ. We can estimate µ byt

averaging Xt over time:

µ =1

T

T∑

t=1

Xt.

More later!

• Invertibility — because it permits us to represent our process as AR, and AR processes

are often “easy” to estimate and forecast. Again, more later.

33

2.4 Partial autocorrelation

In the section on MA processes, we saw that the autocovariance sequence cut off after

some lag k. Also, we saw that for an AR(1) process, it never cut off to zero, but decayed

exponentially. The same result can be proved for a general AR process.

For purposes of model identification, it would be useful to have a quantity which did cut

off to zero for a general autoregressive process AR(p). One such quantity is the partial

autocorrelation function (or sequence).

For a process Xt, it is defined by

π(1) = corr(X2,X1)

π(2) = corr(X3 −E(X3|X2),X1 − E(X1|X2))

π(3) = corr(X4 −E(X4|X3,X2),X1 − E(X1|X3,X2))

etc . . .

Interpretation: E(X4|X3,X2) is the “part of X4 that is explained by X3,X2” (or more

formally, it is the prediction of X4 based on X3,X2). Thus X4 − E(X4|X3,X2) is the part

of X4 that is unexplained (un-predicted) by X2,X3. Thus the partial autocorrelation at lag

k is the correlation of those portions of X1, Xk+1 which are unexplained by the intermediate

variables X2, . . . ,Xk.

Fact: for an autoregressive process AR(p), pacf at lags k > p is zero.

Example for AR(1), Xt = aXt−1 + ǫt.

π(1) = corr(X2,X1) = ρ(1).

π(2) = corr(X3 − E(X3|X2),X1 − E(X1|X2)).

34

Now,

X3 − E(X3|X2) = X3 − E(aX2 + ǫ3|X2) = X3 − aX2 + 0 = ǫ3.

But X1 −E(X1|X2) is a function of X1,X2, which are independent of ǫ3. Thus π(2) = 0.

Note: for a general AR(p) process as defined above,

Xt = φ1,pXt−1 + φ2,pXt−2 + . . .+ φp,pXt−p + ǫt,

if the ǫ’s are iid Gaussian, then it can be shown that π(k) = φk,p.

3 Spectral Representation theorem for discrete time station-

ary processes

Spectral analysis is a study of the frequency domain characteristics of a process, and de-

scribes the contribution of each frequency to the variance of the process.

The whole idea is best described graphically on the board.

A more formal description follows.

We start with a stochastic process Z(f), possibly complex-valued, defined on the interval

f ∈ [−1/2, 1/2]. The variable f will correspond to “frequency”.

Consider “infinitely small” jumps of Z(f), denoted by dZ(f):

dZ(f) ≡

Z(f + df)− Z(f), f < 1/2;

0, f = 1/2,

where df is a small positive increment. If the intervals [f, f + df ] and [f ′, f ′ + df ′] are non-

intersecting subintervals of [−1/2, 1/2], then the r.v.’s dZ(f) and dZ(f ′) are uncorrelated.

Again, we say that the process has orthogonal increments, and the process itself is called

35

an orthogonal process – this orthogonality assumption is very important.

Note. While Z(f) is a well-defined stochastic process, dZ(f) is not a real process, in the

same way as the “delta function” is not a real function: both make sense only in integration!

Let {Xt} be a (possibly complex-valued) discrete time second-order stationary process, with

zero mean. The spectral representation theorem states that there exists such an orthogonal

process {Z(f)}, defined on [−1/2, 1/2], such that

Xt =

∫ 1/2

−1/2ei2πft dZ(f)

for all integers t.

The process {Z(f)} has the following properties:

[1] E{dZ(f)} = 0 ∀ |f | ≤ 1/2.

[2] E{|dZ(f)|2} ≡ dS(I)(f) say ∀ |f | ≤ 1/2, where dS(I)(f) is called the integrated spec-

trum of {Xt}, and

[3] for any two distinct frequencies f and f ′ ∈ [−1/2, 1/2]

cov{dZ(f ′), dZ(f)} = E{dZ∗(f ′)dZ(f)} = 0.

The spectral representation

Xt =

∫ 1/2

−1/2ei2πft dZ(f) =

∫ 1/2

−1/2ei2πft |dZ(f)|ei arg{dZ(f)},

means that we can represent any discrete complex-valued stationary process as an “infinite”

sum of complex exponentials at frequencies f with associated random amplitudes |dZ(f)|

and random phases arg{dZ(f)}.

Since for any two different frequencies, dZ(f) and dZ(f ′) are uncorrelated (= independent if

36

Z is Gaussian), we have a convenient decomposition of Xt into a sum of uncorrelated com-

ponents of a particularly simple form: note that ei2πft are simple complex-valued oscillatory

functions. Think of them as “extended” versions of sines or cosines.

Non-examinable example, mostly for those interested in Stochastic Processes.

Let B(f) be a standard Brownian motion on [−1/2, 1/2], where the integration w.r.t.

dB(−f) is defined by

∫

[−1/2,1/2]g(f)dB(−f) =

∫

[−1/2,1/2]g(−f)dB(f).

Define the orthogonal increment process by

dZ(f) = dB(f) + dB(−f) + i(dB(−f)− dB(f)).

Note:

1. Quick check whether Z really has orthogonal increments: from the properties of

Brownian motion, it is obvious that cov(dZ(f), dZ(f ′)) = 0 for all |f | 6= |f ′|. The

only case where it is not obvious is f ′ = −f . W.l.o.g. assume f > 0. Recalling that

E(dB(−f), dB(−f)) = df , we compute

cov(dZ(f), dZ(−f)) =

E(dZ(f)dZ∗(−f)) =

E {(dB(f) + dB(−f) + i(dB(−f)− dB(f))) (dB(−f) + dB(f)− i(dB(f)− dB(−f)))} =

df − idf + df + idf + idf − df − idf − df =

0.

2. Note that dZ∗(−f) = dZ(f). It is always the case when the process Xt “generated”

by Z(f) is real-valued!

37

We have

Xt =∫ 1/2

−1/2ei2πftdZ(f)

∫ 1/2

−1/2(cos(2πft) + i sin(2πft))(dB(f) + dB(−f) + i(dB(−f)− dB(f))) =

∫ 1/2

−1/2cos(2πft)dB(f) + cos(2πft)dB(−f)− sin(2πft)dB(−f) + sin(2πft)dB(f) +

i

∫ 1/2

−1/2cos(2πft)dB(−f)− cos(2πft)dB(f) + sin(2πft)dB(f) + sin(2πft)dB(−f) =

2

∫ 1/2

−1/2(cos(2πft) + sin(2πft))dB(f).

So Xt is real-valued. Obviously, it is Gaussian and has zero mean. We will now compute

the covariance structure of Xt.

cov(Xt,Xt+τ ) =

4

∫ 1/2

−1/2

∫ 1/2

−1/2(cos(2πft) + sin(2πft))(cos(2πf ′(t+ τ)) + sin(2πf ′(t+ τ)))dB(f)dB(f ′) =

4

∫ 1/2

−1/2(cos(2πft) + sin(2πft))(cos(2πf(t+ τ)) + sin(2πf(t+ τ)))df =

4

∫ 1/2

−1/2(cos(2πfτ)) + sin(2πf(2t+ τ))df.

Now, the sin part is always zero, whereas the cos part is equal to 4 if τ = 0 and 0 otherwise.

So, Z “generates” the standard Gaussian white noise (up to a multiplicative factor).

End of example.

The orthogonal increments property can be used to define the relationship between the

38

autocovariance sequence {sτ} and the integrated spectrum SI(f):

sτ = E{XtXt+τ} = E{X∗t Xt+τ} = E

∫ 1/2

−1/2e−i2πf ′t dZ∗(f ′)

∫ 1/2

−1/2ei2πf(t+τ) dZ(f)

=

∫ 1/2

−1/2

∫ 1/2

−1/2ei2π(f−f ′)tei2πfτE{dZ∗(f ′)dZ(f)}.

Because of the orthogonal increments property,

E{dZ∗(f ′)dZ(f)} =

dS(I)(f) f = f ′

0 f 6= f ′

so

sτ =

∫ 1/2

−1/2ei2πfτ dS(I)(f),

which shows that the integrated spectrum determines the acvs for a stationary process. If

in fact S(I)(f) is differentiable everywhere with a derivative denoted by S(f) we have

E{|dZ(f)|2} = dS(I)(f) = S(f) df.

The function S(·) is called the spectral density function (sdf). Hence

sτ =

∫ 1/2

−1/2ei2πfτS(f) df.

But a square summable deterministic sequence {gt} say has the Fourier representation

gt =

∫ 1/2

−1/2G(f)ei2πft df,

where

G(f) =

∞∑

t=−∞

gte−i2πft,

39

If we assume that S(f) is square integrable, then S(f) is the Fourier transform of {sτ},

S(f) =∞∑

τ=−∞

sτe−i2πfτ .

Hence,

{sτ} ←→ S(f),

i.e., {sτ} and S(f) are a F.T. pair.

Conclusion: because for real-valued processes, sτ is real-valued and symmetric, S(f) is nec-

essarily real-valued (obviously) and also symmetric!

Spectral Density Function

Subject to its existence, S(·) has the following interpretation: S(f) df is the average con-

tribution (over all realizations) to the power from components with frequencies in a small

interval about f . The power – or variance – is

∫ 1/2

−1/2S(f) df.

Hence, S(f) is often called the power spectral density function or just power spectrum.

Properties: (assuming existence)

[1] S(I)(f) =∫ f−1/2 S(f ′) df ′.

[2] 0 ≤ S(I)(f) ≤ σ2 where σ2 = var{Xt}; S(f) ≥ 0.

[3] S(I)(−1/2) = 0; S(I)(1/2) = σ2;∫ 1/2−1/2 S(f) df = σ2.

[4] f < f ′ ⇒ S(I)(f) ≤ S(I)(f ′); S(−f) = S(f).

Except, basically, for the scaling factor σ2, S(I)(f) has all the properties of a probability

40

distribution function, and hence is sometimes called a spectral distribution function.

Classification of Spectra

For most practical purposes any integrated spectrum, S(I)(f) can be written as

S(I)(f) = S(I)1 (f) + S

(I)2 (f)

where the S(I)j (f)’s are nonnegative, nondecreasing functions with S

(I)j (−1/2) = 0 and are

of the following types:

[1] S(I)1 (·) is absolutely continuous, i.e., its derivative exists for almost all f and is equal

almost everywhere to an sdf S(·) such that

S(I)(f) =

∫ f

−1/2S(f ′)df ′.

[2] S(I)2 (·) is a step function with jumps of size {pl} : l = 1, 2, . . .} at the points {fl : l =

1, 2, . . .}.

We consider the integrated spectrum to be a combination of two ‘pure’ forms :

case (a): S(I)1 (f) ≥ 0;S

(I)2 (f) = 0.

{Xt} is said to have a purely continuous spectrum and S(f) is absolutely integrable,

with∫ 1/2

−1/2S(f) cos(2πfτ) df and

∫ 1/2

−1/2S(f) sin(2πfτ)→ 0,

as τ →∞. [This is known as the Riemann-Lebesgue thm.]. But,

sτ =

∫ 1/2

−1/2ei2πfτS(f) df =

∫ 1/2

−1/2S(f) cos(2πfτ) df + i

∫ 1/2

−1/2S(f) sin(2πfτ) df.

Hence sτ → 0 as |τ | → ∞. In other words, the acvs diminishes to zero (called “mixing

41

condition”).

case (b): S(I)1 (f) = 0;S

(I)2 (f) ≥ 0.

Here the integrated spectrum consists entirely of a step function, and the {Xt} is said

to have a purely discrete spectrum or a line spectrum. The acvs for a process with a

line spectrum never damps down to 0.

Examples see Figs 13 and 14.

We will not be studying processes falling under the (b) category.

White noise spectrum

Recall that a white noise process {ǫt} has acvs:

sτ =

σ2ǫ τ = 0

0 otherwise

Therefore, the spectrum of a white noise process is given by:

Sǫ(f) =

∞∑

τ=−∞

sτe−i2πfτ = s0 = σ2

ǫ .

i.e., white noise has a constant spectrum.

Spectral density function vs. autocovariance function

The sdf and acvs contain the same amount of information in that if we know one of them,

we can calculate the other. However, they are often not equally informative.

On some occasions, sdf or acvs proves to be the more sensitive and interpretable diagnostic

or exploratory tool, so it is often useful to apply both tools in the exploratory analysis of

time series data.

There are also other transformations, for example those based on wavelets, which help bring

42

out important features of the data.

Linear Filtering

A digital filter maps a sequence to another sequence. The following digital filtering method-

ology will be extremely useful in establishing spectral density functions of the linear time

series models covered so far, e.g. AR, MA, etc.

A digital filter L that transforms an input sequence {xt} into an output sequence {yt} is

called a linear time-invariant (LTI) digital filter if it has the following three properties:

[1] Scale-preservation:

L {{αxt}} = αL {{xt}} .

[2] Superposition:

L {{xt,1 + xt,2}} = L {{xt,1}}+ L {{xt,2}} .

[3] Time invariance:

If

L {{xt}} = {yt}, then L {{xt+τ}} = {yt+τ}.

Where τ is integer-valued, and the notation {xt+τ} refers to the sequence whose t-th

element is xt+τ .

Suppose we use a sequence with t-th element exp(i2πft) as the input to a LTI digital filter:

Let ξf,t = {ei2πft}, and let yf,t denote the output function:

yf,t = L{ξf,t}.

By properties [1] and [3]:

yf,t+τ = L{ξf,t+τ} = L{ei2πfτ ξf,t} = ei2πfτL{ξf,t} = ei2πfτyf,t.

43

In particular, for t = 0:

yf,τ = ei2πfτyf,0.

Now set τ = t:

yf,t = ei2πftyf,0.

Thus, when ξf,t is input to the LTI digital filter, the output is the same function multiplied

by some constant, yf,0, which is independent of time but will depend on f . Let G(f) = yf,0.

Then

L{ξf,t} = ξf,tG(f).

G(f) is called the transfer function or frequency response function of L. We can write

G(f) = |G(f)|eiθ(f)

where,

|G(f)| gain

θ(f) = arg{G(f)} phase

Any LTI digital filter can be expressed in the form:

L {{Xt}} =

∞∑

u=−∞

guXt−u ≡ {Yt},

where {gu} is a real-valued deterministic sequence called the impulse response sequence.

Note,

L{{ei2πft}} =

∞∑

u=−∞

guei2πf(t−u) = ei2πftG(f),

with

G(f) =∞∑

u=−∞

gue−i2πfu for |f | ≤ 1

2.

44

Note:

{gu} ←→ G(f) (F.T. pair).

We have,

Yt =∑

u

guXt−u

Recall,

Xt =

∫ 1/2

−1/2ei2πft dZX(f) Yt =

∫ 1/2

−1/2ei2πft dZY (f),

⇒∫

ei2πft dZY (f) =∑

u

gu

∫ 1/2

−1/2ei2πf(t−u) dZX(f)

=

∫ 1/2

−1/2ei2πftG(f) dZX(f)

so that,

dZY (f) = G(f) dZX(f) ; (1 : 1)

and

E{|dZY (f)|2} = |G(f)|2E{|dZX(f)|2},

and if the spectral densities exist

SY (f) = |G(f)|2SX(f).

This relationship can be used to determine the sdf’s of discrete parameter stationary

processes.

Determination of sdf’s by LTI digital filtering

45

[1] q-th order moving average: MA(q),

Xt = ǫt − θ1,qǫt−1 − . . . − θq,qǫt−q,

with usual assumptions (mean zero). Define

L {{ǫt}} = ǫt − θ1,qǫt−1 − . . . − θq,qǫt−q,

so that {Xt} = L {{ǫt}}. To determine G(f), input ei2πft:

L{

{ei2πft}}

= ei2πft − θ1,qei2πf(t−1) − . . . θq,qe

i2πf(t−q)

= ei2πft[

1− θ1,qe−i2πf − . . . − θq,qe

−i2πfq]

,

so that

Gθ(f) = 1− θ1,qe−i2πf − . . . − θq,qe

−i2πfq.

Since,

SX(f) = |Gθ(f)|2Sǫ(f) and Sǫ(f) = σ2ǫ ,

we have

SX(f) = σ2ǫ |1− θ1,qe

−i2πf − . . . − θq,qe−i2πfq|2.

Let z = e−i2πf and define

Hθ(z) = 1− θ1,qz − . . . − θq,qzq,

so that Gθ(f) = Hθ(z). Of course, Hθ(z) is the characteristic polynomial of Xt. We

have

|Gθ(f)|2 = Gθ(f)G∗θ(f) ≡ Hθ(z)Hθ(z

−1).

Roots ofH(z) andH(z−1) are inverses. Hence, ifXt is invertible (roots ofH(z) outside

46

the unit circle), then there exists a non-invertible process with the same |Gθ(f)|2 (and

hence the same spectrum). This, we cannot determine from the spectrum whether

the process is invertible or not. This makes sense, since we cannot distinguish these

cases using the acvs either.

Example:

1. The invertible case: H(z) = 1− z/2.

2. The non-invertible case: H(z) = 1/2− z.

Both have the same spectrum, the same autocovariance sequence and the same auto-

correlation sequence.

[2] p-th order autoregressive process: AR(p),

Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt

Define

L {{Xt}} = Xt − φ1,pXt−1 − . . . − φp,pXt−p,

so that L {{Xt}} = {ǫt}. By analogy to MA(q)

Gφ(f) = 1− φ1,pe−i2πf − . . . − φp,pe

−i2πfp.

Since,

|Gφ(f)|2SX(f) = Sǫ(f) and Sǫ(f) = σ2ǫ ,

we have

SX(f) =σ2

ǫ

|1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2

Interpretation of AR spectra

47

Recall that for an AR process we have characteristic equation

1− φ1,pz − φ2,pz2 − . . .− φp,pz

p

and the process is stationary if the roots of this equation lie outside the unit circle.

Consider an AR(2) process with complex characteristic roots, these roots must form

a complex conjugate pair:

z =1

re−i2πf ′

, z =1

rei2πf ′

and we can write

1− φ1,pz − φ2,pz2 = (rz − e−i2πf ′

)(rz − ei2πf ′

)

= r2z2 − zr(e−i2πf ′

+ ei2πf ′

) + 1

= r2z2 − 2zr cos(2πf ′) + 1

and the AR process can be written

(r2B2 − 2r cos(2πf ′)B + 1)Xt = ǫt

⇒ Xt = 2r cos(2πf ′)Xt−1 − r2Xt−2 + ǫt

The spectrum can be written in terms of the complex roots, by substituting z = e−i2πf

in the characteristic equation.

SX(f) =σ2

ǫ

|re−i2πf − e−i2πf ′ |2|re−i2πf − ei2πf ′ |2

48

Now,

|re−i2πf − e−i2πf ′ |2 = |e−i2πf (r − e−i2π(f ′−f))|2

= (r − e−i2π(f ′−f))(r − ei2π(f ′−f))

= r2 − r(e−i2π(f ′−f) + ei2π(f ′−f)) + 1

= r2 − 2r cos(2π(f ′ − f)) + 1

similarly,

|re−i2πf − ei2πf ′ |2 = r2 − 2r cos(2π(f ′ + f)) + 1

giving,

SX(f) =σ2

ǫ

(r2 − 2r cos(2π(f ′ + f)) + 1)(r2 − 2r cos(2π(f ′ − f) + 1)

The spectrum will be at its largest when denominator is at its smallest - when r is

close to 1 this occurs when f ≈ ±f ′. Also notice that at f = ±f ′ as r → 1 (from

below as 0 < r < 1) so the spectrum becomes larger.

Generally speaking complex roots will induce a peak in the spectrum, indicating a

tendency towards a cycle at frequency f ′. Also, the larger the value of r the more

dominant the cycle. This may be termed pseudo-cyclical behaviour (recall that a

deterministic cycle will show up at a sharp spike – i.e., a line spectrum).

Similar analysis possible in the case of real roots.

[3] (p, q)−th order autoregressive, moving average process: ARMA(p, q),

Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q

If we write this as

Xt − φ1,pXt−1 − . . .− φp,pXt−p = Yt;

49

Yt = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

then we have

|Gφ(f)|2SX(f) = SY (f),

and

SY (f) = |Gθ(f)|2Sǫ(f),

so that

SX(f) = Sǫ(f)|Gθ(f)|2|Gφ(f)|2

= σ2ǫ

|1− θ1,qe−i2πf − . . .− θq,qe

−i2πfq|2|1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2

[4] Differencing

Let {Xt} be a stationary process with sdf SX(f). Let Yt = Xt −Xt−1. Then

L{

{ei2πft}}

= ei2πft − ei2πf(t−1)

= ei2πft(1− e−i2πf )

= ei2πftG(f),

so

|G(f)|2 = |1− e−i2πf |2 = |e−iπf (eiπf − e−iπf )|2

= |e−iπf2i sin(πf)|2 = 4 sin2(πf).

50

4 Estimation

4.1 Estimation of mean and autocovariance function

Ergodic Property

Methods we shall look at for estimating quantities such as the autocovariance function will

use observations from a single realization. Such methods are based on the strategy of re-

placing ensemble averages by their corresponding time averages.

Sample mean:

Given a time series X1,X2, . . . ,XN .

Let

X =1

N

∑

Xt.

(

assume∞∑

τ=−∞

|sτ | <∞)

.

Then,

E{X} =1

N

n∑

t=1

E{Xt} =1

N.Nµ = µ

so X is an unbiased estimator of µ. Hence, X converges to µ in mean square if

limN→∞

var{X} = 0.

var{X} = E{(X − µ)2}

= E

(

1

N

N∑

i=1

(Xi − µ)

)2

=1

N2

N∑

t=1

N∑

u=1

E{(Xt − µ)(Xu − µ)}

=1

N2

N∑

t=1

N∑

u=1

su−t

51

=1

N2

N−1∑

τ=−(N−1)

N−|τ |∑

k=1

sτ

=1

N2

N−1∑

τ=−(N−1)

(N − |τ |)sτ

=1

N

N−1∑

τ=−(N−1)

(

1− |τ |N

)

sτ

The summation interchange merely swaps row sums for diagonal sums, e.g., N = 4, elements

are sτ :

• • • •

• • • •

• • • •

• • • •

• • • •

• • • •

• • • •

• • • •

1 2 3 4

1 2 3 4

1

2

3

4

1

2

3

4

u

t

u

t

-3 -2 -1

1

2

3

τ=0

By the Cesaro summability theorem, if∑N−1

τ=−(N−1) sτ converges to a limit as N →∞,

it must since

∣

∣

∣

∣

∣

∣

N−1∑

τ=−(N−1)

sτ

∣

∣

∣

∣

∣

∣

≤N−1∑

τ=−(N−1)

|sτ | <∞ ∀N.

then∑N−1

τ=−(N−1)

(

1− |τ |N

)

sτ converges to the same limit.

52

We can thus conclude that,

limN→∞

Nvar{X} = limN→∞

N−1∑

τ=−(N−1)

(

1− |τ |N

)

sτ

= limN→∞

N−1∑

τ=−(N−1)

sτ =

∞∑

τ=−∞

sτ .

The assumption of absolute summability of {sτ} implies that {Xt} has a purely continuous

spectrum with sdf

S(f) =

∞∑

τ=−∞

sτe−i2πfτ ,

so that

S(0) =

∞∑

τ=−∞

sτ .

Thus

limN→∞

Nvar{X} = S(0),

i.e.,

var{X} ≈ S(0)

Nfor large N.

and therefore, var{X} → 0.

Note that the convergence of X depends only on the spectrum at S(0), i.e. at f = 0.

Autocovariance Sequence:

Now,

sτ = E{(Xt − µ)(Xt+τ − µ)}

so that a natural estimator for the acvs is

s(u)τ =

1

N − |τ |

N−|τ |∑

t=1

(Xt − X)(Xt+|τ | − X) τ = 0,±1, . . . ,±(N − 1).

53

Note s(u)−τ = s

(u)τ as it should.

If we replace X by µ:

E{s(u)τ } =

1

N − |τ |

N−|τ |∑

t=1

E{(Xt − µ)(Xt+|τ | − µ)}

=1

N − |τ |

N−|τ |∑

t=1

sτ = sτ , τ = 0,±1, . . . ,±(N − 1).

Thus, s(u)τ is an unbiased estimator of sτ when µ is known. (Hence the (u) – for unbiased).

Most texts refer to s(u)τ as unbiased – however, if µ is estimated by X, s

(u)τ is typically a

biased estimator of sτ !!!

A second estimator of sτ is typically preferred:

s(p)τ =

1

N

N−|τ |∑

t=1

(Xt − X)(Xt+|τ | − X) τ = 0,±1, . . . ,±(N − 1).

With X replaced by µ:

E{s(p)τ } =

1

N

N−|τ |∑

t=1

sτ =

(

1− |τ |N

)

sτ ,

so that s(p)τ is a biased estimator, and the magnitude of its bias increases as |τ | increases.

Most texts refer to s(p)τ as biased.

Why should we prefer the “biased” estimator s(p)τ to the “unbiased” estimator s

(u)τ ?

[1] For many stationary processes of practical interest

mse{s(p)τ } < mse{s(u)

τ },

54

where

mse{sτ} = E{(sτ − sτ )2}

= E{s2τ} − 2sτE{sτ}+ s2τ

= (E{s2τ} − E2{sτ}) + E2{sτ} − 2sτE{sτ}+ s2τ

= var{sτ}+ (sτ − E{sτ})2

= variance + (bias)2

[2] If {Xt} has a purely continuous spectrum we know that sτ → 0 as |τ | → ∞. It therefore

makes sense to choose an estimator that decreases nicely as |τ | → N − 1 (i.e. choose

s(p)τ ).

[3] We know that the acvs must be positive semidefinite, the sequence {s(p)τ } has this

property, whereas the sequence {s(u)τ } may not.

4.2 Parametric model fitting: autoregressive processes

Here we concentrate on models of the form

Xt − φ1,pXt−1 − . . . − φp,pXt−p = ǫt.

As we have seen the corresponding sdf is

S(f) =σ2

|1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2 .

This class of models is appealing to use for time series analysis for several reasons:

[1] Any time series with a purely continuous sdf can be approximated well by an AR(p)

model if p is large enough.

55

[2] There exist efficient algorithms for fitting AR(p) models to time series.

[3] Quite a few physical phenomena are reverberant and hence an AR model is naturally

appropriate.

A method for estimating the {φj,p} – Yule-Walker

We start by multiplying the defining equation by Xt−k:

XtXt−k =

p∑

j=1

φj,pXt−jXt−k + ǫtXt−k.

Taking expectations, for k > 0:

sk =

p∑

j=1

φj,psk−j.

Let k = 1, 2, . . . , p and recall that s−τ = sτ to obtain

s1 = φ1,ps0 + φ2,ps1 + . . .+ φp,psp−1

s2 = φ1,ps1 + φ2,ps0 + . . .+ φp,psp−2

......

sp = φ1,psp−1 + φ2,psp−2 + . . .+ φp,ps0

or in matrix notation,

γp = Γpφp,

where γp = [s1, s2, . . . , sp]T ; φp = [φ1,p, φ2,p, . . . , φp,p]

T and

Γp =

s0 s1 . . . sp−1

s1 s0 . . . sp−2

......

...

sp−1 sp−2 . . . s0

56

Note: this is a symmetric Toeplitz matrix which we have met already. All elements on a

diagonal are the same.

Suppose we don’t know the {sτ}, but the mean is zero, then take

sτ =1

N

N−|τ |∑

t=1

XtXt+|τ |,

and substitute these for the sτ ’s in γ and Γp to obtain γp, Γp, from which we estimate φp

as φp:

φp = Γ−1γp.

Finally, we need to estimate σ2ǫ . To do so, we multiply the defining equation by Xt and

take expectations to obtain

s0 =

p∑

j=1

φj,psj + E{ǫtXt}

=

p∑

j=1

φj,psj + σ2ǫ ,

so that as an estimator for σ2ǫ we take

σ2ǫ = so −

p∑

j=1

φj,psj.

The estimators φp and σ2ǫ are called the Yule-Walker estimators of the AR(p) process.

The estimate of the sdf resulting is

S(f) =σ2

ǫ∣

∣

∣1−∑p

j=1 φj,pe−i2πfj∣

∣

∣

2 .

57

Least squares estimation of the {φj,p}

Let {Xt} be a zero-mean AR(p) process, i.e.,

Xt = φ1,pXt−1 + φ2,pXt−2 + . . .+ φp,pXt− + ǫt.

We can formulate an appropriate least squares model in terms of data X1,X2, . . . ,XN as

follows:

XF = Fφ + ǫF ,

where,

F =

Xp Xp−1 . . . X1

Xp+1 Xp . . . X2

......

XN−1 XN−2 . . . XN−p

and,

XF =

Xp+1

Xp+2

...

XN

; φ =

φ1,p

φ2,p

...

φp,p

; ǫF =

ǫp+1

ǫp+2

...

ǫN

.

We can thus estimate φ by finding that φ such that

SSF (φ) =

N∑

t=p+1

(

Xt −p∑

k=1

φk,pXt−k

)2

=

N∑

t=p+1

ǫ2t

= (XF − Fφ)T (XF − Fφ)

is minimized. If we denote the vector that minimizes the above as φF , standard least

squares theory tells us that it is given by

φF = (F TF )−1F T XF .

58

Note: convince yourselves of this using the fact that:

∂

∂x(Ax + b)T (Ax + b) = 2AT (Ax + b).

We can estimate the innovations variance σ2F by the usual estimator of the residual variation,

namely

σ2F =

(XF − F φF )T (XF − F φF )

(N − 2p).

(Note: there are N − p effective observations, and p parameters are estimated).

The estimator φF is known as the forward least squares estimator of φ.

Notes:

[1] φF produces estimated models which need not be stationary. This may be a concern for

prediction, however, for spectral estimation, the parameter values will still produce a

valid sdf (i.e., nonnegative everywhere, symmetric about the origin and integrates to

a finite number).

[2] The Yule-Walker estimates can be formulated as a least squares problem.

4.3 Non-parametric spectral estimation – the periodogram

Suppose the zero mean discrete stationary process {Xt} has a purely continuous spectrum

with sdf S(f). We have,

S(f) =

∞∑

τ=−∞

sτe−i2πfτ |f | ≤ 1

2,

With µ = 0, we can use the biased estimator of sτ :

s(p)τ =

1

N

N−|τ |∑

t=1

XtXt+|τ |

59

for |τ | ≤ N − 1, but not for |τ | ≥ N . Hence we could replace sτ by s(p)τ for |τ | ≤ N − 1 and

assume sτ = 0 for |τ | ≥ N .

Giving,

S(p)(f) =

(N−1)∑

τ=−(N−1)

s(p)τ e−i2πfτ =

1

N

(N−1)∑

τ=−(N−1)

N−|τ |∑

t=1

XtXt+|τ |e−i2πfτ

=1

N

N∑

j=1

N∑

k=1

XjXke−i2πf(k−j)

=1

N

∣

∣

∣

∣

∣

N∑

t=1

Xte−i2πft

∣

∣

∣

∣

∣

2

,

where the summation interchange has merely swapped diagonal sums for row sums (see

section 4.1 on the ergodic property). S(p)(f) defined above is known as the periodogram,

and is defined over [−1/2, 1/2].

Note that {s(p)τ } and S(p)(f),

{s(p)τ } ←→ S(p)(f)

hence the (p) for periodogram,

just like the population quantities

{sτ} ←→ S(f).

Hence, {s(p)τ } can be written as

s(p)τ =

∫ 1/2

−1/2S(p)(f)ei2πfτ df |τ | ≤ N − 1.

If S(p)(f) were an ideal estimator of S(f) we would have

[1] E{S(p)(f)} ≈ S(f) ∀f.

[2] var{S(p)(f)} → 0 as N →∞ and,

60

[3] cov{S(p)(f), S(p)(f ′)} ≈ 0 for f 6= f ′.

We find that

[1] is a good approximation for some processes,

[2] is false,

[3] holds if f and f ′ are certain distinct frequencies, namely, the Fourier frequencies fk =

k/N (∆t = 1).

We firstly look at the expectation in [1] (assuming µ = 0).

E{S(p)(f)} =

(N−1)∑

τ=−(N−1)

E{s(p)τ }e−i2πfτ

=

(N−1)∑

τ=−(N−1)

(

1− |τ |N

)

sτe−i2πfτ .

Hence, if we know the acvs {sτ} we can work out from this what E{S(p)(f)} will be. We

can obtain much more insight by considering:

E{|J(f)|2} where J(f) =1√N

N∑

t=1

Xte−i2πft, |f | ≤ 1

2.

[S(p)(f) = |J(f)|2.]

We know from the spectral representation theorem that,

Xt =

∫ 1/2

−1/2ei2πf ′t dZ(f ′),

so that,

J(f) =

N∑

t=1

(

∫ 1/2

−1/2

1√Nei2πf ′t dZ(f ′)

)

e−i2πft

61

=

∫ 1/2

−1/2

N∑

t=1

1√Ne−i2π(f−f ′)t dZ(f ′)

We find that,

E{S(p)(f)} = E{|J(f)|2} = E{J∗(f)J(f)}

= E

{

∫ 1/2

−1/2

N∑

t=1

1√Nei2π(f−f ′)t dZ∗(f ′)

∫ 1/2

−1/2

N∑

s=1

1√Ne−i2π(f−f ′′)s dZ(f ′′)

}

=

∫ 1/2

−1/2

∫ 1/2

−1/2

N∑

t=1

1√Nei2π(f−f ′)t

N∑

s=1

1√Ne−i2π(f−f ′′)sE{dZ∗(f ′) dZ(f ′′)}

=

∫ 1/2

−1/2F(f − f ′)S(f ′) df ′,

where F is Fejer’s kernel defined by

F(f) =

∣

∣

∣

∣

∣

N∑

t=1

1√Ne−i2πft

∣

∣

∣

∣

∣

2

=sin2(Nπf)

N sin2(πf).

This result tells us that the expected value of S(p)(f) is the true spectrum convolved with

Fejer’s kernel. To understand the implications of this we need to know the properties of

Fejer’s kernel:

[1] For all integers N ≥ 1,F(f)→ N as f → 0.

[2] For N ≥ 1, f ∈ [−1/2, 1/2] and f 6= 0, F(f) < F(0).

[3] For f ∈ [−1/2, 1/2], f 6= 0, F(f)→ 0 as N →∞.

[4] For any integer k 6= 0 such that fk = k/N ∈ [−1/2, 1/2], F(fk) = 0.

[5]∫ 1/2−1/2 F(f) df = 1.

Frome the above discussion, we can see that the periodogram is an asymptotically unbiased

estimator of the true spectral density. However we now turn to the [2]nd property above,

the fact that its variance does not go to zero as the sample size goes to infinity.

62

The intuition for this is best seen for Gaussian time series. From the formula for the

periodogram, we can see that it is simply “the modulus squared of a complex random

variable with mean zero”. To start with, let us ignore the complex-valued nature of this

variable and imagine that it is real-valued. Squaring a mean-zero normal variable produces

a χ21 variable. We know that its mean is, approximately, a constant (= the true spectral

density). Because for a χ21 variable, its variance equals twice its mean squared, it will also

be approximately a positive constant, and in particular it will not go to zero.

The same phenomenon happens for the “correct” complex-valued variable. Note that squar-

ing the real and imaginary part and adding them together is a bit like adding two (asymp-

totically independent, as it turns out) normal variables with means zero and equal variances.

Thus the resulting variable will be asymptotically exponential (12χ

22). Again, for an expo-

nential variable, its variance equals its mean squared, so we cannot hope for the variance

to go to zero if the mean goes to a positive constant.

For a precise result regarding the asymptotic distribution of the periodogram, see Brockwell

& Davis, Theorem 10.3.2.

The way to reduce the variance of the periodogram is by smoothing. There are several

smoothing techniques for the periodogram, some very advanced. However, we will study

probably the simplest possible smoothing technique, kernel smoothing via the uniform ker-

nel.

To simplify arguments even further and “abstract” from unnecessary details, we consider

a model which only approximates the periodogram. In the approximate model, each ob-

servation is exponential with mean gi, independently of other observations. However, the

arguments we will make for this model apply with only very slight changes to the full model

for the periodogram.

The approximate model is: yi, for i = 1, . . . , N are independent, exponentially distrib-

uted variables with means gi (and, obviously, variances g2i ) where gi is a sampled version

63

of a “smooth” (Lipschitz-continuous) function g(z) in the sense that gi = g(i/N). The

smoothness of the underlying function is essential for kernel smoothing to work: it would

be pointless to average the observations if the underlying means were completely unrelated

to each other.

To form the kernel smoothing estimate with the uniform kernel, simply take the average of

the neighbouring observations:

gi =1

2M + 1

i+M∑

j=i−M

yj.

Firstly, to compute the mean, we have

E(gi) =1

2M + 1

i+M∑

j=i−M

gj = gi +1

2M + 1

i+M∑

j=i−M

(gj − gi).

Due to the Lipschitz property, the bias is bounded as

1

2M + 1

i+M∑

j=i−M

|gj − gi| ≤C

2M + 1

i+M∑

j=i−M

|i− j|N

≤ CM

N.

On the other hand, for the variance we have

Var(gi) =1

(2M + 1)2

i+M∑

j=i−M

g2j ≤ max g2(z)

1

2M + 1≤ C

M.

The mean-squared error (of any estimator) equals its bias squared + variance (why?). Thus

here, the best rate of convergence to zero can be obtained by equating

M2

N2=

1

M,

and solving for M to give M = O(N2/3).

64

5 Forecasting

Suppose we wish to predict the value of Xt+l of a process, given Xt,Xt−1,Xt−2, . . .. Let

the appropriate model for {Xt} be an ARMA(p, q) process:

Φ(B)Xt = Θ(B)ǫt.

Consider a forecast Xt(l) of Xt+l (an l-step ahead forecast) which is a linear combination

of Xt,Xt−1,Xt−2, . . .:

Xt(l) =

∞∑

k=0

πkXt−k.

Note: this assumes a semi-infinite realization of {Xt}. Let us now assume that {Xt} can

be written as a one-sided linear process, so that

Xt =

∞∑

k=0

ψkǫt−k = Ψ(B)ǫt,

and

Xt+l =∞∑

k=0

ψkǫt+l−k = Ψ(B)ǫt+l.

Hence,

Xt(l) =

∞∑

k=0

πkXt−k =

∞∑

k=0

πkΨ(B)ǫt−k

= Π(B)Ψ(B)ǫt.

Let δ(B) = Π(B)Ψ(B) so that,

Xt(l) = δ(B)ǫt

=

∞∑

k=0

δkǫt−k.

65

Now,

Xt+l =∑∞

k=0 ψkǫt+l−k

=∑l−1

k=0 ψkǫt+l−k +∑∞

k=l ψkǫt+l−k

(A) (B)

(A) Involves future ǫts, and so represents the “unpredictable” part of Xt+l.

(B) Depends only on past and present values of ǫt, thus representing the “predictable”

part of Xt+l.

Hence we would expect,

Xt(l) =∞∑

k=l

ψkǫt+l−k

=∞∑

j=0

ψj+lǫt−j.

so that δk ≡ ψk+l. This can be readily proved. For linear least squares, we want to minimize,

E{(Xt+l −Xt(l))2} = E

(

l−1∑

k=0

ψkǫt+l−k +

∞∑

k=0

[ψk+l − δk]ǫt−k

)2

= σ2ǫ

{(

l−1∑

k=0

ψ2k

)

+

∞∑

k=0

(ψk+l − δk)2

}

.

The first term is independent of the choice of the {δk} and the second term is clearly

minimized by choosing δk = ψk+l, k = 0, 1, 2, . . . as expected. With this choice of {δk} the

second term vanishes, and we have,

σ2(l) = E{(Xt+l −Xt(l))2}

= σ2ǫ

l−1∑

k=0

ψ2k,

66

which is known as the l-step prediction variance.

When l = 1, δk = ψk+1,

Xt(1) = δ0ǫt + δ1ǫt−1 + δ2ǫt−2 + . . .

= ψ1ǫt + ψ2ǫt−1 + ψ3ǫt−2 + . . .

Xt+1 = ψ0ǫt+1 + ψ1ǫt + ψ2ǫt−1 + . . .

so that,

Xt+1 −Xt(1) = ψ0ǫt+1 = ǫt+1 since ψ0 = 1.

Hence ǫt+1 can be thought of as the “one step prediction error”. Also of course,

Xt+1 = Xt(1) + ǫt+1

so that ǫt+1 is the essentially “new” part of Xt+1 which is not linearly dependent on past

observations. The sequence {ǫt} is often called the innovations process of {Xt}, and σ2ǫ is

often called the innovations variance.

If we wish to write Xt(l) explicitly as a function of Xt,Xt−1, . . . then we note first that,

Xt(l) =

∞∑

k=0

δkǫt−k =

∞∑

k=0

ψk+lǫt−k,

so that,

Xt(l) = Ψ(l)(B)ǫt, say

where,

Ψ(l)(z) =

∞∑

k=0

ψk+lzk.

67

Assuming that Ψ(z) is analytic in and on the unit circle (stationary and invertible) then we

can write

Xt = Ψ(B)ǫt and ǫt = Ψ−1(B)Xt,

and thus

Xt(l) = Ψ(l)(B)ǫt = Ψ(l)(B)Ψ−1(B)Xt

= G(l)(B)Xt, say

with,

G(l)(z) = Ψ(l)(z)Ψ−1(z).

If we consider the sequence of predictors Xt(l) for different values of t (with l fixed) then

this forms a new process, which since

Xt(l) = G(l)(B)Xt,

may be regarded as the output of a linear filter acting on the {Xt}. Since,

Xt(l) =

(

∑

u

g(l)u Bu

)

Xt =∑

u

g(l)u Xt−u,

we know that the transfer function is

G(l)(f) =∑

u

g(l)u e−i2πfu.

Example: AR(1)

Xt − φ1,1Xt−1 = ǫt |φ1,1| < 1.

Then

Xt = (1− φ1,1B)−1ǫt.

68

So,

Ψ(z) = 1 + φ1,1z + φ21,1z

2 + . . .

= ψ0 + ψ1z + ψ2z2 + . . .

i.e., ψk = φk1,1.

Hence,

Xt(l) =

∞∑

k=0

δkǫt−k =

∞∑

k=0

ψk+lǫt−k

=∞∑

k=0

φk+l1,1 ǫt−k = φl

1,1

∞∑

k=0

φk1,1ǫt−k

= φl1,1Xt.

The l-step prediction variance is

σ2(l) = σ2ǫ

(

l−1∑

k=0

ψ2k

)

= σ2ǫ

(

l−1∑

k=0

φ2k1,1

)

= σ2ǫ

(1− φ2l1,1)

(1− φ21,1)

.

Alternatively,

Xt(l) = G(l)(B)Xt,

with G(l)(z) = Ψ(l)(z)Ψ−1(z). But,

Ψ(l)(z) =

∞∑

k=0

ψk+lzk =

∞∑

k=0

φk+l1,1 z

k,

and,

Ψ−1(z) = 1− φ1,1z,

69

so that

G(l)(z) = (φl1,1 + φl+1

1,1 z + φl+21,1 z

2 + . . .)

×(1− φ1,1z)

= φl1,1,

i.e., Xt(l) = φl1,1Xt as before.

We have demonstrated that for the AR(1) model the linear least squares predictor of Xt+l

depends only on the most recent observation, Xt, and does not involve Xt−1,Xt−2, . . .,

which is what we would expect bearing in mind the Markov nature of the AR(1) model. As

l→∞, Xt(l)→ 0, since Xt(l) = φl1,1Xt and |φ1,1| < 1. Also, the l-step prediction variance,

σ2(l)→ σ2ǫ

(1− φ21,1)

= var{Xt}.

In fact the solution to the forecasting problem for the AR(1) model can be derived directly

from the difference equation,

Xt − φ1,1Xt−1 = ǫt.

by setting future innovations ǫt to be zero:

Xt(1) = φ1,1Xt + 0

Xt(2) = φ1,1Xt(1) + 0

...

Xt(l) = φ1,1Xt(l − 1) + 0

so that,

Xt(l) = φl1,1Xt.

70

For general AR(p) processes it turns out that Xt(l) depends only on the last p observed

values of {Xt}, and may be obtained by solving the AR(p) difference equation with the

future {ǫt} set to zero. For example for an AR(p) process and l = 1,

Xt(1) = φ1,pXt + . . .+ φp,pXt−p+1.

Example: ARMA(1,1)

(1 − φ1,1B)Xt = (1− θ1,1B)ǫt.

Take φ1,1 = φ and θ1,1 = θ,

Xt =(1− θB)

(1− φB)ǫt = Ψ(B)ǫt.

So,

Ψ(z) = (1− θz)(1 + φz + φ2z2 + φ3z3 + . . .)

= 1 + (φ− θ)z + φ(φ− θ)z2 + . . .

+φl−1(φ− θ)zl + . . .

= ψ0 + ψ1z + ψ2z2 + . . .

So,

ψl =

1 l = 0

φl−1(φ− θ) l ≥ 1

The l-step prediction variance is

σ2(l) = σ2ǫ

(

l−1∑

k=0

ψ2k

)

= σ2ǫ

(

1 +

l−1∑

k=1

ψ2k

)

71

= σ2ǫ

(

1 + (φ− θ)2l−1∑

k=1

φ2k−2

)

= σ2ǫ

(

1 + (φ− θ)2 (1− φ2l−2)

(1− φ2)

)

.

Now,

Ψ(l)(z) =

∞∑

k=0

ψk+lzk

= φl−1(φ− θ)∞∑

k=0

φkzk

= φl−1(φ− θ)(1− φz)−1.

And,

Ψ−1(z) =(1− φz)(1 − θz) .

So,

G(l)(z) = Ψ(l)(z)Ψ−1(z)

= φl−1(φ− θ)(1− θz)−1,

and,

Xt(l) = G(l)(B)Xt

= φl−1(φ− θ)(1− θB)−1Xt.

Consider l = 1,

Xt(1) = (φ− θ)(1− θB)−1Xt

= (φ− θ)(1 + θB + θ2B2 + . . .)Xt

= (φ− θ)Xt + θ(φ− θ)Xt−1 +

72

θ2(φ− θ)Xt−2 + . . .

= φXt − θ [Xt − (φ− θ)Xt−1−

θ(φ− θ)Xt−2 − . . .− θk−1(φ− θ)Xt−k − . . .]

But consider,

ǫt = Ψ−1(B)Xt = (1− φB)(1− θB)−1Xt

= (1− φB)(1 + θB + θ2B2 + θ3B3 . . .)Xt

= Xt − (φ− θ)Xt−1 − θ(φ− θ)Xt−2 −

. . . − θk−1(φ− θ)Xt−k − . . . .

Therefore,

Xt(1) = φXt − θǫt.

So can again be derived directly from the difference equation,

Xt = φXt−1 − θǫt−1 + ǫt,

by setting future innovations ǫt to zero.

MA(1) (invertible)

Xt = ǫt − θ1,1ǫt−1 |θ1,1| < 1.

So,

Ψ(z) = ψ0 + ψ1z + ψ2z2 + . . .

= 1− θ1,1z

73

Hence, ψ0 = 1; ψ1 = −θ1,1;

ψk = 0, k ≥ 2.

Xt(l) =

∞∑

k=0

ψk+lǫt−k = Ψ(l)(B)ǫt

= ψlǫt + ψl+1ǫt−1 + . . .

So,

Ψ(l)(z) =

∞∑

k=0

ψk+lzk = ψlz

0 + ψl+1z1

=

−θ1,1 l = 1

0 l ≥ 2.

Hence,

G(l)(z) = Ψ(l)(z)Ψ−1(z)

=

−θ1,1(1− θ1,1z)−1 l = 1

0 l ≥ 2.

Thus, for l = 1,

G(1)(z) = −θ1,1(1 + θ1,1z + θ21,1z

2 + . . .),

and hence,

Xt(1) = G(1)(B)Xt

= −∞∑

k=0

θk+11,1 Xt−k

Forecast errors and updating

We have seen that when δk = ψk+l the forecast error is∑l−1

k=0 ψkǫt+l−k.

74

Let,

et(l) = Xt+l −Xt(l)

=l−1∑

k=0

ψkǫt+l−k.

Then,

et(l +m) =

l+m−1∑

j=0

ψjǫt+l+m−j .

Clearly,

E{et(l)} = E{et(l +m)} = 0.

Hence,

cov{et(l), et(l +m)} = E{et(l)et(l +m)}

= σ2ǫ

l−1∑

k=0

ψkψk+m.

and

var{et(l)} = σ2ǫ

l−1∑

k=0

ψ2k = σ2(l).

E.g.,

cov{et(1), et(2)} = σ2ǫψ1.

This could be quite large – should the forecast for a series wander off target, it is possible

for it to remain there in the short run since forecast errors can be quite highly correlated.

Hence, when Xt+1 becomes available we should update the forecast.

Xt+1(l) =

∞∑

k=0

ψk+lǫt+1−k

= ψlǫt+1 + ψl+1ǫt + ψl+2ǫt−1 + . . . ,

75

but,

Xt(l + 1) =∞∑

k=0

ψk+l+1ǫt−k

= ψl+1ǫt + ψl+2ǫt−1 + ψl+3ǫt−2 + . . . ,

and,

Xt+1(l) = Xt(l + 1) + ψlǫt+1

= Xt(l + 1) + ψl(Xt+1 −Xt(1)).

Hence, to forecast Xt+l+1 we can modify the l+1- step ahead forecast at time t by producing

an l-step ahead forecast at time t+ 1 using Xt+1 as it becomes available.

76

0 50 150 250 350

05

1015

2025

0 50 150 250 350

−10

−50

510

0 50 150 250 350

−50

5

0 50 150 250 350

−50

510

Figure 12: Top left: Oxford temperature data. Top right: residuals from least-squares fit.Bottom left: detrended. Bottom right: detrended and deseasonalised.

77

frequency

S(f

)

-0.50 -0.25 0.0 0.25 0.50

time

0 64 128

frequency

S(f

)

-0.5 -0.1 0.1 0.5

time

0 64 128

Figure 13: Top: purely continuous spectrum and a corresponding simulated sample path.Bottom: line spectrum + sample path.

78

time

0 64 128

Lag

0 16 32

-0.2

0.0

0.5

1.0

time

0 64 128

Lag

0 16 32

-1.0

-0.5

0.0

0.5

1.0

Figure 14: Continuation of Figure 13 with plots of the corresponding sample autocorrela-tions.

79

time series chap1-5 - lse statisticsstats.lse.ac.uk/fryzlewicz/st304/st304_lecture_notes.pdf · 0...

Documents