time series chap1-5 - lse statisticsstats.lse.ac.uk/fryzlewicz/st422/time_series_chap1-5.pdf ·...

79
Time Series Piotr Fryzlewicz Department of Statistics [email protected] http://stats.lse.ac.uk/fryzlewicz/ November 10, 2009 1 Motivation Time series are measurements of a quantity x t , taken repeatedly over a certain period of time. The quantity x t can be a scalar, but it can also be a vector, or a more complex object such as an image or a network. The time index t can be continuous (when x t is observed continuously), discrete and equally spaced (when x t is measured at discrete time intervals, e.g. every day or every month), or have a more complex form (think of an experiment which needs close supervision at the beginning, but can later be observed less frequently). Time series arise in many sciences, or more generally in many “domains of human endeav- our”. We first look at some examples of time series, before moving on to describe the branch of statistics called Time Series Analysis. 1

Upload: others

Post on 27-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Time Series

    Piotr Fryzlewicz

    Department of Statistics

    [email protected]

    http://stats.lse.ac.uk/fryzlewicz/

    November 10, 2009

    1 Motivation

    Time series are measurements of a quantity xt, taken repeatedly over a certain period of

    time.

    • The quantity xt can be a scalar, but it can also be a vector, or a more complex object

    such as an image or a network.

    • The time index t can be continuous (when xt is observed continuously), discrete and

    equally spaced (when xt is measured at discrete time intervals, e.g. every day or every

    month), or have a more complex form (think of an experiment which needs close

    supervision at the beginning, but can later be observed less frequently).

    Time series arise in many sciences, or more generally in many “domains of human endeav-

    our”. We first look at some examples of time series, before moving on to describe the branch

    of statistics called Time Series Analysis.

    1

  • 1.1 Examples of time series

    1. Finance. We look at two examples related to finance.

    (a) Time series of the daily exchange rate between the British Pound and the US

    Dollar, from 3 Jan 2000 to 8 May 2009. Plot in Figure 1. Source of the data:

    http://www.federalreserve.gov/releases/h10/Hist/. In this example, t

    takes discrete (daily) values which we have numbered from 1 to 2440. Note

    that plotting the values of a scalar-valued time series is often the most natural

    way of visualising such a dataset. We denote this time series by xt; it will be

    used again in the next example. The values of xt do not change much from day

    to day, but over time, clear trends are formed. Note, for example, the strong

    negative trend starting around t = 2000 (which corresponds to August 2008),

    associated, perhaps, with the start of the “credit crunch” in the UK.

    t (Days)

    US

    D/G

    BP

    0 500 1000 1500 2000 2500

    1.4

    1.6

    1.8

    2.0

    Figure 1: Daily USD/GBP exchange rate, from 03/01/2000 to 08/05/2009.

    2

  • (b) Time series of the daily increments of the exchange rate between the British

    Pound and the US Dollar, over the same time period. Plot in Figure 2. Note

    that this time series is simply yt = xt − xt−1. Contrary to xt, there are no clear

    trends in the mean of yt, which oscillates around zero. However, its variance

    changes over time.

    t (Days)

    incr

    emen

    t US

    D/G

    BP

    0 500 1000 1500 2000 2500

    −0.0

    6−0

    .02

    0.02

    0.06

    Figure 2: Increments in daily USD/GBP exchange rate, from 03/01/2000 to 08/05/2009.

    2. Economics. Time series of yearly percentage increments in inflation-adjusted GDP

    per capita of China, 1951–2007. Data from: http://www.gapminder.org. See Figure

    3. The series is short, the values are mostly positive and appear to oscillate around a

    flat but increasing trend.

    3. Social Sciences. Time series of female labour force in Hong Kong, as percentage of

    total labour fource, 1980–2005. Data from: http://www.gapminder.org. See Figure

    4. The series is not only short but has clear trends and does not appear too “random”.

    3

  • Figure 3: Yearly percentage increments in inflation-adjusted GDP per capita of China,1951–2007.

    Figure 4: Female labour force in Hong Kong, as percentage of total labour fource, 1980–2005.

    4. Environment. Time series of monthly mean maximum temperatures, recorded in

    Oxford between January 1900 and December 2008. Data from:

    http://www.metoffice.gov.uk/climate/uk/stationdata/index.html. See Fig-

    ure 5. The yearly periodicity is very pronounced, as expected. Might there be a

    4

  • slight upward trend towards the end of the series? Anything to do with the “global

    warming”?

    month, starting Jan 1900

    mea

    n m

    ax te

    mp,

    deg

    rees

    Cel

    sius

    0 200 400 600 800 1000 1200

    05

    1015

    2025

    Figure 5: Monthly mean maximum temperatures in Oxford between 01/1900 and 12/2008.

    5. Engineering. Speech signal (digitised acoustic sound wave) representing the word

    “Piotr” (my first name) recorded using the wavrecord command in Matlab. Plot in

    Figure 6. Both the amplitude and the frequency of the signal change over time.

    This list only mentions a few out of many, many possible domains in which time series arise.

    Rather than focusing on the different domains, we will mention a few possible types of time

    series that we often observe in practice.

    Note that all of the above examples are univariate, i.e. they are measurements of a single

    scalar quantity. Often, we are faced with multivariate time series, in which a number of

    (often related) quantities are measured simultaneously. For example, an EEG recording

    measures electrical activity simultaneously in a number of locations around the scalp. Also,

    5

  • 0 0.5 1 1.5 2 2.5

    x 104

    −2

    −1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    2x 10

    4

    Figure 6: The word “Piotr”.

    any of the univariate time series above could be a component of a multivariate time series.

    For instance, rather than observing a single price process as in Example 1, market profes-

    sional often trace prices in a number of markets simultaneously with the purpose of e.g.

    portfolio construction.

    Finally, we note that any video can be viewed as an image-valued time series. On the other

    hand, the evolution of e.g. facebook (as a graph) over time can be viewed as a network- (or

    graph-)valued time series.

    1.2 Statistical time series analysis

    Scientists and analysts are interested in a variety of different questions / issues when faced

    with time series data.

    For example, one pertinent question in finance and economics / econometrics is that of

    6

  • forecasting future values. This can be done, for example, for the purpose of potential gain

    (e.g. in hedge funds or investment banks) or planning for the future (e.g. when should I

    buy a house?).

    Another common aim is to understand and be able to summarise time series data.

    Again, the underlying “science” question differs from one example to another. For example,

    in the analysis of EEG recordings, how can we decide whether the subject is “healthy” or

    not? Or how can we decide if the labour market in HK (as in example 3 above) has been

    evolving “significantly differently” from that, say, in Singapore? Or, more generally, given

    a time series, how can we describe and summarise the mechanics of its evolution?

    Finally, another frequent objective is to be able to control the evolution of a time series.

    This is not quite the same as forecasting, where we do not intervene in the process in

    any way. As an example of the control problem, consider the global temperature data: is

    “global warming” really happening, and if so, what impacts the temperature and how can

    we eliminate or suppress those factors?

    As expected, there are often a variety of ways in which those questions can be answered,

    and many of them do not formally involve statistics at all: for example, people often debate

    expected trends in house prices and investment opportunities, express their informal views

    about global warming, or use techniques originating from computer science (e.g. pattern

    recognition) to aid medical diagnosis in neuroscience. So do we need statistics in time series

    analysis?

    The answer is not necessarily, but there are good arguments why the statistical approach

    may often be very useful.

    1. Firstly, even those informal approaches to time series analysis are in fact often statis-

    tical in nature, sometimes in a “hidden” way: for example, people’s subjective views

    about time series can often be formally formulated as priors in Bayesian statistics,

    and informal forecasts which we often encounter in the media are in fact often in-

    7

  • stances of simple statistical forecasting procedures, such as trend extrapolation. Also,

    frequently, techniques originating in computer science (such as: neural networks, ma-

    chine learning, pattern recognition, artificial intelligence) often have their counterparts

    in statistics, which do exactly the same thing but are named differently.

    2. The above-mentioned tasks: forecasting, understanding the structure of time se-

    ries, as well as time series control, have inherent uncertainty about them, which

    makes probability and statistics a natural tool for describing them. We briefly discuss

    those issues in turn.

    (a) Most real-life time series are so complex that accurate forecasting is impossible.

    For example, rather than saying “tomorrow’s value of Xt will be exactly 2.745”

    it often makes more sense to say “tomorrow’s value will be around 2.745”, but

    then there is a chance that we will still be wrong, so our forecasts, even those

    informal ones, will often be of the form “tomorrow’s value will probably be around

    2.745”, which is already in the territory of probability, since it contains a natural

    statement of uncertainty. We will find that probability and statistics provide a

    natural and simple language to express forecasts and their associated uncertainty.

    (b) Again, with the complexity of time series data, it is often impossible to build

    exact deterministic models describing their structure. Indeed, if we had a

    correct deterministic model, we would be able to predict the evolution of the

    time series exactly, but since we are not able to do that, it means that we do

    not have the exact model! Often, probabilistic models make more sense: for

    example, “tomorrow’s value is about a half of today’s value plus a term which

    is best described as random, i.e. there is no clear pattern in its values from one

    day to another”. In this way, we have a simple model for the evolution of the

    time series, but again we are in the territory of randomness, i.e. probability and

    statistics.

    8

  • (c) In the issue of time series control, one natural task that often needs to be

    performed is to understand what factors affect the evolution of the series. But

    this is often impossible to specify exactly: it is unlikely that any one factor (out

    of the ones we are considering), or indeed their combination, is fully responsible

    for the evolution of the time series. Therefore, again, a statistical approach,

    where we permit uncertainty by building a statistical model, might be of use.

    For example, if we suspect that there is a link between pollution and global

    warming, it might be helpful to build a statistical model in which we will be able

    to test this hypothesis, and answer questions like: “how sure are we that there is

    correspondence between pollution and global warming”, or “what is the strength

    of this relationship”.

    We will use this discussion as a motivation for studying the statistical approach to the

    analysis of time series, starting from next section.

    1.3 Simple descriptive technique: autocorrelation

    Denote by xt the Oxford temperature series from example 4 above. One of the main

    characteristics of time series data is dependence between observations at different lags: i.e.

    often, there is a relationship between observations separated by a lag k. For numerical data,

    we can illustrate this using scatter plots. As an example, consider a scatter plot of xt+6

    against xt, as t varies from 1 to N − 6, where N is the length of xt. It is shown in Figure

    7. As expected, there is a clear negative dependence between temperatures separated by 6

    months, due to seasonality.

    Similar scatter plots could be created for k = 1, 2, 3, 4, 5, 7, . . ., but they are unwieldy.

    Suppose we make the assumption that a linear relationship holds approximately between

    xt+k and xt for all k, i.e.,

    xt+k = αk + βkxt + εt+k

    9

  • 0 5 10 15 20 25

    05

    1015

    2025

    temperature in given month

    tem

    pera

    ture

    6 m

    onth

    s la

    ter

    Figure 7: Plot of xt+6 against xt.

    where εt+k is an error term. Then we can use as a summary statistic a measure of the

    strength of the linear relationship between two variables {yt} and {zt} say, namely the

    Pearson product moment correlation coefficient

    ρ̂ =

    (yt − ȳ)(zt − z̄)√

    (yt − ȳ)2∑

    (zt − z̄)2

    where ȳ and z̄ are the sample means. Hence if yt = xt+k and zt = xt we are led to the lag

    k sample autocorrelation for a time series:

    ρ̂k =

    ∑N−kt=1 (xt+k − x̄)(xt − x̄)∑N

    t=1(xt − x̄)2

    with ρ̂0 = 1. The sequence {ρ̂k} is called the sample autocorrelation sequence (sample acs)

    for the time series. The sample acs for the Oxford temperature data is given in Figure 8.

    10

  • 0 5 10 15 20 25 30

    −0.5

    0.0

    0.5

    1.0

    Lag

    AC

    F

    Series temp$temp

    Figure 8: Sample autocorrelation for xt.

    Note e.g., that xt and xt+6 are negatively correlated, while xt and xt+12 are positively

    correlated (consistent with the yearly temperature cycle).

    In Section 1.2, we thoroughly justified using probabilistic models in time series analysis.

    Consistent with this reasoning, we regard x1, . . . , xN as a realization of the corresponding

    random variables X1, . . . ,XN . The quantity ρ̂k is an estimate of a corresponding population

    quantity called the lag k theoretical autocorrelation, defined as

    ρk =E{(Xt − µ)(Xt+k − µ)}

    σ2

    where E{·} is the expectation operator, µ = E{Xt} is the population mean, and σ2 =

    E{(Xt − µ)2} is the corresponding population variance. (Note that ρk, µ and σ2 do not

    depend on ‘t’ here. As we shall see soon, models for which this is true play a central role in

    TSA and are called stationary).

    11

  • 2 Real-valued discrete time stationary processes

    Denote the process by {Xt}. For fixed t, Xt is a random variable (r.v.), and hence there is

    an associated cumulative probability distribution function (cdf):

    Ft(a) = P(Xt ≤ a),

    and

    E{Xt} =∫ ∞

    −∞x dFt(x) ≡ µt

    var{Xt} =∫ ∞

    −∞(x− µt)2 dFt(x).

    But we are interested in the relationships between the various r.v.s that form the process.

    For example, for any t1 and t2 ∈ T ,

    Ft1,t2(a1, a2) = P(Xt1 ≤ a1,Xt2 ≤ a2)

    gives the bivariate cdf. More generally for any t1, t2, . . . , tn ∈ T ,

    Ft1,t2,...,tn(a1, a2, . . . , an) = P(Xt1 ≤ a1, . . . ,Xtn ≤ an)

    Stationarity

    The class of all stochastic processes is too large to work with in practice. We consider

    only the subclass of stationary processes (later, if time permits, we will dicuss also some

    subclasses of non-stationary processes).

    COMPLETE/STRONG/STRICT stationarity

    {Xt} is said to be completely stationary if, for all n ≥ 1, for any t1, t2, . . . , tn ∈ T , and for

    12

  • any τ such that t1 + τ, t2 + τ, . . . , tn + τ ∈ T are also contained in the index set, the joint

    cdf of {Xt1 ,Xt2 , . . . ,Xtn} is the same as that of {Xt1+τ ,Xt2+τ , . . . ,Xtn+τ} i.e.,

    Ft1,t2,...,tn(a1, a2, . . . , an) = Ft1+τ,t2+τ,...,tn+τ (a1, a2, . . . , an),

    so that the probabilistic structure of a completely stationary process is invariant under a

    shift in time.

    SECOND-ORDER/WEAK/COVARIANCE stationarity

    {Xt} is said to be second-order stationary if, for all n ≥ 1, for any t1, t2, . . . , tn ∈ T , and

    for any τ such that t1 + τ, t2 + τ, . . . , tn + τ ∈ T are also contained in the index set, all the

    joint moments of orders 1 and 2 of {Xt1 ,Xt2 , . . . ,Xtn} exist, are finite, and equal to the

    corresponding joint moments of {Xt1+τ ,Xt2+τ , . . . ,Xtn+τ}. Hence,

    E{Xt} ≡ µ ; var{Xt} ≡ σ2 (= E{X2t } − µ2),

    are constants independent of t. If we let τ = −t1,

    E{Xt1Xt2} = E{Xt1+τXt2+τ}

    = E{X0Xt2−t1},

    and with τ = −t2,

    E{Xt1Xt2} = E{Xt1+τXt2+τ}

    = E{Xt1−t2X0}.

    Hence, E{Xt1Xt2} is a function of the absolute difference |t2 − t1| only, similarly, for the

    13

  • covariance between Xt1 & Xt2 :

    cov{Xt1 ,Xt2} = E{(Xt1 − µ)(Xt2 − µ)} = E{Xt1Xt2} − µ2.

    For a discrete time second-order stationary process {Xt} we define the autocovariance se-

    quence (acvs) by

    sτ ≡ cov{Xt,Xt+τ} = cov{X0,Xτ}.

    Note,

    1. τ is called the lag.

    2. s0 = σ2 and s−τ = sτ .

    3. The autocorrelation sequence (acs) is given by

    ρτ =sτs0

    =cov{Xt,Xt+τ}

    σ2.

    4. By Cauchy-Schwarz inequality, |sτ | ≤ s0.

    5. The sequence {sτ} is positive semidefinite, i.e., for all n ≥ 1, for any t1, t2, . . . , tncontained in the index set, and for any set of nonzero real numbers a1, a2, . . . , an

    n∑

    j=1

    n∑

    k=1

    stj−tkajak ≥ 0.

    Proof

    Let

    a = (a1, a2, . . . , an)T, V = (Xt1 ,Xt2 , . . . ,Xtn)

    T

    and let Σ be the variance-covariance matrix of V . Its j, k-th element is given by

    stj−tk = E{(Xtj − µ)(Xtk − µ)}.

    14

  • Define the r.v.

    w =

    n∑

    j=1

    ajXtj = aTV ,

    Then

    0 ≤ var{w} = var{aTV } = aTvar{V }a = aTΣa =n∑

    j=1

    n∑

    k=1

    stj−tkajak.

    6. The variance-covariance matrix of equispaced X’s, (X1,X2, . . . ,XN )T has the form

    s0 s1 . . . sN−2 sN−1

    s1 s0 . . . sN−3 sN−2...

    . . .

    sN−2 sN−3 . . . s0 s1

    sN−1 sN−2 . . . s1 s0

    which is known as a symmetric Toeplitz matrix – all elements on a diagonal are the

    same. Note the matrix has only N unique elements, s0, s1, . . . , sN−1.

    7. A stochastic process {Xt} is called Gaussian if, for all n ≥ 1 and for any t1, t2, . . . , tncontained in the index set, the joint cdf of Xt1 ,Xt2 , . . . ,Xtn is multivariate Gaussian.

    • 2nd-order stationary Gaussian ⇒ complete stationarity (since MVN completely

    characterized by 1st and 2nd moments). Not true in general.

    • Complete stationarity ⇒ 2nd-order stationary in general.

    2.1 Examples of discrete stationary processes

    [1] White noise process

    Also known as a purely random process. Let {Xt} be a sequence of uncorrelated r.v.s

    15

  • such that

    E{Xt} = µ, var{Xt} = σ2 ∀t

    and

    sτ =

    σ2 τ = 0

    0 τ 6= 0or ρτ =

    1 τ = 0

    0 τ 6= 0

    forms a basic building block in time series analysis. Very different realizations of white

    noise can be obtained for different distributions of {Xt}. Examples are given in Figure

    9.

    Gaussian white noise

    0 50 100 150 200 250

    −3−2

    −10

    12

    Exponential white noise

    0 50 100 150 200 250

    01

    23

    45

    6

    Figure 9: Simulated realisations of white noise.

    16

  • [2] q-th order moving average process MA(q)

    Xt can be expressed in the form

    Xt = µ− θ0,qǫt − θ1,qǫt−1 − . . . − θq,qǫt−q

    = µ−q∑

    j=0

    θj,qǫt−j ,

    where µ and θj,q’s are constants (θ0,q ≡ −1, θq,q 6= 0), and {ǫt} is a zero-mean white

    noise process with variance σ2ǫ .

    W.l.o.g. assume E{Xt} = µ = 0.

    Then cov{Xt,Xt+τ } = E{XtXt+τ}.

    Recall: cov(X,Y ) = E{(X − E{X})(Y − E{Y })}.

    Since E{ǫtǫt+τ} = 0 ∀ τ 6= 0 we have for τ ≥ 0.

    cov{Xt,Xt+τ} =q∑

    j=0

    q∑

    k=0

    θj,qθk,qE{ǫt−jǫt+τ−k}

    = σ2ǫ

    q−τ∑

    j=0

    θj,qθj+τ,q (k = j + τ)

    ≡ sτ ,

    which does not depend on t. Since sτ = s−τ , {Xt} is a stationary process with acvs

    given by

    sτ =

    σ2ǫ∑q−|τ |

    j=0 θj,qθj+|τ |,q |τ | ≤ q

    0 |τ | > q

    N.B. No restrictions were placed on the θj,q’s to ensure stationarity. (Though obvi-

    ously, |θj,q|

  • 0 50 100 150 200 250

    −50

    5

    0 50 100 150 200 250

    −50

    5

    Figure 10: Top: realisation of the MA(9) process Xt =∑9

    i=0 ǫt−i. Bottom: realisation ofthe MA(9) process Xt =

    ∑9i=0(−1)iǫt−i.

    acvs:

    sτ = σ2ǫ

    1−|τ |∑

    j=0

    θj,1θj+|τ |,1 |τ | ≤ 1,

    so,

    s0 = σ2ǫ (θ0,1θ0,1 + θ1,1θ1,1)

    = σ2ǫ (1 + θ21,1);

    18

  • and,

    s1 = σ2ǫ θ0,1θ1,1

    = −σ2ǫ θ1,1.

    acs:

    ρτ =sτs0.

    ρ0 = 1.0, ρ1 =−θ1,1

    1 + θ21,1

    (a) θ1,1 = 1.0, σ2ǫ = 1.0,

    we have,

    s0 = 2.0, s1 = −1.0, s2, s3, . . . = 0.0,

    giving,

    ρ0 = 1.0, ρ1 = −0.5, ρ2, ρ3, . . . = 0.0.

    (b) θ1,1 = −1.0, σ2ǫ = 1.0,

    we have,

    s0 = 2.0, s1 = 1.0, s2, s3, . . . = 0.0,

    giving,

    ρ0 = 1.0, ρ1 = 0.5, ρ2, ρ3, . . . = 0.0.

    Note: if we replace θ1,1 by θ−11,1 the model becomes

    Xt = ǫt −1

    θ1,1ǫt−1

    19

  • and the autocorrelation becomes

    ρ1 =− 1θ1,1

    1 +(

    1θ1,1

    )2 =−θ1,1θ21,1 + 1

    ,

    i.e., is unchanged!

    We cannot identify the MA(1) process uniquely from the autocorrelation.

    [3] p-th order autoregressive process AR(p)

    {Xt} is expressed in the form

    Xt = φ1,pXt−1 + φ2,pXt−2 + . . . + φp,pXt−p + ǫt,

    where φ1,p, φ2,p, . . . , φp,p are constants (φp,p 6= 0) and {ǫt} is a zero mean white noise

    process with variance σ2ǫ . In contrast to the parameters of an MA(q) process, the

    {φk,p} must satisfy certain conditions for {Xt} to be a stationary process – i.e., not

    all AR(p) processes are stationary (more later).

    Some example are in Figure 11.

    Further examples

    Xt = φ1,1Xt−1 + ǫt AR(1) – Markov process (1)

    = φ1,1{φ1,1Xt−2 + ǫt−1}+ ǫt

    = φ21,1Xt−2 + φ1,1ǫt−1 + ǫt

    = φ31,1Xt−3 + φ21,1ǫt−2 + φ1,1ǫt−1 + ǫt

    ...

    =

    ∞∑

    k=0

    φk1,1ǫt−k,

    Where we take the initial condition X−N = 0 and let N →∞.

    20

  • 0 50 100 150 200 250

    −3−2

    −10

    12

    3

    0 50 100 150 200 250

    −3−2

    −10

    12

    3

    Figure 11: Top: realisation of the AR(2) process Xt = 0.5Xt−1 + 0.2Xt−2 + ǫt. Bottom:realisation of the AR(2) process Xt = 0.5Xt−1 − 0.2Xt−2 + ǫt.

    Note E{Xt} = 0.

    var{Xt} = var{

    ∞∑

    k=0

    φk1,1ǫt−k

    }

    =

    ∞∑

    k=0

    var{φk1,1ǫt−k} = σ2ǫ∞∑

    k=0

    φ2k1,1.

    For var{Xt} 0, Xt−τ is a linear function of

    21

  • ǫt−τ , ǫt−τ−1, . . . and is therefore uncorrelated with ǫt. Hence

    E{ǫtXt−τ} = 0,

    so, assuming stationarity and multiplying the defining equation (1) by Xt−τ :

    XtXt−τ = φ1,1XtXt−τ + ǫtXt−τ

    ⇒ E{XtXt−τ} = φ1,1E{Xt−1Xt−τ}

    i.e., sτ = φ1,1sτ−1 = φ21,1sτ−2 = . . . = φ

    τ1,1s0

    ⇒ ρτ =sτs0

    = φτ1,1.

    But ρτ is an even function of τ , so

    ρτ = φ|τ |1,1 τ = 0,±1,±2, . . . .

    Xt = 0.5Xt−1 + ǫt Xt = −0.5Xt−1 + ǫt

    Lag

    -5 -4 -3 -2 -1 0 1 2 3 4 5

    0.0

    0.5

    1.0

    φ=0.5

    φ=0.5

    Lag

    -5 -4 -3 -2 -1 0 1 2 3 4 5

    -1.0

    -0.5

    0.0

    0.5

    1.0

    φ=0.5

    φ=−0.5

    – exponential decay.

    [4] (p, q)’th order autoregressive-moving average process ARMA(p, q)

    Here {Xt} is expressed as

    Xt = φ1,pXt−1 + . . . + φp,pXt−p + ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

    22

  • where the φj,p’s and the θj,q’s are all constants (φp,p 6= 0; θq,q 6= 0) and again {ǫt} is

    a zero mean white noise process with variance σ2ǫ . The ARMA class is important as

    many data sets may be approximated in a more parsimonious way (meaning fewer

    parameters are needed) by a mixed ARMA model than by a pure AR or MA process.

    In brief, time series analysts like MA and AR models for different reasons. MA models

    are appealing because they are easy to manipulate mathematically, e.g. as we saw, no

    restrictions on parameter values are needed to ensure stationarity. On the other hand,

    AR models are more convenient for forecasting, which we will see later. Obviously, the

    main criterion for whether a model is or isn’t useful is whether it performs well at our

    desired task, which will often be: modelling (or: understanding) the data, forecasting,

    or control.

    The ARMA model shares the best, and the worst, features of the AR and MA classes!

    Models for changing variance

    Objective: obtain better estimates of local variance in order to obtain a better assess-

    ment of risk.

    Note: not to get better estimates of the trend.

    [5] p’th order autoregressive conditionally heteroscedastic

    model ARCH(p)

    Assume we have a derived time series {Yt} that is (approximately) uncorrelated but

    has a variance (volatility) that changes through time,

    Yt = σtεt (2)

    where {εt} is a white noise sequence with zero mean and unit variance. Here, σtrepresents the local conditional standard deviation of the process. Note that σt is not

    observable directly.

    23

  • {Yt} is ARCH(p) if it satisfies equation (2) and

    σ2t = α+ β1,py2t−1 + . . . + βp,py

    2t−p, (3)

    where α > 0 and βj,p ≥ 0, j = 1, . . . , p (to ensure the variance remains positive), and

    yt−1 is the observed value of the derived time series at time (t− 1).

    Notes:

    (a) the absence of the error term in equation (3).

    (b) unconstrained estimation often leads to violation of the non-negativity constraints

    that are needed to ensure positive variance.

    (c) quadratic form (i.e. modelling σ2t ) prevents modelling of asymmetry in volatility

    (i.e. volatility tends to be higher after a decrease than after an equal increase

    and ARCH cannot account for this).

    Example: ARCH(1)

    σ2t = α+ β1,1y2t−1

    Define,

    vt = y2t − σ2t , ⇒ σ2t = y2t − vt.

    The model can also be written:

    y2t = α+ β1,1y2t−1 + vt,

    i.e. an AR(1) model for {y2t }.

    where the errors, {vt}, have zero mean, but as vt = σ2t (ǫ2t − 1) the errors are het-

    eroscedastic.

    [6] (p, q)’th order generalized autoregressive conditionally

    heteroscedastic model GARCH(p, q)

    24

  • {Yt} is GARCH(p, q) if it satisfies equation (2) and

    σ2t = α+ β1,py2t−1 + . . . + βp,py

    2t−p + γ1,qσ

    2t−1 + . . . γq,qσ

    2t−q,

    where the parameters are chosen to ensure positive variance. GARCH models were

    introduced because it was observed that the ARCH class does not account sufficiently

    well for the persistence of volatility in financial time series data; i.e. according to the

    ARCH model, the series y2t often has less (theoretical) autocorrelation than real data

    tend to have in practice.

    2.2 Trend removal and seasonal adjustment

    There are certain, quite common, situations where the observations exhibit a trend – a

    tendency to increase or decrease slowly steadily over time – or may fluctuate in a periodic

    manner due to seasonal effects. The model is modified to

    Xt = µt + Yt

    µt = time dependent mean.

    Yt = zero mean stationary process.

    Example Oxford temperature data, last 30 years. The data are plotted in the top-left

    plot of Figure 12.

    Model suggested by plot: Xt = α+ βt+ Yt.

    Trend adjustment

    At least two possible approaches:

    25

  • (a) Estimate α and β by least squares, and work with the residuals

    Ŷt = Xt − α̂− β̂t.

    For the Oxford data these are shown in the top-right plot of Figure 12.

    (b) Take first differences:

    X(1)t = Xt −Xt−1 = α+ βt+ Yt − (α+ β(t− 1) + Yt−1)

    = β + Yt − Yt−1.

    For the Oxford temperature data these are shown in the bottom-left plot of Figure

    12.

    Note: if {Yt} is stationary so is {Y (1)t }

    In the case of linear trend, if we difference again:

    X(2)t = X

    (1)t −X

    (1)t−1 = (Xt −Xt−1)− (Xt−1 −Xt−2)

    = (β + Yt − Yt−1)− (β + Yt−1 − Yt−2)

    = Yt − 2Yt−1 + Yt−2, (≡ Y (1)t − Y(1)t−1 = Y

    (2)t ),

    so that the effect of µt(= α+ βt) has been completely removed.

    If µt is a polynomial of degree (d− 1) in t, then dth differences of µt will be zero (d = 2 for

    linear trend).

    Further,

    X(d)t =

    d∑

    k=0

    (

    d

    k

    )

    (−1)kXt−k

    =

    d∑

    k=0

    (

    d

    k

    )

    (−1)kYt−k.

    26

  • There are other ways of writing this. Define the difference operator

    ∆ = (1−B)

    where BXt = Xt−1 is the backward shift operator (sometimes known as the lag operator L

    – especially in econometrics). Then,

    X(d)t = ∆

    dXt = ∆dYt.

    For example, for d = 2:

    X(2)t = (1−B)2Xt = (1−B)(Xt −Xt−1)

    = (Xt −Xt−1)− (Xt−1 −Xt−2)

    = (β + Yt − Yt−1)− (β + Yt−1 − Yt−2)

    = (Yt − Yt−1)− (Yt−1 − Yt−2)

    = (1−B)2Yt = ∆2Yt.

    This notation can be incorporated into the ARMA set up, recall if {Xt} is ARMA(p, q),

    Xt = φ1,pXt−1 + . . .+ φp,pXt−p + ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

    Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q

    (1− φ1,pB − φ2,pB2 − . . .− φp,pBp)Xt = (1− θ1,qB − θ2,qB2 − . . .− θq,qBq)ǫt

    Φ(B)Xt = Θ(B)ǫt.

    27

  • Where

    Φ(B) = 1− φ1,pB − φ2,pB2 − . . .− φp,pBp

    Θ(B) = 1− θ1,qB − θ2,qB2 − . . . − θq,qBq

    are known as the associated or characteristic polynomials.

    Further, we can generalize the class of ARMA models to include differencing to account for

    certain types of non-stationarity, namely, Xt is called ARIMA(p, d, q) if

    Φ(B)(1−B)dXt = Θ(B)ǫt,

    Φ(B)∆dXt = Θ(B)ǫt.

    Seasonal adjustment

    The model is modified to

    Xt = st + Yt

    where

    st = seasonal component,

    Yt = zero mean stationary process.

    Presuming that the seasonal component maintains a constant pattern over time with period

    s, there are again several approaches to removing st. A popular approach used by Box &

    Jenkins is to use the operator (1−Bs):

    X(s)t = (1−Bs)Xt = Xt −Xt−s

    = (st + Yt)− (st−s + Yt−s)

    = Yt − Yt−s

    28

  • since st has period s (and so st−s = st).

    The bottom-right plot of Figure 12 shows this technique applied to the Oxford temperature

    data – most of the seasonal structure and trend has been removed by applying the following

    differencing:

    (1−Bs)(1−B)Xt

    with s = 12.

    2.3 ARMA — stationarity, invertibility and autocorrelation

    For this section, assume that Φ(z) and Θ(z) have no common zeroes (for technical reasons).

    Stationarity

    Consider a general ARMA process

    Φ(B)Xt = Θ(B)ǫt.

    Note that the RHS, being an MA process, is always stationary. The following general result

    will help us determine when Xt itself is stationary, and find its autocovariance sequence.

    Proposition 3.1.2, Brockwell & Davis. If Yt is a second-order stationary process with

    autocovariance function sYτ and if∑∞

    j=−∞ |ψj |

  • Proof. Brockwell & Davis, p. 84.

    General ARMA can be represented as

    Xt = Φ−1(B)Θ(B)ǫt.

    The above proposition is telling us that if we can represent Φ−1(B) as∑∞

    j=0 φjBj with

    ∑∞j=0 |φj |

  • Writing this in B notation:

    Xt = (1− 1.3B + 0.4B2)ǫt

    = Θ(B)ǫt

    to check if invertible, find roots of Θ(z) = 1− 1.3z + 0.4z2,

    1− 1.3z + 0.4z2 = 0

    4z2 − 13z + 10 = 0

    (4z − 5)(z − 2) = 0

    roots of Θ(z) are z = 2 and z = 5/4, which are both outside the unit circle ⇒ invertible.

    Example 2

    Determine whether the following model is stationary and/or invertible,

    Xt = 1.3Xt−1 − 0.4Xt−2 + ǫt − 1.5ǫt−1.

    Writing in B notation:

    (1− 1.3B + 0.4B2)Xt = (1− 1.5B)ǫt

    we have

    Φ(z) = 1− 1.3z + 0.4z2

    with roots z = 2 and 5/4 (from previous example), so the roots of Φ(z) = 0 both lie outside

    the unit circle, therefore model is stationary, and

    Θ(z) = 1− 1.5z,

    31

  • so the root of Θ(z) = 0 is given by z = 2/3 which lies inside the unit circle and the model

    is not invertible.

    Computing the autocovariance sequence of an AR process. The above Proposition

    provides us with an “easy” method for computing the autocovariance sequence of a station-

    ary AR process. This is best seen in an example. Let us simplify slightly the above model

    to give

    Xt = 1.3Xt−1 − 0.4Xt−2 + ǫt.

    Write as

    (1− 1.3B + 0.4B2)Xt = ǫt.

    Then, having found the roots of the polynomial on the RHS, we factorise

    (

    1− B2

    )(

    1− 45B

    )

    Xt = ǫt.

    Formally, this is equivalent to:

    Xt =1

    (

    1− B2) (

    1− 45B)ǫt,

    which, using Taylor expansion, gives

    Xt =

    (

    ∞∑

    i=0

    Bi

    2i

    )(

    ∞∑

    i=0

    4iBi

    5i

    )

    ǫt.

    Collecting the terms (not always easy – Maple can help), we get

    Xt =

    (

    1 +13

    10B +

    129

    100B2 + . . .

    )

    ǫt

    = ǫt +13

    10ǫt−1 +

    129

    100ǫt−2 + . . .

    Truncating this expansion at a sufficiently large lag, we then proceed in the same way as in

    32

  • calculating the acs of an MA process.

    Remark. The above discussion implies that

    MA (finite order) ≡ AR (infinite order)

    AR (finite order) ≡ MA (infinite order)

    provided the infinite order expansions exist!

    SUMMARY

    AR(p) MA(q) ARMA(p, q)

    Stationarity Roots of Φ(z) Always Roots of Φ(z)

    outside |z| ≤ 1 stationary outside |z| ≤ 1

    Invertibility Always Roots of Θ(z) Roots of Θ(z)

    invertible outside |z| ≤ 1 outside |z| ≤ 1

    Why are stationarity and invertibility desirable?

    • Stationarity — because it assumes that the joint distribution of Xt (or second-order

    properties) does not change over time. Therefore, we can average over time to obtain

    better estimates of the characteristics of the process.

    Example. Assume {Xt}Tt=1 stationary and E(Xt) = µ. We can estimate µ byt

    averaging Xt over time:

    µ̂ =1

    T

    T∑

    t=1

    Xt.

    More later!

    • Invertibility — because it permits us to represent our process as AR, and AR processes

    are often “easy” to estimate and forecast. Again, more later.

    33

  • 2.4 Partial autocorrelation

    In the section on MA processes, we saw that the autocovariance sequence cut off after

    some lag k. Also, we saw that for an AR(1) process, it never cut off to zero, but decayed

    exponentially. The same result can be proved for a general AR process.

    For purposes of model identification, it would be useful to have a quantity which did cut

    off to zero for a general autoregressive process AR(p). One such quantity is the partial

    autocorrelation function (or sequence).

    For a process Xt, it is defined by

    π(1) = corr(X2,X1)

    π(2) = corr(X3 −E(X3|X2),X1 − E(X1|X2))

    π(3) = corr(X4 −E(X4|X3,X2),X1 − E(X1|X3,X2))

    etc . . .

    Interpretation: E(X4|X3,X2) is the “part of X4 that is explained by X3,X2” (or more

    formally, it is the prediction of X4 based on X3,X2). Thus X4 − E(X4|X3,X2) is the part

    of X4 that is unexplained (un-predicted) by X2,X3. Thus the partial autocorrelation at lag

    k is the correlation of those portions of X1, Xk+1 which are unexplained by the intermediate

    variables X2, . . . ,Xk.

    Fact: for an autoregressive process AR(p), pacf at lags k > p is zero.

    Example for AR(1), Xt = aXt−1 + ǫt.

    π(1) = corr(X2,X1) = ρ(1).

    π(2) = corr(X3 − E(X3|X2),X1 − E(X1|X2)).

    34

  • Now,

    X3 − E(X3|X2) = X3 − E(aX2 + ǫ3|X2) = X3 − aX2 + 0 = ǫ3.

    But X1 −E(X1|X2) is a function of X1,X2, which are independent of ǫ3. Thus π(2) = 0.

    Note: for a general AR(p) process as defined above,

    Xt = φ1,pXt−1 + φ2,pXt−2 + . . .+ φp,pXt−p + ǫt,

    if the ǫ’s are iid Gaussian, then it can be shown that π(k) = φk,p.

    3 Spectral Representation theorem for discrete time station-

    ary processes

    Spectral analysis is a study of the frequency domain characteristics of a process, and de-

    scribes the contribution of each frequency to the variance of the process.

    The whole idea is best described graphically on the board.

    A more formal description follows.

    We start with a stochastic process Z(f), possibly complex-valued, defined on the interval

    f ∈ [−1/2, 1/2]. The variable f will correspond to “frequency”.

    Consider “infinitely small” jumps of Z(f), denoted by dZ(f):

    dZ(f) ≡

    Z(f + df)− Z(f), f < 1/2;

    0, f = 1/2,

    where df is a small positive increment. If the intervals [f, f + df ] and [f ′, f ′ + df ′] are non-

    intersecting subintervals of [−1/2, 1/2], then the r.v.’s dZ(f) and dZ(f ′) are uncorrelated.

    Again, we say that the process has orthogonal increments, and the process itself is called

    35

  • an orthogonal process – this orthogonality assumption is very important.

    Note. While Z(f) is a well-defined stochastic process, dZ(f) is not a real process, in the

    same way as the “delta function” is not a real function: both make sense only in integration!

    Let {Xt} be a (possibly complex-valued) discrete time second-order stationary process, with

    zero mean. The spectral representation theorem states that there exists such an orthogonal

    process {Z(f)}, defined on [−1/2, 1/2], such that

    Xt =

    ∫ 1/2

    −1/2ei2πft dZ(f)

    for all integers t.

    The process {Z(f)} has the following properties:

    [1] E{dZ(f)} = 0 ∀ |f | ≤ 1/2.

    [2] E{|dZ(f)|2} ≡ dS(I)(f) say ∀ |f | ≤ 1/2, where dS(I)(f) is called the integrated spec-

    trum of {Xt}, and

    [3] for any two distinct frequencies f and f ′ ∈ [−1/2, 1/2]

    cov{dZ(f ′), dZ(f)} = E{dZ∗(f ′)dZ(f)} = 0.

    The spectral representation

    Xt =

    ∫ 1/2

    −1/2ei2πft dZ(f) =

    ∫ 1/2

    −1/2ei2πft |dZ(f)|ei arg{dZ(f)},

    means that we can represent any discrete complex-valued stationary process as an “infinite”

    sum of complex exponentials at frequencies f with associated random amplitudes |dZ(f)|

    and random phases arg{dZ(f)}.

    Since for any two different frequencies, dZ(f) and dZ(f ′) are uncorrelated (= independent if

    36

  • Z is Gaussian), we have a convenient decomposition of Xt into a sum of uncorrelated com-

    ponents of a particularly simple form: note that ei2πft are simple complex-valued oscillatory

    functions. Think of them as “extended” versions of sines or cosines.

    Non-examinable example, mostly for those taking ST409 Stochastic Processes.

    Let B(f) be a standard Brownian motion on [−1/2, 1/2], where the integration w.r.t.

    dB(−f) is defined by

    [−1/2,1/2]g(f)dB(−f) =

    [−1/2,1/2]g(−f)dB(f).

    Define the orthogonal increment process by

    dZ(f) = dB(f) + dB(−f) + i(dB(−f)− dB(f)).

    Note:

    1. Quick check whether Z really has orthogonal increments: from the properties of

    Brownian motion, it is obvious that cov(dZ(f), dZ(f ′)) = 0 for all |f | 6= |f ′|. The

    only case where it is not obvious is f ′ = −f . W.l.o.g. assume f > 0. Recalling that

    E(dB(−f), dB(−f)) = df , we compute

    cov(dZ(f), dZ(−f)) =

    E(dZ(f)dZ∗(−f)) =

    E {(dB(f) + dB(−f) + i(dB(−f)− dB(f))) (dB(−f) + dB(f)− i(dB(f)− dB(−f)))} =

    df − idf + df + idf + idf − df − idf − df =

    0.

    2. Note that dZ∗(−f) = dZ(f). It is always the case when the process Xt “generated”

    by Z(f) is real-valued!

    37

  • We have

    Xt =∫ 1/2

    −1/2ei2πftdZ(f)

    ∫ 1/2

    −1/2(cos(2πft) + i sin(2πft))(dB(f) + dB(−f) + i(dB(−f)− dB(f))) =

    ∫ 1/2

    −1/2cos(2πft)dB(f) + cos(2πft)dB(−f)− sin(2πft)dB(−f) + sin(2πft)dB(f) +

    i

    ∫ 1/2

    −1/2cos(2πft)dB(−f)− cos(2πft)dB(f) + sin(2πft)dB(f) + sin(2πft)dB(−f) =

    2

    ∫ 1/2

    −1/2(cos(2πft) + sin(2πft))dB(f).

    So Xt is real-valued. Obviously, it is Gaussian and has zero mean. We will now compute

    the covariance structure of Xt.

    cov(Xt,Xt+τ ) =

    4

    ∫ 1/2

    −1/2

    ∫ 1/2

    −1/2(cos(2πft) + sin(2πft))(cos(2πf ′(t+ τ)) + sin(2πf ′(t+ τ)))dB(f)dB(f ′) =

    4

    ∫ 1/2

    −1/2(cos(2πft) + sin(2πft))(cos(2πf(t+ τ)) + sin(2πf(t+ τ)))df =

    4

    ∫ 1/2

    −1/2(cos(2πfτ)) + sin(2πf(2t+ τ))df.

    Now, the sin part is always zero, whereas the cos part is equal to 4 if τ = 0 and 0 otherwise.

    So, Z “generates” the standard Gaussian white noise (up to a multiplicative factor).

    End of example.

    The orthogonal increments property can be used to define the relationship between the

    38

  • autocovariance sequence {sτ} and the integrated spectrum SI(f):

    sτ = E{XtXt+τ} = E{X∗t Xt+τ} = E∫ 1/2

    −1/2e−i2πf

    ′t dZ∗(f ′)

    ∫ 1/2

    −1/2ei2πf(t+τ) dZ(f)

    =

    ∫ 1/2

    −1/2

    ∫ 1/2

    −1/2ei2π(f−f

    ′)tei2πfτE{dZ∗(f ′)dZ(f)}.

    Because of the orthogonal increments property,

    E{dZ∗(f ′)dZ(f)} =

    dS(I)(f) f = f ′

    0 f 6= f ′

    so

    sτ =

    ∫ 1/2

    −1/2ei2πfτ dS(I)(f),

    which shows that the integrated spectrum determines the acvs for a stationary process. If

    in fact S(I)(f) is differentiable everywhere with a derivative denoted by S(f) we have

    E{|dZ(f)|2} = dS(I)(f) = S(f) df.

    The function S(·) is called the spectral density function (sdf). Hence

    sτ =

    ∫ 1/2

    −1/2ei2πfτS(f) df.

    But a square summable deterministic sequence {gt} say has the Fourier representation

    gt =

    ∫ 1/2

    −1/2G(f)ei2πft df,

    where

    G(f) =

    ∞∑

    t=−∞

    gte−i2πft,

    39

  • If we assume that S(f) is square integrable, then S(f) is the Fourier transform of {sτ},

    S(f) =∞∑

    τ=−∞

    sτe−i2πfτ .

    Hence,

    {sτ} ←→ S(f),

    i.e., {sτ} and S(f) are a F.T. pair.

    Conclusion: because for real-valued processes, sτ is real-valued and symmetric, S(f) is nec-

    essarily real-valued (obviously) and also symmetric!

    Spectral Density Function

    Subject to its existence, S(·) has the following interpretation: S(f) df is the average con-

    tribution (over all realizations) to the power from components with frequencies in a small

    interval about f . The power – or variance – is

    ∫ 1/2

    −1/2S(f) df.

    Hence, S(f) is often called the power spectral density function or just power spectrum.

    Properties: (assuming existence)

    [1] S(I)(f) =∫ f−1/2 S(f

    ′) df ′.

    [2] 0 ≤ S(I)(f) ≤ σ2 where σ2 = var{Xt}; S(f) ≥ 0.

    [3] S(I)(−1/2) = 0; S(I)(1/2) = σ2;∫ 1/2−1/2 S(f) df = σ

    2.

    [4] f < f ′ ⇒ S(I)(f) ≤ S(I)(f ′); S(−f) = S(f).

    Except, basically, for the scaling factor σ2, S(I)(f) has all the properties of a probability

    40

  • distribution function, and hence is sometimes called a spectral distribution function.

    Classification of Spectra

    For most practical purposes any integrated spectrum, S(I)(f) can be written as

    S(I)(f) = S(I)1 (f) + S

    (I)2 (f)

    where the S(I)j (f)’s are nonnegative, nondecreasing functions with S

    (I)j (−1/2) = 0 and are

    of the following types:

    [1] S(I)1 (·) is absolutely continuous, i.e., its derivative exists for almost all f and is equal

    almost everywhere to an sdf S(·) such that

    S(I)(f) =

    ∫ f

    −1/2S(f ′)df ′.

    [2] S(I)2 (·) is a step function with jumps of size {pl} : l = 1, 2, . . .} at the points {fl : l =

    1, 2, . . .}.

    We consider the integrated spectrum to be a combination of two ‘pure’ forms :

    case (a): S(I)1 (f) ≥ 0;S

    (I)2 (f) = 0.

    {Xt} is said to have a purely continuous spectrum and S(f) is absolutely integrable,

    with∫ 1/2

    −1/2S(f) cos(2πfτ) df and

    ∫ 1/2

    −1/2S(f) sin(2πfτ)→ 0,

    as τ →∞. [This is known as the Riemann-Lebesgue thm.]. But,

    sτ =

    ∫ 1/2

    −1/2ei2πfτS(f) df =

    ∫ 1/2

    −1/2S(f) cos(2πfτ) df + i

    ∫ 1/2

    −1/2S(f) sin(2πfτ) df.

    Hence sτ → 0 as |τ | → ∞. In other words, the acvs diminishes to zero (called “mixing

    41

  • condition”).

    case (b): S(I)1 (f) = 0;S

    (I)2 (f) ≥ 0.

    Here the integrated spectrum consists entirely of a step function, and the {Xt} is said

    to have a purely discrete spectrum or a line spectrum. The acvs for a process with a

    line spectrum never damps down to 0.

    Examples see Figs 13 and 14.

    We will not be studying processes falling under the (b) category.

    White noise spectrum

    Recall that a white noise process {ǫt} has acvs:

    sτ =

    σ2ǫ τ = 0

    0 otherwise

    Therefore, the spectrum of a white noise process is given by:

    Sǫ(f) =

    ∞∑

    τ=−∞

    sτe−i2πfτ = s0 = σ

    2ǫ .

    i.e., white noise has a constant spectrum.

    Spectral density function vs. autocovariance function

    The sdf and acvs contain the same amount of information in that if we know one of them,

    we can calculate the other. However, they are often not equally informative.

    On some occasions, sdf or acvs proves to be the more sensitive and interpretable diagnostic

    or exploratory tool, so it is often useful to apply both tools in the exploratory analysis of

    time series data.

    There are also other transformations, for example those based on wavelets, which help bring

    42

  • out important features of the data.

    Linear Filtering

    A digital filter maps a sequence to another sequence. The following digital filtering method-

    ology will be extremely useful in establishing spectral density functions of the linear time

    series models covered so far, e.g. AR, MA, etc.

    A digital filter L that transforms an input sequence {xt} into an output sequence {yt} is

    called a linear time-invariant (LTI) digital filter if it has the following three properties:

    [1] Scale-preservation:

    L {{αxt}} = αL {{xt}} .

    [2] Superposition:

    L {{xt,1 + xt,2}} = L {{xt,1}}+ L {{xt,2}} .

    [3] Time invariance:

    If

    L {{xt}} = {yt}, then L {{xt+τ}} = {yt+τ}.

    Where τ is integer-valued, and the notation {xt+τ} refers to the sequence whose t-th

    element is xt+τ .

    Suppose we use a sequence with t-th element exp(i2πft) as the input to a LTI digital filter:

    Let ξf,t = {ei2πft}, and let yf,t denote the output function:

    yf,t = L{ξf,t}.

    By properties [1] and [3]:

    yf,t+τ = L{ξf,t+τ} = L{ei2πfτ ξf,t} = ei2πfτL{ξf,t} = ei2πfτyf,t.

    43

  • In particular, for t = 0:

    yf,τ = ei2πfτyf,0.

    Now set τ = t:

    yf,t = ei2πftyf,0.

    Thus, when ξf,t is input to the LTI digital filter, the output is the same function multiplied

    by some constant, yf,0, which is independent of time but will depend on f . Let G(f) = yf,0.

    Then

    L{ξf,t} = ξf,tG(f).

    G(f) is called the transfer function or frequency response function of L. We can write

    G(f) = |G(f)|eiθ(f)

    where,

    |G(f)| gain

    θ(f) = arg{G(f)} phase

    Any LTI digital filter can be expressed in the form:

    L {{Xt}} =∞∑

    u=−∞

    guXt−u ≡ {Yt},

    where {gu} is a real-valued deterministic sequence called the impulse response sequence.

    Note,

    L{{ei2πft}} =∞∑

    u=−∞

    guei2πf(t−u) = ei2πftG(f),

    with

    G(f) =∞∑

    u=−∞

    gue−i2πfu for |f | ≤ 1

    2.

    44

  • Note:

    {gu} ←→ G(f) (F.T. pair).

    We have,

    Yt =∑

    u

    guXt−u

    Recall,

    Xt =

    ∫ 1/2

    −1/2ei2πft dZX(f) Yt =

    ∫ 1/2

    −1/2ei2πft dZY (f),

    ⇒∫

    ei2πft dZY (f) =∑

    u

    gu

    ∫ 1/2

    −1/2ei2πf(t−u) dZX(f)

    =

    ∫ 1/2

    −1/2ei2πftG(f) dZX(f)

    so that,

    dZY (f) = G(f) dZX(f) ; (1 : 1)

    and

    E{|dZY (f)|2} = |G(f)|2E{|dZX(f)|2},

    and if the spectral densities exist

    SY (f) = |G(f)|2SX(f).

    This relationship can be used to determine the sdf’s of discrete parameter stationary

    processes.

    Determination of sdf’s by LTI digital filtering

    45

  • [1] q-th order moving average: MA(q),

    Xt = ǫt − θ1,qǫt−1 − . . . − θq,qǫt−q,

    with usual assumptions (mean zero). Define

    L {{ǫt}} = ǫt − θ1,qǫt−1 − . . . − θq,qǫt−q,

    so that {Xt} = L {{ǫt}}. To determine G(f), input ei2πft:

    L{

    {ei2πft}}

    = ei2πft − θ1,qei2πf(t−1) − . . . θq,qei2πf(t−q)

    = ei2πft[

    1− θ1,qe−i2πf − . . . − θq,qe−i2πfq]

    ,

    so that

    Gθ(f) = 1− θ1,qe−i2πf − . . . − θq,qe−i2πfq.

    Since,

    SX(f) = |Gθ(f)|2Sǫ(f) and Sǫ(f) = σ2ǫ ,

    we have

    SX(f) = σ2ǫ |1− θ1,qe−i2πf − . . . − θq,qe−i2πfq|2.

    Let z = e−i2πf and define

    Hθ(z) = 1− θ1,qz − . . . − θq,qzq,

    so that Gθ(f) = Hθ(z). Of course, Hθ(z) is the characteristic polynomial of Xt. We

    have

    |Gθ(f)|2 = Gθ(f)G∗θ(f) ≡ Hθ(z)Hθ(z−1).

    Roots ofH(z) andH(z−1) are inverses. Hence, ifXt is invertible (roots ofH(z) outside

    46

  • the unit circle), then there exists a non-invertible process with the same |Gθ(f)|2 (and

    hence the same spectrum). This, we cannot determine from the spectrum whether

    the process is invertible or not. This makes sense, since we cannot distinguish these

    cases using the acvs either.

    Example:

    1. The invertible case: H(z) = 1− z/2.

    2. The non-invertible case: H(z) = 1/2− z.

    Both have the same spectrum, the same autocovariance sequence and the same auto-

    correlation sequence.

    [2] p-th order autoregressive process: AR(p),

    Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt

    Define

    L {{Xt}} = Xt − φ1,pXt−1 − . . . − φp,pXt−p,

    so that L {{Xt}} = {ǫt}. By analogy to MA(q)

    Gφ(f) = 1− φ1,pe−i2πf − . . . − φp,pe−i2πfp.

    Since,

    |Gφ(f)|2SX(f) = Sǫ(f) and Sǫ(f) = σ2ǫ ,

    we have

    SX(f) =σ2ǫ

    |1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2

    Interpretation of AR spectra

    47

  • Recall that for an AR process we have characteristic equation

    1− φ1,pz − φ2,pz2 − . . .− φp,pzp

    and the process is stationary if the roots of this equation lie outside the unit circle.

    Consider an AR(2) process with complex characteristic roots, these roots must form

    a complex conjugate pair:

    z =1

    re−i2πf

    , z =1

    rei2πf

    and we can write

    1− φ1,pz − φ2,pz2 = (rz − e−i2πf′

    )(rz − ei2πf ′)

    = r2z2 − zr(e−i2πf ′ + ei2πf ′) + 1

    = r2z2 − 2zr cos(2πf ′) + 1

    and the AR process can be written

    (r2B2 − 2r cos(2πf ′)B + 1)Xt = ǫt

    ⇒ Xt = 2r cos(2πf ′)Xt−1 − r2Xt−2 + ǫt

    The spectrum can be written in terms of the complex roots, by substituting z = e−i2πf

    in the characteristic equation.

    SX(f) =σ2ǫ

    |re−i2πf − e−i2πf ′ |2|re−i2πf − ei2πf ′ |2

    48

  • Now,

    |re−i2πf − e−i2πf ′ |2 = |e−i2πf (r − e−i2π(f ′−f))|2

    = (r − e−i2π(f ′−f))(r − ei2π(f ′−f))

    = r2 − r(e−i2π(f ′−f) + ei2π(f ′−f)) + 1

    = r2 − 2r cos(2π(f ′ − f)) + 1

    similarly,

    |re−i2πf − ei2πf ′ |2 = r2 − 2r cos(2π(f ′ + f)) + 1

    giving,

    SX(f) =σ2ǫ

    (r2 − 2r cos(2π(f ′ + f)) + 1)(r2 − 2r cos(2π(f ′ − f) + 1)

    The spectrum will be at its largest when denominator is at its smallest - when r is

    close to 1 this occurs when f ≈ ±f ′. Also notice that at f = ±f ′ as r → 1 (from

    below as 0 < r < 1) so the spectrum becomes larger.

    Generally speaking complex roots will induce a peak in the spectrum, indicating a

    tendency towards a cycle at frequency f ′. Also, the larger the value of r the more

    dominant the cycle. This may be termed pseudo-cyclical behaviour (recall that a

    deterministic cycle will show up at a sharp spike – i.e., a line spectrum).

    Similar analysis possible in the case of real roots.

    [3] (p, q)−th order autoregressive, moving average process: ARMA(p, q),

    Xt − φ1,pXt−1 − . . .− φp,pXt−p = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q

    If we write this as

    Xt − φ1,pXt−1 − . . .− φp,pXt−p = Yt;

    49

  • Yt = ǫt − θ1,qǫt−1 − . . .− θq,qǫt−q,

    then we have

    |Gφ(f)|2SX(f) = SY (f),

    and

    SY (f) = |Gθ(f)|2Sǫ(f),

    so that

    SX(f) = Sǫ(f)|Gθ(f)|2|Gφ(f)|2

    = σ2ǫ|1− θ1,qe−i2πf − . . .− θq,qe−i2πfq|2|1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2

    [4] Differencing

    Let {Xt} be a stationary process with sdf SX(f). Let Yt = Xt −Xt−1. Then

    L{

    {ei2πft}}

    = ei2πft − ei2πf(t−1)

    = ei2πft(1− e−i2πf )

    = ei2πftG(f),

    so

    |G(f)|2 = |1− e−i2πf |2 = |e−iπf (eiπf − e−iπf )|2

    = |e−iπf2i sin(πf)|2 = 4 sin2(πf).

    50

  • 4 Estimation

    4.1 Estimation of mean and autocovariance function

    Ergodic Property

    Methods we shall look at for estimating quantities such as the autocovariance function will

    use observations from a single realization. Such methods are based on the strategy of re-

    placing ensemble averages by their corresponding time averages.

    Sample mean:

    Given a time series X1,X2, . . . ,XN .

    Let

    X̄ =1

    N

    Xt.

    (

    assume∞∑

    τ=−∞

    |sτ |

  • =1

    N2

    N−1∑

    τ=−(N−1)

    N−|τ |∑

    k=1

    =1

    N2

    N−1∑

    τ=−(N−1)

    (N − |τ |)sτ

    =1

    N

    N−1∑

    τ=−(N−1)

    (

    1− |τ |N

    )

    The summation interchange merely swaps row sums for diagonal sums, e.g., N = 4, elements

    are sτ :

    • • • •

    • • • •

    • • • •

    • • • •

    • • • •

    • • • •

    • • • •

    • • • •

    1 2 3 4

    1 2 3 4

    1

    2

    3

    4

    1

    2

    3

    4

    u

    t

    u

    t

    -3 -2 -1

    1

    2

    3

    τ=0

    By the Cesàro summability theorem, if∑N−1

    τ=−(N−1) sτ converges to a limit as N →∞,

    it must since

    N−1∑

    τ=−(N−1)

    ≤N−1∑

    τ=−(N−1)

    |sτ |

  • We can thus conclude that,

    limN→∞

    Nvar{X̄} = limN→∞

    N−1∑

    τ=−(N−1)

    (

    1− |τ |N

    )

    = limN→∞

    N−1∑

    τ=−(N−1)

    sτ =

    ∞∑

    τ=−∞

    sτ .

    The assumption of absolute summability of {sτ} implies that {Xt} has a purely continuous

    spectrum with sdf

    S(f) =

    ∞∑

    τ=−∞

    sτe−i2πfτ ,

    so that

    S(0) =

    ∞∑

    τ=−∞

    sτ .

    Thus

    limN→∞

    Nvar{X̄} = S(0),

    i.e.,

    var{X̄} ≈ S(0)N

    for large N.

    and therefore, var{X̄} → 0.

    Note that the convergence of X̄ depends only on the spectrum at S(0), i.e. at f = 0.

    Autocovariance Sequence:

    Now,

    sτ = E{(Xt − µ)(Xt+τ − µ)}

    so that a natural estimator for the acvs is

    ŝ(u)τ =1

    N − |τ |

    N−|τ |∑

    t=1

    (Xt − X̄)(Xt+|τ | − X̄) τ = 0,±1, . . . ,±(N − 1).

    53

  • Note ŝ(u)−τ = ŝ

    (u)τ as it should.

    If we replace X̄ by µ:

    E{ŝ(u)τ } =1

    N − |τ |

    N−|τ |∑

    t=1

    E{(Xt − µ)(Xt+|τ | − µ)}

    =1

    N − |τ |

    N−|τ |∑

    t=1

    sτ = sτ , τ = 0,±1, . . . ,±(N − 1).

    Thus, ŝ(u)τ is an unbiased estimator of sτ when µ is known. (Hence the (u) – for unbiased).

    Most texts refer to ŝ(u)τ as unbiased – however, if µ is estimated by X̄, ŝ

    (u)τ is typically a

    biased estimator of sτ !!!

    A second estimator of sτ is typically preferred:

    ŝ(p)τ =1

    N

    N−|τ |∑

    t=1

    (Xt − X̄)(Xt+|τ | − X̄) τ = 0,±1, . . . ,±(N − 1).

    With X̄ replaced by µ:

    E{ŝ(p)τ } =1

    N

    N−|τ |∑

    t=1

    sτ =

    (

    1− |τ |N

    )

    sτ ,

    so that ŝ(p)τ is a biased estimator, and the magnitude of its bias increases as |τ | increases.

    Most texts refer to ŝ(p)τ as biased.

    Why should we prefer the “biased” estimator ŝ(p)τ to the “unbiased” estimator ŝ

    (u)τ ?

    [1] For many stationary processes of practical interest

    mse{ŝ(p)τ } < mse{ŝ(u)τ },

    54

  • where

    mse{ŝτ} = E{(ŝτ − sτ )2}

    = E{ŝ2τ} − 2sτE{ŝτ}+ s2τ

    = (E{ŝ2τ} − E2{ŝτ}) + E2{ŝτ} − 2sτE{ŝτ}+ s2τ

    = var{ŝτ}+ (sτ − E{ŝτ})2

    = variance + (bias)2

    [2] If {Xt} has a purely continuous spectrum we know that sτ → 0 as |τ | → ∞. It therefore

    makes sense to choose an estimator that decreases nicely as |τ | → N − 1 (i.e. choose

    ŝ(p)τ ).

    [3] We know that the acvs must be positive semidefinite, the sequence {ŝ(p)τ } has this

    property, whereas the sequence {ŝ(u)τ } may not.

    4.2 Parametric model fitting: autoregressive processes

    Here we concentrate on models of the form

    Xt − φ1,pXt−1 − . . . − φp,pXt−p = ǫt.

    As we have seen the corresponding sdf is

    S(f) =σ2

    |1− φ1,pe−i2πf − . . .− φp,pe−i2πfp|2.

    This class of models is appealing to use for time series analysis for several reasons:

    [1] Any time series with a purely continuous sdf can be approximated well by an AR(p)

    model if p is large enough.

    55

  • [2] There exist efficient algorithms for fitting AR(p) models to time series.

    [3] Quite a few physical phenomena are reverberant and hence an AR model is naturally

    appropriate.

    A method for estimating the {φj,p} – Yule-Walker

    We start by multiplying the defining equation by Xt−k:

    XtXt−k =

    p∑

    j=1

    φj,pXt−jXt−k + ǫtXt−k.

    Taking expectations, for k > 0:

    sk =

    p∑

    j=1

    φj,psk−j.

    Let k = 1, 2, . . . , p and recall that s−τ = sτ to obtain

    s1 = φ1,ps0 + φ2,ps1 + . . .+ φp,psp−1

    s2 = φ1,ps1 + φ2,ps0 + . . .+ φp,psp−2

    ......

    sp = φ1,psp−1 + φ2,psp−2 + . . .+ φp,ps0

    or in matrix notation,

    γp = Γpφp,

    where γp = [s1, s2, . . . , sp]T ; φp = [φ1,p, φ2,p, . . . , φp,p]

    T and

    Γp =

    s0 s1 . . . sp−1

    s1 s0 . . . sp−2...

    ......

    sp−1 sp−2 . . . s0

    56

  • Note: this is a symmetric Toeplitz matrix which we have met already. All elements on a

    diagonal are the same.

    Suppose we don’t know the {sτ}, but the mean is zero, then take

    ŝτ =1

    N

    N−|τ |∑

    t=1

    XtXt+|τ |,

    and substitute these for the sτ ’s in γ and Γp to obtain γ̂p, Γ̂p, from which we estimate φp

    as φ̂p:

    φ̂p = Γ−1γ̂p.

    Finally, we need to estimate σ2ǫ . To do so, we multiply the defining equation by Xt and

    take expectations to obtain

    s0 =

    p∑

    j=1

    φj,psj + E{ǫtXt}

    =

    p∑

    j=1

    φj,psj + σ2ǫ ,

    so that as an estimator for σ2ǫ we take

    σ̂2ǫ = ŝo −p∑

    j=1

    φ̂j,pŝj.

    The estimators φ̂p and σ̂2ǫ are called the Yule-Walker estimators of the AR(p) process.

    The estimate of the sdf resulting is

    Ŝ(f) =σ̂2ǫ

    ∣1−∑pj=1 φ̂j,pe−i2πfj

    2 .

    57

  • Least squares estimation of the {φj,p}

    Let {Xt} be a zero-mean AR(p) process, i.e.,

    Xt = φ1,pXt−1 + φ2,pXt−2 + . . .+ φp,pXt− + ǫt.

    We can formulate an appropriate least squares model in terms of data X1,X2, . . . ,XN as

    follows:

    XF = Fφ + ǫF ,

    where,

    F =

    Xp Xp−1 . . . X1

    Xp+1 Xp . . . X2...

    ...

    XN−1 XN−2 . . . XN−p

    and,

    XF =

    Xp+1

    Xp+2...

    XN

    ; φ =

    φ1,p

    φ2,p...

    φp,p

    ; ǫF =

    ǫp+1

    ǫp+2...

    ǫN

    .

    We can thus estimate φ by finding that φ such that

    SSF (φ) =

    N∑

    t=p+1

    (

    Xt −p∑

    k=1

    φk,pXt−k

    )2

    =

    N∑

    t=p+1

    ǫ2t

    = (XF − Fφ)T (XF − Fφ)

    is minimized. If we denote the vector that minimizes the above as φ̂F , standard least

    squares theory tells us that it is given by

    φ̂F = (FTF )−1F T XF .

    58

  • Note: convince yourselves of this using the fact that:

    ∂x(Ax + b)T (Ax + b) = 2AT (Ax + b).

    We can estimate the innovations variance σ2F by the usual estimator of the residual variation,

    namely

    σ̂2F =(XF − F φ̂F )T (XF − F φ̂F )

    (N − 2p) .

    (Note: there are N − p effective observations, and p parameters are estimated).

    The estimator φ̂F is known as the forward least squares estimator of φ.

    Notes:

    [1] φ̂F produces estimated models which need not be stationary. This may be a concern for

    prediction, however, for spectral estimation, the parameter values will still produce a

    valid sdf (i.e., nonnegative everywhere, symmetric about the origin and integrates to

    a finite number).

    [2] The Yule-Walker estimates can be formulated as a least squares problem.

    4.3 Non-parametric spectral estimation – the periodogram

    Suppose the zero mean discrete stationary process {Xt} has a purely continuous spectrum

    with sdf S(f). We have,

    S(f) =

    ∞∑

    τ=−∞

    sτe−i2πfτ |f | ≤ 1

    2,

    With µ = 0, we can use the biased estimator of sτ :

    ŝ(p)τ =1

    N

    N−|τ |∑

    t=1

    XtXt+|τ |

    59

  • for |τ | ≤ N − 1, but not for |τ | ≥ N . Hence we could replace sτ by ŝ(p)τ for |τ | ≤ N − 1 and

    assume sτ = 0 for |τ | ≥ N .

    Giving,

    Ŝ(p)(f) =

    (N−1)∑

    τ=−(N−1)

    ŝ(p)τ e−i2πfτ =

    1

    N

    (N−1)∑

    τ=−(N−1)

    N−|τ |∑

    t=1

    XtXt+|τ |e−i2πfτ

    =1

    N

    N∑

    j=1

    N∑

    k=1

    XjXke−i2πf(k−j)

    =1

    N

    N∑

    t=1

    Xte−i2πft

    2

    ,

    where the summation interchange has merely swapped diagonal sums for row sums (see

    section 4.1 on the ergodic property). Ŝ(p)(f) defined above is known as the periodogram,

    and is defined over [−1/2, 1/2].

    Note that {s(p)τ } and Ŝ(p)(f),

    {s(p)τ } ←→ Ŝ(p)(f)

    hence the (p) for periodogram,

    just like the population quantities

    {sτ} ←→ S(f).

    Hence, {s(p)τ } can be written as

    s(p)τ =

    ∫ 1/2

    −1/2Ŝ(p)(f)ei2πfτ df |τ | ≤ N − 1.

    If Ŝ(p)(f) were an ideal estimator of S(f) we would have

    [1] E{Ŝ(p)(f)} ≈ S(f) ∀f.

    [2] var{Ŝ(p)(f)} → 0 as N →∞ and,

    60

  • [3] cov{Ŝ(p)(f), Ŝ(p)(f ′)} ≈ 0 for f 6= f ′.

    We find that

    [1] is a good approximation for some processes,

    [2] is false,

    [3] holds if f and f ′ are certain distinct frequencies, namely, the Fourier frequencies fk =

    k/N (∆t = 1).

    We firstly look at the expectation in [1] (assuming µ = 0).

    E{Ŝ(p)(f)} =(N−1)∑

    τ=−(N−1)

    E{s(p)τ }e−i2πfτ

    =

    (N−1)∑

    τ=−(N−1)

    (

    1− |τ |N

    )

    sτe−i2πfτ .

    Hence, if we know the acvs {sτ} we can work out from this what E{Ŝ(p)(f)} will be. We

    can obtain much more insight by considering:

    E{|J(f)|2} where J(f) = 1√N

    N∑

    t=1

    Xte−i2πft, |f | ≤ 1

    2.

    [Ŝ(p)(f) = |J(f)|2.]

    We know from the spectral representation theorem that,

    Xt =

    ∫ 1/2

    −1/2ei2πf

    ′t dZ(f ′),

    so that,

    J(f) =

    N∑

    t=1

    (

    ∫ 1/2

    −1/2

    1√Nei2πf

    ′t dZ(f ′)

    )

    e−i2πft

    61

  • =

    ∫ 1/2

    −1/2

    N∑

    t=1

    1√Ne−i2π(f−f

    ′)t dZ(f ′)

    We find that,

    E{Ŝ(p)(f)} = E{|J(f)|2} = E{J∗(f)J(f)}

    = E

    {

    ∫ 1/2

    −1/2

    N∑

    t=1

    1√Nei2π(f−f

    ′)t dZ∗(f ′)

    ∫ 1/2

    −1/2

    N∑

    s=1

    1√Ne−i2π(f−f

    ′′)s dZ(f ′′)

    }

    =

    ∫ 1/2

    −1/2

    ∫ 1/2

    −1/2

    N∑

    t=1

    1√Nei2π(f−f

    ′)tN∑

    s=1

    1√Ne−i2π(f−f

    ′′)sE{dZ∗(f ′) dZ(f ′′)}

    =

    ∫ 1/2

    −1/2F(f − f ′)S(f ′) df ′,

    where F is Féjer’s kernel defined by

    F(f) =

    N∑

    t=1

    1√Ne−i2πft

    2

    =sin2(Nπf)

    N sin2(πf).

    This result tells us that the expected value of Ŝ(p)(f) is the true spectrum convolved with

    Féjer’s kernel. To understand the implications of this we need to know the properties of

    Féjer’s kernel:

    [1] For all integers N ≥ 1,F(f)→ N as f → 0.

    [2] For N ≥ 1, f ∈ [−1/2, 1/2] and f 6= 0, F(f) < F(0).

    [3] For f ∈ [−1/2, 1/2], f 6= 0, F(f)→ 0 as N →∞.

    [4] For any integer k 6= 0 such that fk = k/N ∈ [−1/2, 1/2], F(fk) = 0.

    [5]∫ 1/2−1/2 F(f) df = 1.

    Frome the above discussion, we can see that the periodogram is an asymptotically unbiased

    estimator of the true spectral density. However we now turn to the [2]nd property above,

    the fact that its variance does not go to zero as the sample size goes to infinity.

    62

  • The intuition for this is best seen for Gaussian time series. From the formula for the

    periodogram, we can see that it is simply “the modulus squared of a complex random

    variable with mean zero”. To start with, let us ignore the complex-valued nature of this

    variable and imagine that it is real-valued. Squaring a mean-zero normal variable produces

    a χ21 variable. We know that its mean is, approximately, a constant (= the true spectral

    density). Because for a χ21 variable, its variance equals twice its mean squared, it will also

    be approximately a positive constant, and in particular it will not go to zero.

    The same phenomenon happens for the “correct” complex-valued variable. Note that squar-

    ing the real and imaginary part and adding them together is a bit like adding two (asymp-

    totically independent, as it turns out) normal variables with means zero and equal variances.

    Thus the resulting variable will be asymptotically exponential (12χ22). Again, for an expo-

    nential variable, its variance equals its mean squared, so we cannot hope for the variance

    to go to zero if the mean goes to a positive constant.

    For a precise result regarding the asymptotic distribution of the periodogram, see Brockwell

    & Davis, Theorem 10.3.2.

    The way to reduce the variance of the periodogram is by smoothing. There are several

    smoothing techniques for the periodogram, some very advanced. However, we will study

    probably the simplest possible smoothing technique, kernel smoothing via the uniform ker-

    nel.

    To simplify arguments even further and “abstract” from unnecessary details, we consider

    a model which only approximates the periodogram. In the approximate model, each ob-

    servation is exponential with mean gi, independently of other observations. However, the

    arguments we will make for this model apply with only very slight changes to the full model

    for the periodogram.

    The approximate model is: yi, for i = 1, . . . , N are independent, exponentially distrib-

    uted variables with means gi (and, obviously, variances g2i ) where gi is a sampled version

    63

  • of a “smooth” (Lipschitz-continuous) function g(z) in the sense that gi = g(i/N). The

    smoothness of the underlying function is essential for kernel smoothing to work: it would

    be pointless to average the observations if the underlying means were completely unrelated

    to each other.

    To form the kernel smoothing estimate with the uniform kernel, simply take the average of

    the neighbouring observations:

    ĝi =1

    2M + 1

    i+M∑

    j=i−M

    yj.

    Firstly, to compute the mean, we have

    E(ĝi) =1

    2M + 1

    i+M∑

    j=i−M

    gj = gi +1

    2M + 1

    i+M∑

    j=i−M

    (gj − gi).

    Due to the Lipschitz property, the bias is bounded as

    1

    2M + 1

    i+M∑

    j=i−M

    |gj − gi| ≤C

    2M + 1

    i+M∑

    j=i−M

    |i− j|N

    ≤ CMN

    .

    On the other hand, for the variance we have

    Var(ĝi) =1

    (2M + 1)2

    i+M∑

    j=i−M

    g2j ≤ max g2(z)1

    2M + 1≤ CM.

    The mean-squared error (of any estimator) equals its bias squared + variance (why?). Thus

    here, the best rate of convergence to zero can be obtained by equating

    M2

    N2=

    1

    M,

    and solving for M to give M = O(N2/3).

    64

  • 5 Forecasting

    Suppose we wish to predict the value of Xt+l of a process, given Xt,Xt−1,Xt−2, . . .. Let

    the appropriate model for {Xt} be an ARMA(p, q) process:

    Φ(B)Xt = Θ(B)ǫt.

    Consider a forecast Xt(l) of Xt+l (an l-step ahead forecast) which is a linear combination

    of Xt,Xt−1,Xt−2, . . .:

    Xt(l) =

    ∞∑

    k=0

    πkXt−k.

    Note: this assumes a semi-infinite realization of {Xt}. Let us now assume that {Xt} can

    be written as a one-sided linear process, so that

    Xt =

    ∞∑

    k=0

    ψkǫt−k = Ψ(B)ǫt,

    and

    Xt+l =∞∑

    k=0

    ψkǫt+l−k = Ψ(B)ǫt+l.

    Hence,

    Xt(l) =

    ∞∑

    k=0

    πkXt−k =

    ∞∑

    k=0

    πkΨ(B)ǫt−k

    = Π(B)Ψ(B)ǫt.

    Let δ(B) = Π(B)Ψ(B) so that,

    Xt(l) = δ(B)ǫt

    =

    ∞∑

    k=0

    δkǫt−k.

    65

  • Now,

    Xt+l =∑∞

    k=0 ψkǫt+l−k

    =∑l−1

    k=0 ψkǫt+l−k +∑∞

    k=l ψkǫt+l−k

    (A) (B)

    (A) Involves future ǫts, and so represents the “unpredictable” part of Xt+l.

    (B) Depends only on past and present values of ǫt, thus representing the “predictable”

    part of Xt+l.

    Hence we would expect,

    Xt(l) =∞∑

    k=l

    ψkǫt+l−k

    =∞∑

    j=0

    ψj+lǫt−j.

    so that δk ≡ ψk+l. This can be readily proved. For linear least squares, we want to minimize,

    E{(Xt+l −Xt(l))2} = E

    (

    l−1∑

    k=0

    ψkǫt+l−k +

    ∞∑

    k=0

    [ψk+l − δk]ǫt−k

    )2

    = σ2ǫ

    {(

    l−1∑

    k=0

    ψ2k

    )

    +

    ∞∑

    k=0

    (ψk+l − δk)2}

    .

    The first term is independent of the choice of the {δk} and the second term is clearly

    minimized by choosing δk = ψk+l, k = 0, 1, 2, . . . as expected. With this choice of {δk} the

    second term vanishes, and we have,

    σ2(l) = E{(Xt+l −Xt(l))2}

    = σ2ǫ

    l−1∑

    k=0

    ψ2k,

    66

  • which is known as the l-step prediction variance.

    When l = 1, δk = ψk+1,

    Xt(1) = δ0ǫt + δ1ǫt−1 + δ2ǫt−2 + . . .

    = ψ1ǫt + ψ2ǫt−1 + ψ3ǫt−2 + . . .

    Xt+1 = ψ0ǫt+1 + ψ1ǫt + ψ2ǫt−1 + . . .

    so that,

    Xt+1 −Xt(1) = ψ0ǫt+1 = ǫt+1 since ψ0 = 1.

    Hence ǫt+1 can be thought of as the “one step prediction error”. Also of course,

    Xt+1 = Xt(1) + ǫt+1

    so that ǫt+1 is the essentially “new” part of Xt+1 which is not linearly dependent on past

    observations. The sequence {ǫt} is often called the innovations process of {Xt}, and σ2ǫ is

    often called the innovations variance.

    If we wish to write Xt(l) explicitly as a function of Xt,Xt−1, . . . then we note first that,

    Xt(l) =

    ∞∑

    k=0

    δkǫt−k =

    ∞∑

    k=0

    ψk+lǫt−k,

    so that,

    Xt(l) = Ψ(l)(B)ǫt, say

    where,

    Ψ(l)(z) =

    ∞∑

    k=0

    ψk+lzk.

    67

  • Assuming that Ψ(z) is analytic in and on the unit circle (stationary and invertible) then we

    can write

    Xt = Ψ(B)ǫt and ǫt = Ψ−1(B)Xt,

    and thus

    Xt(l) = Ψ(l)(B)ǫt = Ψ

    (l)(B)Ψ−1(B)Xt

    = G(l)(B)Xt, say

    with,

    G(l)(z) = Ψ(l)(z)Ψ−1(z).

    If we consider the sequence of predictors Xt(l) for different values of t (with l fixed) then

    this forms a new process, which since

    Xt(l) = G(l)(B)Xt,

    may be regarded as the output of a linear filter acting on the {Xt}. Since,

    Xt(l) =

    (

    u

    g(l)u Bu

    )

    Xt =∑

    u

    g(l)u Xt−u,

    we know that the transfer function is

    G(l)(f) =∑

    u

    g(l)u e−i2πfu.

    Example: AR(1)

    Xt − φ1,1Xt−1 = ǫt |φ1,1| < 1.

    Then

    Xt = (1− φ1,1B)−1ǫt.

    68

  • So,

    Ψ(z) = 1 + φ1,1z + φ21,1z

    2 + . . .

    = ψ0 + ψ1z + ψ2z2 + . . .

    i.e., ψk = φk1,1.

    Hence,

    Xt(l) =

    ∞∑

    k=0

    δkǫt−k =

    ∞∑

    k=0

    ψk+lǫt−k

    =∞∑

    k=0

    φk+l1,1 ǫt−k = φl1,1

    ∞∑

    k=0

    φk1,1ǫt−k

    = φl1,1Xt.

    The l-step prediction variance is

    σ2(l) = σ2ǫ

    (

    l−1∑

    k=0

    ψ2k

    )

    = σ2ǫ

    (

    l−1∑

    k=0

    φ2k1,1

    )

    = σ2ǫ(1− φ2l1,1)(1− φ21,1)

    .

    Alternatively,

    Xt(l) = G(l)(B)Xt,

    with G(l)(z) = Ψ(l)(z)Ψ−1(z). But,

    Ψ(l)(z) =

    ∞∑

    k=0

    ψk+lzk =

    ∞∑

    k=0

    φk+l1,1 zk,

    and,

    Ψ−1(z) = 1− φ1,1z,

    69

  • so that

    G(l)(z) = (φl1,1 + φl+11,1 z + φ

    l+21,1 z

    2 + . . .)

    ×(1− φ1,1z)

    = φl1,1,

    i.e., Xt(l) = φl1,1Xt as before.

    We have demonstrated that for the AR(1) model the linear least squares predictor of Xt+l

    depends only on the most recent observation, Xt, and does not involve Xt−1,Xt−2, . . .,

    which is what we would expect bearing in mind the Markov nature of the AR(1) model. As

    l→∞, Xt(l)→ 0, since Xt(l) = φl1,1Xt and |φ1,1| < 1. Also, the l-step prediction variance,

    σ2(l)→ σ2ǫ

    (1− φ21,1)= var{Xt}.

    In fact the solution to the forecasting problem for the AR(1) model can be derived directly

    from the difference equation,

    Xt − φ1,1Xt−1 = ǫt.

    by setting future innovations ǫt to be zero:

    Xt(1) = φ1,1Xt + 0

    Xt(2) = φ1,1Xt(1) + 0

    ...

    Xt(l) = φ1,1Xt(l − 1) + 0

    so that,

    Xt(l) = φl1,1Xt.

    70

  • For general AR(p) processes it turns out that Xt(l) depends only on the last p observed

    values of {Xt}, and may be obtained by solving the AR(p) difference equation with the

    future {ǫt} set to zero. For example for an AR(p) process and l = 1,

    Xt(1) = φ1,pXt + . . .+ φp,pXt−p+1.

    Example: ARMA(1,1)

    (1 − φ1,1B)Xt = (1− θ1,1B)ǫt.

    Take φ1,1 = φ and θ1,1 = θ,

    Xt =(1− θB)(1− φB)ǫt = Ψ(B)ǫt.

    So,

    Ψ(z) = (1− θz)(1 + φz + φ2z2 + φ3z3 + . . .)

    = 1 + (φ− θ)z + φ(φ− θ)z2 + . . .

    +φl−1(φ− θ)zl + . . .

    = ψ0 + ψ1z + ψ2z2 + . . .

    So,

    ψl =

    1 l = 0

    φl−1(φ− θ) l ≥ 1

    The l-step prediction variance is

    σ2(l) = σ2ǫ

    (

    l−1∑

    k=0

    ψ2k

    )

    = σ2ǫ

    (

    1 +

    l−1∑

    k=1

    ψ2k

    )

    71

  • = σ2ǫ

    (

    1 + (φ− θ)2l−1∑

    k=1

    φ2k−2

    )

    = σ2ǫ

    (

    1 + (φ− θ)2 (1− φ2l−2)

    (1− φ2)

    )

    .

    Now,

    Ψ(l)(z) =

    ∞∑

    k=0

    ψk+lzk

    = φl−1(φ− θ)∞∑

    k=0

    φkzk

    = φl−1(φ− θ)(1− φz)−1.

    And,

    Ψ−1(z) =(1− φz)(1 − θz) .

    So,

    G(l)(z) = Ψ(l)(z)Ψ−1(z)

    = φl−1(φ− θ)(1− θz)−1,

    and,

    Xt(l) = G(l)(B)Xt

    = φl−1(φ− θ)(1− θB)−1Xt.

    Consider l = 1,

    Xt(1) = (φ− θ)(1− θB)−1Xt

    = (φ− θ)(1 + θB + θ2B2 + . . .)Xt

    = (φ− θ)Xt + θ(φ− θ)Xt−1 +

    72

  • θ2(φ− θ)Xt−2 + . . .

    = φXt − θ [Xt − (φ− θ)Xt−1−

    θ(φ− θ)Xt−2 − . . .− θk−1(φ− θ)Xt−k − . . .]

    But consider,

    ǫt = Ψ−1(B)Xt = (1− φB)(1− θB)−1Xt

    = (1− φB)(1 + θB + θ2B2 + θ3B3 . . .)Xt

    = Xt − (φ− θ)Xt−1 − θ(φ− θ)Xt−2 −

    . . . − θk−1(φ− θ)Xt−k − . . . .

    Therefore,

    Xt(1) = φXt − θǫt.

    So can again be derived directly from the difference equation,

    Xt = φXt−1 − θǫt−1 + ǫt,

    by setting future innovations ǫt to zero.

    MA(1) (invertible)

    Xt = ǫt − θ1,1ǫt−1 |θ1,1| < 1.

    So,

    Ψ(z) = ψ0 + ψ1z + ψ2z2 + . . .

    = 1− θ1,1z

    73

  • Hence, ψ0 = 1; ψ1 = −θ1,1;

    ψk = 0, k ≥ 2.

    Xt(l) =

    ∞∑

    k=0

    ψk+lǫt−k = Ψ(l)(B)ǫt

    = ψlǫt + ψl+1ǫt−1 + . . .

    So,

    Ψ(l)(z) =

    ∞∑

    k=0

    ψk+lzk = ψlz

    0 + ψl+1z1

    =

    −θ1,1 l = 1

    0 l ≥ 2.

    Hence,

    G(l)(z) = Ψ(l)(z)Ψ−1(z)

    =

    −θ1,1(1− θ1,1z)−1 l = 1

    0 l ≥ 2.

    Thus, for l = 1,

    G(1)(z) = −θ1,1(1 + θ1,1z + θ21,1z2 + . . .),

    and hence,

    Xt(1) = G(1)(B)Xt

    = −∞∑

    k=0

    θk+11,1 Xt−k

    Forecast errors and updating

    We have seen that when δk = ψk+l the forecast error is∑l−1

    k=0 ψkǫt+l−k.

    74

  • Let,

    et(l) = Xt+l −Xt(l)

    =l−1∑

    k=0

    ψkǫt+l−k.

    Then,

    et(l +m) =

    l+m−1∑

    j=0

    ψjǫt+l+m−j .

    Clearly,

    E{et(l)} = E{et(l +m)} = 0.

    Hence,

    cov{et(l), et(l +m)} = E{et(l)et(l +m)}

    = σ2ǫ

    l−1∑

    k=0

    ψkψk+m.

    and

    var{et(l)} = σ2ǫl−1∑

    k=0

    ψ2k = σ2(l).

    E.g.,

    cov{et(1), et(2)} = σ2ǫψ1.

    This could be quite large – should the forecast for a series wander off target, it is possible

    for it to remain there in the short run since forecast errors can be quite highly correlated.

    Hence, when Xt+1 becomes available we should update the forecast.

    Xt+1(l) =

    ∞∑

    k=0

    ψk+lǫt+1−k

    = ψlǫt+1 + ψl+1ǫt + ψl+2ǫt−1 + . . . ,

    75

  • but,

    Xt(l + 1) =∞∑

    k=0

    ψk+l+1ǫt−k

    = ψl+1ǫt + ψl+2ǫt−1 + ψl+3ǫt−2 + . . . ,

    and,

    Xt+1(l) = Xt(l + 1) + ψlǫt+1

    = Xt(l + 1) + ψl(Xt+1 −Xt(1)).

    Hence, to forecast Xt+l+1 we can modify the l+1- step ahead forecast at time t by producing

    an l-step ahead forecast at time t+ 1 using Xt+1 as it becomes available.

    76

  • 0 50 150 250 350

    05

    1015

    2025

    0 50 150 250 350

    −10

    −50

    510

    0 50 150 250 350

    −50

    5

    0 50 150 250 350

    −50

    510

    Figure 12: Top left: Oxford temperature data. Top right: residuals from least-squares fit.Bottom left: detrended. Bottom right: detrended and deseasonalised.

    77

  • frequency

    S(f

    )

    -0.50 -0.25 0.0 0.25 0.50

    time

    0 64 128

    frequency

    S(f

    )

    -0.5 -0.1 0.1 0.5

    time

    0 64 128

    Figure 13: Top: purely continuous spectrum and a corresponding simulated sample path.Bottom: line spectrum + sample path.

    78

  • time

    0 64 128

    Lag

    0 16 32

    -0.2

    0.0

    0.5

    1.0

    time

    0 64 128

    Lag

    0 16 32

    -1.0

    -0.5

    0.0

    0.5

    1.0

    Figure 14: Continuation of Figure 13 with plots of the corresponding sample autocorrela-tions.

    79