st3454 stochastic processes in space and time · 2017-03-27 · stochastic processesiii many...

Post on 13-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ST3454 Stochastic processes in Space and Time

Rozenn Dahyot

School of Computer Science & StatisticsTrinity College Dublin, Ireland

https://www.scss.tcd.ie/Rozenn.Dahyot/

Rozenn.Dahyot@scss.tcd.ie

Hilary term 2017

1

Space and time I

This course is concerned by the study of quantities (stochastic processes)

that varies over the spacetime domain.

The spacetime of our universe is usually interpreted as an Euclidean space

with space consisting of three dimensions, and time as consisting of one

dimension (the 4th dimension).

2

Space and time II

In this course:

space x will be limited to R2 (2D Euclidian space) as a local planar

approximation of the earth surface

time t will be limited to R+.

3

Stochastic Processes I

we note s the stochastic process of interest, that may vary over the time

domain s(t), the space domain s(x), or spacetime s(x, t). For instance,

the stochastic process of interest may be the temperature in Ireland

(space) over the next few days (time).

4

Stochastic Processes II

Sometimes, s is not measurable directly (at any location in spacetime) but

instead we can observe another related stochastic process o.

A simple example is o the temperature as read on a thermometer, while s

is the true temperature with the relation o = s+ ε such that ε is the noise

associated with the sensor (thermometer).

5

Course content overview I

Many techniques seen in this class are Linear Smoothing methods:

s(x, t) =

n∑i=1

λi(x, t) s(i) + ε(x, t)

with having observed the stochastic process s at n (spacetime) locations

{(x(i), t(i), s(i))}i=1,··· ,n,

The goal is to estimate the quantity s at a new given location (x, t) and if

possible getting a measure for its uncertainty.

6

Course content overview II

Estimation depends on the model chosen for the weights λ and

hypotheses about ε.

s(x, t) =

n∑i=1

λi(x, t) s(i) or for short s =

n∑i=1

λi s(i) = 〈 ~λ|~s 〉

Note∑ni=1 λi s

(i) = ~λT~s = 〈 ~λ|~s 〉.

Examples: Regression, ARIMA, Filtering, Kriging, Nadaraya-Watson estimator.

7

Content & Exam

We will attempt to study all the following:

Regression as a Linear Smoothing method

ARIMA models

Multivariate State-Space Model & Kalman Filter,

Kriging,

Functional data analysis

For illustration, some R labs may be organised.

Exam (100%):

8

References

Gaussian Markov Random Fields - Theory and Applications , H. Rue & L. Held,

2005.

Geostatistics for Environmental Scientists , R. Webster & M. A. Oliver .

Statistics for Spatial Data , N. A. C. Cressie, Wiley 1993.

Functional Data Analysis , J.O. Ramsay and B. W. Silverman, Springer 2006.

Statistical Models of Shape - Optimisation and Evaluation , R. Davies, C. Twining

& C. Taylor, Springer 2008.

Linear Models for Multivariate, Time Series, and Spatial Data , R. Christensen,

Springer 1991.

9

Stochastic processes I

Definition (stochastic process)

A Spatio-temporal stochastic process s(x, t) is defined such that:

x is a spatial location,

t is the time,

s is a random quantity of interest. s takes values in the state space S.

For a two state model, S = {S1, S2}, a probability P(sxt = S1) (or P(sxt = S2),

such that P(sxt = S1) + P(sxt = S2) = 1) is associated with s at the

spatiotemporal position (x, t) (noted sxt).

10

Stochastic processes II

Like the state space, the variables x and t can take values in a discrete

(e.g. N, Z, N2 etc.) or continuous spaces (eg. R, R2, or R3 or S2 (unit

sphere)).

When no direct observations is available for s (hidden random variable),

another stochastic process o(x, t) linked to s(x, t) and for which

observations are available, is defined.

11

Stochastic processes III

Many applications use stochastic processes to estimate s with its

associated uncertainty at a new (unobserved) location in spacetime.

For most examples in this course, we look at stochastic processes s that

are 1-dimensional (it is a random variable and not a random vector).

12

Stochastic processes IV

Several cost functions (e.g. probability density functions) will be

presented in the course to compute a prediction (estimate) and its

associated uncertainty.

The Normal distribution is often used as a hypothesis in these modellings.

13

Regression as a Linear Smoothing method I

Imagine the following model for stochastic process s:

s(t) = c+ α t+ ε(t)

Let assume:

E[ε(t)] = 0, ∀t.

E[(ε(t))2] = σ2, ∀t.

14

Regression as a Linear Smoothing method II

Questions. Assuming that we have collected n observations {(t(i), s(i))}ni=1

propose an estimate s for s at a new location t.

show that that estimate can be written as

s(t) =

n∑i=1

λi(t) s(i)

15

Regression as a Linear Smoothing method III

Answers. Defining ~s = [s(1) · · · s(n)]T and matrix

U =

1 t(1)

......

1 t(n)

then the least squares estimate minimising the sum of square errors

n∑i=1

ε2i =

n∑i=1

(s(i) − c− α t(i))2

is [c

α

]= (UTU)−1UT~s

16

Regression as a Linear Smoothing method IV

Having compute estimates c and α an estimate for s at new temporal location

t can be proposed:

s = c+ α t = [1 t]

[c

α

]= [1 t] (UTU)−1UT︸ ︷︷ ︸

~λT

~s =

n∑i=1

λi s(i)

17

Regression as a Linear Smoothing method V

We suggested the following model for stochastic process s:

s(t) = c+ α t+ ε(t) with ε(t) ∼ N (0;σ2),∀t

limitation of such model ?

other explanatory variables that could be used ?

18

Regression as a Linear Smoothing method VI

Exercise: Consider the model

s(x, t) = c+ α t+ β x+ γ x t+ ε(x, t) with ε(x, t) ∼ N (0;σ2),∀t

n observations {(x(i), t(i), s(i))}i=1,··· ,n are available.

Propose an estimate s at a new location (x, t).

Show that this solution can also be written s(x, t) =∑ni=1 λi(x, t) s

(i).

19

Time series analysis I

Example time series: Beer production. The Monthly Australian beer

production has been observed between Jan 1991 to Aug 1995.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

1991 164 148 152 144 155 125 153 146 138 190 192 192

1992 147 133 163 150 129 131 145 137 138 168 176 188

1993 139 143 150 154 137 129 128 140 143 151 177 184

1994 151 134 164 126 131 125 127 143 143 160 190 182

1995 138 136 152 127 151 130 119 153

20

Time series analysis II

Time

beer

1991 1992 1993 1994 1995

120

140

160

180

21

Time series analysis III

Questions:

What are the patterns in the beer data ?

What model can you propose to model this stochastic process ? (what

explanatory variable would you use ?)

22

ARIMA models I

Stochastic process s in the time domain

AR(1)

{s(t+ ∆) = c+ φ1 s(t) + ε(t+ ∆)} ≡ {st+1 = c+ φ1 st + εt+1}

MA(1)

{s(t+ ∆) = c+ φ1 ε(t) + ε(t+ ∆)} ≡ {st+1 = c+ φ1 εt + εt+1}

with ∆ a fixed interval of time.

23

ARIMA models II

Exercise: Assume a simple AR(1) model:

st = α st−1 + εt with εt ∼ N (0, σ2) ∀t, |α| < 1

with observations {s(1)t }t=1,··· ,n. Draw a probabilistic graph to model this time

series.

Gaussian Markov Random Fields - Theory and Applications , H. Rue & L. Held,

2005 (chp. 1).

24

ARIMA models III

Exercise: Assume a simple AR(1) model:

st = α st−1 + εt with εt ∼ N (0, σ2) ∀t, |α| < 1

Show

p(st|st−1, · · · , s1) = p(st|st−1) = N (st;α st−1, σ2)

25

ARIMA models IV

Exercise: Given a temporal stochastic process s with the following Markov property

p(st|st−1) = N (st;α st−1, σ2) ∀t

with observations {s(1)t }t=1,··· ,n.

Estimate s at a new time n+ 1 with its confidence interval.

Give an estimate for s at a new time n+ 2 with its confidence interval.

26

ARIMA models V

27

ARIMA models VI

We know

p(st|st−1) = N (st;α st−1, σ2) ∀t

so at n+ 1:

p(sn+1|sn) = N (sn+1;α sn, σ2)

and we use the observation s(1)n that is available for sn:

p(sn+1|s(1)n ) = N (sn+1;α s(1)n , σ2)

A good estimate is the expectation sn+1 = α s(1)n and the 95% prediction interval is

sn+1 ± 2σ.

28

ARIMA models VIIAt n+ 2:

p(sn+2|sn+1) = N (sn+2;α sn+1, σ2)

But we dont have observation for sn+1 ! so instead we use:

p(sn+2|sn) =∫p(sn+2, sn+1|sn) dsn+1

=∫p(sn+2|sn+1, sn) p(sn+1|sn) dsn+1

=∫p(sn+2|sn+1) p(sn+1|sn) dsn+1

=∫N (sn+2;α sn+1, σ

2) N (sn+1;α sn, σ2) dsn+1

= N (sn+2;α2 sn, σ

2 (1 + α2))

A good estimate is the expectation sn+2 = α2 s(1)n using the observation s(1)n , and the

95% prediction interval is sn+2 ± 2σ√1 + α2.

29

ARIMA models VIII

30

Filters I

31

Filters II

Given a hidden temporal stochastic process (s1, · · · , sJ) and observed one

(o1, · · · , oJ) linked byps1(s1) Initial state

psj |sj−1(sj |sj−1) transition (order 1)

poj |sj (oj |sj) link between observed variable and hidden state

Estimate the posterior psj |o1:j . Hint: Express psj |o1:j (sj |o1:j) w.r.t.

psj−1|o1:j−1(sj−1|o1:j−1) (notation o1:j−1 = (o1, · · · , oj−1)).

32

Filters III

psj |o1:j (sj |o1:j)

=∫p(sj , sj−1|o1:j) dsj−1 (Total probability theorem)

= 1p(o1:j)

∫p(st, st−1, o1:j) dsj−1 (Bayes)

= 1p(o1:j)

∫p(sj , sj−1, oj , o1:j−1) dsj−1

=p(o1:j−1)p(o1:j)

∫p(oj |sj , sj−1, o1:j−1) p(sj |sj−1, o1:j−1) p(sj−1|o1:j−1) dsj−1 (Bayes)

=p(o1:j−1)p(o1:j)

∫p(oj |sj) p(sj |sj−1) p(sj−1|o1:j−1) dsj−1

33

Filters IV

Hence

p(sj |o1:j) =p(o1:j−1)

p(o1:j)p(oj |sj)

∫p(sj |sj−1) p(sj−1|o1:j−1) dsj−1

given a hidden temporal stochastic process (s1, · · · , sJ) and observed one

(o1, · · · , oJ) linked byps1(s1) Initial state

psj |sj−1(sj |sj−1) transition (order 1)

poj |sj (oj |sj) link between observed variable and hidden state

34

Filters V'

&

$

%

In the general case, we have the following integral to solve:

p(sj |o1:j) =p(o1:j−1)

p(o1:j)p(oj |sj)

∫p(sj |sj−1) p(sj−1|o1:j−1) dsj−1

1 Prediction:

p(sj |o1:j−1) =

∫p(sj |sj−1) p(sj−1|o1:j−1) dsj−1

2 Update

p(sj |o1:j) ∝ p(oj |sj) p(sj |o1:j−1)

35

Filters VI

Having collected observations {o(1)t }t=1,··· ,j−1 (also noted o(1)1:j−1)), the

probability density function to predict is sj is

p(sj |o(1)1:j−1)

Once the new observation has been collected at time j, o(1)j , the

probability density function taking this update into account is

p(sj |o(1)1:j )

36

Filters VII

37

Kalman Filter I

We have illustrated the idea behind Filtering with a transition model of

order 1 (This can be extended to any order). In practice, we need to be

able to calculate or compute p(sj |o1:j). Kalman filtering is one elegant

way of doing that.

In 1960, R.E. Kalman published his famous paper describing a recursive

solution to the discrete-data linear filtering problem.

The Kalman filter uses hypotheses of linearity (in the state transition

equation, and the observation equation) and Normal distributions.

38

Kalman Filter II

Exercise: Givenp(s1) = N (s1;µ1, σ

21) (initial state pdf)

p(sj |sj−1) = N (sj ;αsj−1, σ2) (state transition pdf)

p(oj |sj) = N (oj ; sj , σ2o) (observation pdf)

with the notation N (x;µ, σ2) corresponding to the normal pdf for r.v. x with

mean µ and variance σ2. Show by induction that p(sj |o1:j) is a normaldistribution N (sj ;µj , σ

2j ) and give the recursive formula to compute

parameters (µj , σ2j ) assuming p(sj−1|o1:j−1) = N (sj−1;µj−1, σ

2j−1).

39

Kalman Filter III

Answer.

1 at j = 1, we know the initial state s1 is normally distributed (given in the

hypothesis).

2 assuming at j − 1 that p(sj−1|o1:j−1) = N (sj−1;µj−1, σ2j−1), then solving

the prediction and update steps of filtering, we find that:

µj = α (1−Kj) µj−1 +Kj oj

σ2j = σ2

o Kj

Kj =σ2+α2σ2

j−1

σ2+σ2o+α

2σ2j−1

40

Kalman Filter IV

1 the prediction step corresponds to (using result < NA|NB >):

p(sj |o1:j−1) =∫N (sj ;αsj−1, σ

2) N (sj−1;µj−1, σ2j−1) dsj−1

= 1|α|

∫N(sj−1;

sjα, σ

2

α2

)N (sj−1;µj−1, σ

2j−1) dsj−1

= 1|α| N

(0;

sjα− µj−1,

σ2

α2 + σ2j−1

)= N (sj ;αµj−1, σ

2 + α2σ2j−1)

2 update (rewriting is required to be able to identify (µj , σ2j )):

p(sj |o1:j) ∝ N (oj ; sj , σ2o) N (sj ;αµj−1, σ

2 + α2σ2j−1)

= N (sj ;µj , σ2j )

41

Kalman Filter V

These two parameters (µj , σ2j ) are enough to characterise fully the

Normal distribution p(sj |o1:j).

The observations available {o(1)j } are plugged in the expression (µj , σ2j )

so that they can be computed iteratively.

42

State-Space Model I

In the Multivariate case:

State equation with ~sj a random vector

~sj = F ~sj−1 + G ~εj

F is a square matrix, G is a matrix, and ~εj ∼ N (0,Σ) is a random vector.

Observation equation

~oj = H ~sj + ~ej

with H a matrix and ~ej ∼ N (0,Σo).

43

State-Space Model II

Given p(~sj−1|~o1:j−1) = N (~sj−1;µj−1,Σj−1)

prediction

p(~sj |~o1:j−1) = N (~sj ; Fµj−1,R = FΣj−1FT + GΣGT )

update p(~sj |~o1:j) = N (~sj ;µj ,Σj) with

Σj =(HTΣ−1o H + R−1

)−1and

µj = Σj(HTΣ−1o ~oj + R−1Fµj−1

)

44

State-Space Model III

Example

Consider the following model

1 Observation equation:

oj = sj + ej

2 State equation

AR(3) : sj = φ1 sj−1 + φ2 sj−2 + φ3 sj−3 + εj

Define ~sj , ~oj , H, F and G .

45

State-Space Model IV

Answer: Using Matrix notation, the State equation is:sj

sj−1

sj−2

︸ ︷︷ ︸

~sj

=

φ1 φ2 φ3

1 0 0

0 1 0

︸ ︷︷ ︸

F

sj−1

sj−2

sj−3

︸ ︷︷ ︸

~sj−1

+

εj

0

0

︸ ︷︷ ︸

~ej

and the observation equation is:

oj = (1, 0, 0)︸ ︷︷ ︸H

sj

sj−1

sj−2

︸ ︷︷ ︸

~sj

+ej

46

State-Space Model V

Exercises: Find F and G when sj follows

1 an MA(2)

2 an ARMA(2,2)

47

State-Space Model VI

F, G and H are assumed to be known and constant over time.

More general state space models allow these matrices to depend on time.

48

Introduction to geostatistics I

Founders: D. G. Krige and G. Matheron formulated the theory of

geostatistics and Kriging in the 1950’s.

Application to mining industry

We want to predict at unobserved locations.

Geostatistics is the study of spatial data where the spatial correlation is

modeled through the variogram. Focus of geostatistical analyses is on

predicting responses at unsampled sites.

49

Introduction to geostatistics II

Applications:

hydrological data,

mining applications,

air quality studies

soil science data

biological applications

economic housing data

etc.

Geostatistics for Environmental Scientists , R. Webster & M. A. Olvier, Wiley

2001.

50

Introduction to geostatistics III

Lets consider:

J physical locations {xj}j=1,··· ,J ,

Some information of interest (e.g. radioactivity levels) is modeled as a

stochastic process at these locations {sj = s(xj)}j=1,··· ,J ,

One observation (or measurement) is available for each site

{s(1)j }j=1,··· ,J .

At a new location x0, we want to predict s0.

51

Introduction to geostatistics IV

Content:

1 Adhoc methods: spatial interpolation

2 Statistical modeling with Kriging

1 simple, ordinary and universal Kriging

2 modelling with covariance and/or variogram

3 Stationarity assumptions

52

Spatial interpolation I

Prediction of s at a new site x0 can be expressed as a weighted averages of

data:

s(x0) =

J∑j=1

λj s(xj)

with the constraints:

(λj ≥ 0, ∀j) ∧

J∑j=1

λj = 1

53

Spatial interpolation II

1 Thiessen polygones (Voronoi polygons, Dirichlet tesselations)

2 Triangulation

3 Natural Neighbour interpolation

4 Inverse function of distance

5 Trend surface

54

Spatial interpolation III

1 Thiessen polygones (Voronoi polygons, Dirichlet tesselations)

s(x0) = s(xj) with xj = arg mini=1,··· ,J

‖xi − x0‖

so we have binary weights:

λj = 1, and λi = 0 ∀i 6= j

55

Spatial interpolation IV

Figure: Voronoi polygons

56

Spatial interpolation V

2 Triangulation. Sampling points are linked to their neighbours by straight

lines to create triangles that do not contain any of the points. Having a

new position x0 = (u0, v0) in one of the triangle, let says the one defined

by (x1, x2, x3), then

λ1 =|x0 − x3 ; x2 − x3||x1 − x3 ; x2 − x3|

with the notation

|x0−x3 ; x2−x3| =

∣∣∣∣∣ u0 − u3 u2 − u3v0 − v3 v2 − v3

∣∣∣∣∣ = det

([u0 − u3 u2 − u3v0 − v3 v2 − v3

])

λ2 and λ3 are defined in a similar fashion and all the other λs are 0s.

Unlike Thiessen method, the resulting surface is continuous but yet has

abrupt changes in gradient at the margins of the triangles.

57

Spatial interpolation VI

Figure: Triangulation: The weights λ1 corresponds to the blue area divided by the

area of the triangle (x1, x2, x3). Similarly λ2 corresponds to the green area

divided by the area of the triangle (x1, x2, x3) and λ3 corresponds to the pink area

divided by the area of the triangle (x1, x2, x3).

58

Spatial interpolation VII3 Natural Neighbour interpolation

59

Spatial interpolation VIII

60

Spatial interpolation IX

61

Spatial interpolation X

62

Spatial interpolation XI

λj =aj∑Jj=1 aj

with aj = 0 if xj is not a natural neighbour to x0.

63

Spatial interpolation XII

4 Inverse function of distance

λj ∝1

‖xj − x0‖β, β > 0

I The weights {λj}j=1,··· ,J are scaled such that they sum up to 1.I Usually, β = 2 (Euclidian distance).I If x0 = xj , then s(x0) = s(xj).I There are no discontinuities in the map s.I There is no measure of the error.

64

Spatial interpolation XIII

5 Trend Surface. This method proposes to do regression:

s(x) = µ(x) + ε

with the error term ε ∼ N (0, σ2ε ). The function µ is a parametric function

such as planes or quadratics e.g.

µ(x = (u, v)) = b0 + b1 u+ b2 v

Coefficients b = (b0, b1, b2)T can then be estimated by Least Squares

using the J observations.

Once b is estimated, the prediction at the new location x0 is computed by:

s(x0) = b0 + b1 u0 + b2 v0

65

Spatial interpolation XIV

Limits of Interpolation for prediction:

Some interpolators give a crude prediction and the spatial variation is

displayed poorly.

The interpolators fail to provide any estimates of the error on the

prediction.

With the exception of trend surface, these methods were deterministic.

However the processes are stochastic by nature.

In practice the modelling with trend surface is too simplistic to perform

well and the uncertainty is the same everywhere.

66

Introduction: Kriging I

The aim of Kriging is to estimate the value of a random variable s at one

or more unsampled points or locations, from more or less sparse sample

data on a given support say {s(x1), · · · , s(xJ)} at {x1, · · · , xJ}.

Different kinds of kriging methods exist, which pertains to the

assumptions about the mean structure of the model:

E[s(x)] = µ(x) or E[s(x)− µ(x)︸ ︷︷ ︸s(x)

] = 0

67

Introduction: Kriging II

Different Kriging methods:I Ordinary Kriging :

E[s(x)] = µ (µ is unknown)

I Simple Kriging :

E[s(x)] = µ (µ is known)

I Universal Kriging : the mean is unknown and depends on a linear model:

µ(x) =

P∑p=0

βp φp(x)

and coefficients {βp} need to be estimated.

68

Ordinary kriging I

Ordinary kriging is the most common type of kriging.

The underlying model assumption in ordinary kriging is:

E[s(x)] = µ

with µ unknown.

The stochastic process s has been observed at J sites ( the r.v.

s(xj) = sj has one observation s(1)j associated with it).

69

Ordinary kriging II

The model for s(x0) is:

s(x0)− µ =

J∑j=1

λj (s(xj)− µ) + ε(x0)

or

s(x0) =

J∑j=1

λj s(xj) + µ (1−J∑j=1

λj) + ε(x0)

We filter the unknown mean by requiring that the kriging weights sum to

1, leading to the ordinary kriging estimator :

s(x0) =

J∑j=1

λj s(xj) + ε(x0) subject toJ∑j=1

λj = 1

70

Ordinary kriging III

ε(x0) is the noise at position x0 such that:

E[ε(x0)] = 0

We want to estimate s(x0). In other words we need to get the appropriate

{λj}j=1,··· ,J .

Estimation by Mean square errors subject to a constraint:

(λ1, · · · , λJ) = arg minλ1,··· ,λJ

{E[ε(x0)2]

}subject to

J∑j=1

λj = 1

71

Ordinary kriging IV

This is solved using Lagrange multipliers. We define the energy J thatdepends both on {λj}j=1,··· ,J and ψ:

({λj}j=1,··· ,J , ψ) = arg minψ,λ1,··· ,λJ

J (λ1, · · · , λJ , ψ) = E[ε(x0)2] + 2 ψ

J∑j=1

λj − 1

72

Ordinary kriging V1 First we express the expectation of the error:

E[ε(x0)2] = E[(s(x0)−

∑jj=1 λj s(xj)

)2]= E

[(s(x0)− µ+ µ−

∑Jj=1 λj s(xj)

)2]= E

[(s(x0)− µ)2

]− 2

∑Jj=1 λjE [(s(x0)− µ) (s(xj)− µ)]

+∑Ji=1

∑Jj=1 λiλjE [(s(xj)− µ) (s(xi)− µ)]

Remember that the covariance is defined as

Cov(s(xj); s(xi)) = cij = E [(s(xj)− µ) (s(xi)− µ)]

So the energy to minimize:

J (λ1, · · · , λJ , ψ) = c00 − 2J∑j=1

λjc0j +J∑i=1

J∑j=1

λiλjcij + 2 ψ

(J∑j=1

λj − 1

)

73

Ordinary kriging VI

2 Second, we differentiate J w.r.t. λk, k = 1, · · · , J and ψ, and the minimum

of J is found when all the derivatives are equal to zeros.∂J∂ψ

= 0

∂J∂λk

= 0, ∀k = 1, · · · , J

The derivative w.r.t. ψ is:

∂J∂ψ

=

J∑j=1

λj − 1 = 0

The derivative w.r.t. λk is :

∂J∂λk

= 2 ψ − 2 c0k + 2J∑j=1

λj cik = 0

74

Ordinary kriging VII

3 The solution is:c11 · · · c1J 1... 1

cJ1 · · · cJJ 1

1 · · · 1 0

︸ ︷︷ ︸

A

λ1

...

λJ

ψ

︸ ︷︷ ︸

λλλ

=

c10...

cJ0

1

︸ ︷︷ ︸

b

or

λλλ = A−1b

75

Ordinary kriging VIII

Once you have the estimate λλλ, then you can predict (using the

observations):

s(x0) =

J∑j=1

λj s(1)j

76

Simple Kriging I

Assumption for Simple Kriging:

The mean E[s(x)] = µ is known.

We estimate s(x0) using the relation (same as Ordinary Kriging):

s(x0)− µ =

J∑j=1

λj (s(xj)− µ) + ε(x0)

or

s(x0) =

J∑j=1

λj s(xj) + µ

1−J∑j=1

λj

+ ε(x0)

where µ is known. The λj do not need to be constrained to sum to 1 anymore

and the second term insured that E[s(x)] = µ, ∀x.

77

Simple Kriging II

The hypothesis for the error is E[ε(x0)] = 0 and we estimate {λj}j=1,··· ,J such

that the Mean Square Error E[ε2(x0)] is minimised.

The solution is then:λ1...

λJ

=

c11 · · · c1J...

cJ1 · · · cJJ

−1

c10...

cJ0

Once you have the estimate λλλ, then you can predict (using the observations):

s(x0) =

J∑j=1

λj s(1)j + µ

1−J∑j=1

λj

78

Universal Kriging I

For a new location x0, we have the following model

s(x0)− µ(x0) =

J∑j=1

λj (s(xj)− µ(xj)) + εx0

or

s(x0) =

J∑j=1

λj s(xj) + µ(x0)−J∑j=1

λj µ(xj) + εx0

In the Universal Kriging, the mean of s depends on the position x:

µ(x) =

P∑p=0

βp φp(x)

79

Universal Kriging II

Example of choice of the functions {φp}p=1,··· ,P of x = (u, v) ∈ R2:

Linear trend (P = 2):

φ0(x) = 1, φ1(x) = u, φ2(x) = v

Quadratic trend (P = 5):

φ3(x) = u2, φ4(x) = uv, φ5(x) = v2

80

Universal Kriging III

In a similar fashion as ordinary kriging (we dont know the βp, so µ is

unknown), we rewrite:

s(x0) =

J∑j=1

λj s(xj) + µ(x0)−J∑j=1

λj µ(xj) + εx0

as

s(x0) =

J∑j=1

λj s(xj) + εx0subject to µ(x0)−

J∑j=1

λj µ(xj) = 0︸ ︷︷ ︸constraint

81

Universal Kriging IV

Having µ(x) =∑Pp=0 βp φp(x), the constraint is equivalent to:

P∑p=0

βp φp(x0) =

P∑p=0

βp

J∑j=1

λj φp(xj)

This is true for any combination of βp. Hence we have in fact P + 1 constraints:φp(x0) =

J∑j=1

λj φp(xj)

∀p = 0, · · · , P

82

Universal Kriging V

Note that at p = 0, using φ0(x) = 1, we recover the constraint∑Jj=1 λj = 1.

This minimisation is solved by introducing P + 1 Lagrange multipliers:

(λ1, · · · , λJ , m0, · · · , mP

)= argmin

E[ε2x0 ] +P∑p=0

mp

φp(x0)− J∑j=1

λj φp(xj)

83

Universal Kriging VI

The solution of Universal Kriging is:

[C FT

F 0

]

λ1...

λJ

m0

m1

...

mP

=

c01...

c0J

φ0(x0)

φ1(x0)...

φP (x0)

84

Universal Kriging VIIwith

F =

φ0(x1) φ0(x2) · · · φ0(xJ)

φ1(x1) φ1(x2) · · · φ1(xJ)

φ2(x1) φ2(x2) · · · φ2(xJ)...

φP (x1) φP (x2) · · · φP (xJ)

and

C =

c11 · · · c1J...

cJ1 · · · cJJ

The prediction at x0 is computed by:

s0 =

J∑j=1

λj s(1)j

85

Remarks about Kriging I

The covariance function C[s(x), s(x′)] is a function assumed to be known

in all the solutions proposed here for Kriging.

In practice such a function is not known and need to be estimated.

Some hypotheses will be used about the process to ease this estimation.

We define the concept of variogram as an alternative to covariance and

we see next what are the assumptions about the process that will be

used for Kriging.

86

Remarks about Kriging II

Definition (variogram)

The variogram is defined as:

γ(s(xi), s(xj)) = γij =1

2E[(s(xi)− s(xj))2]

When E[(s(xi)− s(xj))] = 0 then the variogram is linked to the covariance as

follow:

γij =1

2(cii + cjj − 2 cij)

87

Stationarity I

Definition (Strict Stationarity)

The distribution of the random process has certain attributes that are the

same everywhere. Strict stationarity indicates that for any number k of any

sites x1, x2, · · · , xk, the joint cumulative distribution of (s(x1), · · · , s(xk))

remains the same under an arbitrary translation h:

P (s(x1), · · · , s(xk)) = P (s(x1 + h), · · · , s(xk + h))

If the random process is strictly stationary, its moments if they exist are also

invariant under translations.

88

Stationarity II

Definition (weak stationarity)

A process s(x) is said second order stationary (or weakly stationary, or

wide-sense stationary) when

1 the mean of the process does not depend on x: E[s(x)] = µ

2 the variance of the process does not depend on x: E[(s(x)− µ)2] = σ2

3 the covariance between s(x) and s(x+ h) only depends on h:

Cov [s(x), s(x+ h))] = E [(s(x)− E[s(x)]) (s(x+ h)− E[s(x+ h)])]

= E [(s(x)− µ) (s(x+ h)− µ)]

= C(h)

Note that C(0) = σ2.

89

Stationarity III

For a weakly stationary process, the autocorrelation can also be defined

as:

ρ(h) =C(h)

C(0)

with C(0) is the covariance at lag 0, i.e. σ2.

If a random field with the function C(h) only dependent on the distance

‖h‖ and not on its orientation, it is said to be isotropic.

90

Stationarity IV

Definition (Intrinsic Stationarity)

s(x) is said to be an intrinsic random function such that:

E[s(x+ h)− s(x)] = 0

and

Var[s(x+ h)− s(x)] = 2 γ(h)

The function γ(h) is called the variogram.

91

Stationarity V

Exercises: For second-order stationary processes,

1 Express γ(h) w.r.t. C(h).

2 Express γ(h) w.r.t. ρ(h).

3 Show that a process that is weakly stationary is also intrinsically

stationary.

92

Brownian Motion & Ornstein-Uhlenbeck Processes

A second order stationary process is also an intrinsic stationary process.

But an intrinsic stationary process is not always a second order stationary

process or a strictly stationary process.

To illustrate the weakly stationary and intrinsic stationary, we look at thefollowing processes:

I Brownian Motion,I Ornstein-Uhlenbeck Process.

93

Brownian Motion I

Definition (diffusion & Brownian Motion)

A diffusion is a continuous time Stochastic Process s(t) with the following

properties:

1 s(0) = 0,

2 s(t) has independent increments,

3 P (s(t2) | s(t1)) has a density function f(s(t2)|t1, t2, s(t1)).

Standard Brownian motion is a diffusion s(t), t ≥ 0 satisfying the following:

1 s(0) = 0.

2 s(t) has independent increments.

3 For t2 > t1, s(t2)− s(t1) ∼ N (0, σ2(t2 − t1)).

94

Brownian Motion II

s(t) has independent increments means that for all times

0 ≤ t1 ≤ t2 · · · ≤ tn the increments s(tn)− s(tn−1), s(tn−1)− s(tn−2), · · · ,s(t2)− s(t1) are independent random variables.

For t2 > t1, s(t2)− s(t1) ∼ N (0, σ2(t2 − t1)) is equivalent to

s(t2) | s(t1) ∼ N (s(t1), σ2(t2 − t1))

s(t2) given s(t1) is normally distributed with mean s(t1) and variance

σ2(t2 − t1).

95

Brownian Motion III

Exercises:

1 Show that a standard brownian motion is intrinsically stationary

2 Show that a standard brownian motion is not weakly stationary

96

Ornstein-Uhlenbeck Process I

Definition (Ornstein-Uhlenbeck)

Let s(t) be standard Brownian motion. The process

V (t) = e−t s(e2t)

is called the Ornstein-Uhlenbeck process.

97

Ornstein-Uhlenbeck Process II

Show that V (t) is intrinsically stationary and weakly stationary.

98

Brownian Motion and Ornstein-Uhlenbeck processes

Historical remarks:

Robert Brown observes ceaseless irregular motion of small particles in a

fluid. Motion explained by believing particles to be alive (1827-1829).

Goul puts forward a kinetic theory to explain the motion; it is due to rapid

bombardment of a huge number of fluid molecules (1860).

Einstein presents the theory of “Brownian motion” (c. 1900). At the same

time (c. 1900), Bachelier defines it to model stock options.

The Ornstein-Uhlenbeck process was proposed by Uhlenbeck and

Ornstein (1930) in a physical modelling context, as an alternative to

Brownian Motion. The model has been used since in a wide variety of

applications areas e.g. in finance (see Vasicek (1977) interest rate

model).

99

Kriging formulation with the variogram I

In Kriging, we have

s(x0)− µ(x0)︸ ︷︷ ︸s(x0)

=

J∑j=1

λj (s(xj)− µ(xj))︸ ︷︷ ︸s(xj)

+ε(x0)

or

ε(x0) = s(x0)−J∑j=1

λj s(xj)

with the assumption E[ε(x0)] = 0.

100

Kriging formulation with the variogram II

The estimation is performed by:

(λ1, · · · , λJ) = arg min E[ε(x0)2]

solved

without constraint : Simple Kriging

with constraints : Universal/Ordinary Kriging

and the estimate is computed using the observations:

s0 =

J∑j=1

λj s(1)j︸ ︷︷ ︸

Ordinary/Universal Kriging

or s0 =

J∑j=1

λj s(1)j + µ (1−

J∑j=1

λj)︸ ︷︷ ︸Simple Kriging

101

Kriging formulation with the variogram III

Exercise: With the constraint∑Jj=1 λj = 1 (true in Universal/Ordinary

Kriging):

1 Show that :s0 − J∑j=1

λj sj

2

= −1

2

J∑j=1

J∑i=1

λi λj (si − sj)2 +

J∑j=1

λj (s0 − sj)2

2 Show that

E[ε(x0)2] = −J∑j=1

J∑i=1

λi λj γij + 2

J∑j=1

λj γ0j

3 Deduce what are the assumptions of stationarity (weak or intrinsic) that

will be used for Simple, Ordinary and Universal Kriging.

102

Kriging formulation with the variogram IV

Definition (Kriging variance)

The minimised mean-squared prediction error E[ε(x0)2] is sometimes called

the Kriging variance:

σ20 = 2

J∑j=1

λj γ0j −J∑j=1

J∑i=1

λi λj γij (Or./Un. Kriging)

or

σ20 = c00 − 2

J∑j=1

λj c0j +

J∑i=1

J∑j=1

λiλj cij (Simple Kriging)

so prediction interval can be constructed: s0 ± 2σ0.

103

Kriging formulation with the variogram V

Kriging Hypotheses µ(x) Contraints Prediction at x0 variance stationarity?

Simple -

Ordinary

Universal

104

Modelling variogram I

We have presented the Kriging approaches to perform prediction. These

approaches require the knowledge of the covariance function or variogram of

the stochastic process s(x) = s(x)− µ(x).

Definition (variogram cloud)

For any set of data we can compute the semivariance for each pair of points

xi and xj individually as

γ(xi, xj) = γij =1

2(s(xi)− s(xj))2

These values are plotted against the difference h = xi − xj between the sites.

When a variogram is isotropic, it depends only on the distance ‖h‖. A

scatterplot of this form is called the variogram cloud.

105

Modelling variogram II

Exercises:

1 How many points does the variogram cloud contain if the separation

distance h is not limited?

2 Consider x1 = (2, 3), x2 = (4, 4) and x3 = (4, 3) where we have measured

s1 = 2, s2 = 3 and s3 = 2.4. Assuming the variogram is isotropic, plot the

variogram cloud.

106

Modelling variogram III

Definition (Average semi variance)

The definition of the semivariance is

γ(h) =1

2E[(s(x)− s(x+ h))2]

which can be estimated by:

γ(h) =1

2m(h)

m(h)∑i=1

(s(xi)− s(xi + h))2

m(h) is the number of pairs of sites separated by the particular lag vector h.

The way it is computed depends on the sampling design.

107

Modelling variogram IV

Example: In labs, we explore the first tutorial given with the R-package gstat.

This tutorial looks at the Meuse data set that gives locations and topsoil heavy

metal concentrations, collected in a flood plain of the river Meuse, near the

village of Stein (NL). Heavy metal concentrations are from composite samples

of an area of approximately 15 m x 15 m.

108

Modelling variogram V

distance

sem

ivar

ianc

e

1

2

3

500 1000 1500

●●

●● ●

● ●● ●

●● ●●●

●●●

●●●

●● ●●

● ●●●●

●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●●

●●

●●

●●●

● ●● ●

●●●

●●

●●

●●

●● ●●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●● ●

●●●

●●

●●

●●●●

●●●

●●●

●●

●● ●●●

●●●

●●

●●

● ● ● ● ●●

●●●

●●●

●●

●●● ●● ● ●

●●●

●●●

●●

●●●●● ● ●●

●●●●

● ●●●

●●

●●●●●

●●●

●●●

●●●

●●

●●●●●●

●●●●

●●●

●●●

●●

●●●●● ● ●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

●●● ●

●●

●●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●

● ●●●

● ●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●●●● ●●

●●●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ● ●

●●

●●●

●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●●●

●●

●●

●●●

●●

●● ●●

●●

●●

●●

●●

●●●●●●●

●●●●●●

●●

●●●●●●

●●

●●

●●●●●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●●●

●●●●●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●● ●● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●● ● ● ●●

●●

●●●

●● ●●

●●

●●

●●

●●●

● ●●

● ●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●● ●

●●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ● ●●

●●

●●●

●●

●●●

●●

●●●

● ●●●

●●

●●●

●● ●● ●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●

●● ● ●●●

●●

●●

●●

●●

●●

●●● ● ● ●●

●●

●●

●●

● ●●● ●●●

●●

●●

●●

●●

●●

●●

●●●● ● ● ●●

● ●

●●

●●

●●

●●

●●

● ● ●●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

● ● ●●

●●●

●●

●●

●●

●●

●●

●●

● ●●● ● ● ●●●●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●

●●

●●

● ● ● ●●

●●

●●

●●

●●

●●

●●

●●

● ● ●●

● ● ●●●

●●

●●

●●

●●

●●

●●

●●

● ● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●●●●● ●●●● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●●● ● ●●●●● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●●

●●

●●

●●●

●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

●●

●●●● ● ● ●●

●●

●●

● ●

●●●

●●

● ●●

●●●●

● ●● ● ●● ●●

● ●

●●

●●

●●●●● ● ●●● ●●

●●●●

●●

●●

●●

●●

●● ● ●●●

●●

●●

●● ●

●●

●●●●

●●

●●●● ● ●●●

●●

●●●

●●

●●

●●

●●

●●● ● ● ●●

●●

●●

● ●

●● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●● ● ● ●●

●●

●●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●

●●

●●

●●●

●●

●● ●●

●●

●●●

●●

●●●

●●

●● ● ● ● ●

● ● ● ● ●

●●

●●

●● ●

●●

●●

●●

●●● ●

●●

●●

● ●

● ●● ●

● ●●

●●

●●

●●●

●●

● ●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●

●● ●

●●

●● ●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●● ● ●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●●●

●●●

● ●● ●

●●

●●●●

●●

● ●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ● ●● ●

●●

●●●

●●

●●●

●●

●●

●●●●●● ●●●● ●●

●●●●

●●

●●

●●

●●

●● ● ●●●

●●

●●

●● ●

●●

●●●●

● ●

●● ●

●●●

●●

●● ●● ●● ●● ● ● ● ●

● ● ● ●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●● ● ●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ● ●●

● ● ●●●

●●

●● ●●

●●

●●●●

●●

●●

●●

●●

●●

● ● ●●●

●● ●

●●●●

●●

●●● ●

●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ● ●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●● ●●● ● ●

●●

● ●

●●●

●●

●● ● ●

●●

●●

●●

●●

●●●

●●●

● ●

●●●●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●●●● ● ● ●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●

●● ●

●●●

●●

● ● ●●

● ●●●●

●●

●●

●●

●●

●●●● ● ● ● ●

●●

●● ●

●●●

●●●

●●

●●

●●

●●

●●

● ●

●●●●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

● ●

●●●

●● ● ●

● ●

●●●●

● ●●●● ●

●●●

● ●

● ●

●●●

●● ●●●● ●

●●●

● ●

●●

● ●●

●●●●●●

● ● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●● ●

●●●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ● ●●●● ●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ● ● ● ● ●

●●●

●●

● ●●●

●●●

●●

● ● ●

●●

●●

●●

●●●

●●●

●● ●

●●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●● ● ●

●●●●

●●

● ●●●

●●

● ●

●●

●●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Figure: Variogram cloud of log(zinc) in the Meuse data set.

109

Modelling variogram VI

distance

sem

ivar

ianc

e

0.2

0.4

0.6

500 1000 1500

●●

●●

Figure: Average semiVariogram of log(zinc) in the Meuse data set.

110

Sampling designs I

Consider the region D ⊂ R2, the location {xj}j=1,··· ,J can be chosen by:

1 Systematic Sampling (e.g. grid)

2 Pure random sampling

3 Stratified sampling

111

Conclusion

We have seen how to estimate a variogram using the observation we

have for stochastic processes {s(xj)}j=1,··· ,J that are measured at each

site xj .

Can we estimate the variogram γ in this fashion when Universal Kriging

will be used to perform prediction at a new position x0?

What is the stochastic process for which the variogram is used in

Universal Kriging? Ordinary Kriging? Simple Kriging?

112

Variogram modelling for Universal Kriging I

If s(xi)− s(xj) = s(xi)− s(xj) for simple and ordinary Kriging, this is not

the case anymore for universal Krigind where E[s(x)] = µ(x).

We have collected observations for s but not for s and we dont know the

form of µ(x).

For instance, if we assume that s(x) = a x+ b (linear drift), then the

difference

s(x+ h)− s(x) = a h

so the estimated variogram (variogram plot) will be affected by this drift

and looks like a parabola:

γ(h) =1

2(a h)2

113

Variogram modelling for Universal Kriging II

1 The first method proposes to get a quick estimate of the mean to removeit:

1 Specify a linear model µ(x) =∑Pp=0 βpφp(x) on a chosen basis of functions

{φp}p=0,··· ,P .

2 Estimate {βp}p=0,··· ,P by Least Squares to give an estimate of the mean

µ(x) =∑Pp=0 βpφp(x).

3 Compute the variogram of s(x) = s(x)− µ(x).

114

Variogram modelling for Universal Kriging III

2 The other approach to estimate the variogram works as follow:

1 Choose a theoretic variogram function.

2 Specify a linear model µ(x) =∑Pp=0 βpφp(x) on a chosen basis of functions

{φp}p=0,··· ,P .

3 Estimate µ(x) (i.e. find {βp}p=0,··· ,P ) such that the residuals

s(x) = s(x)− µ(x) have a variogram that matches the chosen theoretical

one.

115

Variogram functions I

Several analytical functions have been proposed in the litterature to model a

variogram. They all follow a few basic rules.

1 Limits on variogram function.I Behaviour near the origin. The way in which the variogram approaches the

origin is determined by the continuity (or lack of continuity) of the variable

s(x) itself. The semivariance at |h| = 0 is by definition 0. It can happen

however that experimental values give a positive value (positive intercept,

nugget effect).

Near 0, the variogram can have a linear approach γ(h) ' b|h| or quadratic

γ(h) ' b|h|2 as |h| → 0.

116

Variogram functions II

I Behaviour towards infinity. The variogram is constrained by:

limγ(h)

|h|2 = 0 as |h| → ∞

If it does not then the process is not entirely random (and is not compatible

with the intrinsic hypothesis).

117

Variogram functions III

2 Examples of models:I Unbounded models.

γ(h) = ω |h|α for 0 < α < 2

α = 1 the the variogram is linear .I Bounded models. Based on experience, bounded variation is more common

than unbounded one.I Bounded linearI SphericalI ExponentialI ...

118

Variogram functions IV

Once we have computed the variogram plot, we select the best analytical

(theoretic) variogram function by:

1 fitting the models to the observed variogram plot.

2 choosing the theoretic model that has the smallest RSS (Residual sum of

square) and/or AIC (Akaike Information Criterion).

Exercise: Assuming that s is weakly stationary, explain why the variogram

cannot be unbounded.

119

Introduction Spectral Analysis I

We consider a time series where observations have been collected at regular

interval of time. One of the earliest general types of models for time series

was that they were a combination of

1 a trend describing a long term behaviour of the series (deterministic)

2 a periodic or seasonal component describing variation within a cycle

(deterministic)

3 an error term describing random fluctuations of the series about the

trend and seasonal components (stochastic).

(s(tj) = µ(tj) + ε(tj)) ≡ (sj = µj + εj)

where µ captures the deterministic behaviour (while both s and ε are

stochastic).

120

Introduction Spectral Analysis IIThe purpose of the trend term, is to describe the long-term features of the

series (e.g an overall tendency), it would usually correspond to some smooth

slowly changing function.

By differencing the time series the trend can easily be removed. So now we

assume that we have only a remaining seasonal pattern.

For certain series we can specify the periodicity of the seasonal component

very accurately (e.g. in the case of economic or geophysical series which

contain a strict 12-month cycle).

However in other series it may not be possible to specify the period of the

cyclical term a priori or there may be several (non harmonically related)

periodic terms present with unknown periods.

121

Introduction Spectral Analysis III

We need special techniques to detect these hidden periodicities. We can

represent the seasonal component by a sum of sine and cosine terms (well

suited explanatory variables to capture periodic information):

sj =∑p

(Ap cos (ωp j) +Bp sin (ωp j)) + εj

How to estimate {(Ap, Bp, ωp)} ?

122

Periodogram I

Definition (Periodogram)

Given a time series s1, s2, · · · , sJ , the periodogram is defined by:

I(ω) =2

J

J∑j=1

sj cos(ω × j)

2

+

J∑j=1

sj sin(ω × j)

2

The function I(ω) is then computed for ωp = 2πpJ with p = 0, 1, · · · , J2 .

The periodogram is also sometimes represented w.r.t. the frequency f = ω2π .

This implies f ∈{

0, 1J ,

2J , · · · ,

12

}.

123

Periodogram II

Example (Pure sine time series)

Consider a noiseless time series:

sj = cos

(2× π × 5

J× j), j = 1, · · · , J = 100

1 Compute numerically ω = 2π×5J .

2 Compute numerically the corresponding frequency f = ω2π

3 Look at the periodogram and conclude.

124

Periodogram III

Figure: Pure sine time series sj = cos(2×π×5J× j), j = 1, · · · , J = 100.

125

Periodogram IV

Figure: Periodogram I(f).

126

Periodogram V

Example (Combination of pure sine time series)

Consider the time series:

sj = 2 cos

(2π × 5

J× j)

+ sin

(2π × 10

J× j), j = 1, · · · , J = 100

1 Compute numerically ω1 = 2π×5J and ω2 = 2π×10

J .

2 Compute numerically the corresponding frequencies f1 and f2.

3 Look at the periodogram and conclude.

127

Periodogram VI

Figure: Combination of two pure sine time series.

128

Periodogram VII

Figure: Periodogram I(f).

129

Periodogram VIII

For a time series with a cyclical patterns:

sj =∑p

(Ap cos (ωp j) +Bp sin (ωp j)) + εj , j = 1, · · · , J

By computing the periodogram, you can identify the most prevalent

frequencies {ωp}.

How would you propose to estimate the corresponding coefficients

{Ap, Bp} ?

130

Periodogram IX

Lets define the complex vector:

Vp =

exp (−ωp 1)

exp (−ωp 2)...

exp (−ωp J)

=

cos (−ωp 1) + sin (−ωp 1)

cos (−ωp 2) + sin (−ωp 2)...

cos (−ωp J) + sin (−ωp J)

you can check that the periodogram I(ωp) (with ~s = [s1, · · · , sJ ]T ):�� ��I(ωp) ∝ ‖ < ~s|Vp > ‖2

with < ·|· > the dot product and ‖ · ‖ the euclidian distance.

131

Periodogram X

Definition (Inner product)

For complex vectors x and y of length J , the inner product is:

< x,y >=

J∑n=1

xn yn

where yn is the complex conjugate of yn.

You can check also that (ωp = 2πpJ ) :

< Vp1 |Vp2 >= J · δp1p2 = J ×

{0 if p1 6= p2

1 otherwise

(δp1p2 indicates the Kronecker delta).

132

Periodogram XI

The set of vectors {Vp}p=0,··· ,J−1 forms a basis of the space CJ . So any

~s ∈ CJ can be written exactly (note V0 = VJ )

~s =

J−1∑p=0

θpVp or ~s =

J∑p=1

θpVp

Because here s is a time series of real numbers, then it can be shown

that

θp = θJ−p (complex conjugate)

so only half of the coefficients {θp}p=0,··· ,J/2 are needed.

133

Periodogram XII

Definition (Discrete Fourier Transform and Power spectrum)

The discrete Fourier Transform (DFT) of the time series s1, · · · , sJ is defined:

θp =< ~s|Vp >=

J∑j=1

sj exp (−ıωpj)

with ωp = 2πpJ . {θp}p=0,··· ,J−1 are the Fourier coefficients of the time series s

and:

~s =

J∑p=0

θpVp

The power spectrum is defined as P (ωp) = θ2p and the periodogram is

therefore proportional to the power spectral density of a signal.

134

Periodogram XIII

Reading and using a periodogram

High values of the periodogram indicate periodic patterns.

Detect the main maximum (or maxima) and identify the corresponding

frequency(ies) f .

The period of the cyclical pattern is recovered by computing 1/f . If the

time series have been recorded every months, then the period is 1/f

months.

135

Case study: Variable Star Brightness I

The brightness or the magnitude of a particular star at midnight is

observed on 600 consecutive nights

Day

Brig

hnes

s

0 100 200 300 400 500 600

05

1015

2025

3035

136

Case study: Variable Star Brightness II

0.0 0.1 0.2 0.3 0.4 0.5

050

0010

000

1500

020

000

Frequency

Var

iabl

e S

tar

per

iodo

gram

Two prominent peaks appear for f1 = 0.035 and f2 = 0.04167 .

137

Case study: Variable Star Brightness III

Exercises:

1 What are the periods in days associated with f1 and f2 ?

2 What model would you suggest to fit to this time series?

138

Case study: Variable Star Brightness IV

Answers:

1 f1 corresponds to 29 day period and f2 corresponds to 24 day period.

2 model:

st = β0+β1 cos(2πf1t)+β2 sin(2πf1t)+β3 cos(2πf2t)+β4 sin(2πf2t)+εt

139

Conclusion

Remember that spectral analysis by discrete fourier transform can be

applied when observations have been collected at regular interval.

Frequency domain analysis or spectral analysis has been found to beuseful in (for instance):

I AcousticI communication engineeringI geophysical scienceI biomedical science

140

Fourier transform I

The Fourier transform is an important tool to analyse signals, discrete or

continuous.

A Wavelet tour of signal processing , by S. Mallat, second edition

Definition (Fourier transform)

The Fourier integral measures how much oscillations at the frequency ω there

is the function f :

F (ω) =

∫ ∞−∞

f(t) exp(−ıωt) dt

141

Fourier transform II

Definition (Inverse Fourier transform)

The Inverse Fourier transform can be computed:

f(t) =1√2π

∫ ∞−∞

F (ω) exp(ıωt) dω

Exercises: Fourier transformCompute the Fourier transforms of:

1

f(t) =

{1 − T < t < T

0 otherwise

2 f(t) = δ(t)

3 f(t) = δ(t− t0)

142

Fourier transform III

4 f(t) = exp(−at) h(t) (h is the heaviside step function)

5 f(t) = cos(ω0t)

6 Show

1 Linearity:

a x(t) + b y(t)↔ a X(ω) + b Y (ω)

2 Time shifting:

x(t− t0)↔ exp(−ıωt0)X(ω)

143

Fourier transform IV

Definition (Convolution)

The convolution of two functions x(t) and h(t) gives a function y(t) such that:

y(t) = x(t) ∗ h(t) =

∫ +∞

−∞x(τ) h(t− τ) dτ

Theorem (Convolution)

If y(t) = x(t) ∗ h(t) then:

Y (ω) = X(ω) ·H(ω)

Prove this theorem.

144

Two-dimensional Fourier transform I

Definition (2D Fourier Transform)

F (ωx, ωy) =

∫ ∞−∞

∫ ∞−∞

f(x, y) exp(−ıωxx) exp(−ıωyy) dx dy

Remember that Spectral analysis is well suited for signal that is observed at

regular interval. In particular, 2D Fourier Transform is suitable to analyse

images:

Definition (image)

An image can be understood as a stochastic process s(x) that has been

observed at J locations {xj}j=1,··· ,J lying on a regular 2D grid. For grey level

images, the state space of s is all integers between 0 (black) and 255 (white).

145

Two-dimensional Fourier transform II

Low pass filtering

146

Two-dimensional Fourier transform III

High pass filtering

147

Functional Data Analysis: Introduction I

Definition (functional data)

Functional data indicate that the collected (observed ) data are observations

of functions varying over a continuum.

Functional Data Analysis , J.O. Ramsay and B. W. Silverman, Springer 2006.

148

Functional Data Analysis: Introduction II

Functional Data Analysis (FDA) aims at:

representing data in ways that help further analysis.

to display the data so as to highlight various characteristics

studying important sources of patterns and variations among the data

149

Functional Data Analysis: Introduction IIIExample: Height of Girls

150

Functional Data Analysis: Introduction IV

10 subjects took part in the study.

each record has 31 discrete values of height that are not equally spaced.

there are uncertainty or noise in the collected height values (about 3mm).

the underlying process can be assumed to be a smooth function µj(t) for

each subject j taking part in the study and observations have been

collected for the stochastic process

sj(t) = µj(t) + εj(t)

151

Functional Data Analysis: Introduction V

For each record j, it is interesting to calculate an estimate of the function

µj(t) (Linear Smoothing methods).

When having a family of functions {µj(t)}j=1,··· ,J , it is interesting to

investigate the variations of this family of functions (functional Principal

Component Analysis).

For simplicity here, we are considering temporal stochastic processes. All

methods can easily be extended to spatio-temporal stochastic processes.

152

Linear Smoothing: Introduction I

We consider that we have observations {(t(i), s(i))}i=1,··· ,n

Definition (Linear smoothing)

A linear smoother estimates the function value s(t) by a linear combination of

the discrete observations:

s(t) =

n∑i=1

λi(t) s(i) =< ~λ(t)|~s >

The behaviour of the smoother at t is controlled by the weights λi(t).

153

Linear Smoothing: Introduction II

Example of linear smoothing methods:

1 Kriging X

2 Regression on basis of functions

3 Nadaraya-Watson estimator

Note that these methods have different hypotheses concerning the nature of

the noise ε.

154

Linear Smoothing: Regression on basis of functions I

A model can be proposed for the deterministic part of the process as

follow:

µ(t) =

K∑k=0

θk φk(t) =< ~φ(t)|~θ >

The standard model for the error ε is to be N (0, σ2) and independent on

time (or E[s(t)] = µ(t) and Var[s(t)] = σ2). Observations collected are

{(t(i), s(i))}i=1,··· ,n

155

Linear Smoothing: Regression on basis of functions II

The coefficients {θk}k=0,··· ,K are estimated using least squares:

~θ = (ΦTΦ)−1ΦT ~s

where ~s = [s(i), s(2), · · · , s(n)]T , Φ is the n× (K + 1) matrix collecting the

values {φk(t(i))}. So the estimate for E(s(t)) is:

s(t) =< ~φ(t)|~θ >= ~φ(t)T (ΦTΦ)−1ΦT︸ ︷︷ ︸~S(t)T

~s

Exercise: What are the basis functions {φk}k=0,··· ,K that can be used?

156

Linear Smoothing: Regression on basis of functions III

Example: Vancouver precipitation data with 13 Fourier Bases.157

Linear Smoothing: Regression on basis of functions IV

Choice of Basis Expansions.

When performing least squares fitting of basis expansions φ0(t), φ1(t), · · · is a

basis system for s. The choice of this basis system is important in particular if

you want to explore time derivatives of s(t).

158

Linear Smoothing: Regression on basis of functions V

Monomial Basis φ0(t) = 1, φ1(t) = t, φ2(t) = t2, · · · , φK(t) = tK

Properties

I difficult for K > 6

I Derivatives of φk(t) get simpler but this is often not a desirable property

when fitting real-world data.

159

Linear Smoothing: Regression on basis of functions VI

K = 0 K = 1

K = 2 K = 3 160

Linear Smoothing: Regression on basis of functions

VII

Fourier Basis{1, sin(ωt), cos(ωt), sin(2ωt), cos(2ωt), sin(3ωt), cos(3ωt), · · · , sin(Kωt), cos(Kωt)}Properties

I natural basis for periodic dataI Derivatives retain complexities.

161

Linear Smoothing: Regression on basis of functions

VIII

K = 3 K = 4

K = 5 K = 6 162

Linear Smoothing: Regression on basis of functions IX

Splines. Splines are polynomial segments joined end-to-end. They are

constrained to be smooth at the joints (called knots). The order (order =

degree+1) of the polynomial segments and the location of the knots

define the system.

163

Linear Smoothing: Regression on basis of functions X

K = 1 K = 2

K = 3 K = 4

164

Linear Smoothing: Regression on basis of functions XI

Other basis:

wavelet

...

How to choose the number K of functions?

Information Criterion: AIC, BIC.

Cross-Validation

165

Linear Smoothing: Nadaraya-Watson estimator I

The model is s(t) = µ(t) + ε(t) with E(ε(t)) = 0.

The estimate s(t) is computed as an expectation of s at time t:

E[s|t] = E[s(t)] = E[µ(t) + ε(t)] = µ(t)

By definition of expectation:

E[s|t] =

∫s ps|t(s|t) ds =

∫s pst(s, t) ds

pt(t)

Note how Bayes formula is used.

166

Linear Smoothing: Nadaraya-Watson estimator II

Since we dont know the true densities pst(s, t) and pt(t), the idea of the

Nadaraya-Watson estimator is to replace them by density estimates such

that the integral is easy to compute and the solution provides a smooth

estimate of s(t).

Consequently the Nadaraya-Watson estimator can be written:

E[s(t)] '∫s pst(s, t) ds

pt(t)=

∑ni=1

1hk(t−t(i)h

)s(i)∑n

i=11hk(t−t(i)h

) =

n∑i=1

Si(t) s(i) = s(t)

167

Linear Smoothing: Nadaraya-Watson estimator III

Choosing the kernel k as the Dirac function is equivalent as using the

empirical density estimates for pst(s, t) and pt(t).

The resulting estimate for s(t) would not be smooth with the Dirac kernel

and other kernels can be used (e.g. Gaussian, uniform, quadratic,

triangle, Epanechnikov, etc.).

168

Linear Smoothing: Nadaraya-Watson estimator IV

Definition (Kernel Density Estimate)

Consider a random variable x for which observations {x(i)}i=1,··· ,n have been

collected. Choosing a even function k such that∫k(x) dx = 1 and k(x) ≥ 0 ∀x

the a Kernel Density Estimate (KDE) for the density function px is:

px(x) =1

nh

n∑i=1

k

(x− x(i)

h

)h is called the bandwidth and is a parameter controlling the level of smoothing.

169

Linear Smoothing: Conclusion and Extensions I

Conclusion:

Nadaraya Watson estimator: the bandwidth needs to be

chosen/estimated.

Regression on basis of functions: {θk} needs to be estimated, the basis

{φk(t)} needs to be chosen, and the number K of functions needs to be

selected.

170

Linear Smoothing: Conclusion and Extensions II

Extension: Regularised basis approach

When fitting the model µ(t) =∑Kk=1 θk φk(t), one may add a smoothness

constraint to ensure that the estimated function is smooth.

If the basis of functions is differentiable then D2µ(t) can be computed

and a constraint (prior) can be added to the cost function C to estimate

the coefficients:

C(θ1, · · · , θK) =

n∑i=1

(s(i) − µ(t(i)))2 + λ

∫[D2µ(t)]2 dt

171

Linear Smoothing: Conclusion and Extensions III

Another example, consider growth, exponential growth or decay (e.g.

population, radioactive decay, economic indicators), we know that such

system follows an ODE:

Lx = −γx+Dx = 0

then fitting the model µ(t) =∑Kk=1 θk φk(t) can be constrained with

C(θ1, · · · , θK) =

n∑j=1

(s(i) − µ(t(i))2 + λ

∫[Lµ(t)]2 dt

where γ is unknown it can be estimated.

172

Linear Smoothing: Conclusion and Extensions IV

Depending of the problem, derivatives can be used for regularisation,

adding prior for the type of solutions we are looking for (e.g. smoothness,

or subject to a known ODE).

Derivatices can also be the subject of interest in the study i.e. For

instance in mechanics, if you collect spatial positions over time

(trajectories), one may be interested in studying the velocity (first

derivative) or acceleration (second derivatives).

173

Linear Smoothing: Conclusion and Extensions VExample: Height of Girls

174

Linear Smoothing: Conclusion and Extensions VI

175

Functional PCA I

Definition (Mean and variance functions)

Consider the functions {µj(t)}j=1,··· ,J that are observed instances of the

random function s(t), the mean function is the average of the functions:

E[s(t)] = µ(t) =1

J

J∑j=1

µj(t)

and the variance function is:

Var[s(t)] =1

J − 1

J∑j=1

(µj(t)− µ(t))2

176

Functional PCA II

Example (Height of Girls)

Height(t) =1

10

10∑i=1

Heighti(t)

VarHeight(t) =1

10− 1

10∑i=1

(Heighti(t)−Height(t))2

177

Functional PCA III

FPCA, Principal Component Analysis for functions:

We compute the mean µ(t) and substract it to each curve:

µj(t) = µj(t)− µ(t)

Let v(t1, t2) = E[µ(t1) µ(t2)] be the sample covariance function, it can be

estimated by:

v(t1, t2) =1

J

J∑j=1

µj(t1) µj(t2)

An eigencurve u(t) is then computed such that:

∀t1,∫v(t1, t) u(t) dt = λ u(t1) subject to

∫u(t)2dt = 1

178

Functional PCA IVA simple approach is to discretise the functions µj(t) to a fine grid of N

sites equally spaced on the time line. This allows to treat these functions

as finite dimensional vectors and standard PCA can be applied to

estimate the eigenvectors.

The continuous principal component u(t) is recovered by interpolating the

discrete eigenvector.

The choice of interpolation method does not matter when recovering the

eigencurve from the eigenvector when the space between the N

sampling sites is small.

This approach is the earliest approach to functional principal component

analysis.

179

Transforms I

Definition (Integral transform)

An integral transform T is defined as:

T [f(t)] ≡ F (u) =

∫ t2

t1

K(t, u) f(t) dt

where K is the kernel function. This can also be understood as a linear

combination (’continuous sum’) over a basis of functions K.

Example

The Laplace transform is an example of integral transform with

K(t, u) = exp(−u t).

The Fourier transform is another example with K(t, u) = exp(−u t).

180

Transforms II

Another useful example is the Radon transform:

Definition (Radon transform)

Having a function f defined on a domain x = (x, y) ∈ R2, the Radon transform

of f is its integral along the line of equation ρ− x cos θ − y sin θ = 0 i.e.:

Rf(ρ, θ) =∫R∫R δ(ρ− x cos θ − y sin θ) f(x, y) dx dy

=∫R2 δ(ρ− xTn) f(x) dx

with δ(·) the Dirac delta function, ρ ∈ R, θ ∈ [−π2 ; π2 ] and n = (cos θ, sin θ)T .

181

Transforms III

x

x

y ρ− xTn = 0

n

ρ

θ

S

182

Transforms IV

Exercises:

1 Assuming that f(x) = px(x) is the probability density function of x, what

is Rf(ρ, θ) when θ = 0 and θ = π2 ?

2 Assuming that f(x) = px(x) is the probability density function of x, is

Rf(ρ, θ) probability density function?

183

Application Radon Transform I

1 Consider the r.v. x = (x, y) for which only one observation x(1) is

available. Then the empirical density function is:

px(x) = δ(x− x(1)) = δ(x− x(1)) δ(y − y(1))

What is the Radon transform of px(x)?

2 With two observations

px(x) =1

2

[δ(x− x(1)) + δ(x− x(2))

]What is the Radon transform of px(x)?

184

Application Radon Transform II

x

y

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100

185

Application Radon Transform III

θ (degrees)

ρ

0 20 40 60 80 100 120 140 160 180

−60

−40

−20

0

20

40

60

186

Application Radon Transform IV

x

y

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

187

Application Radon Transform V

θ (degrees)

ρ

0 20 40 60 80 100 120 140 160 180

−60

−40

−20

0

20

40

60

Conclude on how the Radon transform can be used to detect perfect

lineament.

188

Application Radon Transform VI

Remarks:

The Hough transform is a very closely related method to the Radon

transform to find shapes (line, circles, etc.).

The formulation of the Radon tranform itself has been generalised to

consider quadratic shapes for instance:The general quadratic Radon transform, Koen Denecker, Jeroen Van Overloop and FrankSommen, Inverse Problems 14 (1998) 615-633

189

Application Radon Transform VIIExample: Archeology

Detecting Roman land boundaries in aerial photographs using Radon transforms, D.J. Bescoby,Journal of Archaeological Science Volume 33, Issue 5, May 2006, Pages 735-743

190

Application Radon Transform VIII

Example: crater detection on Mars -

Method for Crater Detection From Martian Digital Topography Data Using GradientValue/Orientation, Morphometry, Vote Analysis, Slip Tuning, and Calibration, Goran Salamuniccarand Sven Loncaric, IEEE transactions on Geoscience and remote sensing, vol 48, n. 5, May 2010

191

top related