gaussian markov random fields: theory and applications ... · gaussian markov random fields: theory...

Gaussian Markov Random Fields

Gaussian Markov Random Fields: Theory andApplications

Havard Rue1

1Department of Mathematical SciencesNTNU, Norway

September 26, 2008


Part I

Definition and basic properties


Outline IIntroduction

Why?

Outline of this courseWhat is a GMRF?The precision matrixDefinition of a GMRFExample: Auto-regressive processWhy are GMRFs important?Main features of GMRFs

Properties of GMRFsInterpretation of elements of QMarkov propertiesConditional densitySpecification through full conditionalsProof via Brook’s lemma


Introduction

Why?

Why is it a good idea to learn about Gaussian Markov randomfields (GMRFs)?That is a good question!

What’s there to learn? Isn’t all just a Gaussian?

That is also a good question!


Introduction

Why?

Why?

• Gaussians (and most often in the form of a GMRF) areextensively used in statistical models.

• However, its excellent computational properties are not oftenexplored.

• Lack of knowledge of general purpose and near optimalnumerical algorithms


Introduction

Why?

Why?

• Fast(er) MCMC based inference

Recent developments (the really nice stuff...)

• Near instant (compare to MCMC) approximate Bayesianinference is often possible

• Deterministic

• Relative error

• In practice: no “error”


Introduction

Why?

Longitudinal mixed effects model: Epil-example fromBUGS


Introduction

Why?

Longitudinal mixed effects model: Epil-example fromBUGS...

In this example

x = (a0, aBase, aTrt, aBT, aAge, aV4, b1j, bjk)

is a GMRF.

Two (hyper-)parameters

Precision(b1) and Precision(b)

Poisson-observations


Outline of this course

Outline of this course: Day I

Lecture 1 GMRFs: definition and basic properties

Lecture 2 Simulation algorithms for GMRFs

Lecture 3 Numerical methods for sparse matrices

Lecture 4 Intrinsic GMRFs

Lecture 5 Hierarchical GMRF models and MCMC algorithmsfor such models

Lecture 6 Case-studies



Outline of this course: Day II

Lecture 1 Motivation for Approximate Bayesian inference

Lecture 2 INLA: Integrated Nested Laplace Approximations

Lecture 3 The inla-program: examples

Lecture 4 Using R-interface to the inla-program (Sara Martino)

Lecture 5 Case studies with R (Sara Martino)

Lecture 6 Discussion



What is a GMRF?

What is a Gaussian Markov random field (GMRF)?

A GMRF is a simple construct

• A normal distributed random vector

x = (x1, . . . , xn)T

• Additional Markov properties:

xi ⊥ xj | x−ij

xi and xj are conditional independent (CI).



The precision matrix

If xi ⊥ xj | x−ij for a set of i , j, then we need to constrain theparametrisation of the GMRF.

• Covariance matrix: difficult

• Precision matrix: easy




Conditional independence and the precision matrix

The density of a zero mean Gaussian

π(x) ∝ |Q|1/2 exp

(−1

2xT Qx

)Constraining the parametrisation to obey CI properties

Theorem

xi ⊥ xj | x−ij ⇐⇒ Qij = 0




Use a (undirected) graph G = (V, E) to represent the CI properties,

V Vertices: 1, 2, . . . , n.

E Edges i , j• No edge between i and j if xi ⊥ xj | x−ij .• An edge between i and j if xi 6⊥ xj | x−ij .



Definition of a GMRF

Definition of a GMRF

Definition (GMRF)A random vector x = (x1, . . . , xn)T is called a GMRF wrt thegraph G = (V = 1, . . . , n, E) with mean µ and precision matrixQ > 0, iff its density has the form

π(x) = (2π)−n/2|Q|1/2 exp

(−1

2(x− µ)T Q(x− µ)

)and

Qij 6= 0 ⇐⇒ i , j ∈ E for all i 6= j .



Example: Auto-regressive process

Simple example of a GMRFAuto-regressive process of order 1

xt | xt−1, . . . , x1 ∼ N (φxt−1, 1), t = 2, . . . , n

and x1 ∼ N (0, (1− φ2)−1).

Tridiagonal precision matrix

Q =

1 −φ−φ 1 + φ2 −φ

. . .. . .

. . .

−φ 1 + φ2 −φ−φ 1



Why are GMRFs important?

Usage of GMRFs (I)

Structural time-series analysis

• Autoregressive models.

• Gaussian state-space models.

• Computational algorithms based on the Kalman filter and itsvariants.

Analysis of longitudinal and survival data

• temporal GMRF priors

• state-space approaches

• spatial GMRF priors

used to analyse longitudinal and survival data.




Usage of GMRFs (II)

Graphical models

• A key model

• Estimate Q and its (associated) graph from data.

• Often used in a larger context.

Semiparametric regression and splines

• Model a smooth curve in time or a surface in space.

• Intrinsic GMRF models and random walk models

• Discretely observed integrated Wiener processes are GMRFs

• GMRFs models for coefficients in B-splines.




Usage of GMRFs (III)

Image analysis

• The first main area for spatial models

• Image restoration using the Wiener filter.

• Texture modelling and texture discrimination.

• Segmentation

• Deformable templates

• Object identification

• 3D reconstruction

• Restoring ultrasound images




Usage of GMRFs (IV)

Spatial statistics

• Latent GMRF model analysis of spatial binary data

• Geostatistics using GMRFs

• Analysis of data in social sciences and spatial econometrics

• Spatial and space-time epidemiology

• Environmental statistics

• Inverse problems



Main features of GMRFs

Main features

• Analytical tractable

• Modelling using conditional independence

• Merging GMRFs using conditioning (hierarchical models)

• Unified framework for• understanding• representation• computation using numerical methods for sparse matrices

• Fits nicely into the MCMC world

• Can construct fast and reliable block-MCMC algorithms.


Properties of GMRFs

Constraining the parametrisation to obey CI properties

Theorem

xi ⊥ xj | x−ij ⇐⇒ Qij = 0


Properties of GMRFs

Proof

Start with the factorisation theorem.

Theorem

x ⊥ y | z ⇐⇒ π(x , y , z) = f (x , z)g(y , z) (1)

for some functions f and g, and for all z with π(z) > 0.

ExampleFor

π(x , y , z) ∝ exp(x + xz + yz)

on some bounded region, we see that x ⊥ y |z . However, this is notthe case for

π(x , y , z) ∝ exp(xyz)


Properties of GMRFs

Proof...

π(xi , xj , x−ij) ∝ exp(− 1

2

∑k,l

xkQklxl

)∝ exp

(− 1

2xixj(Qij + Qji )︸︷︷︸

term 1

−1

2

∑k,l6=i ,j

xkQklxl︸︷︷︸term 2

).

Term 1 involves xixj iff Qij 6= 0. Term 2 does not involve xixj .

We now see that π(xi , xj , x−ij) = f (xi , x−ij)g(xj , x−ij) for somefunctions f and g , iff Qij = 0.Use the factorisation theorem.


Properties of GMRFs

Interpretation of elements of Q

Interpretation of elements of Q

Let x be a GMRF wrt G = (V, E) with mean µ and precisionmatrix Q > 0, then

E(xi | x−i ) = µi − 1

Qii

∑j :j∼i

Qij(xj − µj),

Prec(xi | x−i ) = Qii and

Corr(xi , xj | x−ij) = − Qij√QiiQjj

, i 6= j .


Properties of GMRFs

Markov properties

Markov properties

Let x be a GMRF wrt G = (V, E). Then the following areequivalent.The pairwise Markov property:

xi ⊥ xj | x−ij if i , j 6∈ E and i 6= j .


Properties of GMRFs

Markov properties

Markov properties

Let x be a GMRF wrt G = (V, E). Then the following areequivalent.The local Markov property:

xi ⊥ x−i ,ne(i) | xne(i) for every i ∈ V.


Properties of GMRFs

Markov properties

Markov propertiesLet x be a GMRF wrt G = (V, E). Then the following areequivalent.The global Markov property:

xA ⊥ xB | xC

for all disjoint sets A, B and C where C separates A and B, and Aand B are non-empty.


Properties of GMRFs

Conditional density

(Induced) subgraph

Let A ⊂ VGA denote the graph restricted to A.

• remove all nodes not belonging to A, and• all edges where at least one node does not belong to A

ExampleA = 1, 2, then

VA = 1, 2 and EA = 1, 2


Properties of GMRFs

Conditional density

Conditional density I

Let V = A ∪ B where A ∩ B = ∅, and

x =

(xA

xB

), µ =

(µA

µB

), Q =

(QAA QAB

QBA QBB

).

Result: xA|xB is then a GMRF wrt the subgraph GA withparameters µA|B and QA|B > 0, where

µA|B = µA −Q−1AAQAB(xB − µB) and QA|B = QAA.


Properties of GMRFs

Conditional density

Conditional density II

µA|B = µA −Q−1AAQAB(xB − µB) and QA|B = QAA.

• Have explicit knowledge QA|B as the principal matrix QAA.

• The subgraph GA does not change the structure, only removesnodes and edges in A.

• The conditional mean only depends on nodes in A ∪ ne(A).

• If QAA is sparse, then µA|B is the solution of a sparse linearsystem

QAA(µA|B − µA) = −QAB(xB − µB)


Properties of GMRFs

Conditional density

Conditional density III


Properties of GMRFs

Conditional density



Properties of GMRFs

Conditional density



Properties of GMRFs

Specification through full conditionals


An alternative to specifying a GMRF by its mean and precisionmatrix, is to specify it implicitly through the full conditionals

π(xi |x−i ), i = 1, . . . , n

This approach was pioneered by Besag (1974,1975)

Also known as conditional autoregressions or CAR-models.


Properties of GMRFs


Example: AR-process

Specify full conditionals

E (x1 | x−1) = 0.3x2 Prec(x1 | x−1) = 1

E (x3 | x−3) = 0.2x2 + 0.4x4 Prec(x3 | x−3) = 2

and so on.


Properties of GMRFs


Example: Spatial model

Specify full conditionals for each i (assuming zero mean)

E (xi | x−i ) =∑

j

wijxj Prec(xi | x−i ) = κi


Properties of GMRFs


Potential problems

• Even though the “full conditionals” are well defined, are wesure that the joint density exists and is unique?

• What are the compatibility conditions required for a jointGMRF to exists with the prescribed Markov properties.


Properties of GMRFs


In general

Specify the full conditionals as normals with

E(xi | x−i ) = µi −∑j :j∼i

βij(xj − µj) and (2)

Prec(xi | x−i ) = κi > 0 (3)

for i = 1, . . . , n, for some βij , i 6= j, and vectors µ and κ.

Clearly, ∼ is defined implicitly by the nonzero terms of βij.


Properties of GMRFs


E(xi | x−i ) = µi −∑j :j∼i

βij(xj − µj) and Prec(xi | x−i ) = κi > 0

Must be a joint density π(x) with these as the full conditionals.

Comparing term by term with (2):

Qii = κi , and Qij = κiβij

Q is symmetric, i.e.,κiβij = κjβji ,

We have a candidate for a joint density provided Q > 0.


Properties of GMRFs

Proof via Brook’s lemma

Lemma (Brook’s lemma)Let π(x) be the density for x ∈ Rn and defineΩ = x ∈ Rn : π(x) > 0. Let x, x′ ∈ Ω, then

π(x)

π(x′)=

n∏i=1

π(xi |x1, . . . , xi−1, x′i+1, . . . , x

′n)

π(x ′i |x1, . . . , xi−1, x ′i+1, . . . , x′n)

=n∏

i=1

π(xi |x ′1, . . . , x ′i−1, xi+1, . . . , xn)

π(x ′i |x ′1, . . . , x ′i−1, xi+1, . . . , xn).


Properties of GMRFs

Proof via Brook’s lemma

Assume µ = 0 and fix x′ = 0. Then

logπ(x)

π(0)= −1

2

n∑i=1

κix2i −

n∑i=2

i−1∑j=1

κiβijxixj . (4)

and

logπ(x)

π(0)= −1

2

n∑i=1

κix2i −

n−1∑i=1

n∑j=i+1

κiβijxixj . (5)

Since (4) and (5) must be identical then κiβij = κjβji for i 6= j ,and

log π(x) = const− 1

2

n∑i=1

κix2i −

1

2

∑i 6=j

κiβijxixj ;

hence x is zero mean multivariate normal provided Q > 0.


Part II

Simulation algorithms for GMRFs


Outline IIntroductionSimulation algorithms for GMRFs

TaskSummary of the simulation algorithms

Basic numerical linear algebraCholesky factorisation and the Cholesky triangleSolving linear equationsAvoid computing the inverse

Unconditional samplingSimulation algorithmEvaluating the log-density

Conditional samplingCanonical parameterisationConditional distribution in the canonical parameterisationSampling from a canonical parameterised GMRF


Outline IISampling under hard linear constraints

IntroductionGeneral algorithm (slow)Specific algorithm with not to many constraints (fast)ExampleEvaluating the log-density

Sampling under soft linear constraintsIntroductionSpecific algorithm with not to many constraints (fast)The algorithmEvaluate the log-densityExample


Introduction

Sparse precision matrix

Recall that

E (xi − µi | x−i ) = − 1

Qii

∑j∼i

Qij(xi − µj), Prec(xi | x−i ) = Qii

In most cases:

• Total number of neighbours is O(n).

• Only O(n) of the n2 terms in Q will be non-zero.

• Use this to construct exact simulation algorithms for GMRFs,using numerical algorithms for sparse matrices.


Introduction

Example of a typical precision matrix


Introduction


Each node have on the average 7 neighbours.


Introduction


Each node have on the average 7 neighbours.



Task


Can we take advantage of the sparse structure of Q?

• It is faster to factorise a sparse Q compared to a dense Q.

• The speedup depends on the “pattern” in Q, not only thenumber of non-zero terms.

Our task

• Formulate all algorithms to use only sparse matrices.

• Unconditional simulation

• Conditional simulation• Condition on a subset of variables• Condition on linear constraints• Condition on linear constraints with normal noise

• Evaluation of the log-density in all cases.



Summary of the simulation algorithms

The result

In most cases, the cost is

• O(n) for temporal GMRFs

• O(n3/2) for spatial GMRFs

• O(n2) for spatio-temporal GMRFs

including evaluation of the log-density.

Condition on k linear constraints, add O(k3).

These are general algorithms only depending on the graph G notthe numerical values in Q.

The core is numerical algorithms for sparse matrices.


Basic numerical linear algebra

Cholesky factorisation and the Cholesky triangle

Cholesky factorisation

If A > 0 be a n × n positive definite matrix, then there exists aunique Cholesky triangle L, such that L is a lower triangularmatrix, and

A = LLT

Computing L costs n3/3 flops.

This factorisation is the basis for solving systems like

Ax = b or AX = B

for k right hand sides, or equivalently, computing

x = A−1b or X = A−1B



Solving linear equations

Algorithm 1 Solving Ax = b where A > 0

1: Compute the Cholesky factorisation, A = LLT

2: Solve Lv = b3: Solve LT x = v4: Return x

Step 2 is called forward-substitution and cost O(n2) flops.

=

The solution v is computed in a forward-loop

vi =1

Lii(bi −

i−1∑j=1

Lijvj), i = 1, . . . , n (6)



Solving linear equations

Algorithm 1 Solving Ax = b where A > 0


2: Solve Lv = b3: Solve LT x = v4: Return x

Step 3 is called back-substitution and costs O(n2) flops.

The solution x is computed in a backward-loop

xi =1

Lii(vi −

n∑j=i+1

Ljixj), i = n, . . . , 1 (7)



Avoid computing the inverse

To compute A−1B where B is a n × k matrix, we do this bycomputing the solution X of

AXj = Bj

for each of the k columns of X.

Algorithm 2 Solving AX = B where A > 0


2: for j = 1 to k do3: Solve Lv = Bj

4: Solve LT Xj = v5: end for6: Return X


Unconditional sampling

Simulation algorithm

Sample x ∼ N (µ,Q−1)If Q = LLT and z ∼ N (0, I), then x defined by

LT x = z

has covariance

Cov(x) = Cov(L−T z) = (LLT )−1 = Q−1

Algorithm 3 Sampling x ∼ N (µ,Q−1)

1: Compute the Cholesky factorisation, Q = LLT

2: Sample z ∼ N (0, I)3: Solve LT v = z4: Compute x = µ + v5: Return x


Unconditional sampling

Evaluating the log-density

The log-density is

log π(x) = −n

2log 2π +

n∑i=1

log Lii − 1

2(x− µ)T Q(x− µ)︸︷︷︸

=q

If x is sampled, thenq = zT z

otherwise, compute this term as

• u = x− µ

• v = Qu

• q = uT v


Conditional sampling

Canonical parameterisation

The canonical parameterisation

A GMRF x wrt G having a canonical parameterisation (b,Q), hasdensity

π(x) ∝ exp(− 1

2xT Qx + bT x

)(8)

ie, precision matrix Q and mean µ = Q−1b.

Write this asx ∼ NC (b,Q). (9)

The relation to the Gaussian distribution, is that

N (µ,Q−1) = NC (Qµ,Q)



Conditional distribution in the canonical parameterisation

Conditional simulation of a GMRF

Decompose x as(xA

xB

)∼ N

((µA

µB

),

(QAA QAB

QBA QBB

)−1)

ThenxA | xB

has canonical parameterisation

xA − µA | xB ∼ NC (−QAB(xB − µB),QAA) (10)

Simulate using an algorithm for a canonical parameterisation.



Sampling from a canonical parameterised GMRF

Sample x ∼ NC (b,Q−1)

Recall that“NC (b,Q) = N (Q−1b,Q−1)′′

so we need to compute the mean as well.

Algorithm 4 Sampling x ∼ NC (b,Q)


2: Solve Lw = b3: Solve LTµ = w4: Sample z ∼ N (0, I)5: Solve LT v = z6: Compute x = µ + v7: Return x


Sampling under hard linear constraints

Introduction

Sampling from x | Ax = e

• A is a k × n matrix, 0 < k < n, with rank k ,

• e is a vector of length k.

This case occurs quite frequently:

A sum-to-zero constraint corresponds to k = 1, A = 1T

and e = 0.

We will denote this problem sampling under a hard constraint.

Note that π(x | Ax) is singular with rank n − k .



Introduction

The linear constraint makes the conditional distribution Gaussian

• but it is singular as the rank of the constrained covariancematrix is n − k

• more care must be exercised when sampling from thisdistribution.

We have that

E(x | Ax = e) = µ− AQ−1(AQ−1AT )−1(Aµ− e) (11)

Cov(x | Ax = e) = Q−1 −Q−1AT (AQ−1AT )−1AQ−1 (12)

This is typically a dense-matrix case, which must be solved usinggeneral O(n3) algorithms (see next frame for details).



General algorithm (slow)

We can sample from this distribution as follows. As the covariancematrix is singular we compute the eigenvalues and eigenvectors, andwrite the covariance matrix as VΛVT where V have the eigenvectors oneach row and Λ is a diagonal matrix with the eigenvalues on thediagonal. This corresponds to a different factorisation than the Choleskytriangle, but any matrix C which satisfy CCT = Σ, will do. Note that kof the eigenvalues are zero. We can produce a sample by computingv = Cz, where C = VΛ1/2, z ∼ N (0, I), and then add the conditionalmean. We can compute the log-density as

log(π(x|Ax = e)) = −n − k

2log 2π − 1

2

∑i :Λii>0

log Λii

−1

2(x− µ∗)T Σ−(x− µ∗) (13)

where µ∗ is (11), and Σ− = VΛ−VT where (Λ−)ii is Λ−1ii if Λii > 0 and

zero otherwise. In total, this is a quite computational demeaningprocedure, as the algorithm is not able to take advantage of the sparsestructure of Q.



Specific algorithm with not to many constraints (fast)

Conditioning via Kriging

There is an alternative algorithm which correct for the constraints,at nearly no costs if k n.

Let x ∼ N (µ,Q), then compute

x∗ = x−Q−1AT (AQ−1AT )−1(Ax− e). (14)

Now x∗ has the correct conditional distribution!

AQ−1AT is a k × k matrix, hence its factorisation is fast tocompute for small k .




Algorithm 5 Sampling x | Ax = e when x ∼ N (µ,Q−1)


2: Sample z ∼ N (0, I)3: Solve LT v = z4: Compute x = µ + v5: Compute Vn×k = Q−1AT with Algorithm 26: Compute Wk×k = AV7: Compute Uk×n = W−1VT using Algorithm 28: Compute c = Ax− e9: Compute x∗ = x−UT c

10: Return x∗

If z = 0 in Algorithm 5, then x∗ is the conditional mean.Extra cost is only O(k3) for large k!



Example

ExampleLet x1, . . . , xn be independent Gaussian variables with variance σ2

i andmean µi .To sample x conditioned on ∑

xi = 0

we sample firstxi ∼ N (µi , σ

2i ), i = 1, . . . , n

and compute the constrained sample x∗, as

x∗i = xi − c σ2i (15)

where

c =

∑j xj∑j σ

2j




The log-density can be rapidly computed using the followingidentity,

π(x | Ax) =π(x)π(Ax | x)

π(Ax), (16)

We can compute each term on the rhs easier than the lhs.

π(x)-term. This is a GMRF and the log-density is easy tocompute using L computed in Algorithm 5 step 1.






π(Ax), (16)


π(Ax | x)-term. This is a degenerate density, which is either zeroor a constant,

log π(Ax | x) = −1

2log |AAT | (17)

The determinant of a k × k matrix can be found itsCholesky factorisation.






π(Ax), (17)


π(Ax)-term. Ax is Gaussian with mean Aµ and covariance matrixAQ−1AT with Cholesky triangle L available fromAlgorithm 5 step 7.


Sampling under soft linear constraints

Introduction

Sampling from x | Ax = ε

Let x be a GMRF which is observed by e, wheree | x ∼ N (Ax,Σε).

• e is a vector of length k < n

• A a k × n matrix rank k

• Σε > 0 is the covariance matrix for the noise.

The conditional for x, is

log π(x | e).

= −1

2(x−µ)T Q(x−µ)−1

2(e−Ax)T Σ−1

ε (e−Ax) (18)

x | e ∼ NC (Qµ + AT Σ−1ε e,Q + AT Σ−1

ε A) (19)

This precision matrix is most often a full matrix and the nicesparse structure of Q is lost.



Introduction

ExampleObserve the sum of xi with unit variance noise, the posteriorprecision is

Q + 11T

which is a dense matrix.

In general, we have to sample from (18) using a general algorithmwhich is computational expensive for large n.




Alternative approach

Similar problem to sampling under a hard constraint Ax = e, butnow “e” is stochastic

Ax = ε where ε ∼ N (e,Σε). (20)

We denote this case sampling under a soft constraint.

If we extend (14) to

x∗ = x−Q−1AT (AQ−1AT + Σε)−1(Ax− ε), (21)

where ε and x are samples from their respective distributions,

....then x∗ has the correct conditional distribution.



The algorithm

Algorithm 6 Sampling x | Ax = ε when ε ∼ N (e, Σε) and x ∼ N (µ, Q−1)


2: Sample z ∼ N (0, I)3: Solve LT v = z4: Compute x = µ + v5: Compute Vn×k = Q−1AT using Algorithm 2 using L6: Compute Wk×k = AV + Σε

7: Compute Uk×n = W−1VT using Algorithm 28: Sample ε ∼ N (e,W) using the factorisation from step 79: Compute c = Ax− ε

10: Compute x∗ = x−UT c11: Return x∗

When z = 0 and ε = e then x∗ is the conditional mean.



Evaluate the log-density

The log-density is computed via

π(x | e) =π(x)π(e | x)

π(e)(22)

All Cholesky triangles required are available from the simulationalgorithm.

π(x)-term. This is a GMRF.

π(e | x)-term. e | x is Gaussian with mean Ax and covariance Σε.

π(e)-term. e is Gaussian with mean Aµ and covariance matrixAQ−1AT + Σε.



Example

Example

Let x1, . . . , xn be independent Gaussian variables with variance σ2i and

mean µi . We now observe

e ∼ N (∑

xi , σ2ε )

To sample from π(x | e) we sample first

xi ∼ N (µi , σ2i ), i = 1, . . . , n

and thenε ∼ N (e, σ2

ε )

A conditional sample x∗, is then

x∗i = xi − c σ2i , where c =

∑j xj − ε∑

j σ2j + σ2

ε


Part III

Numerical methods for sparse matrices


Outline IIntroductionCholesky factorisationThe Cholesky triangle

InterpretationThe zero-pattern in LExample: A simple graphExample: auto-regressive processes

Band matricesBandwidth is preservedCholesky factorisation for band-matrices

Reordering schemesIntroductionReordering to band-matricesReordering using the idea of nested dissection

What to do with sparse matrices?


Outline IIA numerical case-study

IntroductionGMRF-models in timeGMRF-models in spaceGMRF-models in time×space


Introduction

Numerical methods for sparse matrices

Have shown that computations on GMRFs can be expressed suchthat the main tasks are

1. compute the Cholesky factorisation of Q = LLT , and

2. solve Lv = b and LT x = z.

The second task is much-faster than the first, but sparsity will beof advantage also here.


Introduction

The goal is to explain

• why a sparse Q allow for fast factorisation

• how we can take advantage of it,

• why we gain if we permute the vertics before factorising thematrix,

• how statisticians can benefit for recent research in this areaby the numerical mathematicians.

At the end we present a small case study factorising some typicalmatrices for GMRFs, using a classical and more recent methods forfactorising matrices.



How to compute the Cholesky factorisation

Qij =

j∑k=1

LikLjk , i ≥ j .

vi = Qij −j−1∑k=1

LikLjk , i ≥ j ,

Then

• L2jj = vj , and

• LijLjj = vi for i > j .

If we know vi for fixed j , then

Ljj =√

vj and Lij = vi/√

vj , for i = j + 1, . . . , n.

This gives the jth column in L.



Cholesky factorization of Q > 0

Algorithm 7 Computing the Cholesky triangle L of Q

1: for j = 1 to n do2: vj :n = Qj :n,j

3: for k = 1 to j − 1 do vj :n = vj :n − Lj :n,kLjk

4: Lj :n,j = vj :n/√

vj

5: end for6: Return L

The overall process involves n3/3 flops.


The Cholesky triangle

Interpretation

Interpretation of L (I)

Let Q = LLT , then the solution of

LT x = z where z ∼ N (0, I)

is N (0,Q−1) distributed.

=

Since L is lower triangular then

xn =1

Lnnzn

xn−1 =1

Ln−1,n−1(zn−1 − Ln,n−1xn)

. . .



Interpretation

Interpretation of L (II)

TheoremLet x be a GMRF wrt to the labelled graph G, with mean µ andprecision matrix Q > 0. Let L be the Cholesky triangle of Q. Thenfor i ∈ V,

E(xi | x(i+1):n) = µi − 1

Lii

n∑j=i+1

Lji (xj − µj) and

Prec(xi | x(i+1):n) = L2ii .



The zero-pattern in L

Determine the zero-pattern in L (I)

TheoremLet x be a GMRF wrt G, with mean µ and precision matrix Q > 0.Let L be the Cholesky triangle of Q and define for 1 ≤ i < j ≤ nthe set

F (i , j) = i + 1, . . . , j − 1, j + 1, . . . , n,which is the future of i except j. Then

xi ⊥ xj | xF (i ,j) ⇐⇒ Lji = 0.

If we can verify that Lji is zero, we do not have to compute itwhen factorising Q




Proof

Assume µ = 0 and fix 1 ≤ i < j ≤ n. Theorem 11 gives that

π(xi :n) ∝ exp

−1

2

n∑k=i

L2kk

xk +1

Lkk

n∑j=k+1

Ljkxj

2= exp

(−1

2xTi :nQ(i :n)xi :n

),

where Q(i :n)ij = LiiLji . Then

xi ⊥ xj | xF (i ,j) ⇐⇒ LiiLji = 0,

which is equivalent to Lji = 0 since Lii > 0 as Q(i :n) > 0.




Determine the zero-pattern in L (II)

The global Markov property provide a simple and sufficient criteriafor checking if Lji = 0.

CorollaryIf F (i , j) separates i < j in G, then Lji = 0.

CorollaryIf i ∼ j then F (i , j) does not separates i < j .

The idea is simple

• Use the global Markov property to check if Lji = 0.

• Compute only the non-zero terms in L, so that Q = LLT .



Example: A simple graph

Example

Q =

× × ×× × ×× × ×× × ×

L =

×× ×× ? ×? × × ×



Example: A simple graph

Example

Q =

× × ×× × ×× × ×× × ×

L =

×× ×× √ ×× × ×



Example: auto-regressive processes

Example: AR(1)-process

xt | x1:(t−1) ∼ N (φxt−1, σ2), t = 1, . . . , n

Q =

× ×× × ×× × ×× × ×× × ×× × ×× ×

L =

×× ×× ×× ×× ×× ×× ×


Band matrices

Bandwidth is preserved

Bandwidth is preserved

Similarly, for an AR(p)-process

• Q have bandwidth p.

• L have lower-bandwidth p.

TheoremLet Q > 0 be a band matrix with bandwidth p and dimension n,then the Cholesky triangle of Q has (lower) bandwidth p.

=

...easy to modify existing Cholesky-factorisation code to use onlyentries where |i − j | ≤ p.


Band matrices

Cholesky factorisation for band-matrices

Avoid computing Lij and reading Qij for |i − j | > p.

Algorithm 8 Band-Cholesky factorization of Q with bandwidth p

1: for j = 1 to n do2: λ = minj + p, n3: vj :λ = Qj :λ,j

4: for k = max1, j − p to j − 1 do5: i = mink + p, n6: vj :i = vj :i − Lj :i ,kLjk

7: end for8: Lj :λ,j = vj :λ/

√vj

9: end for10: Return L

Cost is now n(p2 + 3p) flops assuming n p.


Reordering schemes

Introduction

Reorder the vertices

We can permute the vertexes;

select one of the n! possible permutations, define thecorresponding permutation matrix P, such that iP = Pi,where i = (1, . . . , n)T , is the new ordering of thevertexes.

Chose P, if possible, such that

QP = PQPT (23)

is a band-matrix with a small bandwidth.


Reordering schemes

Introduction

• Impossible in general to obtain the optimal permutation, n! isto large!

• A sub-optimal ordering will do as well.

• Solve Qµ = b as follows:• bP = Pb.• Solve QPµP = bP

• Map the solution back, µ = PTµP .


Reordering schemes

Reordering to band-matrices



Reordering schemes




Reordering schemes



• factorisation of QP required about 0.0018 seconds

• Solving Qµ = b required about 0.0006 seconds.

• on a 1 200MHz laptop.


Reordering schemes

Reordering using the idea of nested dissection

More optimal reordering schemes

× × × × × ×× ×× ×× ×× ×× ×

× ×× ×× ×× ×× ×

× × × × × ×

×× ×× √ ×× √ √ ×× √ √ √ ×× √ √ √ √ ×

×××××

× × × × × ×


Reordering schemes


Nested dissection reordering (I)

The idea generalise as follows.

• Select a (small) set of nodes whose removal divides the graphinto two disconnected subgraphs of almost equal size.

• Order the nodes chosen after ordering all the nodes in bothsubgraphs.

• Apply this procedure recursively to the nodes in eachsubgraph.

Costs in the spatial case

• Factorisation O(n3/2)

• Fill-in O(n log n)

• Optimal in the order sense.


Reordering schemes


Nested dissection reordering (II)


What to do with sparse matrices?

Statisticians and numerical methods for sparse matrices

Gupta (2002) summarises his findings on recent advances forsparse linear solvers:

. . . recent sparse solvers have significantly improved thestate of the art of the direct solution of general sparsesystems.. . . recent years have seen some remarkable advances inthe general sparse direct-solver algorithms and software.

• Good news

• We just use their software


A numerical case-study

Introduction

A numerical case-study of typical GMRFs

Divide GMRF models into three categories.

1. GMRF models in time or on a line; auto-regressive models andmodels for smooth functions.Neighbours to xt are then those xs such that |s − t| ≤ p.

2. Spatial GMRF models; regular lattice, or irregular latticeinduced by a tessellation or regions of land.Neighbours to xi (spatial index i), are those j spatially “close”to i , where “close” is defined from its context.

3. Spatio-temporal GMRF models. Often an extension of spatialmodels to also include dynamic changes.



Introduction

Also include some “global” nodes:

Let x be a GMRF with a common mean µ

x|µ ∼ N (µ1,Q−1) (24)

Assume µ ∼ N (0, σ2)

...then (x, µ) is also a GMRF where the node µ is neighbour withall xi ’s.



Introduction

Two different algorithms

1. The band-Cholesky factorisation (BCF) as in Algorithm 8.Here we use the LAPACK-routines DPBTRF and DTBSV, for thefactorisation and the forward/back-substitution, respectively,and the Gibbs-Poole-Stockmeyer algorithm for bandwidthreduction.

2. The Multifrontal Supernodal Cholesky factorisation (MSCF)implementation in the library TAUCS using the nesteddissection reordering from the library METIS.

Both solvers are available in the GMRFLib-library



Introduction

The tasks we want to investigate are

1. factorising Q into LLT , and

2. solving LLTµ = b.

Producing a random sample from the GMRF, is half the cost ofsolving the linear system in step 2.

All tests reported here, we conducted on a 1 200MHz laptop with512Mb memory running Linux.

New machines are nearly 3− 10 times as fast...



GMRF-models in time

GMRF models in time

Let Q be a band-matrix with bandwidth p and dimension n.For such a problem, using BCF will be (theoretically) optimal, asthe fillin will be be zero.

n = 103 n = 104 n = 105

CPU-time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25

Factorise 0.0005 0.0019 0.0044 0.0271 0.0443 0.2705Solve 0.0000 0.0004 0.0031 0.0109 0.0509 0.1052

“Long and thin” are fast!The MSCF is less optimal for band-matrices: fillin and morecomplicated data-structures.



GMRF-models in time

Add 10 global nodes:

n = 103 n = 104 n = 105

CPU-time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25

Factorise 0.0119 0.0335 0.1394 0.4085 1.6396 4.1679Solve 0.0007 0.0035 0.0138 0.0306 0.1541 0.3078

• The fillin ≈ pn.

• The nested dissection ordering give good results in all casesconsidered so far.



GMRF-models in space

Spatial GMRF models

Regular lattice




Spatial GMRF models

Irregular lattice




n = 1002 n = 1502 n = 2002

CPU-time Method 3× 3 5× 5 3× 3 5× 5 3× 3 5× 5

Factorise BCF 0.51 1.02 2.60 4.93 13.30 38.12MSCF 0.17 0.62 0.55 1.92 1.91 4.90

Solve BCF 0.03 0.05 0.10 0.16 0.24 0.43MSCF 0.01 0.04 0.04 0.11 0.08 0.21

• For the largest lattice the MSCF really outperform the BCF.

• The reason is the O(n3/2) cost for MSCF compared to O(n2)for the BCF.



GMRF-models in time×space

Spatio-temporal GMRF models

Spatio-temporal GMRF models is often an extension of spatialGMRF models to account to time variation.

time

Use this model and the graph of Germany, for T = 10 andT = 100.



GMRF-models in time×space

10 global nodesCPU-time T = 10 T = 100 T = 10 T = 100

Factorise 0.25 39.96 0.31 39.22Solve 0.02 0.42 0.02 0.42

• The results shows a quite heavy dependency on T .

• Spatio-Temporal GMRFs are more demanding than spatialones, O(n2) for a n1/3 × n1/3 × n1/3-cube.


Part IV

Intrinsic GMRFs


Outline IIntroductionDefinition of IGMRFs

Improper GMRFs

IGMRFs of first orderForward differencesOn the line with regular locationsOn the line with irregular locationsWhy IGMRFs are useful in applicationsIGMRFs of first order on irregular latticesIGMRFs of first order on regular lattices

IGMRFs of second orderRW2 model for regular locations

The CRWk model for regular and irregular locationsIGMRFs of general order on regular latticesIGMRFs of second order on regular lattices


Introduction

Intrinsic GMRFs (IGMRF)

• IGMRFs are improper, i.e., they have precision matrices not offull rank.

• Often used as prior distributions in various applications.

• Of particular importance are IGMRFs that are invariant to anytrend that is a polynomial of the locations of the nodes up toa specific order.


Definition of IGMRFs

Improper GMRFs

Improper GMRF

DefinitionLet Q be an n × n SPSD matrix with rank n − k > 0. Thenx = (x1, . . . , xn)T is an improper GMRF of rank n − k withparameters (µ,Q), if its density is

π(x) = (2π)−(n−k)

2 (|Q|∗)1/2 exp

(−1

2(x− µ)T Q(x− µ)

). (25)

Further, x is an improper GMRF wrt to the labelled graphG = (V, E), where

Qij 6= 0⇐⇒ i , j ∈ E for all i 6= j .

| · |∗ denote the generalised determinant: product of the non-zeroeigenvalues.


Definition of IGMRFs

Improper GMRFs

Markov properties

The Markov properties are interpreted as those obtained from thelimit of a proper density.

Let the columns of AT span the null space of Q

Q(γ) = Q + γAT A. (26)

Q(γ) tends to the corresponding one in Q as γ → 0.

E(xi | x−i ) = µi − 1

Qii

∑j∼i

Qij(xj − µj)

is interpreted as γ → 0.


IGMRFs of first order

IGMRF of first order

An intrinsic GMRF of first order is an improper GMRF of rankn − 1 where Q1 = 0.

The condition Q1 = 0 means that∑j

Qij = 0, i = 1, . . . , n



Let µ = 0 so

E(xi | x−i ) = − 1

Qii

∑j :j∼i

Qijxj (27)

−∑j :j∼i

Qij/Qii = 1

• The conditional mean of xi is a weighted mean of itsneighbours.

• No shrinking towards an overall level.

• Many IGMRFs are constructed such that the deviation fromthe overall level is a smooth curve in time or a smooth surfacein space.



Forward differences

Definition (Forward difference)Define the first-order forward difference of a function f (·) as

∆f (z) = f (z + 1)− f (z).

Higher-order forward differences are defined recursively:

∆k f (z) = ∆∆k−1f (z)

so∆2f (z) = f (z + 2)− 2f (z + 1) + f (z) (28)

and in general for k = 1, 2, . . .,

∆k f (z) = (−1)kk∑

j=0

(−1)j

(k

j

)f (z + j).



Forward differences

For a vector z = (z1, z2, . . . , zn)T , ∆z has elements

∆zi = zi+1 − zi , i = 1, . . . , n − 1

The forward difference of kth order as an approximation to the kthderivative of f (z),

f ′(z) = limh→0

f (z + h)− f (z)

h



On the line with regular locations

IGMRFs of first order on the line

Location of the node i is i (think “time”).

Assume independent increments

∆xiiid∼ N (0, κ−1), i = 1, . . . , n − 1, so (29)

xj − xi ∼ N (0, (j − i)κ−1) for i < j . (30)

If the intersection between i , . . . , j and k , . . . , l is empty fori < j and k < l , then

Cov(xj − xi , xl − xk) = 0. (31)

Properties coincide with those of a Wiener process.




The density for x is

π(x | κ) ∝ κ(n−1)/2 exp

(−κ

2

n−1∑i=1

(∆xi )2

)(32)

= κ(n−1)/2 exp

(−κ

2

n−1∑i=1

(xi+1 − xi )2

), (33)

orπ(x) ∝ κ(n−1)/2 exp

(−κ

2xT Rx

)(34)




The structure matrix

R =

1 −1−1 2 −1

−1 2 −1. . .

. . .. . .

−1 2 −1−1 2 −1

−1 1

. (35)

• The n − 1 independent increments ensure that the rank of Qis n − 1.

• Denote this model by RW1(κ) or short RW1.




Full conditionals

xi | x−i , κ ∼ N (1

2(xi−1 + xi+1), 1/(2κ)), 1 < i < n, (36)

Alternative interpretation:

• fit a first order polynomial

p(j) = β0 + β1j , (37)

locally through the points (i − 1, xi−1) and (i + 1, xi+1).

• The conditional mean is p(i).




Example

0 20 40 60 80 100

−10

−50

510

Samples with n = 99 and κ = 1 by conditioning on the constraint∑xi = 0.




Example

Marginal variances Var(xi ) for i = 1, . . . , n.




Example

Corr(xn/2, xi ) for i = 1, . . . , n



On the line with irregular locations

The first order RW for irregular locations

Location of xi is si but si+1 − si is not constant.Assume

s1 < s2 < · · · < sn

and letδi

def= si+1 − si . (38)




Wiener process

Consider xi as the realisation of an integrated Brownian motion incontinuous time, i.e. a Wiener process W (t), at time si .

Definition (Wiener process)

• A Wiener process with precision κ is a continuous-timestochastic process W (t) for t ≥ 0 with W (0) = 0 and suchthat the increments W (t)−W (s) are Gaussian with mean 0and variance (t − s)/κ for any 0 ≤ s < t.

• Furthermore, increments for non overlapping time intervals areindependent.

• For κ = 1, this process is called a standard Wiener process.




Brownian bridge

The full conditional is in this case known as the the Brownianbridge:

E(xi | x−i , κ) =δi

δi−1 + δixi−1 +

δi−1

δi−1 + δixi+1 (39)

and

Prec(xi | x−i , κ) = κ

(1

δi−1+

1

δi

). (40)

κ is a precision parameter.




The precision matrix is now

Qij = κ

1δi−1

+ 1δi

j = i

− 1δi

j = i + 1

0 otherwise

(41)

for 1 < i < n.

A proper correction at the boundary gives the remaining diagonalterms Q11 = κ/δ1, Qnn = κ/δn−1.




The joint density is

π(x | κ) ∝ κ(n−1)/2 exp

(−κ

2

n−1∑i=1

(xi+1 − xi )2/δi

), (42)

and is invariant to the addition of a constant.




• The interpretation of a RW1 model as a discretely observedWiener-process, justifies the corrections needed.

• The underlying model is the same, it is only observeddifferently.

0 20 40 60 80 100

−10

−50

5



Why IGMRFs are useful in applications

When the mean is only locally constant

Alternative IGMRF

π(x) ∝ κ(n−1)/2 exp

(−κ

2

n∑i=1

(xi − x)2

)(43)

where x is the empirical mean of x.




Assume n is even and

xi =

0, 1 ≤ i ≤ n/2

1, n/2 < i ≤ n. (44)

Thus x is locally constant with two levels.

Evaluate the density at this configuration under the RW1 modeland the alternative (43), we obtain

κ(n−1)/2 exp(−κ

2

)and κ(n−1)/2 exp

(−n

κ

8

)(45)

The log ratio of the densities is then of order O(n).




• The RW1 model only penalises the local deviation from aconstant level.

• The alternative penalises the global deviation from a constantlevel.

This local behaviour is advantageous in applications if the meanlevel of x is approximately or locally constant.

A similar argument also apply for polynomial IGMRFs of higherorder constructed using forward differences of order k asindependent Gaussian increments.



IGMRFs of first order on irregular lattices

First order IGMRFs on irregular latticesConsider the map of the 544 regions in Germany.Two regions are neighbours if they share a common border.




Between neighbouring regions i and j , say, we define a“independent” Gaussian increment

xi − xj ∼ N (0, κ−1) (46)

Assume independent increments yields

π(x) ∝ κ(n−1)/2 exp

−κ2

∑i∼j

(xi − xj)2

. (47)

“i ∼ j” denotes the set of all unordered pairs of neighbours.

Number of increments |i ∼ j | is larger than n, but the rank of thecorresponding precision matrix is still n − 1.




There are hidden constraints in the increments due to the morecomplicated geometry on a lattice than on the line.

ExampleLet n = 3 where all nodes are neighbours. Then x1 − x2 = ε1,x2 − x3 = ε2, and x3 − x1 = ε3, where ε1, ε2 and ε3 are theincrements.

This implies thatε1 + ε2 + ε3 = 0

which is the ‘hidden’ linear constraint.




Let ni denote the number of neighbours of region i .The precision matrix Q is

Qij = κ

ni i = j ,

−1 i ∼ j ,

0 otherwise,

(48)

xi | x−i , κ ∼ N (1

ni

∑j :j∼i

xj ,1

niκ). (49)




Example




• In general there is no longer an underlying continuousstochastic process that we can relate to this density.

• If we change the spatial resolution or split a region into twonew ones, we change the model.




Weighted variants

Incorporate symmetric weights wij for each pair of adjacent nodes iand j .

For example, wij = 1/d(i , j)

Assuming independent increments

xi − xj ∼ N (0, 1/(wijκ)), (50)

then

π(x) ∝ κ(n−1)/2 exp

−κ2

∑i∼j

wij(xi − xj)2

. (51)




The precision matrix is

Qij = κ

∑

k:k∼i wik i = j ,

−1/wij i ∼ j ,

0 otherwise,

(52)

and with mean and precision∑j :j∼i xjwij∑j :j∼i wij

and κ∑j :j∼i

wij , (53)

respectively.



IGMRFs of first order on regular lattices

First order IGMRFs on regular latticesFor a lattice IN with n = n1n2 nodes, let i = (i1, i2) denote thenode in the i1th row and i2th column.

Use the nearest four sites of i as its neighbours

(i1 + 1, i2), (i1 − 1, i2), (i1, i2 + 1), (i1, i2 − 1).

The precision matrix is

Qij = κ

ni i = j ,

−1 i ∼ j ,

0 otherwise,

(54)

and the full conditionals for xi are

xi | x−i , κ ∼ N (1

ni

∑j :j∼i

xj ,1

niκ). (55)



IGMRFs of first order on regular lattices

Limiting behaviour

This model have an important property

The process converge to the de Wijs-process: a Gaussianprocess with variogram

log(distance)

This is important

• there is a limiting process

• The de Wijs process has dense precision matrix whereas theIGMRF is very sparse!


IGMRFs of second order

RW2 model for regular locations

The RW2 model for regular locations

Let si = i for i = 1, . . . , n, with a constant distance betweenconsecutive nodes.

Use the second order increments

∆2xiiid∼ N (0, κ−1) (56)

for i = 1, . . . , n − 2, to define the joint density of x

π(x) ∝ κ(n−2)/2 exp

(−κ

2

n−2∑i=1

(xi − 2xi+1 + xi+2)2

)(57)

= κ(n−2)/2 exp(−κ

2xT Rx

)(58)




R =

1 −2 1−2 5 −4 11 −4 6 −4 1

1 −4 6 −4 1. . .

. . .. . .

. . .. . .

1 −4 6 −4 11 −4 6 −4 1

1 −4 5 −21 −2 1

. (59)

• Verify directly that QS1 = 0 and that the rank of Q is n − 2.

• IGMRF of second order: invariant to the adding line to x.

• Known as the second order random walk model, denoted byRW2(κ) or simply RW2 model.




Remarks

• This is the RW2 model defined and used in the literature.

• We cannot extend it consistently to the case where thelocations are irregular.

• Similar problems occur if we increase the resolution from n to2n locations, say.

• This is in contrast to the RW1 model.

• There exists an alternative (somewhat more involved)formulation with the desired continuous time interpretation.




The conditional mean and precision is

E(xi | x−i , κ) =4

6(xi+1 + xi−1)− 1

6(xi+2 + xi−2), (60)

Prec(xi | x−i , κ) = 6κ, (61)

respectively for 2 < i < n − 2.




Consider the second order polynomial

p(j) = β0 + β1j +1

2β2j2 (62)

and compute the coefficients by a local least squares fit to thepoints

(i − 2, xi−2), (i − 1, xi−1), (i + 1, xi+1), (i + 2, xi+2). (63)

Turns out thatE(xi | x−i , κ) = p(i)




Samples with n = 99 and κ = 1 by conditioning on the constraints.




Marginal variances Var(xi ) for i = 1, . . . , n.




Corr(xn/2, xi ) for i = 1, . . . , n


The CRWk model for regular and irregular locations

Random walk models: irregular locations

• Locations s1 < s2 < . . . < sn.

• Discretely observed (integrated) Wiener process

• Augment with the velocities

• Valid for any order

• Connection to spline-models




• Locations s1 < s2 < . . . < sn.





A1 B1

BT1 A2 + C1 B2

BT2 A3 + C2 B3

. . .. . .

. . .

Bn−1

BTn−1 Cn−1

.




• Locations s1 < s2 < . . . < sn.





Ai =

(12/δ3

i 6/δ2i

6/δ2i 4/δi

)Bi =

(−12/δ3i 6/δ2

i

−6/δ2i 2/δi

)Ci =

(12/δ3

i −6/δ2i

−6/δ2i 4/δi

)where

δi = si+1 − si .


IGMRFs of general order on regular lattices

IGMRFs of higher order on regular lattices

The precision matrix is orthogonal to a polynomial design matrix ofa certain degree.

Definition (IGMRFs of order k in dimension d)An IGMRF of order k in dimension d , is an improper GMRF ofrank n −mk−1,d where QSk−1,d = 0.


IGMRFs of second order on regular lattices

A second order IGMRF in two dimensions

Consider a regular lattice IN in d = 2 dimensions where si1 = i1and si2 = i2.

Choose the independent increments(x(i1+1,i2) + x(i1−1,i2) + x(i1,i2+1) + x(i1,i2−1)

)− 4xi1,i2 (64)

The motivation for this choice is that (64) is(∆2

i1 + ∆2i2

)xi1−1,i2−1 (65)

Invariant to adding a first order polynomial

p1,2(i1, i2) = β00 + β10i1 + β01i2,



The precision matrix (apart from boundary effects) should havenon-zero elements

− (∆2i1 + ∆2

i2

)2= − (∆4

i1 + 2∆2i1∆2

i2 + ∆4i2

)(66)

which is a negative difference approximation to the biharmonicdifferential operator(

∂2

∂x2+

∂2

∂y 2

)2

=∂4

∂x4+ 2

∂4

∂x2∂y 2+

∂4

∂y 4. (67)

The fundamental solution of the biharmonic equation(∂4

∂x4+ 2

∂4

∂x2∂y 2+

∂4

∂y 4

)φ(x , y) = 0 (68)

is the thin plate spline.



Full conditionals in the interior

E(xi | x−i ) =1

20

(8 • • • •

− 2 • • • •

− 1 • • • •

)(69)

andPrec(xi | x−i ) = 20κ. (70)



Alternative IGMRFs in two dimensions

Using only the terms • • • • • (71)

to obtain an difference approximation to

∂2

∂x2+

∂2

∂y 2(72)

is not optimal.

The discretization error different 45 degrees to the main directions,hence we should expect a “directional” effect.



Use an isotropic approximations (ex. “Mehrstellen-stencil”)

− 10

3

• +2

3

• • • • +1

6

• • • • (73)

E(xi | x−i ) =1

468

(144

• • • • − 18

• • • •

+8 • • • •

− 8 • • • • • • • •

− 1• • • •

)(74)

Prec(xi | x−i ) = 13κ. (75)



0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Samples.



0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Samples.


Part V

Hierarchical GMRF models


Outline IHierarchical GMRFs models

A simple exampleClasses of hierarchical GMRF models

A brief introduction to MCMCBlocking

Rate of convergenceExampleOne-block algorithm

Block algorithms for hierarchical GMRF modelsGeneral setupThe one-block algorithmThe sub-block algorithmGMRF approximations

Merging GMRFsIntroduction


Outline IIWhy is it so?


Hierarchical GMRFs models

Hierarchical GMRFs modelsCharacterised through several stages of observables andparameters.

A typical scenario is as follows.

Stage 1 Formulate a distributional assumption for theobservables, dependent on latent parameters.

• Time series of binary observations y, we mayassume

yi , i = 1, . . . , n : yi ∼ B(pi )

• We assume the observations to be conditionallyindependent





Stage 2 Assign a prior model, i.e. a GMRF, for the unknownparameters, here pi .

• Chose an autoregressive model for thelogit-transformed probabilities xi = logit(pi ).





Stage 3 Assign to unknown parameters (or hyperparameters)of the GMRF

• precision parameter κ• “strength” of dependency.

Further stages if needed.



A simple example

Tokyo rainfall data

Stage 1 Binomial data

yi ∼

Binomial(2, p(xi ))

Binomial(1, p(xi ))



A simple example

Tokyo rainfall data

Stage 2 Assume a smooth latent x,

x ∼ RW 2(κ), logit(pi ) = xi



A simple example

Tokyo rainfall data

Stage 3 Gamma(α, β)-prior on κ



Classes of hierarchical GMRF models

Classes of hierarchical GMRF models

• Normal data• Block-MCMC

• Non-normal data that allows for a normal-mixturerepresentation• Student-t distribution• Logistic and Laplace (Binary regression)

• Block-MCMC with auxillary variables

• Non-normal data• Poisson• and others...

• Block-MCMC with GMRF-approximations


A brief introduction to MCMC

MCMCConstruct a Markov chain

θ(1),θ(2), . . . ,θ(k), . . .

that converges (under some conditions) to π(θ).

1. Start with θ(0) where π(θ(0)) > 0. Set k = 1.

2. Generate a proposal θ∗ from some proposal kernelq(θ∗|θ(k−1)).Set θ(k) = θ∗ with probability

α = min

1,

π(θ∗)π(θ(k−1))

q(θ(k−1)|θ∗)q(θ∗|θ(k−1))

;

otherwise set θ(k) = θ(k−1).

3. Set k = k + 1 and go back to 2



Depending on the specific choice of the proposal kernel q(θ∗|θ),very different algorithms result.

• When q(θ∗|θ) does not depend on the current value of θproposal is called an independence proposal.

• When q(θ∗|θ) = q(θ|θ∗) we have a Metropolis proposal.These includes so-called random-walk proposals.

The rate of convergence toward π(θ) and the degree ofdependence between successive samples of the Markov chain(mixing) will depend on the chosen proposal.



Single-site algorithms

• Most MCMC algorithms have been based on updating eachscalar component

θi , i = 1, . . . , p

of θ conditional on θ−i .

• Apply the MH-algorithm in turn to every component θi of θwith arbitrary proposal kernels

qi (θ∗i |θi ,θ−i )

• As long as we update each component of θ, this algorithmwill converge to the target distribution π(θ).



However...

• Such single-site updating can be disadvantageous ifparameters are highly dependent in the posterior distributionπ(θ).

• The problem is that the Markov chain may move around veryslowly in its target (posterior) distribution.

• A general approach to circumvent this problem is to updateparameters in larger blocks, θj : a vector of components of θ.

• The choice of blocks are often controlled by what is possibleto do in practise.

• Ideally, we should choose a small number of blocks with largedependence within the blocks but with less dependencebetween blocks.


Blocking

Blocking

Assume the following

• θ = (κ, x) where• κ is the precision and• x is a GMRF.

Two-block approach

• sample x ∼ π(x|κ), and

• sample κ ∼ π(κ|x)

Often strong dependence between κ and x in the posterior;resolved using a joint update of (x, κ).

Why this modification is important and why it works is discussednext.


Blocking

Rate of convergence

Rate of convergence

Let θ(1), θ(2), . . . denote a Markov chain with target distributionπ(θ) and initial value θ(0) ∼ π(θ).

Rate of convergence ρ: how quickly E(h(θ(t))|θ(0)) approaches thestationary value E(h(θ)) for all square π-integrable functions h(·).

Let ρ be the minimum number such that for all h(·) and for allr > ρ

limk→∞

E

[(E(

h(θ(k)) | θ(0))− E (h(θ))

)2r−2k

]= 0. (76)


Blocking

Example

Example

Let x be a first-order autoregressive process

xt − µ = γ(xt−1 − µ) + νt , t = 2, . . . , n, (77)

where |γ| < 1, νt are iid normals with zero mean and variance

σ2, and x1 ∼ N (µ, σ2

1−γ2 ).

Let γ, σ2, and µ be fixed parameters.

At each iteration a single-site Gibbs sampler will sample xt fromthe full conditional π(xt |x−t) for t = 1, . . . , n,

xt | x−t ∼

N (µ+ γ(x2 − µ), σ2) t = 1,

N (µ+ γ1+γ2 (xt−1 + xt+1 − 2µ), σ2

1+γ2 ) t = 2, . . . , n − 1,

N (µ+ γ(xn−1 − µ), σ2) t = n.


Blocking

Example

• For large n, the rate of convergence is

ρ = 4γ2

(1 + γ2)2. (78)

• For |γ| close to one the rate of convergence can be slow: Ifγ = 1− δ for small δ > 0, then ρ = 1− δ2 +O(δ3).

To circumvent this problem, we may update x in one block. This ispossible as x is a GMRF.

This yields immediate convergence.


Blocking

Example

Relax the assumptions of fixed hyperparameters.

Consider a hierarchical formulation where the mean of xt , µ, isunknown and assigned with a standard normal prior,

µ ∼ N (0, 1) and x | µ ∼ N (µ1,Q−1),

where Q is the precision matrix of the GMRF x|µ.

The joint density of (µ, x) is normal.

We have two natural blocks, µ and x.


Blocking

Example

A two-block Gibbs sampler update µ and x with samples from theirfull conditionals,

µ(k)|x(k) ∼ N(

1T Qx(k−1)

1 + 1T Q1, (1 + 1T Q1)

−1

)x(k)|µ(k) ∼ N (µ(k)1, Q−1).

(79)

The presence of the hyperparameter µ will slow down theconvergence compared to the case when µ is fixed.

Due to the nice structure of (79) we can characterise explicitly themarginal chain of µ(k).


Blocking

Example

TheoremThe marginal chain µ(1), µ(2), . . . from the two-block Gibbssampler defined in (79) and started in equilibrium, is a first-orderautoregressive process

µ(k) = φµ(k−1) + εk ,

where

φ =1T Q1

1 + 1T Q1

and εkiid∼ N (0, 1− φ2).


Blocking

Example

Proof. It follows directly that the marginal chain µ(1), µ(2), . . . is afirst-order autoregressive process.

The coefficient φ is found by computing the covariance at lag 1,

Cov(µ(k), µ(k+1)) = E(µ(k)µ(k+1)

)= E

(µ(k)E

(µ(k+1) | µ(k)

))= E

(µ(k)E

(E(µ(k+1) | x(k)

)| µ(k)

))= E

(µ(k)E

(1T Qx(k)

1 + 1T Q1| µ(k)

))

=1T Q1

1 + 1T Q1Var(µ(k)),

which is known to be φ times the variance Var(µ(k)) for afirst-order autoregressive process.


Blocking

Example

For our model

φ =n(1− γ)2/σ2

1 + n(1− γ)2/σ2= 1− Var(xt)

n

1− γ2

(1− γ)2+O(1/n2).

When n is large, φ is close to 1 and the chain will both mix andconverge slowly even though we use a two-block Gibbs sampler.

Note that

• increasing variance of xt improves the convergence.

However, φ→ 1 as n→∞, which is bad.


Blocking

Example

What is going on?

(a) Trace of µ(k), and (b) the pairs (µ(k), 1T Qx(k)), with µ(k) onthe horizontal axis.


Blocking

One-block algorithm

One-block algorithm

So far: blocking improves mainly within the block. If there isstrong dependence between blocks, the MCMC algorithm may stillsuffer from slow convergence.

Update (µ, x) jointly by delaying the accept/reject step until x alsois updated.

µ∗ ∼ q(µ∗ | µ(k−1))

x∗|µ∗ ∼ N (µ∗1,Q−1)(80)

then accept/reject (µ∗, x∗) jointly.

Here, q(µ∗|µ(k−1)) can be a simple random-walk proposal or someother suitable proposal distribution.


Blocking

One-block algorithm

What is going on?

(a) Trace of µ(k), and (b) the pairs (µ(k), 1T Qx(k)), with µ(k) onthe horizontal axis.


Blocking

One-block algorithm

With a symmetric µ-proposal,

α = min

1, exp(−1

2((µ∗)2 − (µ(k−1))2))

. (81)

• Only the marginal density of µ is needed in (81): weeffectively integrate x out of the target density.

• The minor modification to delay the accept/reject step until xis updated as well can give a large improvement.

• In this case: random walk on a one-dimensional density.


Block algorithms for hierarchical GMRF models

General setup

A more general setup

• Hyperparameters θ (low dimension)

• GMRF x | θ of size n

• Observe x with data y.

The posterior is

π(x,θ | y) ∝ π(θ) π(x | θ) π(y | x,θ).

Assume we are able to sample from π(x|θ, y), i.e., the fullconditional of x is a GMRF.



The one-block algorithm


The following proposal update (θ, x) in one block:

θ∗ ∼ q(θ∗ | θ(k−1))

x∗ ∼ π(x | θ∗, y).(82)

The proposal (θ∗, x∗) is then accepted/rejected jointly.

We denote this as the one-block algorithm.




Consider the θ-chain, then we are in fact sampling from theposterior marginal π(θ|y) using the proposal

θ∗ ∼ q(θ∗ | θ(k−1))

The dimension of θ is typically low (1-5, say).

The proposed algorithm should not experience any serious mixingproblems.




The one-block algorithm is not always feasible

The one-block algorithm is not always feasible for the followingreasons:

1. The full conditional of x can be a GMRF with a precisionmatrix that is not sparse.This will prohibit a fast factorisation, hence a joint update isfeasible but not computationally efficient.

2. The data can be non-normal so the full conditional of x is nota GMRF and sampling x∗ using (82) is not possible (ingeneral).



The sub-block algorithm

...not sparse precision matrix

These cases can often be approached using sub-blocks of (θ, x),the sub-block algorithm.

Assume a natural splitting exists for both θ and x into

(θa, xa), (θb, xb) and (θc , xc), (83)

The sets a, b, and c do not need to be disjoint.

One class of examples where such an approach is fruitful is(geo-)additive models where a, b, and c represent three differentcovariate effects with their respective hyperparameters.



GMRF approximations

...non-Gaussian full conditionals

Auxillary variables

• can help achieving Gaussian full conditionals.

• logit and probit regression models for binary andmulti-categorical data, and

• Student-tν distributed observations.

GMRF approximations

• can approximate π(x|θ, y) using a second-order Taylorexpansion.

• Prominent example: Poisson-regression.

• Such approximations can be surprisingly accurate in manycases and can be interpreted as integrating x approximatelyout of π(x,θ|y).


Merging GMRFs

Introduction

Merging GMRFs using conditioning (I)

µ ∼ N (0, 1)

x− µ | µ ∼ AR(1)

z | x ∼ N (x, I)

y | z ∼ N (z, I)

• x∗ = (µ, x, z, y) is a GMRF

• x∗ | y is a GMRF


Merging GMRFs

Introduction

Merging GMRFs using conditioning (I)

µ ∼ N (0, 1)

x− µ | µ ∼ AR(1)

z | x ∼ N (x, I)

y | z ∼ N (z, I)

• x∗ = (µ, x, z, y) is a GMRF

• x∗ | y is a GMRF

• Additional hyperparameters θ


Merging GMRFs

Introduction

Merging GMRFs using conditioning (II)


Merging GMRFs

Why is it so?

Merging GMRFs using conditioning (III)

If

x ∼ N (0,Q−1)

y | x ∼ N (x,K−1)

z | x, y ∼ N (y,H−1)

then

Prec(x, y, z) =

Q + K −K 0K + H −H

H

which is sparse if Q, K and H are.


Part VI

Applications


Outline I

Normal dataMunich rental data

Normal-mixture response modelsIntroductionHierarchical-t formulationsBinary regressionSummary

Non-normal response modelsGMRF approximationJoint analysis of diseases


Normal data

Normal data: Munich rental guide

• Response variable yi : rent pr m2

• location

• floor space

• year of construction

• various indicator variables, suchas• no central heating• no bathroom• large balcony

German law: increase in the rent is based on an ‘average rent’ of acomparable flat.


Normal data

Munich rental data

Spatial regression model

yi ∼ N(µ+ xS(i) + xC (i) + xL(i) + zT

i β, 1/κy

)xS(i) Floor space (continuous RW2)

xC (i) Year of construction (continuous RW2)

xL(i) Location (first order IGMRF)

β Parameter for indicator variables

• 380 spatial locations

• 2035 observations

• sum-to-zero constraint on spatial IGMRF model and the yearof construction covariate


Normal data

Munich rental data

Inference using MCMC

Sub-block approach

(xS , κS), (xC , κC ), (xL, κL), and (β, µ, κy).

Update each block at a time, using

κ∗L ∼ q(κ∗L | κL)

xL,∗ ∼ π(xL,∗ | the rest)

and then accepts/rejects (κ∗L, xL,∗) jointly.


Normal data

Munich rental data

Log-RW proposal for precisions

It’s convenient to propose new precisions using

κnew = κold · f

f ∼ π(f ) ∝ 1 + 1/f

for f in [1/F ,F ] and F > 1.

With this choiceq(κold | κnew)

q(κnew | κold)= 1

Var(κnew | κold) ∝ (κold)2


Normal data

Munich rental data

Full conditional π(xL,∗|the rest)

Introduce ‘fake’ data y

yi = yi −(µ+ xS(i) + xC (i) + zT

i β),

The full conditional of xL is

π(xL | the rest) ∝ exp(−κL

2

∑i∼j

(xLi − xL

j )2)

× exp(−κy

2

∑k

(yk − xL(k)

)2).

The data y do not introduce extra dependence between the xLi ’s,

as yi acts as a noisy observation of xLi .


Normal data

Munich rental data

Denote by ni the number of neighbors to location i and let L(i) be

L(i) = k : xL(k) = xLi ,

where its size is |L(i)|.The full conditional of xL is a GMRF with parameters (Q−1b,Q),where

bi = κy

∑k∈Li

yk and

Qij =

κLni + κy|L(i)| if i = j

−κL if i ∼ j

0 otherwise.


Normal data

Munich rental data

Results

Floor space


Normal data

Munich rental data

Results

Year of construction


Normal data

Munich rental data

ResultsLocation


Normal-mixture response models

Introduction

Normal-mixture representation

Theorem (Kelker, 1971)If x has density f (x) symmetric around 0, then there existindependent random variables z and v, with z standard normalsuch that x = z/v iff the derivatives of f (x) satisfy(

− d

dy

)k

f (√

y) ≥ 0

for y > 0 and for k = 1, 2, . . ..

• Student-t

• Logistic and Laplace



Introduction

Corresponding mixing distribution for the precision parameter λthat generates these distributions as scale mixtures of normals.

Distribution of x Mixing distribution of λ

Student-tν G(ν/2, ν/2)Logistic 1/(2K )2 where K is

Kolmogorov-Smirnov distributedLaplace 1/(2E ) where E is exponential distributed



Hierarchical-t formulations

RW1 with tν − increments

Replace the assumption of normally distributed increments by aStudent-tν distribution to allow for larger jumps in the sequence x.

Introduce n−1 independent G(ν/2, ν/2) scale mixture variables λi :

∆xi | λiiid∼ N (0, (κλi )

−1), i = 1, . . . , n − 1.



Hierarchical-t formulations

Observe data yi ∼ N (xi , κ−1y ) for i = 1, . . . , n

The posterior density for (x,λ) is

π(x,λ | y) ∝ π(x | λ) π(λ) π(y | x).

Note that

• x|(y,λ) is now a GMRF, while

• λ1, . . . , λn−1|(x, y) are conditionally independent gammadistributed with parameters (ν + 1)/2 and (ν + κ(∆xi )

2)/2.

Use the sub-block algorithm with blocks

(θ, x), λ



Binary regression

Example: Binary regression

GMRF x and Bernoulli data

yi ∼ B(g−1(xi ))

g(p) =

log(p/(1− p)) logit link

Φ(p) probit link

Equivalent representation using auxiliary variables w

εiiid∼ N (0, 1)

wi = xi + εi

yi =

1 if wi > 0

0 otherwise.

for the probit-link.



Binary regression

MCMC Inference

• Full conditional for the GMRF prior is a GMRF

• Full conditional for the auxiliary variables

π(w | . . .) =∏i

π(wi | ...)

Use the sub-block algorithm with blocks

(θ, x), w



Binary regression

Simple example: Tokyo rainfall data

Stage 1 Binomial data

yi ∼

Binomial(2, p(xi ))

Binomial(1, p(xi ))



Binary regression


Stage 2 Assume a smooth latent x,

x ∼ RW 2(κ), logit(pi ) = xi



Binary regression


Stage 3 Gamma(α, β)-prior on κ



Binary regression


Block Single-site

Precision



Binary regression


Block Single-site

Probability



Summary

• These auxiliary variables methods works better than onemight think.

• Not that strong dependency between the blocks:

(θ, x), w

• Nearly as random normal data.


Non-normal response models

GMRF approximation


Example of full conditional

π(x | y) ∝ exp

(−1

2xT Qx−

∑i

exp(xi )

)

≈ exp

(−1

2(x− µ)T (Q + diag(ci ))(x− µ)

)Construct GMRF approximation

• Locate the mode x∗

• Expand to second order

• To obtain the GMRF approximation

• The graph is unchanged



GMRF approximation

Why use the mode?



GMRF approximation

How to find the mode?

• Numerical gradient-search methods• evaluate the gradient and hessian in O(n) flops.• faster for large GMRFs

• Newton-Raphson method• Current x is initial value• Taylor-expand around x• Compute the mean in the GMRF-approximation• Repeat this until convergence

• Further complications with linear constraints Ax = b.



GMRF approximation

“Taylor” or not? (I)

Consider a non-quadratic function g(x)

g(x) = g(x0) + (x − x0)g ′(x0) +1

2(x − x0)2g ′′(x0) + . . .

• error is zero in x0

• error typically increase with |x − x0|



GMRF approximation

“Taylor” or not? (II)

Alternative approximation

g(x) ≈ g(x) = a + b +1

2cx2 (84)

where a, b and c are such that

g(x0) = g(x0), g(x0 ± h) = g(x0 ± h)

• More “uniformly distributed error”

• Better for distribution approximation

• ....if h is a rough guess of ROI.

Preference to numerical approxmations of derivatives compared toanalytical expressions.



Joint analysis of diseases


SMR Oral SMR Lung




Separate analysis: model

yij : number of cases in area i for cancer type j .

yij ∼ P(eij exp(ηij)),

ηij : log relative risk in area i for disease j .eij : constants

• Decompose the log-relative risk η as

η = µ1 + u + v

• µ is the overall mean

• u a spatially structured component (IGMRF)

• v an unstructured component (random effects)




Separate analysis: model

Posterior

π(x,κ | y) ∝ κn/2v κ

(n−1)/2u exp

(−1

2xT Qx

)× exp

(∑i

yiηi − ei exp(ηi )

)π(κ)

Q =

κµ + nκv κv1T −κv1T

κv1 κuR + κvI −κvI−κv1 −κvI κvI

Q is a 2n + 1× 2n + 1 matrix where n = 544.

The spatially structured term u has a sum-to-zero constraint,1T u = 0.




Separate analysis: MCMC

Joint update of all parameters:

κ∗u ∼ q(κ∗u | κu)

κ∗v ∼ q(κ∗v | κv)

x∗ ∼ π(x | κ∗, y)




Separate analysis: Results

Oral Lung




Joint analysis: Model

• Oral cavity and lung cancer relates to tobacco smoking (u1)

• Oral cancer relates to alcohol consumption (u2)

• Latent shared and specific spatial components:

η1 | u1,u2,µ,κ ∼ N (µ11 + δu1 + u2, κ−1η1

I)

η2 | u1,u2,µ,κ ∼ N (µ21 + δ−1u1, κ−1η2

I),




Joint analysis: MCMC

Update all parameters in one block

• Simple (log-)RW for the hyperparamteters

• Sample (µ1,η1,u1, µ2,η2,u2) from the GMRF approximation.

• Accepts/rejects jointly.




Joint analysis: Results

“tobacco” “alcohol”




Independence samplers

For many of the examples it’s quite easy/possible to constructindependence samplers

• (More or less) exact samples

• The basis to methods avoiding completely MCMC

...continue tomorrow!


Part VII

Advanced topics


Outline I

Computing marginal variances for GMRFs∗

IntroductionStatistical derivationMarginal variances under hard and soft constraints



Introduction


LetQ = VDVT

• where D is a diagonal matrix, and

• V is a lower triangular matrix with ones on the diagonal.

The matrix identity

Σ = D−1V−1 + (I− VT )Σ

define recursions which can be used to compute

• Var(xi ) and Cov(xi , xj) for i ∼ j

essentially without cost when the Cholesky triangle L is known.This is a result from 1973.



Statistical derivation


Recall for a zero mean GMRF that

xi | xi+1, . . . , xn ∼ N (− 1

Lii

n∑k=i+1

Lkixk , 1/L2ii ), i = n, . . . , 1.

provides a sequential representation of the GMRF backward in“time” i .




Multiply by xj , j ≥ i , and taking expectation yields

Σij = δij/L2ii −

1

Lii

n∑k∈I(i)

LkiΣkj , j ≥ i , i = n, . . . , 1,

where I(i) as those k where Lki is non-zero,

I(i) = k > i : Lki 6= 0

and δij is one if i = j and zero otherwise.




We can use


1

Lii

n∑k∈I(i)

LkiΣkj , j ≥ i , i = n, . . . , 1,

to compute Σij for each ij :

• Outer loop i = n, . . . , 1

• Inner loop j = n, . . . , i




ExampleLet n = 3, I(1) = 2, 3, I(2) = 3, then we get

Σ33 =1

L233

Σ23 = − 1

L22(L32Σ33)

Σ22 =1

L222

− 1

L22(L32Σ32) Σ13 = − 1

L11(L21Σ23 + L31Σ33)

Σ12 = − 1

L11(L21Σ22 + L31Σ32) Σ11 =

1

L211

− 1

L11(L21Σ21 + L31Σ31)

where we also need to use that Σ is symmetric.




• Assume we want to compute all marginal variances.

• To do so, we need to compute Σij (or Σji ) for all ij in someset S.

• If the recursions can be solved by only computing Σij for allij ∈ S we say that the recursions are solvable using S.




From


1

Lii

n∑k∈I(i)

LkiΣkj , j ≥ i , i = n, . . . , 1, (85)

it is evident that S must satisfy

ij ∈ S and k ∈ I(i) =⇒ kj ∈ S (86)

We also need that ii ∈ S for i = 1, . . . , n.




• S = V × V is a valid set, but we want |S| to be minimal toavoid unnecessary computations.

• Such a minimal set depends however on the numerical valuesin L, or Q implicitly.

• Denote by S(Q) a minimal set.

TheoremThe union of S(Q) for all Q > 0 with fixed graph G, is a subset of

S∗ = ij ∈ V × V : j ≥ i , i and j are not separated by F (i , j)

and S∗ is solvable.




Proof. We first note that ii ∈ S∗, for i = 1, . . . , n, since i and i are notseparated by F (i , i). We will now verify that the recursions are solvable usingS∗. The global Markov property ensure that if ij 6∈ S∗ then Lji = 0 for allQ > 0 with fixed graph G. We use this to replace I(i) withI∗(i) = k > i : ik ∈ S∗ in (86), which is legal since I(i) ⊆ I∗(i) and thedifference only identify terms Lki which are zero. It is now sufficient to showthat

ij ∈ S∗ and ik ∈ S∗ =⇒ kj ∈ S∗ (87)

which implies (86). Eq. (87) is trivially true for i ≤ k = j . Fix now i < k < j .

Then ij ∈ S∗ says that there exists a path i , i1, . . . , in, j , where i1, . . . , in are all

smaller than i , and ik ∈ S∗ says that exists a path i , i ′1, . . . , i ′n′ , k, where

i ′1, . . . , i ′n′ are all smaller than i. Then there is a path from k to i and from i to

j where all nodes are less or equal to i , but then also less than k since i < k.

Hence, k and j are not separated by F (k, j) so kj ∈ S∗. Finally, since S∗contains 11, . . . , nn and only depend on G, it must contain the union of all

S(Q) since each S(Q) is minimal.




Interpretation of S∗

• S∗ is the set of all possible non-zero elements in L based on Gonly.

• This is the set of Lji ’s that are computed when computingQ = LLT .

• Since Lji 6= 0 in general when i ∼ j , then we compute alsoCov(xi , xj) for i ∼ j .

• Some of the Lij ’s might turn out to be zero depending on theconditional independence properties of the marginal densityfor xi :n for i = n, . . . , 1. This might cause a slight problem inpractice (implementation dependent).




General algorithm

for i = n, . . . , 1for decreasing j in I(i)

Compute Σi ,j from Eq. (85)




Band matrices

for i = n, . . . , 1for j = min(i + bw , n), . . . , i

Compute Σi ,j from Eq. (85).

Equivalent to Kalman-recursions for smoothing.



Marginal variances under hard and soft constraints


Let Σ be the covariance with the constraints and Σ be without.

Then Σ relates to Σ as

Σ = Σ−Q−1AT(

AQ−1AT)−1

AQ−1.

hence

Σii = Σii −(

Q−1AT(

AQ−1AT)−1

AQ−1

)ii

, i = 1, . . . , n.

for the hard constraint Ax = b.




Computational costs

Time O(n)

Spatial O(n log(n)2)

Spatio-temporal O(n5/3).

gaussian markov random fields: theory and applications ... · gaussian markov random fields: theory...

Documents