sensitivity analysis of discrete markov chains via matrix calculus

Linear Algebra and its Applications 438 (2013) 1727–1745

Contents lists available at SciVerse ScienceDirect

Linear Algebra and its Applications

journal homepage: www.elsevier .com/locate/ laa

Sensitivity analysis of discrete Markov chains via matrix

calculus

Hal Caswell∗Biology Department MS-34, Woods Hole Oceanographic Institution, Woods Hole, MA 02543, USA

Max Planck Institute for Demographic Research Rostock, Germany

A R T I C L E I N F O A B S T R A C T

Article history:

Received 29 November 2010

Accepted 29 July 2011

Available online 12 November 2011

Submitted by V. Mehrmann

Keywords:

Sensitivity

Elasticity

Population models

Life expectancy

Life lost to mortality

Community dynamics

Demography

Markov chains are often used as mathematical models of natural

phenomena, with transition probabilities defined in terms of para-

meters that are of interest in the scientific question at hand. Sensi-

tivity analysis is an important way to quantify the effects of changes

in these parameters on the behavior of the chain. Many properties

of Markov chains can be written as simple matrix expressions, and

hence matrix calculus is a powerful approach to sensitivity analysis.

Usingmatrix calculus,wederive the sensitivity and elasticity of a va-

riety of properties of absorbing and ergodic finite-state chains. For

absorbing chains, we present the sensitivities of themoments of the

number of visits to each transient state, the moments of the time to

absorption, themeannumber of states visited before absorption, the

quasistationary distribution, and the probabilities of absorption in

each of several absorbing states. For ergodic chains, we present the

sensitivity of the stationary distribution, themean first passage time

matrix, the fundamental matrix, and the Kemeny constant. We in-

clude two examples of application of the results to demographic and

ecological problems.

© 2011 Elsevier Inc. All rights reserved.

1. Introduction

Perturbation analysis is a long-standing problem in the theory of Markov chains [40,10,15,13,42,

43,31,9,33,35,36,27]. When Markov chains are applied as models of physical, biological, or social

systems, they are oftendefinedas functions of parameters that have substantivemeaning. Perturbation

∗ Address: Biology Department MS-34, Woods Hole Oceanographic Institution, Woods Hole, MA 02543, USA.

E-mail address: hcaswellwhoi.edu

0024-3795/$ - see front matter © 2011 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/j.laa.2011.07.046


http://www.sciencedirect.com/science/journal/00243795

www.elsevier.com/locate/laa


1728 H. Caswell / Linear Algebra and its Applications 438 (2013) 1727–1745

analysis is an important tool for understanding how those parameters determine the properties of the

chain, and in predicting how changes in the environment (sensu lato) will change the outcome. For

applications, the analyses should be easily computable, and flexible enough to handle a variety of

dependent variables. To the extent that the properties of the chain can bewritten in terms ofmatrices,

matrix calculus [28,29,1] is an ideal tool for this end.

In this paper, we applymatrix calculus to absorbing and ergodic discrete-time chains. For absorbing

chains, we derive the sensitivities of the moments of the number of visits to each transient state, of

the moments of the time to absorption, of the mean number of states visited before absorption, of the

quasistationary distribution, and of the probabilities of absorption in each of several absorbing states.

For ergodic chains, we derive the sensitivity of the stationary distribution, of the mean first passage

timematrix, of the fundamentalmatrix, and of the Kemeny constant. The results aremotivated by (but

not restricted to) problems in ecology and demography, and we include two examples. The examples

present, as far as we know, the first calculations of many of these sensitivities.

Notation: Vectors are column vectors, and transition matrices are column-stochastic and operate on

column vectors, to agree with usage in other dynamical systems. The norm ‖x‖ denotes the 1-norm.

The vector e is a vector of ones, ei is the ith unit vector, E = eeT is a matrix of ones, and Eij = eieTj is a

matrix with a 1 in the (i, j) entry and zeros elsewhere. ThematrixD(x) is thematrix with the vector x

on the diagonal and zeros elsewhere. The matrix Xdg is the diagonal matrix with the diagonal of X on

its diagonal and zeros elsewhere. The Hadamard product is denoted by ◦ and the Kronecker product

by ⊗. Where necessary to avoid ambiguity, a subscript denotes the dimension of the identity matrix

(e.g. Is is the identity matrix of dimension s). As inMatlab, row i and column j of X are denoted X(i, :)andX(:, j), respectively. The vec operator applied to am×nmatrixX yields themn×1 vector created

by stacking the columns of X one above the other.

In matrix calculus [29], the derivative of a vector y with respect to a vector x is

dy

dxT =(

dyi

dxj

). (1)

If y and x are nonnegative it is often useful to evaluate the elasticity, or proportional sensitivity of y to

x. This gives the relative change in y caused by a relative change in x. There is no standard notation for

elasticities, but here we adapt Samuelson’s suggestion [39] and use a notation analogous to that for

derivatives, but with ε replacing d:

εy

εxT = D(y)−1 dy

dxT D(x). (2)

Matrices are treated by using the vec operator to transform them to vectors, and differentiated accord-

ingly. Thus the derivative of y with respect to the matrix X is written dy/dvec TX; this arrangement

permits the chain rule to operate (see Appendix A and references therein). Most results are reported

as differentials. If dy = �dx for some matrix �, then

dy

dxT = � (3)

dy

dθT = �dx

dθT (4)

whereθ is anyvector ofwhichx is a function [29]. Ifydependson x1 and x2, then thepartial differentials

of y [14, p. 37] are denoted ∂x1 = (∂y/∂x1) dx1 and similarly for x2. For more details of matrix calculus

notation, see Appendix A.

2. Absorbing chains

The transition matrix for a discrete-time absorbing chain can be written

P =⎛⎝ U 0

M I

⎞⎠ (5)

H. Caswell / Linear Algebra and its Applications 438 (2013) 1727–1745 1729

whereU, of dimension s× s, is the transitionmatrix among the s transient states, andM, of dimension

a × s, contains probabilities of transition from the transient states to the a absorbing states. Assume

that the spectral radius of U is strictly less than 1. Because we are concerned here with absorption,

but not what happens after, we ignore transitions among absorbing states; hence the identity matrix

(a × a) in the lower right corner. The matrices U[θ ] andM[θ ] are functions of a vector of parameters.

We assume that θ varies over some set in which the column sums of P are 1 and the spectral radius of

U is strictly less than one.

2.1. Number of visits to transient states

Let νij be the number of visits to transient state i, prior to absorption, by an individual starting in

transient state j. The expectations of the νij are entries of the fundamental matrixN = N1 =(E(ηij)

):

N = (I − U)−1 (6)

(e.g. [25,23]). Let Nk =(E(ηk

ij))be a matrix containing the kth moments about the origin of the νij .

The first several of these matrices are [23, Theorem 3.1]

N1 = (I − U)−1 (7)

N2 = (2Ndg − I

)N1 (8)

N3 =(6N2

dg − 6Ndg + I)N1 (9)

N4 =(24N3

dg − 36N2dg + 14Ndg − I

)N1. (10)

Theorem 2.1. Let Nk be the matrix of kth moments of the νij , as given by (7)–(10). The sensitivities of Nk,

for k = 1, . . . , 4 are

dvecN1 =(NT1 ⊗ N1

)dvecU (11)

dvecN2 =[2

(I ⊗ Ndg

) − Is2]dvecN1 + 2

(NT ⊗ I

)dvecNdg (12)

dvecN3 =[I ⊗

(6N2

dg − 6Ndg + I)]

dvecN1

+[6

(NTNdg ⊗ I

)+ 6

(NT ⊗ Ndg

)− 6

(NT ⊗ I

)]dvecNdg (13)

dvecN4 =[I ⊗

(24N3

dg − 36N2dg + 14Ndg − I

)]dvecN1

+[24

(NTN2

dg ⊗ I)

+ 24(NTNdg ⊗ Ndg

)+ 24

(NT ⊗ N2

dg

)

−36(NTNdg ⊗ I

)− 36

(NT ⊗ Ndg

)+ 14

(NT ⊗ I

) ]dvecNdg (14)

where dvecNdg = D(vec I)dvecN1; see (A-12).

Proof. The result (11) is derived in [3, Section 3.1]. For k > 1, and considering Nk as a function of N1

and Ndg, the total differential of Nk is

dvecNk = ∂vecNk

∂vec TN1

dvecN1 + ∂vecNk

∂vec TNdg

dvecNdg. (15)

The two terms of (15) are the partial differentials of vecNk , obtained by taking differentials treat-

ing only N1 or only Ndg as variables, respectively. Denote these partial differentials as ∂N1 and ∂Ndg .

Differentiating N2 in (8) gives


∂N1N2 = 2Ndg (dN1) − dN1 (16)

∂NdgN2 = 2(dNdg

)N1. (17)

Applying the vec operator and using (A-8) gives

∂N1vecN2 = [2

(I ⊗ Ndg

) − Is2]dvecN1 (18)

∂NdgvecN2 = 2(NT1 ⊗ I

)dvecNdg (19)

and (15) becomes

dvecN2 = [2

(I ⊗ Ndg

) − Is2]dvecN1 + 2

(NT1 ⊗ I

)dvecNdg (20)

which is (12). The derivations of dvecN3 and dvecN4 follow the same sequence of steps. The details

are given in Online Supplement Appendix B. �

The derivatives of N2, N3, and N4 can be used to study the variance, standard deviation, coefficient

of variation, skewness, and kurtosis of the number of visits to the transient states [3,6,7].

2.2. Time to absorption

Let ηj be the time to absorption starting in transient state j and let ηk = E(ηk1, · · · , ηk

s

)T. The

first several of these moments satisfy [23, Theorem 3.2]

ηT1 = eTN1 (21)

ηT2 = ηT

1 (2N1 − I) (22)

ηT3 = ηT

1

(6N2

1 − 6N1 + I)

(23)

ηT4 = ηT

1

(24N3

1 − 36N21 + 14N1 − I

). (24)

Theorem 2.2. Let ηk be the vector of the kth moments of the ηi. The sensitivities of these moment vectors

are

dη1 =(I ⊗ eT

)dvecN1 (25)

dη2 =(2NT

1 − I)dη1 + 2

(I ⊗ ηT

1

)dvecN1 (26)

dη3 =(6N2

1 − 6N1 + I)T

dη1

+[6

(NT1 ⊗ ηT

1

)+ 6

(I ⊗ ηT

1N1

)− 6

(I ⊗ ηT

1

) ]dvecN1 (27)

dη4 =(24N3

1 − 36N21 + 14N1 − I

)Tdη1

+{24

[(NT1

)2 ⊗ ηT1

]+ 24

(NT1 ⊗ ηT

1N1

)+ 24

(I ⊗ ηT

1N21

)

−36(NT1 ⊗ ηT

1

)− 36

(I ⊗ ηT

1N1

)+ 14

(I ⊗ ηT

1

) }dvecN1 (28)

where dvecN1 is given by (11).

Proof. The derivative of η1 is obtained [3] by differentiating (21) to get dηT1 = eT (dN1) and then

applying the vec operator. For the higher moments, consider the ηk to be functions of η1 and N1, and


write the total differential

dηk = ∂ηk

∂ηT1

dη1 + ∂ηk

∂vec TN1

dvecN1. (29)

The partial differentials of η2 with respect to η1 and N1 are

∂η1ηT2 =

(dηT

1

)(2N1 − I) (30)

∂N1ηT2 = 2ηT

1 (dN1) . (31)


∂η1η2 =

(2NT

1 − I)dη1 (32)

∂N1η2 = 2(I ⊗ ηT

1

)dvecN1 (33)

which combine according to (29) to yield (26). Thederivations ofdη3 anddη4 follow the samesequence

of steps; the details are shown in Online Supplementary Appendix B. �

2.3. Number of states visited before absorption

Let ξi ≥ 1 be the number of distinct transient states visited before absorption, and let ξ1 = E(ξ).Then

ξT1 = eTN

−1dg N1 (34)

[23, Section 3.2.5], where N−1dg = (

Ndg

)−1.

Theorem 2.3. Let ξ 1 = E(ξ). The sensitivity of ξ is

dξ 1 =[−

(NT1 ⊗ eT

) (N

−1dg ⊗ N

−1dg

)D(vec I) +

(I ⊗ eTN

−1dg

)]dvecN1 (35)

where dvecN1 is given by (11).

Proof. Differentiating (34) yields

dξT1 = eT

(dN

−1dg

)N1 + eTN

−1dg dN1. (36)


dξ 1 =(NT1 ⊗ eT

)dvecN

−1dg +

(I ⊗ eTN

−1dg

)dvecN1. (37)

Applying (A-10) to dvecN−1dg and using (A-13) for dvecNdg gives

dξ 1 = −(NT1 ⊗ eT

) (N

−1dg ⊗ N

−1dg

)D(vec I)dvecN1 +

(I ⊗ eTN

−1dg

)dvecN1 (38)

which simplifies to (35). �

2.4. Multiple absorbing states and probabilities of absorption

When the chain includes a > 1 absorbing states, the entry mij of the a × s submatrix M in (5)

is the probability of transition from transient state j to absorbing state i. The result of the competing

risks of absorption is a set of probabilities bij = P [absorption in i |starting in j ] for i = 1, . . . , a and

j = 1, . . . , s . The matrix B = (bij

) = MN1 [23, Theorem 3.3] .


Theorem 2.4. Let B = MN1 be the matrix of absorption probabilities. Then

dvec B =(NT1 ⊗ I

)dvecM +

(NT1 ⊗ B

)dvecU. (39)

Proof. Differentiating B yields

dB = (dM)N1 + M (dN1) . (40)

Applying the vec operator gives

dvec B =(NT1 ⊗ I

)dvecM + (I ⊗ M) dvecN1. (41)

Substituting (11) for dvecN1 and simplifying gives (39). �

Column j of B is the probability distribution of the eventual absorption state for an individual

starting in transient state j. Usually a few of those starting states are of particular interest (e.g. states

corresponding to “birth” or to the start of some process). Let B(:, j) = Bej denote column j of B, where

ej is the jth unit vector of length s. Thus the derivative of B(:, j) isdvec B(:, j) =

(eTj ⊗ Ia

)dvec B (42)

where dvec B is given by (39). Similarly, row i of B is B(i, :) = eTi B and

dvec B(i, :) =(Is ⊗ eT

i

)dvec B (43)

where ei is the ith unit vector of length a.

2.5. The quasistationary distribution

The quasistationary distribution of an absorbingMarkov chain gives the limiting probability distri-

bution, over the set of transient states, of the state of an individual that has yet to be absorbed. Let w

and v be the right and left eigenvectors associated with the dominant eigenvalue of U, normalized so

that ‖w‖ = ‖v‖ = 1. Darroch and Seneta [11] defined two quasistationary distributions in terms of

w and v. The limiting probability distribution of the state of an individual, given that absorption has

not yet happened, converges to

qa = w. (44)

The limiting probability distribution of the state of an individual, given that absorption has not hap-

pened and will not happen for a long time, is

qb = w ◦ v

wTv. (45)

Lemma 2.1. Let the dominant eigenvalue of U, guaranteed real and nonnegative by the Perron–Frobenius

theorem, satisfy 0 < λ < 1, and let w and v be the right and left eigenvectors corresponding to λ, scaledto sum to one. Then

dw =(λIs − U + weTU

)−1 [wT ⊗

(Is − weT

)]dvecU (46)

dv =(λIs − UT + veT

1UT)−1 [ (

Is − veT1

)⊗ vT

]dvecU. (47)

Proof. Eq. (46) is proven in [5, Section 6.1]. Eq. (47) is obtained by treating v as the right eigenvector

of UT. �


Theorem 2.5. The derivative of the quasistationary distribution qa is given by (46). The derivative of the

quasistationary distribution qb is

dqb = 1

vTw

[ (D(v) − qbv

T)dw +

(D(w) − qbw

T)dv

](48)

where dw and dv are given by (46) and (47), respectively.

Proof. The derivative of qa follows from its definition as the scaled right eigenvector of U. For qb,

differentiating (45) gives

dqb = 1(vTw

)2{ (

vTw)d (v ◦ w) − (v ◦ w)

[(dvT

)w + vT (dw)

]}(49)

= 1

vTw

[d (v ◦ w) − qb

(dvT

)w − qbv

T (dw)]. (50)

Applying the vec operator and using (A-11) for dvec (v ◦ w) gives

dqb = 1

vTw

[D(v)dw + D(w)dv −

(wT ⊗ qb

)dv − qbv

Tdw]

(51)

which simplifies to give (48). �

3. An absorbing chain example: life lost due to mortality

The approach here makes it easy to compute the sensitivity of a variety of dependent variables

calculated from the Markov chain. As an example of this flexibility, consider a recently developed

demographic index, the number of years of life lost due to mortality [44].

The transient states of this chain are age classes, absorption corresponds to death, and absorbing

states correspond to age at death. Let μi be the mortality rate and pi = exp(−μi) the survival proba-

bility at age i. ThematrixU has the pi on the subdiagonal and zeros elsewhere. ThematrixM has 1−pion the diagonal and zeros elsewhere. Let f = B(:, 1) be the distribution of age at death and η1 the

vector of expected longevity as a function of age.

A death at age i represents the loss of some number of years of life beyond that age. The expectation

of that loss is given by the ith entry of η1, and the expected number of years lost over the distribution

of age at death is η† = ηT1 f . This quantity also measures the disparity among individuals in longevity

[44]. If everyone died at the identical age x, f would be a delta function at x and further life expectancy

at age x would be zero; their product would give η† = 0. Declines in discrepancy have accompanied

increases in life expectancy observed in developed countries [12,45]. Thus it is useful to know how η†

responds to changes in mortality.

Differentiating η† gives

dη† =(dηT

1

)Be1 + ηT

1 (dB) e1. (52)


dη† = eT1 b

TdηT1 +

(eT1 ⊗ ηT

1

)dvec B. (53)

Substituting (25) for dη1 and (39) for dvec B gives

dη† = fT(I ⊗ eT

)dvecN1 +

(eT1 ⊗ ηT

1

) [ (NT1 ⊗ I

)dvecM +

(NT1 ⊗ B

)dvecU

]. (54)

Simplifying and writing derivatives in terms of μ gives

dη†

dμT =[fT

(NT1 ⊗ ηT

1

)+

(eT1N

T1 ⊗ ηT

1 B)]

dvecU

dμT +(eT1N

T1 ⊗ ηT

1

) dvecM

dμT . (55)


0 20 40 60 80 100−0.02

0

0.02Japan

Elas

ticity

0 20 40 60 80 1000

0.05

0.1India

Age

Elas

ticity

Fig. 1. The elasticity of mean years of life lost due to mortality, η†, to changes in age-specific mortality, calculated from the female

life tables of India in 1961 and of Japan in 2006. Data obtained from the Human Mortality Database [20].

Because mortality rates vary over several orders of magnitude with age, it is useful to present the

results as elasticities,

εη†

εμT = 1

η†

dη†

dμT D(μ). (56)

Fig. 1 shows these elasticities for two populations chosen to have very different life expectancies: India

in 1961, with female life expectancy of 45 years and η† = 23.9 years and Japan in 2006, with female

life expectancy of 86 years andη† = 10.1 years [20]. In both cases, elasticities are positive frombirth to

some age (≈50 for India,≈85 for Japan) and negative thereafter. This implies that reductions in infant

and early life mortality would reduce η†, whereas reductions in old age mortality would increase η†.

4. Ergodic chains

Now let us consider perturbations of an ergodic finite-state Markov chain with an irreducible,

primitive, column-stochastic transition matrix P of dimension s × s. The stationary distribution π is

given by the right eigenvector, scaled to sum to 1, corresponding to the dominant eigenvalue λ1 = 1 of

P. The fundamental matrix of the chain is, in our notation, Z =(I − P + πeT

)−1[25]. (This is simply

the transpose of the result derived from a row-stochastic matrix.)

The perturbations of interest are those for which P remains a stochastic matrix. Such perturbations

are easily defined when the pij depend explicitly on a parameter vector θ . However, when the para-

meters of interest are the pij themselves, an implicit parameterizationmust be defined to preserve the

stochastic nature of P under perturbation [10,2]. In Section 4.5 we will explore new expressions for

two different forms of implicit parameterization.

Previous studies of perturbations of ergodic chains focus almost completely on perturbations of

the stationary distribution, and are divided between those focusing on sensitivity as a derivative; e.g.

[40,10,15] and studies focusing on perturbation bounds and condition numbers [13,31,42,21,26]; for

reviews see [9,27].Myapproachhere is similar in spirit to that of [40,10,15] in focusingonderivatives of

Markov chain properties with respect to parameter perturbations, but taking advantage of the matrix

calculus approach. We do not consider perturbation bounds here.


In Sections4.1–4.4wederive the sensitivities of the stationarydistribution, the fundamentalmatrix,

the mean first passage times, and the Kemeny constant. In Section 4.5 we consider implicit parameter

dependence, and in Section 5 give an ecological example.

4.1. The stationary distribution

Theorem 4.1. Let π be the stationary distribution, satisfying Pπ = π and eTπ = 1. The sensitivity of πis

dπ =[πT ⊗

(Z − πeT

)]dvec P (57)

where Z is the fundamental matrix of the chain.

Proof. The vector π is the right eigenvector of P, scaled to sum to 1. Applying Lemma 2.1, and noting

that λ = 1 and eTP = eT, gives dπ = Z[πT ⊗

(Is − πeT

)]dvec P. Noting that Zπ = π and

simplifying the Kronecker products yields (57). �

Based on an analysis of eigenvector sensitivity [32], Golub andMeyer [15] derived an expression for

the derivative of π to a change in a single element of P using the group generalized inverse (I − P)#

of I− P. Since (I − P)# = Z − πeT [15], expression (57) is exactly the Golub–Meyer result expressed

in matrix calculus notation. Our results here permit sensitivity analysis of functions of π using only

the chain rule. If g(π) is a vector- or scalar-valued function of π , then

dg(π) = dg

dπT

dπ

dvec TPdvec P. (58)

Some examples will appear in Section 5.

4.2. The fundamental matrix

The fundamentalmatrix Z =(I − P + πeT

)−1plays a role in ergodic chains similar to that played

byN1 in absorbing chains [25]. It has been extended using generalized inverses [30,24], but we do not

consider those extensions here.

Theorem 4.2. The sensitivity of the fundamental matrix is

dvec Z =(ZT ⊗ Z

) {Is2 −

[eπT ⊗

(Z − πeT

)]}dvec P. (59)

Proof. From (A-10),

dvec Z = −(ZT ⊗ Z

)dvec

(I − P + πeT

)(60)

=(ZT ⊗ Z

)(dvec P − (e ⊗ Is) dπ) . (61)

Substituting (57) for dπ and simplifying gives (59). �

4.3. The first passage time matrix

Let R =(rij

)be the matrix of mean first passage times from j to i, given by [23, Theorem 4.7].

R = D(π)−1 (I − Z + ZdgE

). (62)

Again, this is the transpose of the expression obtained when P is row-stochastic.


Theorem 4.3. The sensitivity of R is

dvec R = −[RT ⊗ D(π)−1

]D (vec Is) (e ⊗ Is) dπ

−{ [

Is ⊗ D(π)−1]−

[E ⊗ D(π)−1

]D (vec Is)

}dvec Z (63)

where dπ is given by (57) and dvec Z is given by (59).

Proof. Differentiating (62) gives

dR = d[D(π)−1

] (I − Z + ZdgE

) + D(π)−1 [−dZ + (dZdg

)E]. (64)


dvec R =[(I − Z + ZdgE

)T ⊗ Is

]dvec

[D(π)−1

]−

[Is ⊗ D(π)−1

]dvec Z +

[E ⊗ D(π)−1

]dvec Zdg. (65)

Using (A-10) for dvec[D(π)−1

], (A-16) for dvecD(π), and (A-13) for dvec Zdg yields

dvec R = −[RTD(π) ⊗ Is

] [D(π)−1 ⊗ D(π)−1

]D(vec I) (e ⊗ I) dπ

−[I ⊗ D(π)−1

]dvec Z +

[E ⊗ D(π)−1

]D (vec I) dvec Z (66)

which simplifies to give (63). �

4.4. Mixing time and the Kemeny constant

The mixing time K of a chain is the mean time required to get from a specified state to a state

chosen at random from the stationary distribution π . Remarkably, K is independent of the starting

state [16,22] and is sometimes called Kemeny’s constant; it is a measure of the rate of convergence to

stationarity, and is K = trace(Z) [22]. In addition to being a quantity of interest in itself, the rate of

convergence also plays a role in the sensitivity of the stationary distribution of ergodic chains [21,35].

Theorem 4.4. The sensitivity of K is

dK = (vec Is)T dvec Z. (67)

Proof. Differentiating K = trace(Z) gives

dK = eT (I ◦ dZ) e. (68)


dK =(eT ⊗ eT

)D(vec I)dvec Z (69)

which simplifies to (67). �

4.5. Implicit parameters and compensation

Theorems 4.1–4.4 are written in terms of dvec P. However, perturbation of any element, say pkj ,

to pkj + θkj , must be compensated for by adjustments of the other elements in column j so that the

column sum is 1 [10]. Caswell [2] introduced two kinds of compensation likely to be of use in applica-

tions: additive and proportional. Additive compensation adjusts all the elements of the column by an

equal amount, distributing the perturbation θkj additively over column j. Proportional compensation


distributes θkj in proportion to the values of the pij , for i �= k. Proportional compensation is attractive

because it preserves the pattern of zero and non-zero elements within P.

To develop the compensation formulae, let us start by considering a probability vector p, of dimen-

sion s × 1, with pi ≥ 0 and∑

i pi = 1. Let θi be the perturbation of pi, and write

p(θ) = p(0) + Aθ (70)

for some matrix A to be determined. If y is a function of p, then

dy = dy

dpT

dp

dθT dθ (71)

evaluated at θ = 0. For the case of additive compensation, we write

p1(θ) = p1(0) + θ1 − θ2

s − 1− · · · − θs

s − 1

p2(θ) = p2(0) − θ1

s − 1+ θ2 − · · · − θs

s − 1

... (72)

ps(θ) = ps(0) − θ1

s − 1− θ2

s − 1− · · · + θs.

The perturbation θ1 is added to p1 and compensated for by subtracting θ1/(s−1) fromall other entries

of p; clearly∑

i pi(θ) = 1 for any perturbation vector θ .The system of equations (72) can be written

p(θ) = p(0) +(I − 1

s − 1C

)θ (73)

where C = E− I is the Toeplitzmatrixwith zeros on the diagonal and ones elsewhere. Thus thematrix

A in (70) is

A = I − 1

s − 1C. (74)

For proportional compensation, assume that pi < 1 for all i. The vector p(θ) is

p1(θ) = p1(0) + θ1 − p1θ2

1 − p2− · · · − p1θs

1 − ps

p2(θ) = p2(0) − p2θ1

1 − p1+ θ2 − · · · − p2θs

1 − ps

... (75)

ps(θ) = ps(0) − psθ1

1 − p1− psθ2

1 − p2− · · · + θs.

The perturbation θ1 is added to p1 and compensated for by subtracting θ1pi/(1 − p1) from the ith

entry of p. Again,∑

i pi(θ) = 1 for any perturbation vector θ .Eq. (75) can be written

p(θ) = p(0) +[I − D(p) C D(e − p)−1

]θ (76)

so that the matrix A in (70) is

A = I − D(p) C D(e − p)−1. (77)


Now consider perturbation of a probability matrix P, where each column is a probability vector.

Define a perturbation matrix � where θij is the perturbation of pij . Perturbations of column j are to be

compensated by a matrix Aj , so that

P(�) = P(0) +[A1�(:, 1) · · ·As�(:, s)

](78)

where Ai compensates for the changes in column i of P. Applying the vec operator to (78) gives

vec P(�) = vec P(0) +

⎛⎜⎜⎜⎜⎝

A1

. . .

As

⎞⎟⎟⎟⎟⎠ vec� (79)

= vec P(0) +s∑

i=1

(Eii ⊗ Ai) vec�. (80)

The terms in the summation in (80) are recognizable as the vec of the product Ai�Eii; thus

P(�) = P(0) +s∑

i=1

Ai�Eii. (81)

Theorem 4.5. Let P be a column-stochastic s × s transition matrix. Let � be a matrix of perturbations,

where θij is applied to pij, and the other entries of � compensate for the perturbation. Let C = E − I. If

compensation is additive, then

P(�) = P(0) +(I − 1

s − 1C

)� (82)

dvec P

dvec T�=

[Is2 − 1

s − 1(Is ⊗ C)

]. (83)

If compensation is proportional,

P(�) = P(0) +s∑

i=1

{I − D [P(:, i)] C D [e − P(:, i)]−1

}�Eii (84)

dvec P

dvec T�= Is2 −

s∑i=1

{Eii ⊗ D [P(:, i)] C D [e − P(:, i)]−1

}. (85)

Proof. P(�) is given by (81). If compensation is additive, Ai is given by (74) for all i. Substituting into

(81) gives (82). Differentiating (82), applying the vec operator, and using (A-5) gives (83).

If compensation is proportional, substituting (77) for Ai in (81) gives (84). Differentiating yields

dP = (d�)s∑

i=1

Eii −s∑

i=1

D[P(:, 1)] C D[e − P(:, i)]−1(d�)Eii. (86)

Using the vec operator and (A-5) gives (85). �

Perturbations of P subject to compensation are given by perturbations of �. Thus for any function

y(P) we can write

dy

dvec TP

∣∣∣∣comp

= dy

dvec TP

dvec P

dvec T�(87)


where dvec P/dvec T� is given (for additive and proportional compensation) by Theorem 4.5. The

slight notational complexity is worthwhile for clarifying how to use Theorem 4.5 in practice.

5. An ergodic chain example: species succession in a marine community

Markov chains are used by ecologists as models of species replacement (succession) in ecological

communities; e.g. [19,18,37]. In these models, the state of a point on a landscape is given by the

species occupying that point. The entry pij of P is the probability that species j is replaced by species

i between t and t + 1. If a community consists of a large number of points independently subject to

the transition probabilities in P, the stationary distribution π gives the relative frequencies of species

in the community at equilibrium.

Hill et al. [18] used a Markov chain to describe a community of encrusting organisms occupying

rock surfaces at 30–35 m depth in the Gulf of Maine. The Markov chain contained 14 species plus an

additional state (“bare rock”) for unoccupied substrate. The matrix Pwas estimated from longitudinal

data in [17,18] and is given, along with a list of species names, in Online Appendix C. We will analyze

the response of species diversity and of the Kemeny constant to perturbations of the transitionmatrix.

5.1. Biotic diversity

The stationary distribution π , with the species numbered in order of decreasing abundance and

bare rock placed at the end as state 15, is shown in Fig. 2 . The two dominant species are an encrusting

sponge (called Hymedesmia) and a bryozoan (Crisia).

The entropy of the stationary distribution, H(π) = −πT(logπ), where the logarithm is applied

elementwise, is used as an index of biodiversity; it is maximal when all species are equally abundant

and goes to 0 in a community dominated by a single species. The sensitivity of H is

dH = −(log πT + eT

)dπ . (88)

Most ecologists, however, would not include bare substrate in a measure of biodiversity [18], so we

define a “biotic diversity” Hb(π) = H (πb) where

πb = Gπ

‖Gπ‖ (89)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Stat

iona

ry d

istri

butio

n

Fig. 2. The stationary distribution for the subtidal benthic community succession model of Hill et al. [18]. States 1–14 correspond to

species, numbered in decreasing order of abundance in the stationary distribution. State 15 is bare rock, unoccupied by any species.

For the identity of species and the transition matrix, see Appendix C.


with G a 14 × 15 0–1 matrix that selects rows 1–14 of π . Because π is positive, ‖Gπ‖ = eTGπ .Differentiating πb gives

dπb =(

G

eTGπ− GπeTG(

eTGπ)2

)dπ (90)

which simplifies to

dπb =(G − πbe

TG

eTGπ

)dπ . (91)

This model contains no explicit parameters; perturbations of the transition probabilities them-

selves are of interest and a compensation pattern is needed. Because the relative magnitudes of the

entries in a column of P reflect the relative abilities of species to capture or to hold space, proportional

compensation is appropriate in this case because it preserves these relative abilities.

The sensitivity and elasticity of the biotic diversity Hb to changes in the matrix P, subject to pro-

portional compensation, are

dHb

dvec TP

∣∣∣∣comp

= dHb

dπTb

dπb

dπT

dπ

dvec TP

dvec P

dvec T�(92)

εHb

εvec TP

∣∣∣∣comp

= 1

Hb

dHb

dvec TPD(vec P) (93)

where the four derivatives on the right hand side of (92) are given by (88), (91), (57), and (85), respec-

tively.

The sensitivity and elasticity vectors (92) and (93) are of dimension 1 × s2 = 1 × 255. To reduce

the number of independent perturbations, we consider subsets of the pij: disturbance (in which a

species is replaced by bare rock), colonization of bare rock, replacement of one species by another, and

persistence of a species in its location, where

P[disturbance of sp. i] = psi

P[colonization by sp. i] = pis

P[persistence of sp. i] = pii

P[replacement of sp. i] = ∑k �=i,s

pki

P[replacement by sp. i] = ∑j �=i,s

pij.

Extracting the corresponding elements ofεHb

εvecTPgives the elasticities to these classes of probabilities.

Fig. 3 shows that the dominant species (1 and 2) have impacts that are larger than, and opposite in sign

to, those of the remaining species. Biodiversitywould be enhanced by increasing the disturbance of, or

the replacement of, species 1 and 2, and reduced by increasing the rates of colonization by, persistence

of, or replacement by species 1 and 2.

5.2. The Kemeny constant and mixing

Ecologists have used several measures of the rate of convergence of communities modeled by

Markov chains, including the damping ratio andDobrushin’s coefficient of ergodicity [18]. The Kemeny

constant K is an interesting addition; it gives the expected time to get from any initial state to a state

selected at random from the stationary distribution [22]. Once reaching that state, the behavior of the

chain and the stationary process are indistinguishable.


−0.01

0

0.01

0.02... to disturbance of

Elas

ticity

of H

−0.04

−0.02

0

0.02... to colonization by

−0.6

−0.4

−0.2

0

0.2... to persistence of

Species

−0.2

−0.1

0

0.1

0.2... to replacement by

Species

Elas

ticity

of H

−0.05

0

0.05

0.1

0.15... to replacement of

Species

Fig. 3. The elasticity of the biotic diversityHb(π) calculated over the biotic states of the stationary distribution of the subtidal benthic

community successionmodel of [18]. States 1–14 correspond to species, numbered in decreasing order of abundance in the stationary

distribution. State 15 is bare rock, unoccupied by any species. For the identity of species and the transition matrix, see Appendix C.

The sensitivity of K , subject to compensation, is

dK

dvec TP

∣∣∣∣comp

= dK

dvec TZ

dvec Z

dvec TP

dvec P

dvec T�(94)

where the three terms on the right hand side are given by (67), (59), and (85), respectively.

Fig. 4 shows the sensitivities dK/dvec TP, subject to proportional compensation, and aggregated

as in Fig. 3. Unlike the case with Hb, the two dominant species do not stand out from the others.

Increases in the rates of replacementwill speed up convergence, and increases in persistencewill slow

convergence. Thedisturbanceof, colonizationby, persistenceof, and replacementof species 6 (it is a sea

anemone,Urticina crassicornis) have particularly large impacts on K . Examination of row 6 and column

6 of P (Appendix C) shows that U. crassicornis has the highest probability of persistence (p66 = 0.86),and one of the lowest rates of disturbance, in the community. While it is far from dominant (Fig. 2), it

has a major impact on the rate of mixing.

6. Discussion

Given thatmany properties of finite stateMarkov chains can be expressed as simplematrix expres-

sions, matrix calculus is an attractive approach to finding the sensitivity and elasticity to parameter

perturbations. Most of the literature on perturbation analysis of Markov chains has focused on the sta-

tionary distribution of ergodic chains, but the approach here is equally applicable to absorbing chains,

and to dependent variables other than the stationary distribution. The perturbation of ergodic chains

is often studied using generalized inverses, since the influential studies ofMeyer [30,15,13,31]. Matrix

calculus provides a complementary approach; the sensitivity of the stationary distributionπ obtained

here agrees with the result obtained by Golub and Meyer [15] using the group generalized inverse.

The issue of compensation to preserve stochasticity of the transition matrix under perturbations

may have interesting generalizations. An anonymous reviewer has pointed out that, in the case of re-

versibleMarkov chains, preserving reversibility under perturbations could be addressed by combining

the approach here with results in [34].

The formulae obtained bymatrix calculusmay appear daunting, but they are in fact far simpler than

expressionswrittenwithout this formalismwould be. They are easy to implement in amatrix oriented


−60

−40

−20

0

20... to disturbance of

Sens

itivi

ty o

f K

−2

−1

0

1

2... to colonization by

0

20

40

60... to persistence of

Species

−150

−100

−50

0... to replacement by

Species

Sens

itivi

ty o

f K

−800

−600

−400

−200

0... to replacement of

Species

Fig. 4. The sensitivity of the Kemeny constant K of the subtidal benthic community succession model of [18]. States 1–14 correspond

to species, numbered in decreasing order of abundance in the stationary distribution. State 15 is bare rock, unoccupied by any species.

For the identity of species and the transition matrix, see Appendix C.

language such as Matlab. Some of the intermediate matrices can be large, but they are often quite

sparse, and a language that can take advantage of the sparsity is helpful. The computational problems

arising in very large state spaces have not yet been investigated.

The examples shown here are typical of cases where absorbing or ergodic Markov chains are used

in population biology and ecology. In each example, the dependent variables of interest are functions

several steps removed from the chain itself. The ease with which one can differentiate such functions

is a particularly attractive property of the matrix calculus approach.

Acknowledgments

This research has been supported by NSF Grant DEB-0816514. Completion of the paper was sup-

portedbyaResearchAward fromtheAlexandervonHumboldt Foundation. I amgrateful fordiscussions

withMike Neubert, Alyson van Raalte, Michal Engelman, Catherine Nosova, Jeffrey Hunter, Carl Meyer,

Joel Cohen, and students at the International Max Planck Research School in Demography course in

Perturbation Analysis of Longevity. The comments of two reviewers helped to improve themanuscript.

The hospitality of the Max Planck Institute for Demographic Research was helpful in developing these

ideas.

Appendix A. Matrix calculus

Matrix calculus permits the consistent differentiation of scalar-, vector-, and matrix-valued func-

tions of scalar, vector, or matrix arguments [28,29,41]; for an introduction, see [1], and for ecological

and demographic applications see [4–6]. Here we describe the approach briefly.

If y is a n× 1 vector and x am× 1 vector, the derivative of y with respect to x is the n×m Jacobian

matrix

dy

dxT =(

dyi

dxj

). (A-1)


Derivatives of matrices are written by transforming the matrices into vectors using the vec operator,

which stacks the columns of the matrix into a column vector. The derivative of the m × n matrix Y

with respect to the p × q matrix X is themn × pq matrix

dvec Y

dvec TX(A-2)

where vec TX ≡ (vecX)T. These definitions (unlike others that do not vectorize the matrices [28])

lead to the familiar chain rule; e.g.

dvec Y

dvec TZ= dvec Y

dvec TX

dvecX

dvec TZ. (A-3)

Derivatives are constructed from differentials, where the differential of a matrix is the matrix of

differentials of the elements. If, for vectors x and y and some matrix A, it can be shown that

dy = Adx (A-4)

then

dy

dxT = A. (A-5)

(the “first identification theorem” of Magnus and Neudecker [28]). Because differentials operate ele-

mentwise, dvecX = vec dX.

The combination of the chain rule and the identification theorem permits more complicated ex-

pressions involving differentials to be turned into derivatives with respect to an arbitrary vector, say

θ . If

dy = Adx + Bdz (A-6)

then

dy

dθT = Adx

dθT + Bdz

dθT (A-7)

for any θ .Obtaining expressions like (A-5) or (A-6) often uses the result [38] that

vec (ABC) =(CT ⊗ A

)vec B. (A-8)

A.1. Sensitivities and elasticities

Perturbation analysis may involve comparison of parameters measured on very different scales.

In such cases, it is helpful to compare proportional effects of proportional perturbations, also called

elasticities. If the sensitivity of y to x is dy/dxT, then the elasticity is

εy

εxT = D(y)−1 dy

dxT D(x). (A-9)

There seems to be no accepted notation for elasticities; the notation used here is adapted [8] from that

in [39].

A.2. Some useful results

Several matrix calculus results will be used repeatedly. The derivative of the inverse of a matrix X

is [29]

dvecX−1 = −[(

X−1)T ⊗ X−1

]dvecX. (A-10)


The derivative of the Hadamard product A ◦ B is [28]

d (A ◦ B) = D(vecA)dvec B + D(vec B)dvec A. (A-11)

The derivative of Xdg is obtained by differentiating Xdg = I ◦ X,

dXdg = I ◦ dX (A-12)

and applying the vec operator and (A-11) to obtain

dvecXdg = D(vec I)dvecX. (A-13)

The diagonal matrix D(x) can be written

D(x) = I ◦(x eT

). (A-14)

Differentiating and applying the vec operator gives

dvecD(x) = D(vec I)vec(dx eT

)(A-15)

= D(vec I) (e ⊗ I) dx (A-16)

Appendix B. Supplementary material

Supplementary data associated with this article can be found, in the online version, at

doi:10.1016/j.laa.2011.07.046.

References

[1] K.M. Abadir, J.R. Magnus, Matrix Algebra Econometric Exercises 1, Cambridge University Press, Cambridge, United Kingdom,

2005.

[2] H. Caswell, Matrix Population Models: Construction, Analysis, and Interpretation, second ed., Sinauer Associates, Sunderland,Massachusetts, USA, 2001.

[3] H. Caswell, Applications ofMarkov chains in demography, in: A.N. Langville,W.J. Stewart (Eds.), MAM2006:Markov AnniversaryMeeting, Boson Books, Raleigh, North Carolina, USA, 2006, pp. 319–334.

[4] H. Caswell, Sensitivity analysis of transient population dynamics, Ecol. Lett. 10 (2007) 1–15.[5] H. Caswell, Perturbation analysis of nonlinear matrix population models, Demogr. Res. 18 (2008) 59–116.

[6] H. Caswell, Stage, age, and individual stochasticity in demography, Oikos 118 (2009) 1763–1782.

[7] H. Caswell, Perturbation analysis of continuous-time absorbing Markov chains, Numer. Linear Algebra Appl. (2011),doi:10.1002/nla.791.

[8] H. Caswell, M.G. Neubert, C.M. Hunter, Demography and dispersal: invasion speeds and sensitivity analysis in periodic andstochastic environments, Theor. Ecol. (2010)z, doi:10.1007/s12080-010-0091-z.

[9] G.E. Cho, C.D. Meyer, Comparison of perturbation bounds for the stationary distribution of aMarkov chain, Linear Algebra Appl.335 (2000) 137–150.

[10] J. Conlisk, Comparative statics for Markov chains, J. Econom. Dynam. Control 19 (1985) 139–151.

[11] J.N. Darroch, E. Seneta, On quasi stationary distributions in absorbing discrete time finiteMarkov chains, J. Appl. Probab. 2 (1965)88–100.

[12] R.D. Edwards, S. Tuljapurkar, Inequality in life spans and a new perspective on mortality convergence across industrializedcountries, Popul. Develop. Rev. 31 (2005) 645–674.

[13] R.E. Funderlic, C.D. Meyer Jr., Sensitivity of the stationary distribution vector for an ergodic Markov chain, Linear Algebra Appl.76 (1986) 1–17.

[14] R.P. Gillespie, Partial Differentiation, second ed., Oliver and Boyd, Edinburgh, United Kingdom, 1954.

[15] G.H. Golub, C.D. Meyer Jr., Using the QR factorization and group inversion to compute, differentiate, and estimate the sensitivityof stationary probabilities for Markov chains, SIAM J. Algebraic Discrete Methods 7 (1986) 273–281.

[16] C.M. Grinstead, J.L. Snell, Introduction to Probability, second ed., American Mathematical Society, 2003.[17] M.F. Hill, J.D. Witman, H. Caswell, Spatio-temporal variation in Markov chain models of subtidal community succession, Ecol.

Lett. 5 (2002) 665–675.[18] M.F. Hill, J.D. Witman, H. Caswell, A Markov chain model of a rocky subtidal community: succession and species interactions in

a complex assemblage, Am. Nat. 164 (2004) E46–E61.[19] H.S. Horn, Markovian properties of forest succession, in: M.L. Cody, J.M. Diamond (Eds.), Ecology and Evolution of Communities,

Harvard University Press, Cambridge, MA, 1975, pp. 196–211.

[20] Human Mortality Database, University of California, Berkeley (USA), and Max Planck Institute for Demographic Research (Ger-many). Available from: <http://www.mortality.org> or <http://www.humanmortality.de>.

[21] J.J. Hunter, Stationary distributions and mean first passage times of perturbed Markov chains, Linear Algebra Appl. 410 (2005)217–243.

10.1016/j.laa.2011.07.046

doi:10.1002/nla.791

http://www.mortality.org

http://www.humanmortality.de


[22] J.J. Hunter, Mixing times with applications to perturbed Markov chains, Linear Algebra Appl. 417 (2006) 108–123.

[23] M. Iosifescu, Finite Markov Processes and their Applications, Wiley, New York, USA, 1980.[24] J.G. Kemeny, Generalization of a fundamental matrix, Linear Algebra Appl. 38 (1981) 193–206.

[25] J.G. Kemeny, J.L. Snell, Finite Markov Chains, Van Nostrand, New York, NY, USA, 1960.[26] S. Kirkland, Conditioning properties of the stationary distribution for a Markov chain, Electron. J. Linear Algebra 10 (2003) 1–15.

[27] S.J. Kirkland, M. Neumann, N-S. Sze, On optimal condition numbers for Markov chains, Numer. Math. 110 (2008) 521–537.

[28] J.R. Magnus, H. Neudecker, Matrix differential calculus with applications to simple, Hadamard, and Kronecker products, J. Math.Psych. 29 (1985) 474–492.

[29] J.R. Magnus, H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, John Wiley & Sons,New York, NY, USA, 1988.

[30] C.D. Meyer, The role of the group generalized inverse in the theory of finite Markov chains, SIAM Rev. 17 (1975) 443–464.[31] C.D. Meyer, Sensitivity of the stationary distribution of a Markov chain, SIAM J. Matrix Anal. Appl. 15 (1994) 715–728.

[32] C.D. Meyer, G.W. Stewart, Derivatives and perturbations of eigenvectors, SIAM J. Numer. Anal. 25 (1982) 679–691.[33] A.Y. Mitrophanov, Stability and exponential convergence of continuous timeMarkov chains, J. Appl. Probab. 40 (2003) 970–979.

[34] A.Y. Mitrophanov, Reversible Markov chains and spanning trees, Math. Sci. 29 (2004) 107–114.

[35] A.Y. Mitrophanov, Sensitivity and convergence of uniformly ergodic Markov chains, J. Appl. Probab. 42 (2005) 1003–1014.[36] A.Y. Mitrophanov, A. Lomsadze, M. Borodovsky, Sensitivity of hidden Markov models, J. Appl. Probab. 42 (2005) 632–642.

[37] L.C. Nelis, J.T. Wootton, Treatment-basedMarkov chainmodels clarify mechanisms of invasion in an invaded grassland commu-nity, Proc. B Roy. Soc. London 277 (2010) 539547, doi:10.1098/rspb.2009.1564.

[38] W.E. Roth, On direct product matrices, Bull. Amer. Math. Soc. 40 (1934) 461–468.[39] P.A. Samuelson, Foundations of Economic Analysis, Harvard University Press, Cambridge, Massachusetts, USA, 1947.

[40] P.J. Schweitzer, Perturbation theory and finite Markov chains, J. Appl. Probab. 5 (1968) 401–413.

[41] G.A.F. Seber, A Matrix Handbook for Statisticians, Wiley, New York, NY, USA, 2008.[42] E. Seneta, Perturbation of the stationary distributionmeasured by ergodicity coefficients, Adv. Appl. Probab. 20 (1988) 228–230.

[43] E. Seneta, Sensitivity of finite Markov chains under perturbation, Stat. Probab. Lett. 17 (1993) 163–168.[44] J.W. Vaupel, V. Canudas Romo, Decomposing change in life expectancy: a bouquet of formulas in honor of Nathan Keyfitz’s 90th

birthday, Demography 40 (2004) 201–216.[45] J.R. Wilmoth, S. Horiuchi, Rectangularization revisited: variability of age at death within human populations, Demography 36

(1999) 475–495.

sensitivity analysis of discrete markov chains via matrix calculus

Documents

ergodic chains

forabsorbing chains

theory of markov chains

propertiesof markov

ergodic discretetime

ergodic finitestate

andhence matrix calculus

properties of absorbing