ronnie loe en november 22, 2019 - university of manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 x x x x x...

87
Lecture notes: Actuarial Models 1 Ronnie Loeffen November 22, 2019

Upload: others

Post on 13-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Lecture notes: Actuarial Models 1

Ronnie Loeffen

November 22, 2019

Page 2: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Contents

1 Discrete time Markov chains 11.1 Stochastic processes and the Markov property . . . . . . . . . . . . . . . . . 11.2 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 How to construct a discrete time Markov chain? . . . . . . . . . . . . . . . . 71.4 Time homogeneous Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 8

2 More on time homogeneous, discrete time Markov chains 152.1 Communicating classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Stationary and limiting distributions . . . . . . . . . . . . . . . . . . . . . . 17

3 Analytical aspects of Markov jump processes 213.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Transition rates and Kolmogorov’s forward differential equations . . . . . . . 243.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Kolmogorov’s backward differential equations (non-examinable) . . . . . . . 33

4 Probabilistic aspects of Markov jump processes 354.1 Survival times and hazard functions . . . . . . . . . . . . . . . . . . . . . . . 354.2 Construction of an MJP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Further remarks in the time homogeneous case . . . . . . . . . . . . . . . . . 45

5 Estimation of transition rates for time homogeneous Markov jump pro-cesses 495.1 Maximum likelihood for a time homogeneous MJP . . . . . . . . . . . . . . . 495.2 The likelihood ratio test: a tool for model selection . . . . . . . . . . . . . . 57

6 Estimation of mortality rates using classical actuarial methods 616.1 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.2 Estimating the exposed to risk . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 Binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.4 Graduation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.4.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.4.2 Methods of graduation . . . . . . . . . . . . . . . . . . . . . . . . . . 72

i

Page 3: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

6.5 Testing the graduation for goodness of fit . . . . . . . . . . . . . . . . . . . . 756.5.1 Chi-squared overall goodness of fit test . . . . . . . . . . . . . . . . . 756.5.2 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.5.3 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5.4 Change of sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.5 Grouping of signs test . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A 83A.1 Dominated convergence theorem and Fubini . . . . . . . . . . . . . . . . . . 83A.2 Duration dependent transition rates . . . . . . . . . . . . . . . . . . . . . . . 84

ii

Page 4: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 1

Discrete time Markov chains

1.1 Stochastic processes and the Markov property

We are all familiar with the notion of a random variable. However, we are sometimesconcerned with a quantity that changes at random with time (or in space). For example wemay be concerned with the capital of an insurance company. This fluctuates in a randommanner with time since, the number of policies the company has on its books, and hence itsincome, varies at random with time, the number and the times of claims is random and sois the size of the individual claims. We therefore say that the capital of the company is arandom or stochastic process. Thus, in general a random process X is a family Xt : t ∈ T of random variables indexed by some set T . Invariably t will represent ’time’, as indeed willbe throughout this course, although in general the index t could represent a spatial positione.g. Xt may represent the number of infected at geographical position t in a country. Theset S of possible values of the process is called the state space of the random process. Thusthe random process can be classified according to whether (a) the time space T is discreteor continuous and (b) the state space S is discrete or continuous. Below are examples of thefour categories of random processes.

Example of a discrete time, discrete state space random process Let Xt be the NoClaims Discount (NCD) status of a motor insurance policy holder with insurance companyABC. Under this company’s NCB scheme five levels of discount are possible 0%, 20%, 40%,50% and 60% depending on the accident history of the driver; the company reviews the NCBstatus of each customer at the start of the policy and yearly thereafter. Thus Xt is a randomprocess with discrete state space S = 0, 20, 40, 50, 60 and discrete time T = 0, 1, 2, . . ..Notice that since the state space is discrete there is an one to one correspondence betweenits states and the positive integers. We could thus code the states as ’1’ for 0% discount, ’2’for 20% discount, . . . , ’5’ for 60% discount and have S = 1, 2, 3, 4, 5. We will study thisexample in more detail later on.

1

Page 5: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Example of a discrete time, continuous state space random process Customersenter a queue at random i.e. times between successive arrivals are random. Let Wn be thewaiting time of the nth customer to enter the queue before he/she starts receiving service.The amount of service required by each customer is a random quantity. Here Wn is a randomprocess with continuous state space S = [0,∞) and discrete time T = 1, 2, . . .

Example of a continuous time, discrete state space random process An insurancecompany is interested in the number of claims Xt that it receives by time t. Here Xt is arandom process with discrete state space S = 0, 1, 2, . . . in continuous time T = [0,∞).This particular example will be studied in more detail in the course unit MATH3/69542 RiskTheory.

Example of a continuous time, continuous state space random process An in-surance company is interested in the total value of the amount of claims Xt that it receivesby time t. Here Xt is a random process with continuous state space S = [0,∞) in continu-ous time T = [0,∞). Again this example will be studied in more detail in the course unitMATH3/69542 Risk Theory.

A particular realisation of Xt for all t in T is called a sample path of the process Xt. Justlike a sample point is an outcome of a random variable, so a sample path is an outcome ofa stochastic process. One should keep in mind that in the background there is some samplespace Ω on which all the random variables Xt, t ∈ T , are defined. So a sample path of thestochastic process X is nothing more than the function t 7→ Xt(ω) for some fixed ω ∈ Ω.Drawing sample paths of a stochastic process is very useful for visualising the behaviour ofthe stochastic process, espcecially in the continuous time case. Note that in what follows weshall usually suppress the dependence of Xt on ω in the notation.

Example 1.1. Let Xt denote the number of observed heads after t independent coin tosses.Then X = Xt : t ∈ 0, 1, 2, . . . is a discrete time stochastic process with discrete statespace. In Figure 1.1 two possible sample paths of this process are depicted. We see that thesample paths of this process are increasing (in the weak sense) and the increments Xt−Xt−1,t ≥ 1, are either equal to 0 or +1.

In these notes we shall only consider stochastic processes with discrete state space. Fur-ther, we will only look at stochastic processes that satisfy a very particular dependencestructure, namely the so-called Markov property.

Definition 1.2. A (discrete-valued) random process X = Xt : t ∈ T is said to enjoy theMarkov property (or to be a Markov process, for short) if

Pr(Xtn = kn|Xt0 = k0, Xt1 = k1, . . . , Xtn−1 = kn−1) = Pr(Xtn = kn|Xtn−1 = kn−1), (1.1)

for all n ≥ 1, k0, k1, . . . , kn ∈ S and all increasing sequences t0 < t1 < . . . < tn−1 < tn oftimes in T such that Pr(Xt0 = k0, Xt1 = k1, . . . , Xtn−1 = kn−1) > 0.1

1Note that if Pr(Xt1 = k1, Xt2 = k2, . . . , Xtn = kn) = 0, then the conditional probability on the left handside of the equality (1.1) is not well-defined.

2

Page 6: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

4

2

1

0 1 2 3 4 5 6 7

3

5

6

X

X

X

X X X

X

t

Xt

4

2

1

0 1 2 3 4 5 6 7

3

5

6

X X X

X

X

X X

t

Xt

Figure 1.1: Two possible sample paths of the discrete time stochastic process X defined inExample 1.1.

A very useful equivalent characterisation of the Markov property is given in the followingtheorem. It gives the joint distribution of the Markov process at a finite number of times,the so-called finite dimensional distributions of the process.

Theorem 1.3. The Markov property (1.1) is equivalent to

Pr(Xtn = kn, Xtn−1 = kn−1, . . . , Xt1 = k1|Xt0 = k0) =n∏i=1

Pr(Xti = ki|Xti−1= ki−1), (1.2)

for n ≥ 1, 0 ≤ t0 < t1 < . . . < tn−1 < tn and k0, k1, . . . , kn−1, kn ∈ S. Note that the equality(1.2) can be equivalently expressed as

Pr(∩ni=1Xti = ki|Xt0 = k0) =n∏i=1

Pr(Xti = ki|Xti−1= ki−1).

Proof. See Exercise 1.4.

Intuitively the Markov property says that the future behaviour of the Markov processis determined by where the process currently is but not by where it was in the past (i.e.not on the path by which it got to the current state). For this reason the study of Markovprocesses is much easier than that of general random processes. This intuitive meaning ofthe Markov property is often phrased as: ‘Given the present (state), the future and the past(of the process) are independent.’ It is further justified in Exercise 1.3.

Markov processes with a discrete state space are called Markov chains. Whereas in therest of the chapter we deal with discrete time Markov chains, in later chapters we will studycontinuous time Markov chains, also called Markov jump processes.

1.2 Discrete time Markov chains

In this chapter we acquaint ourselves with the properties of discrete time Markov chains.Since for such processes T contains of at most countable number of time points, these points

3

Page 7: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

can be put in one to one correspondence with either the non-negative integers 0, 1, 2, . . . ora finite subset of the non-negative integers 0, 1, 2, . . . , N, where N is some (finite) integer.Since the latter does not bring any additional difficulties in comparison to the former, wewill just focus on the case T = 0, 1, 2, . . . for the rest of this chapter (and the next). Wetherefore denote a discrete time Markov chain as Xn : n ≥ 0 with the time index being ninstead of t since n is more naturally associated with integers. We first record the followingresult which says that in order to verify the Markov property we only have to check that(1.1) holds in the case where ti = i, i = 0, 1, . . . , n.

Proposition 1.4. Let X = Xn : n ≥ 0 be a discrete time stochastic process with a discretestate space S. Then X is a Markov chain (or possesses the Markov property) if for all n ≥ 1and all possible values k0, k1, . . . , kn−1, kn in S, we have 2

Pr(Xn = kn|X0 = k0, X1 = k1, . . . , Xn−1 = kn−1) = Pr(Xn = kn|Xn−1 = kn−1). (1.3)

(Here it is implicitly understood that only the cases need to be considered for which Pr(X0 =k0, X1 = k1, . . . , Xn−1 = kn−1) > 0.)

Proof. Step 1. We first prove that (1.3) implies that

Pr(Xn = kn|Xt0 = k0, Xt1 = k1, . . . , Xtm = km, Xn−1 = kn−1) = Pr(Xn = kn|Xn−1 = kn−1)(1.4)

for all t0 < t1 < . . . < tm < n − 1 and m ≤ n − 2. Denote T = t0, . . . , tm, I =0, 1, . . . , n− 2\T and XT = kT = ∩i∈TXi = ki, XI = kI = ∩i∈IXi = ki. Then

Pr(Xn = kn|XT = kT, Xn−1 = kn−1)

=∑

ki∈S,i∈I

Pr(Xn = kn, XT = kT, Xn−1 = kn−1, XI = kI)

Pr(XT = kT, Xn−1 = kn−1)

=∑

ki∈S,i∈I

Pr(Xn = kn|XT = kT, Xn−1 = kn−1, XI = kI)Pr(XT = kT, Xn−1 = kn−1, XI = kI)

Pr(XT = kT, Xn−1 = kn−1)

=∑

ki∈S,i∈I

Pr(Xn = kn|Xn−1 = kn−1)Pr(XT = kT, Xn−1 = kn−1, XI = kI)

Pr(XT = kT, Xn−1 = kn−1)

=Pr(Xn = kn|Xn−1 = kn−1),

where we used (1.3) in the third equality, the definition of conditional probability in the firsttwo equalities and the fact that ∪ki∈S,i∈IXI = kI = Ω together with the finite additivityproperty of probability measures3 in the first and last equality. This shows (1.4).

Step 2. We show that (1.4) implies that

Pr(Xm = k|XT = kT, Xn−1 = kn−1) = Pr(Xm = k|Xn−1 = kn−1) (1.5)

2Note that, in case T is finite, X = Xn : 0 ≤ n ≤ N is a Markov chain if (1.3) holds for all 1 ≤ n ≤ N .3Recall that by the finite additivity property for probability measures we mean that Pr(A∪B) = Pr(A)+

Pr(B) for two events A and B satisfying A∩B = ∅. Note that this implies that Pr(∪nj=1Aj) =∑n

j=1 Pr(Aj)for any n ≥ 2 and events A1, A2, . . . , An satisfying Ap ∩Aq = ∅ for all p 6= q.

4

Page 8: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

for all k ∈ S and m ≥ n+ 1. We have

Pr(Xn+1 = k|XT = kT, Xn−1 = kn−1)

=∑kn∈S

Pr(Xn+1 = k,Xn = kn|XT = kT, Xn−1 = kn−1)

=∑kn∈S

Pr(Xn+1 = k|XT = kT, Xn−1 = kn−1, Xn = kn)Pr(Xn = kn|XT = kT, Xn−1 = kn−1)

=∑kn∈S

Pr(Xn+1 = k|Xn = kn)Pr(Xn = kn|Xn−1 = kn−1)

=∑kn∈S

Pr(Xn+1 = k|Xn−1 = kn−1, Xn = kn)Pr(Xn = kn|Xn−1 = kn−1)

=

∑kn∈S Pr(Xn+1 = k,Xn−1 = kn−1, Xn = kn)

Pr(Xn−1 = kn−1)

=Pr(Xn+1 = k|Xn−1 = kn−1),

where we used (1.4) twice in the third equality and once in the fourth, we used the definitionof conditional probability in the second and last two equalities and we used that ∪kn∈SXn =kn = Ω together with the finite additivity property of probability measures in the first andlast equality. Hence we have shown (1.5) for m = n+ 1. By repeating the argument, we canprove (1.5) for m = n+ 2, n+ 3, etc.4

As (1.5) is equivalent to (1.1), the proposition is proved.

For a Markov chain Xn : n ≥ 0, we write for m ≤ n

pik(m,n) = Pr(Xn = k|Xm = i)

and refer to (pik(m,n); i ∈ S, k ∈ S) as the transition probabilities from time m to timen. Since the state space S of a Markov chain is by definition discrete, i.e. it contains acountable number of states, the states of a MC can be put in one to one correspondencewith the positive integers or a subset of the positive integers and each state in S can beidentified by an integer. Let d denote the total number of states in S. Without loss ofgenerality we assume that the state space is of the form S = 1, 2, . . . , d if d < ∞ orS = 1, 2, 3, . . . if d = ∞. Then we can construct a d × d matrix P(m,n) whose (i, k)thentry is the transition probability pik(m,n). We now list a few properties of the matricesP(m,n) for m ≤ n. First note that P(n, n) = I, where I is the identity matrix, i.e. thematrix with ones on the diagonal and off-diagonal elements equal to zero. Second, we havethat the transition probabilities of a Markov chain must satisfy∑

k∈S

pik(m,n) = 1

since at time n the process must have transitioned to one of the states in S. Hence theelements of each row of P(m,n) add up to one or in other words, the row sums of P(m,n)

4Alternatively, an induction argument can be used here.

5

Page 9: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

equal one. We call a matrix a stochastic matrix if all the entries of the matrix are positive (inthe weak sense) and its row sums equal one. (Note that these two properties imply that allthe entries of a stochastic matrix are less than or equal to one.) So we have that the matricesP(m,n) are stochastic matrices. Another property that the matrices P(m,n) satisfy is thefollowing.

Proposition 1.5. The transition probabilities of a discrete time Markov chain satisfy theChapman-Kolmogorov (C-K) equations

pik(m,n) =∑j∈S

pij(m, l)pjk(l, n) (1.6)

for all i, k ∈ S and all times m ≤ l ≤ n. In matrix notation this equality reads as

P(m,n) = P(m, l)P(l, n).

Proof.

pik(m,n) =Pr(Xn = k|Xm = i)

=∑j∈S

Pr(Xn = k,Xl = j|Xm = i)

=∑j∈S

Pr(Xn = k|Xl = j,Xm = i)Pr(Xl = j|Xm = i)

=∑j∈S

Pr(Xn = k|Xl = j)Pr(Xl = j|Xm = i)

=∑j∈S

pij(m, l)pjk(l, n),

where we used the Markov property in the fourth equality. This proves the result.

A direct consequence of the Chapman-Kolmogorov equations is that the matrix P(m,n)for any 0 ≤ m ≤ n can be expressed in terms of the so-called one-step transition matricesP(0, 1), P(1, 2), P(2, 3), . . .. Namely we have

P(m,n) = P(m,m+ 1)P(m+ 1,m+ 2) · · ·P(n− 2, n− 1)P(n− 1, n). (1.7)

So the one-step transition matrices determine completely how the Markov chain moves/transitionsfrom one state at a certain time to another state at a later time (or to be more precise, theydetermine completely the probabilities of those transitions). This means that if we knowwhere the Markov chain starts (or to be more precise, if we know the distribution of theMarkov chain at time 0), then we can completely determine the distribution of the Markovchain at any later time. Indeed, in the following simple proposition we see how to computethe probability Pr(Xn = k) in the case where we are given the initial distribution of theMarkov chain.

6

Page 10: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Proposition 1.6. Let vk(n) = Pr(Xn = k) be the probability that the process is in state kat time n. Then by the law of total probability5 we have for 0 ≤ m ≤ n,

vk(n) =Pr(Xn = k)

=∑i∈S

Pr(Xn = k|Xm = i)Pr(Xm = i)

=∑i∈S

vi(m)pik(m,n), k ∈ S,

(1.8)

or in matrix notation

~v(n) = ~v(m) P(m,n), (1.9)

where ~v(l) = (v1(l), v2(l), . . .) denotes the distribution of the Markov chain at time l ≥ 0. Inparticular, setting m = 0, we get

vk(n) =∑i∈S

vk(0)pik(0, n),

or in matrix notation

~v(n) = ~v(0) P(0, n).

Note that in many applications the starting point of the Markov chain will not be trulyrandom but instead we are just given that it starts in a particular state, say state i. Thenthe initial distribution is just simply given by vi(0) = 1 and vj(0) = 0 for j 6= i and so

Pr(Xn = k) = vk(n) = pik(0, n) = P(Xn = k|X0 = i).

1.3 How to construct a discrete time Markov chain?

In this section we will see how to construct/specify a Markov chain on the state spaceS. Although what we do in this section seems obvious, we have included it in order forcomparison later with the case of continuous time Markov chains for which it is less clearhow they can be constructed. We start with the following definition.

Definition 1.7. Let (P(m,n),m ≥ 0, n ≥ m + 1) be a set of d-dimensional stochasticmatrices satisfying the Chapman-Kolmogorov equations

P(m,n) = P(m, l)P(l, n)

for all m < l < n. Then we say that (P(m,n),m ≥ 0, n ≥ m + 1) is a (discrete) transitionmatrix function.

5Recall that the law of total probability states that for an event B and for events A1, A2 . . . , An forminga partition of the sample space Ω (i.e. Ai ∩ Aj = ∅ for i 6= j and ∪ni=1Ai = Ω), we have Pr(B) =∑n

i=1 Pr(B|Ai)Pr(Ai).

7

Page 11: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

The next proposition says how one can construct transition matrix functions.

Proposition 1.8. Let for n ≥ 0, P(n, n + 1) be a stochastic matrix. Form for m ≥ 0 andn ≥ m+ 2, the matrix P(m,n) by

P(m,n) := P(m,m+ 1)P(m+ 1,m+ 2) · · ·P(n− 2, n− 1)P(n− 1, n). (1.10)

Then (P(m,n),m ≥ 0, n ≥ m+ 1) is a transition matrix function.

Proof. Since the product of two stochastic matrices is a stochastic matrix, it follows thatP(m,n) is a stochastic matrix. Further, as

P(m,n) = P(m,m+ 1) · · ·P(l − 1, l)P(l, l + 1) · · ·P(n− 1, n) = P(m, l)P(l, n)

for any m < l < n, it follows that the Chapman-Kolmogorov equations are satisfied aswell.

So in order to construct a transition matrix function, one only needs to specify the one-step transition matrices P(n, n + 1). The other r-step transition matrices P(n, n + r), forr ≥ 2, are then given via (1.10).

It is easy to construct a Markov chain whose transition probabilities are equal to a giventransition matrix function (P(m,n),m ≥ 0, n ≥ m + 1). Indeed, given the collection ofone-step transition matrices P(n, n + 1) with entries (pik(n, n + 1))di,k=1 and an initial statei0, construct recursively for n = 0, 1, 2, . . . the discrete time stochastic process X = Xn :n ≥ 0, starting in state i0, as follows.

• GivenXn = i, letXn+1 = k with probability pik(n, n+1) independently from everythingelse that happened before time n.

By construction this stochastic process X satisfies (1.3) and thus is indeed a Markov chain.Moreover, by the reasoning leading up to (1.7), it follows then that Pr(Xn = k|Xm = i) isequal to the (i, k)th entry of the matrix P(m,n).

1.4 Time homogeneous Markov chains

A Markov chain is said to be time homogeneous if the transition probabilities from time mto time n depends only on the difference (n−m) in the times but not on the location of thetimes. Thus we have the following definition.

Definition 1.9. A Markov chain is said to be time homogeneous if the transition probabilities

Pr(Xn+r = k|Xn = i)

depend only on r but not on n for all r ≥ 0 and n ≥ 0 and all i, k ∈ S. In other words,

Pr(Xn+r = k|Xn = i) = Pr(Xm+r = k|Xm = i) for all m,n ≥ 0 and i, k ∈ S.

8

Page 12: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

From the Chapman-Kolmogorov equations it easily follows that a Markov chain is timehomogeneous if and only if the last equation holds for r = 1, i.e. for all i, k ∈ S and n ≥ 1,

Pr(Xn+1 = k|Xn = i) = Pr(X1 = k|X0 = i).

Example 1.10. Consider a discrete-time alive-dead model where X = Xn : n ≥ 0 is aMarkov chain consisting of two states 1 and 2. The random variable Xn represents the statusof the individual at age n, whereby state 1 represents that the individual is alive and state2 represents that the individual is dead. We assume that p22(m,n) = 1 for all 0 ≤ m ≤ n,which means that if the individual is dead at age m, he/she remains dead at age n. In thismodel p11(n, n+ 1) represents the probability that the individual if assumed alive at age n isstill alive one year later. In general this probability can be different for different ages n, butin the time homogeneous case these probabilities do not depend on n, i.e. the probability ofthe individual living one more year is the same no matter what age he/she is.

We now introduce a simplification in our notation for the case of time homogeneousMarkov chains. For a time homogeneous Markov chain the r-step (with r fixed) transitionmatrices P(0, r),P(1, r + 1),P(2, r + 2), . . . are all equal to each other. Therefore we canuse the notation pik(r) := pik(n, n + r) and P(r) := P(n, n + r) for r ≥ 0. We refer toP(r) as the r-step transition matrix. The most important r-step transition matrix is theone-step transition matrix as it generates all the other ones, cf. (1.7). We therefore simplydenote the 1-step transition matrix P(1) by P and refer to it as the transition matrix of thetime homogeneous Markov chain. Similarly, we write pik := pik(1). With this new notationwe can, in the case of time homogeneous Markov chains, write the Chapman-Kolmogorovequations (1.6) as

P(n+ r) = P(n)P(r), n, r ≥ 0,

and we can write identity (1.7) as

P(n) = Pn, (1.11)

i.e. the n-step transition matrix is equal to the nth power of the one step transition matrix(or equivalently the n-step transition probability pik(n) is given by the (i, k)th element ofthe matrix Pn).

We finish this chapter by considering a few examples of time homogeneous discrete timeMarkov chains. For each example we provide a so-called transition graph/diagram of thechain. This is a picture which consist of non-overlapping circles, one for each state in thestates space, and has an arrow from one circle corresponding to state i to another circlecorresponding to state j if (and only if) it is possible (i.e. the probability is strictly positive)to go in one step from state i to state j. Typically the corresponding probability pij is putnext to the arrow as well. Note that there can be an arrow from state/circle i to state/circlei itself, namely in the case where pii > 0. A transition graph is essentially a graphicalrepresentation of the transition matrix of a time homogeneous discrete time Markov chainand is often a very useful tool for understanding the structure of such a chain.

9

Page 13: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

0%

40%20%

0.25

0.75

0.75

0.25

0.75

0.25

Figure 1.2: Transition graph corresponding to the transition matrix of Example 1.11(a).

Example 1.11 (No Claims Discount (NCD) model). Motor insurance company ABC offersits customers a premium discount of either 0%, 20% or 40% depending on the driver’s claimrecord. A claims-free year results in the discount increased to the next higher stage or in theretention of the maximum discount. A year with at least one claim results in the discountreduced to the next lower state or in the retention of a zero discount. Each year a driver hasprobability 1/4 of making at least one claim and a probability of 3/4 of making no claims,independently of what happened in previous years.

Under this plan, if we let Xn be the level of discount a policyholder receives in year n,then Xn : n ≥ 1 is a time homogeneous Markov chain with state space S = 1, 2, 3 wherestate 1 = 0% discount, state 2 = 20% discount and state 3 = 40% discount. The transitionmatrix is

P =

1 2 3( )1 0.25 0.75 02 0.25 0 0.753 0 0.25 0.75

.

The transition matrix can also be represented graphically in terms of a transition graph asseen in Figure 1.2.

Question: Given that a driver does not qualify for a discount in year n, what is theprobability that he/she is holding a maximum discount in year n+ 3?Answer: The required probability is

p13(3) = (P(3))13 = (P3)13

where the last equality follows from (1.11). Since

P3 =

1 2 3( )1 7/64 21/64 36/642 7/64 12/64 45/643 4/64 15/64 45/64

,

we have that p13(3) = 36/64 = 9/16.Question: Suppose that currently the proportion of customers in each of the three discount

10

Page 14: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

states are 60%, 30% and 10% respectively. What proportion of these customers are in thethree discount states exactly three years later?Answer: Note that the proportion of customers in a particular state k at time n can beinterpreted as the probability that a single arbitrarily (i.e. uniformly at random) selectedcustomer is in state k at time n, i.e. it can be expressed as Pr(Xn = k). In other words,the proportion of customers in each of the three states at time n can be interpreted as thedistribution of the Markov chain at time n. Therefore, denoting as in Proposition 1.6 thedistribution of the Markov chain at time n by ~v(n) and assuming without loss of generalitythat currently corresponds to time 0, we are given that ~v(0) = (0.6, 0.3, 0.1) and we want tocompute ~v(3). By Proposition 1.6 in combination with (1.11), we get

~v(3) = ~v(0) P3 = (0.6, 0.3, 0.1)

7/64 21/64 36/647/64 12/64 45/644/64 15/64 45/64

= (0.105, 0.277, 0.619).

Hence 10.5% of the customers will be on 0% discount, 27.7% will be on 20% discount and61.9% will be on 40% discount exactly three years later.

Note that if the probability of zero claims varies from year to year and does not remainconstant at 3/4, then the process giving the NCD status of the driver is no longer a timehomogeneous process.

Example 1.12 (Simple random walk on S = . . . ,−2,−1, 0, 1, 2, . . .). Let Yn, n = 1, 2, . . .be a sequence of mutually independent, identically distributed random variables with

Pr(Yn = 1) = p and Pr(Yn = −1) = 1− p

for all n. Define the partial sums

Xn =n∑j=1

Yj = Xn−1 + Yn, n = 1, 2, . . . .

The process X = Xn, n = 1, 2, . . . is in fact a random process that enjoys the Markovproperty with state space S = . . . ,−2,−1, 0, 1, 2, . . . i.e. it is a Markov chain. This canbe shown as follows:

Pr(Xn = k|X1 = i1, . . . , Xn−1 = i) =Pr(Xn−1 + Yn = k|X1 = i1, . . . , Xn−1 = i)

=Pr(Yn = k − i|X1 = i1, . . . , Xn−1 = i)

=Pr(Yn = k − i),

where the last equality follows from the fact that the Xj’s for j ≤ n − 1 are functions ofY1, Y2, . . . , Yn−1 and hence independent of Yn. By a similar computation

Pr(Xn = k|Xn−1 = i) = Pr(Yn = k − i),

11

Page 15: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

−2 −1 0 1 2

p

1−p1−p1−p1−p 1−p 1−p

p pppp

Figure 1.3: Transition graph corresponding to the simple random walk in Example 1.12.

which proves that the process Xn, n = 1, 2, . . . satisfies the Markov property (cf. Proposi-tion 1.4). The transition matrix of the simple random walk is

P =

. . . . . .

. . . 0 p1− p 0 p

. . . . . . . . .

1− p 0 p

1− p 0. . .

. . . . . .

and its transition graph is given in Figure 1.3.

To find the n-step transition probability pik(n) of the simple random walk, one couldcompute the matrix Pn, but this is a complicate procedure as P is a matrix of infinitedimension. Instead we can compute pik(n) in the following way. First, we note that for thesimple random walk to go from state i to state k in n steps it must take r upward stepsand s downwards steps such that i + r − s = k and r + s = n. Hence r = 1

2(n + k − i) and

s = 12(n− k+ i). Now the number of upward steps in n steps is a binomial random variable

with parameters (n, p)6; we therefore have that

pik(n) =

(n

12(n+ k − i)

)p

12

(n+k−i)(1− p)12

(n−k+i),

provided (n+ k − i) is even, otherwise pik(n) = 0.

Example 1.13 (Bounded simple random walk on S = 0, 1, 2, . . . , b). This is similar tothe last example, except that boundaries are placed at 0 and b > 0 so that the Markovchain cannot take values below zero or above b. The behaviour of the MC on the boundarieswill depend on the boundary. A boundary can be either reflecting or absorbing or a mixedboundary depending on the behaviour of the process it is modelling. The following conditiondefines a reflecting boundary at 0 which reflects the MC to state 1.

Pr(Xn+1 = 1|Xn = 0) = 1.

The following condition defines an absorbing barrier at 0

Pr(Xn+1 = 0|Xn = 0) = 1.

6Recall that the sum of n i.i.d. Bernoulli random variables with parameter p is a binomial random variablewith parameters (n, p).

12

Page 16: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

p

1-p1-p

p

0 1 b-1 b

p

1-p

.....

1− α

1− β βα

Figure 1.4: Transition graph corresponding to the bounded simple random walk in Example1.13.

The following condition defines a mixed boundary at 0 which absorbs with probability α > 0and reflects with probability (1− α) > 0.

Pr(Xn+1 = 0|Xn = 0) = α, Pr(Xn+1 = 1|Xn = 0) = 1− α.

The transition matrix of a bounded simple random walk with mixed boundaries at 0 and bis given below and the transition graph is given in Figure 1.4:

P =

0 1 b

0 α 1− α1 1− p 0 p

1− p 0 p. . . . . . . . .

1− p 0 pb 1− β β

.

13

Page 17: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

14

Page 18: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 2

More on time homogeneous, discretetime Markov chains

2.1 Communicating classes

In this chapter we will only concern ourselves with time homogeneous (discrete time) Markovchains (MCs). In the next section we will look at the long term behaviour of a (timehomogeneous) MC. For this, we first need to understand the structure of the state space Sof the MC, in particular how states of a MC are related to each other. We start with thefollowing definitions.

Definition 2.1. We say that for an MC, state i leads to a state k, and write i → k, ifpik(n) > 0 for some n ≥ 0. Note hereby that pik(0) = Pr(X0 = k|X0 = i) = 1 if i = k andpik(0) = 0 if i 6= k. We say that i and k communicate, and write i↔ k, if i→ k and k → i.

Clearly, we have (i) i↔ i and (ii) if i↔ k, then k ↔ i. Further, the transitivity property(iii) if i ↔ j and j ↔ k, then i ↔ k holds as well (this is an exercise). Hence the relation↔ forms an equivalence relation and therefore we can partition the state space S of the MCinto communicating classes, i.e. S = C1 ∪ C2,∪ . . ., where each Ci consists of states that arecommunicating with each other. Finding the communicating classes of an MC is greatlysimplified if one can draw the transition graph of the corresponding transition matrix P.

Example 2.2. Consider the MC with transition matrix

P =

1/2 1/2 0 0 0 00 0 1 0 0 0

1/3 0 0 1/3 1/3 00 0 0 1/2 1/2 00 0 0 0 0 10 0 0 0 1 0

.

The corresponding transition graph is given in Figure 2.1. From this graph we see that thecommunicating classes are 1, 2, 3, 4 and 5, 6.

15

Page 19: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

1

2

4

6

5

1/2

1/2

1

1

1/2

1/2

3

1/3 1/3

1/31

Figure 2.1: Transition graph corresponding to Example 2.2.

Definition 2.3. We say that a communicating class C is closed if i ∈ C and i → j impliesj ∈ C, i.e. a closed class is one from which there is no escape. A state i is called absorbingif i is a closed class. An MC for which the state space S consists of a single class is calledirreducible.

Note that in Example 2.2 only the communicating class 5, 6 is closed.

Definition 2.4. A state i of an MC is called aperiodic if pii(n) > 0 for all sufficiently largen, i.e. there exists N ≥ 1 such that pii(n) > 0 for all n ≥ N . We say that the MC isaperiodic if all states of the MC are aperiodic.

So considering the infinite sequence pii(1), pii(2), pii(3), . . . we say that state i is aperiodicif (and only if) there are only finitely many elements in this sequence which have the value0. In most cases it is far too difficult to compute pii(n) for all n and so we would like tohave easier ways for determining whether a state is aperiodic or not. A useful tool in thisdirection is the following proposition which says that for an irreducible MC, either all statesare aperiodic or no states are aperiodic.

Proposition 2.5. Suppose the MC is irreducible and has an aperiodic state i. Then allstates are aperiodic.

Proof. Suppose the MC is irreducible and the state i is aperiodic. Let k be a state of theMC. By irreducibility, i↔ k and so there exists m,n ≥ 0 such that pki(m), pik(n) > 0. Using

16

Page 20: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

the Chapman-Kolmogorov equations twice, we have that for all r ≥ n+m,

pkk(r) =∑j∈S

pkj(n)pjk(r − n)

≥pki(n)pik(r − n)

=pki(n)∑j∈S

pij(r − n−m)pjk(m)

≥pki(n)pii(r − n−m)pik(m).

Since pii(r − n − m) > 0 for all sufficiently large r (as i was assumed to be aperiodic), itfollows that pkk(r) > 0 for all r sufficiently large. Hence k is aperiodic.

Note that when a MC is irreducible, a sufficient condition for the state i (and hence allstates) to be aperiodic is pii(1) > 0 (since one can show pii(n) ≥ pnii), but that this is not anecessary condition (see for instance Example 2.10 below where p22 = 0 but state 2 is stillaperiodic).

2.2 Stationary and limiting distributions

Definition 2.6. A vector ~π = (πi)i∈S is called a stationary or invariant (probability) dis-tribution of a (discrete time, time homogeneous) MC with state space S and with transitionmatrix P if

(i) πi ≥ 0 for all i ∈ S and∑

i∈S πi = 1,

(ii) ~π = ~πP (i.e. ~π is a left-eigenvector of the matrix P with eigenvalue 1).

It is easy to see from Proposition 1.6 that if the distribution of the process at time 0 isequal to a stationary distribution ~π, then the distribution of the process at time n, denotedby ~v(n), is equal to ~π as well. Indeed, we have

~v(n) = ~πPn = ~πPPn−1 = ~πPn−1 = . . . = ~πP = ~π.

Hence the name stationary (or invariant) distribution. So in order to find the stationarydistributions of a MC one needs to find all the rowvectors ~π that satisfy πi ≥ 0,

∑i∈S πi = 1

and

πi =∑j∈S

πjpji (2.1)

for all i ∈ S. In particular if there are a finite number of states, say d, then a stationarydistribution satisfies a system of d + 1 linear equations. However, it turns out that in thatcase one can ignore one of the equations in (2.1) as it will be automatically satisfied if allthe d remaining equations are satisfied.

17

Page 21: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Proposition 2.7. Let P be a d-by-d stochastic matrix with d < ∞ and fix k ∈ 1, . . . , d.Suppose ~π = (π1, . . . , πd) is a rowvector satisfying

∑di=1 πi = 1 and (2.1) for all i 6= k. Then

(2.1) is also satisfied for i = k and so ~π is a stationary distribution.

Proof. We just need to show that πk =∑d

j=1 πjpjk, while we are given that (i)∑d

i=1 πi = 1

and (ii) πi =∑d

j=1 πjpji for i 6= k. We have by subtracting the equations in (ii) from theequation (i),

d∑i=1

πi −d∑

i=1,i 6=k

πi = 1−d∑

i=1,i 6=k

d∑j=1

πjpji.

Then by switching the two sums (which we can safely do since d <∞), the above equationis equal to

πk = 1−d∑j=1

πj

d∑i=1,i 6=k

pji = 1−d∑j=1

πj(1− pjk) =d∑j=1

πjpjk.

Stationary distributions are important because under certain conditions the distributionof the MC at time n converges, as n goes to infinity, to the stationary distribution. Thereforeunder these conditions, the stationary distribution describes how the MC behaves in the longrun. To be more precise, we have the following definition and theorem.

Definition 2.8. Let ~π = (πi)i∈S be a probability distribution, i.e. πi ≥ 0 for all i ∈ S and∑i∈S πi = 1. We call ~π the limiting distribution of the (time homogeneous) Markov chain

with transition matrix P if

limn→∞

pik(n) = πk, for all i, k ∈ S.

If the MC X = Xn : n ≥ 0 has a limiting distribution ~π, then the distribution of Xn

converges as n→∞ to this limiting distribution, no matter what the initial distribution ofthe MC is. Indeed, suppose the initial distribution of the MC is given by ~v(0) = (vi(0))i∈S ,i.e. vi = Pr(X0 = i). Then with ~π the limiting distribution, we have

limn→∞

Pr(Xn = k) = limn→∞

∑i∈S

vi(0)pik(n) =∑i∈S

vi(0) limn→∞

pik(n) =∑i∈S

vi(0)πk = πk,

where the first equality follows by Proposition 1.6, the second by Theorem A.1(ii) in theAppendix 1 and the third is due to Definition 2.8. So the limiting distribution describeshow the MC behaves in the long run. However, an MC does not always have a limitingdistribution. The theorem below gives sufficient conditions for the existence of a limitingdistribution and says that under these conditions the limiting distribution is equal to theunique stationary distribution, which provides a way of computing the limiting distribution.Unfortunately, the proof of this theorem is too elaborate to include here.

1Switching a limit and an infinite sum (which is essentially also a limit) cannot always be done and so inthe case where S consists of a countably infinite number of states we have to justify that one can switch thelimit and the sum, which we do via Theorem A.1(ii) in the Appendix.

18

Page 22: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Theorem 2.9. Let P be the transition matrix of an irreducible, aperiodic MC with finitelymany states. Then P has a unique stationary distribution ~π. Moreover, this unique station-ary distribution ~π is also the limiting distribution of the MC.

Example 2.10. Consider the two-state MC with transition matrix P =(

12

12

1 0

). This MC

is irreducible and aperiodic as the state 1 is clearly aperiodic because p11 > 0 and so byProposition 2.5 state 2 must be aperiodic as well. So by Theorem 2.9 there exists a uniquestationary distribution ~π which must also be the limiting distribution of the MC. Let’scompute this distribution now. The system of equations ~πP = π can be written as

1

2π1 + π2 =π1

1

2π1 =π2.

By Proposition 2.7 we can ignore one of these equations, so let’s ignore the first one. Thenusing the second equation and that π1 + π2 = 1, we must have π1 = 2/3 and π2 = 1/3. So~π = (2/3, 1/3).

Example 2.11. Consider the two-state MC with transition matrix P = ( 0 11 0 ). One easily

sees that the MC is irreducible and that the unique stationary distribution is given by~π = (1/2, 1/2). However, as P2 = ( 1 0

0 1 ), it follows that for all n ≥ 1, P2n = ( 1 00 1 ) and

P2n+1 = P = ( 0 11 0 ). Consequently, Pn does not converge as n goes to infinity. Note that

the MC here is not aperiodic since p11(n) = 0 for odd n and so one cannot apply the abovetheorem.

19

Page 23: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

20

Page 24: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 3

Analytical aspects of Markov jumpprocesses

3.1 Introduction

Continuous time Markov chains or Markov jump processes (MJPs) are continuous timerandom processes X = Xt : t ≥ 0 with discrete state space that enjoy the Markovproperty (1.1), i.e.

Pr(Xtn = kn|Xt0 = k0, Xt1 = k1, . . . , Xtn−1 = kn−1) = Pr(Xtn = kn|Xtn−1 = kn−1),

for any finite sequence tn > tn−1 > . . . > t1 > t0 ≥ 0 and all possible values k0, k1, . . . , kn−1, knin the state space S.

We will assume that an MJP has sample paths that are right-continuous, i.e. the mapt 7→ Xt is right-continuous for every realisation of the stochastic process. This is a commonassumption made in the theory of continuous time stochastic processes. In Figure 3.1 atypical sample path of a Markov jump process is displayed. As the sample paths are assumedto be right-continuous and the state space is discrete, it follows that X has piecewise constantsample paths. In Figure 3.1, the random variables J(n), n = 1, 2, . . . denote the consecutivejump times of X and the random variables H(n), n = 1, 2, . . . denote the consecutive holdingtimes of X, i.e. the times between two consecutive jumps of X. Further, Rs for a given s ≥ 0,denotes the residual holding time after s, i.e. the time that is spent between s and the nextjump time. Note that Xs+Rs is the state the MJP jumps to at the next jump time after s.Finally, the random variable Cs denotes the current holding time at s, i.e. the time that isspent between s and the last jump time before time s. Later, we will derive several resultsabout the (joint) distribution of these random variables.

We will now introduce some notation and concepts which are very similar to the discretetime case that is studied in Chapter 1. For an MJP, the transition probability pik(s, t)

1 is

1Actuaries and actuarial literature tend to use the notation t−spiks instead of pik(s, t) for the transition

probabilities.

21

Page 25: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

0 J(1) J(2) J(3) J(4)

Xt

Rs

Xs+Rs

s t

Cs

H(1) H(2) H(3) H(4)

Figure 3.1: A typical sample path of a continuous time Markov Chain.

22

Page 26: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

defined to bepik(s, t) = Pr(Xt = k|Xs = i), s ≤ t and i, k ∈ S.

We assume throughout this chapter (unless explicitly mentioned otherwise) that the statespace S is of the form S = 1, 2, . . . , d, where d < ∞; so we only consider MJPs with afinite state space. We can then construct the d× d transition matrices P(s, t) whose entriesare the transition probabilities pik(s, t), i.e. pik(s, t) is the (i, k)th element of P(s, t). By thesame reasoning as for discrete time MCs, we then have:

(i) P(t, t) = I with I the identity matrix, i.e. pii(t, t) = 1 and pik(t, t) = 0 for i 6= k.

(ii) P(s, t) is a stochastic matrix for any 0 ≤ s ≤ t, i.e. it has positive entries and rowsumsequal to one.

(iii) The transition probabilities of an MJP satisfy the Chapman-Kolmogorov equations:

P(s, t) = P(s, u)P(u, t),

where 0 ≤ s ≤ u ≤ t. Note that the above is equivalent to

pik(s, t) =∑j∈S

pij(s, u)pjk(u, t), for all i, k ∈ S, (3.1)

where 0 ≤ s ≤ u ≤ t.

(iv) Let’s denote vk(n) = Pr(Xn = k) so that ~v(n) = (v1(n), . . . , vd(n)) is the distributionof Xn. Then similar to Proposition 1.6 in Chapter 1, we have using the law of totalprobability,

~v(t) = ~v(s)P(s, t), 0 ≤ s ≤ t.

The MJP is called time homogeneous if the transition probabilities pik(s, t) depend onthe difference (t− s) but not on the individual values of t and s, i.e. if

pik(s, t) = pik(0, t− s) for all i, k ∈ S, 0 ≤ s ≤ t.

For a time homogeneous chain we can and will work with the notation pik(t) for pik(s, s+ t)and P(t) for P(s, s+ t). If we need to stress that an MJP is not time homogeneous we shallrefer to it as time inhomogeneous.

When dealing with discrete time Markov chains we have seen that the one step transitionprobabilities played a fundamental role as everything of interest could be analysed in termsof these one step probabilities. This is NOT the case for continuous time Markov chainsas one-step transition probabilities (no matter what length of time we define one step tocover) only describe the behaviour of the MJP at a discrete set of times, whereas we wantto know what happens for all times in the continous set [0,∞). Instead the fundamentalrole in continuous time chains is played by the so-called transition rates, also referred to astransition intensities or forces of transition, denoted by µik(s). In the next section we willintroduce transition rates, whereas in Chapter 4 we will see how MJPs can be constructedgiven a set of transition rates.

23

Page 27: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

3.2 Transition rates and Kolmogorov’s forward differ-

ential equations

We begin with the following definition.

Definition 3.1. Let P(s, t), s ≥ 0, t ≥ s be a set of d-dimensional stochastic matricessatisfying the Chapman-Kolmogorov equations

P(s, t) = P(s, u)P(u, t)

for all s ≤ u ≤ t (note that this implies in particular P(t, t) = I). Then we say thatP(s, t), s ≥ 0, t ≥ s is a transition matrix function. If moreover P(s, t) = P(0, t − s) forall 0 ≤ s ≤ t, then we call the transition matrix function (time) homogeneous.

By the previous section we obviously have the following proposition.

Proposition 3.2. Let X = Xt : t ≥ 0 be an MJP and let P(s, t) be the matrix whose(i, k)th element equals pik(s, t) := Pr(Xt = k|Xs = i). Then P(s, t), s ≥ 0, t ≥ s is atransition matrix function. If the MJP is time homogeneous, then this transition matrixfunction is homogeneous.

Because of the above proposition transition matrix functions are an important objectto study when considering MJPs. In this section we will study some analytic aspects oftransition matrix functions that do not involve any probability; in particular we will notassume that a transition matrix function is associated with some MJP.

An important question is how to construct transition matrix functions. This is not sostraightforward: although it is easy to specify a set of stochastic matrices P(s, t), s ≥ 0, t ≥s, it is less obvious how to choose them such that they satisfy the Chapman-Kolmogorovequations. It turns out that transition matrix functions can be generated by so-called Q-matrices.

Definition 3.3. We call a d-dimensional matrix Q = (µij)di,j=1 a Q-matrix if

(i) µij ≥ 0 for i 6= j and

(ii)∑d

j=1 µij = 0 for all i = 1, . . . , d.

So a Q-matrix is a matrix with positive off-diagonal elements and rowsums equal to zero.This automatically implies that the diagonal elements µii must be negative (in the weaksense). Note that it is easy to specify/construct Q-matrices. The next theorem shows howto construct a transition matrix function given a set of Q-matrices Q(t), t ≥ 0.

Theorem 3.4. Let Q(t) = (µij(t))di,j=1, t ≥ 0 2 be a set of Q-matrices such that for each

i and j, the function t 7→ µij(t), t ≥ 0, is continuous.Then for each s ≥ 0, there exists

2The elements of Q(t) are denoted by µij(t) rather than qij(t) in order to be more consistent with standardactuarial notation.

24

Page 28: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

a unique solution to the following system of ordinary differential equations, known as theKolmogorov forward differential equations:

∂tP(s, t) = P(s, t)Q(t), t > s, (3.2)

with boundary condition P(s, s) = I. Moreover, the set of solutions P(s, t), s ≥ 0, t ≥ sforms a transition matrix function. If Q(t) does not depend on t, i.e. Q(s) = Q(t) for alls, t ≥ 0, then the transition matrix function P(s, t), s ≥ 0, t ≥ s is homogeneous.

Remarks

• We see that in the above theorem the Q-matrices Q(t) generate a transition matrixfunction P(s, t) and therefore these Q-matrices are also referred to as generator matri-ces.

• Denoting by pik(s, t) the (i, k)th entry of P(s, t), we have that the system of ordinary3

differential equations (ODEs) (3.2) is equivalently written as

∂tpik(s, t) =

d∑j=1

pij(s, t)µjk(t), i, k = 1, . . . , d, (3.3)

with boundary condition pik(s, s) =

1 if k = i,

0 if k 6= i.Letting t ↓ s in (3.3), we have that

limt↓s

pik(s, t)− pik(s, s)t− s

= µik(s) (3.4)

and so µik(s) represents the instantaneous rate of change of pik(s, t) at t = s. Thereforethe quantities µik(s) are referred to as the transition rates/intensities (per unit of time)of the transition matrix function. Actuaries tend to refer to the quantities µik(s) asforces of transition.4

• If Q(t) does not depend on t, then writing P(t) := P(0, t) and Q := Q(t), the Kol-mogorov forward equations simplify to

d

dtP(t) = P(t)Q, t > 0,

with boundary condition P(0) = I.

3Note that although the Kolmogorov forward equations contain a partial derivative, the differential equa-tion is an ordinary rather than a partial differential equation since only the derivative with respect to t istaken in the equation.

4Note that mathematicians and actuaries tend to interpret the word ‘rate’ differently. As an example,where a mathematician would say mortality rate, an actuary would say force of mortality. Conversely, whenan actuary talks about mortality rates, he/she might well mean one-year/annual death probabilities insteadof forces of mortality. In this course unit we interpret the word ‘rate’ as a (typical) mathematician woulddo, i.e. as a force in the actuarial sense.

25

Page 29: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Proof of Theorem 3.4. Since for each i, j ∈ S the functions t 7→ µij(t), t ≥ 0, iscontinuous, the existence and uniqueness of the solution to (3.2) follows from standardODE theory, namely the Picard-Lindelof theorem. Next we show that the set of solutionsP(s, t), s ≥ 0, t ≥ s is a transition matrix function. Fix s ≥ 0 and set for i = 1, . . . , d,Si(t) =

∑dk=1 pik(s, t) with pik(s, t) the (i, k)th entry of P(s, t). Then by (3.3) and the

assumption that Q(t) is a Q-matrix (and thus has rowsums equal to zero),

d

dtSi(t) =

d∑k=1

d∑j=1

pij(s, t)µjk(t) =d∑j=1

pij(s, t)d∑

k=1

µjk(t) = 0, t > s,

from which it follows that Si(t) is constant for t > s. By the boundary condition, we haveSi(s) =

∑dj=1 pij(s, s) = 1 and so it follows (as Si(t) is right-continuous at t = s) that

Si(t) = 1 for t ≥ s. Hence the rowsums of P(s, t) are equal to one. In order to show theChapman-Kolmogorov equations, fix s ≥ 0 and u ≥ s. Then by (3.2),

∂t[P(s, u)P(u, t)] = P(s, u)

∂tP(s, t) = [P(s, u)P(u, t)] Q(t), t > u

and by the boundary condition, [P(s, u)P(u, u)] = P(s, u). So [P(s, u)P(u, t)] is equal toP(s, t) for u = t and satisfies the same ODE as P(s, t) for t > u. Therefore by uniqueness ofthe ODE, we must have that [P(s, u)P(u, t)] = P(s, t) for all t ≥ u. Since s ≥ 0 and u ≥ swere chosen arbitrarily, it follows that the Chapman-Kolmogorov equations are satisfied.

We also need to show that pik(s, t) ≥ 0 for all states i, k and times 0 ≤ s ≤ t. Now fix0 ≤ s ≤ t and let α > 0 be such that −µii(u) ≤ α for all states i and u ∈ [s, t]. Note thatsuch an α exists since the transition intensities µjk(u) are assumed to be continuous and sothey are bounded on compact time intervals. Now form the matrix

P(s, t) = eα(t−s)P(s, t).

Then by (3.2), for u ∈ [s, t],

∂uP(s, u) =eα(u−s) ∂

∂uP(s, u) + αeα(u−s)P(s, u)

=eα(u−s)P(s, u) (Q(u) + αI)

=P(s, u) (Q(u) + αI)

(3.5)

and further P(s, s) = P(s, s) = I. Note that by the Picard-Lindelof theorem, there exists a

unique solution to the system of ODEs (3.5) with boundary condition P(s, s) = I (and this

unique solution is of course given by P(s, u) = eα(u−s)P(s, u)). Since by definition of α, thematrix Q(u) + αI has only positive entries for u ∈ [s, t] and since the same holds for the

matrix P(s, s), it is not hard to see via (3.5) that we must have that the matrices P(s, u)

and ∂∂u

P(s, u) have only positive entries for u ∈ [s, t]. In particular, P(s, t) has only positive

entries and since P(s, t) = e−α(t−s)P(s, t), then so does P(s, t).

26

Page 30: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

In order to prove the last assertion of the theorem, assume Q := Q(t) does not depend ont. Fix s ≥ 0 and let A(t) := P(s, t) and B(t) := P(0, t− s) for t ≥ s. Then A(s) = I = B(s)and further using (3.2),

d

dtA(t) = A(t)Q(t) = A(t)Q and

d

dtB(t) = B(t)Q(t− s) = B(t)Q

and so by uniqueness of the ODE, we must have A(t) = B(t) for all t ≥ s. Hence thetransition matrix function P(s, t), s ≥ 0, t ≥ s is homogeneous.

The next theorem is a sort of converse to Theorem 3.4.

Theorem 3.5. Let P(s, t), s ≥ 0, t ≥ s be a transition matrix function and assume thatthe limits

limt↓s

P(s, t)− I

t− sand lim

t↑s

P(t, s)− I

s− t(3.6)

exist and are equal to each other for all s ≥ 0. Denote the common limit matrix by Q(s).Then for each s ≥ 0, Q(s) is a Q-matrix and P(s, t) satisfies Kolmogorov’s forward differ-ential equations

∂tP(s, t) = P(s, t)Q(t), t > s.

Further, if P(s, t), s ≥ 0, t ≥ s is homogeneous, then Q(s) = Q(t) for all s, t ≥ 0.

Proof. We first prove that Q(s) is a Q-matrix for each s ≥ 0. Since P(s, t) is a stochasticmatrix, we have that its elements pij(s, t) satisfy pij(s, t) ∈ [0, 1]. Hence

µii(s) = limt↓s

pii(s, t)− 1

t− s≤ 0

and for i 6= j,

µij(s) = limt↓s

pij(s, t)− 0

t− s≥ 0.

Further with δij the (i, j)th element of I, the rowsums of Q(s) are equal to

d∑j=1

µij(s) =d∑j=1

limt↓s

pij(s, t)− δijt− s

= limt↓s

d∑j=1

pij(s, t)− δijt− s

= 0,

where we used that d is finite (so that we can switch the sum and the limit) and that therowsums of P(s, t) are equal to one. Hence Q(s) is a Q-matrix.

Next we prove that P(s, t) satisfies the Kolmogorov forward differential equations. Wedo this in two steps: first we prove that

limh↓0

pik(s, t+ h)− pik(s, t)h

=∑j∈S

pij(s, t)µjk(t)

27

Page 31: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

and then that

limh↑0

pik(s, t+ h)− pik(s, t)h

=∑j∈S

pij(s, t)µjk(t).

Step 1. We have by the Chapman-Kolmogorov equations for h > 0,

pik(s, t+ h) =∑j∈S

pij(s, t)pjk(t, t+ h) = pik(s, t)pkk(t, t+ h) +∑

j∈S,j 6=k

pij(s, t)pjk(t, t+ h),

leading to

pik(s, t+ h)− pik(s, t)h

= pik(s, t)pkk(t, t+ h)− 1

h+

∑j∈S,j 6=k

pij(s, t)pjk(t, t+ h)

h.

Taking limits as h ↓ 0 on both sides leads to using the assumption on the first limit in (3.6),

limh↓0

pik(s, t+ h)− pik(s, t)h

=pik(s, t)µkk(t) + limh↓0

∑j∈S,j 6=k

pij(s, t)pjk(t, t+ h)

h

=pik(s, t)µkk(t) +∑

j∈S,j 6=k

pij(s, t) limh↓0

pjk(t, t+ h)

h

=∑j∈S

pij(s, t)µjk(t),

(3.7)

where the switching of the sum and the derivative is justified since S = 1, . . . , d is a finiteset.

Step 2. By the Chapman-Kolmogorov equations we have for h < 0 such that s < t+ h,

pik(s, t) =∑j∈S

pij(s, t+ h)pjk(t+ h, t). (3.8)

Taking limits on both sides of (3.8) as h ↑ 0 and using the assumption on the second limitin (3.6), (which shows that pjk(s, t) is left-continuous at s = t) and the fact that S is a finiteset, which allows switching the sum and the limit, we see that

pik(s, t) = limh↑0

pik(s, t+ h), (3.9)

i.e. pik(s, t) is left-continous in t. Now (3.8) implies

pik(s, t+ h)− pik(s, t)h

= pik(s, t+ h)1− pkk(t+ h, t)

h−

∑j∈S,j 6=k

pij(s, t+ h)pjk(t+ h, t)

h

and taking limits as h ↑ 0 gives us with the aid of (3.9) and the assumption on the secondlimit in (3.6),

limh↑0

pik(s, t+ h)− pik(s, t)h

= pik(s, t)µkk(t) +∑

j∈S,j 6=k

pij(s, t)µjk(t) =∑j∈S

pij(s, t)µjk(t).

28

Page 32: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

This proves the second step and so we have shown that P(s, t) satisfies the Kolmogorovforward differential equations. Note further that from the definition of a transition matrixfunction (cf. Definition 3.1), P(s, t) satisfies the boundary condition P(s, s) = I.

Finally, the last assertion of the theorem follows easily since when the transition matrixfunction is homogeneous, we have

Q(s) = limt↓s

P(s, t)− I

t− s= lim

t↓s

P(0, t− s)− I

t− s= lim

h↓0

P(0, h)− I

h

and so clearly Q(s) does not depend on s.

Remark 3.6. In order to explain the name ‘forward’ equations, recall that for an MJP Xt :t ≥ 0 the quantity pik(s, t) := Pr(Xt = k|Xs = i) gives the probability of the MJP movingfrom state i at time s to state k at time t. Therefore one can refer to the variable i hereas the backward variable (when you travel from i to k you see i when looking backwards)and to the variable k (when you travel from i to k you see k when looking forwards) asthe forward variable. In (3.3) the sum in the right hand side is taken with respect to theforward variable and this is the reason why these differential equations are called the forwardequations. Note though that at this stage we have not actually showed that the solutions ofthe forward equations are indeed the transition probabilities of some MJP. This will we doin the next chapter.

3.3 Examples

In this section we look at a number of examples that appear in actuarial science. In eachexample the starting point is a schematic representation of a Q-matrix.

The time homogeneous health-sickness-death (hsd) model Consider the followingmodel for transitions between health, sickness and death. The arrows indicate the possibletransitions and the value attached to each arrow gives the transition rate (NOT the transitionprobability).

h:healthy s:sick

d:dead

-

α

β@@@@@@R

µ

ν

29

Page 33: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Notice further that the diagram gives the transition rates OUT of each state. The Q- orgenerator matrix for this model with states S = h, s, d is given by

Q =

−(α + µ) α µβ −(β + ν) ν0 0 0

.

Let P(t) =

(phh(t) phs(t) phd(t)psh(t) pss(t) psd(t)pdh(t) pds(t) pdd(t)

), t ≥ 0, be the homogeneous transition matrix function

generated by Q. Then P(t) satisfies the systems of ODEs ddt

P(t) = P(t)Q with boundarycondition P(0) = I. The three forward equations corresponding to the backward stateh:healthy are

d

dtphh(t) =− (α + µ)phh(t) + βphs(t)

d

dtphs(t) =αphh(t)− (β + ν)phs

d

dtphd(t) =µphh(t) + νphs(t).

Notice that in addition we must have phh(t) + phs(t) + phd(t) = 1. One can easily show thatthis last equation in combination with any two of the three forward equations above, impliesthe remaing third forward equation (see exercise). For the backward state s:sick the threeforward differential equations are

d

dtpsh(t) =− (α + µ)psh(t) + βpss(t)

d

dtpss(t) =αpsh(t)− (β + ν)pss

d

dtpsd(t) =µpsh(t) + νpss(t).

Once again in addition we have the equation psh(t) + pss(t) + psd(t) = 1. Note that the threeequations for the state s:sick look very similar as the ones for the state h:healthy and so onemight guess that phh(t) = psh(t), phs(t) = pss(t) and phd(t) = pss(t). However, this is not thecase as the boundary conditions are different for the two sets of ODEs. Indeed, for examplewe have phh(0) = 1 and psh(0) = 0, so the two transition probabilities cannot be the same.

Note that similar equations can be obtained for the remaining state d:dead. Furthernotice that for this example Kolmogorov’s forward equations consist of three systems ofthree differential equations each.

The in-work model Consider the following time homogeneous model for being in work.

d:dead w:working r:retired -µ ν

30

Page 34: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

The generator matrix for the states w, r, d - in that order - is

Q =

−(µ+ ν) ν µ0 0 00 0 0

and the corresponding forward differential equations for the backward state w:working are(with obvious notation)

d

dtpww(t) = −(µ+ ν)pww(t)

d

dtpwr(t) = νpww(t)

d

dtpwd(t) = µpww(t).

From the first equation we have that

pww(t) = Ae−(µ+ν)t

for some constant A and since pww(0) = 1 we must have that A = 1 and thus pww(t) =e−(µ+ν)t. Substituting this in the second equation we have that

d

dtpwr(t) = νe−(µ+ν)t,

i.e.pwr(t) = − ν

µ+ νe−(µ+ν)t + C

for some constant C. Since pwr(0) = 0 we must have C = νµ+ν

and thus

pwr(t) =ν

µ+ ν

(1− e−(µ+ν)t

), t ≥ 0.

Similarly from the third equation we get

pwd(t) =µ

µ+ ν

(1− e−(µ+ν)t

), t ≥ 0.

Alternatively, we could have derived the last equation via pww(t) + pwr(t) + pwd(t) = 1.

The time inhomogeneous alive-dead model Consider the case in which the only pos-sible transition from state a:alive to state d:dead has transition rate µ(t) which is dependenton time/age t making the model time inhomogeneous.

a:alive d:dead-µ(t)

31

Page 35: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

The rate µ(t) is often referred to as the force of mortality. We assume that µ(t) iscontinuous. The generator matrix is

Q(t) =

(−µ(t) µ(t)

0 0

)

and the forward differential equations give, for 0 ≤ s ≤ t,

∂tpaa(s, t) =− µ(t)paa(s, t)

∂tpad(s, t) =µ(t)paa(s, t)

∂tpda(s, t) =− µ(t)pda(s, t)

∂tpdd(s, t) =µ(t)pda(s, t)

with the solution, corresponding to the initial conditions paa(s, s) = 1, pad(s, s) = 0,pda(s, s) = 0, pdd(s, s) = 1 being

paa(s, t) = exp

(−∫ t

s

µ(u)du

)pad(s, t) =1− exp

(−∫ t

s

µ(u)du

)pda(s, t) =0

pdd(s, t) =1.

For the alive-dead model it is easy to construct an MJP whose transition probabilities aregiven by the solutions to the forward equations. Indeed, let T be a positive random variable

with density function fT (t) = µ(t) exp(−∫ t

0µ(u)du

)and set

Xt =

a if t < T ,

d if t ≥ T .

Then X = Xt : t ≥ 0 is a stochastic process satisfying the Markov property and for

32

Page 36: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

0 ≤ s ≤ t,

Pr(Xt = a|Xs = a) =Pr(T > t|T > s) =Pr(T > t)

Pr(T > s)

=

∫∞tfT (v)dv∫∞

sfT (v)dv

=exp

(−∫ t

0µ(u)du

)exp

(−∫ s

0µ(u)du

)= exp

(−∫ t

s

µ(u)du

)= paa(s, t),

Pr(Xt = d|Xs = a) =Pr(T ≤ t|T > s) = 1− paa(s, t) = pad(s, t),

Pr(Xt = a|Xs = d) =Pr(T > t|T ≤ s) = 0 = pda(s, t),

Pr(Xt = d|Xs = d) =Pr(T ≤ t|T ≤ s) = 1 = pdd(s, t).

Thus we see that the transition probabilities of this particular MJP X are equal to thesolutions to the forward equations associated with the given Q-matrices. In the next chapterwe will show that also for the other models considered here, we can construct a correspondingMJP whose transition probabilities are given by the solutions to the forward equations.

3.4 Kolmogorov’s backward differential equations (non-

examinable)

The terminology of the forward equations suggest that there is also something like thebackward equations. This is indeed the case.

Theorem 3.7. Let P(s, t), s ≥ 0, t ≥ s be a transition matrix function generated by theset of Q-matrices Q(t) = (µij(t))

di,j=1, t ≥ 0 whose entries µij(t) are continuous in t. Then

for each t ≥ 0, P(s, t) satisfies the following system of differential equations, known as theKolmogorov backward differential equations:

∂sP(s, t) = −Q(s)P(s, t), 0 ≤ s < t,

i.e. for all i, k ∈ S = 1, . . . , d,

∂spik(s, t) = −

∑j∈S

µij(s)pjk(s, t). (3.10)

Since in (3.10) the sum in the right hand side is taken with respect to the backwardvariable (see Remark 3.6), these differential equations are called the backward equations.Note that in the homogeneous case ∂

∂sP(s, t) = ∂

∂sP(0, t−s) = − d

duP(u)|u=t−s and so in that

case the backward equations can be written as

d

dtP(t) = QP(t), t ≥ 0,

33

Page 37: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

i.e. for all i, k ∈ S,d

dtpik(t) =

∑j∈S

µijpjk(t).

Proof of Theorem 3.7. Since P(s, t), s ≥ 0, t ≥ s satisfies the Chapman-Kolmogorovequations, we have for s ≤ u ≤ t,

P(s, t) = P(s, u)P(u, t).

Taking derivatives with respect to u on both sides of the above equation and using theKolmogorov forward equations leads to,

0 =∂

∂uP(s, u)P(u, t) + P(s, u)

∂uP(u, t) = P(s, u)Q(u)P(u, t) + P(s, u)

∂uP(u, t),

where 0 is the matrix consisting of zeros only. Then setting u = s and using P(s, s) = I,yields,

0 = Q(s)P(s, t) +∂

∂sP(s, t).

We see that P(s, t) satisfies the Kolmogorov backward equations.

Note that when we would like to compute for a fixed i the probabilities pik(s, t) for allk ∈ S, then it is easier to use the forward differential equations, because we only need tosolve a system of d ODEs, whereas if we use the backward equations, we need to solve asystem of d × d ODEs. On the other hand, the backward equations are more convenient ifwe would like to compute for a fixed k the probabilities pik(s, t) for all i ∈ S. However, inapplications the probabilities when the backward state is fixed are usually more interestingthan when the forward state is fixed and so in applications one tends to use more the forwardthan the backward equations.

34

Page 38: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 4

Probabilistic aspects of Markov jumpprocesses

In the previous chapter we showed that the matrices of transition probabilities of a givenMJP form a transition matrix function (Proposition 3.2) and that these transition matrixfunctions can be constructed given a set of Q-matrices via the Kolmogorov forward equations(Theorem 3.4). The next important question is then if, given a set of Q-matrices, thereactually exists an MJP such that its transition probabilities satisfy the Kolmogorov forwardequations associated with the Q-matrices. In this chapter we show that for a set of Q-matrices consisting of continuous transition rates, we can indeed construct an MJP suchthat its transition probabilities satisfy the Kolmogorov forward equations. This is done inSection 4.2 with further details provided in Section 4.3 for the time homogeneous case. Thisconstruction also allows us to say something about the distribution of (i) the length of timean MJP spends in a given state before transitioning to another state and (ii) the state theMJP visits next given that the MJP makes another jump from a given state. Before we talkabout the construction of an MJP, we first introduce in Section 4.1 the concept of hazardfunctions/rates since this allows us to present the construction in a clearer way.

4.1 Survival times and hazard functions

By a survival time T we will mean a random variable taking possible values in the set[0,∞) ∪ ∞ and which has a right-continuous density function f on [0,∞). This meansthat the cumulative distribution function (cdf) of T is given by

Pr(T ≤ t) =

∫ t

0

f(s)ds, 0 ≤ t <∞. (4.1)

We will further always assume that a survival time T is unbounded, which means that

Pr(T ≤ t) < 1 for all 0 ≤ t <∞.

Typically a survival time is used to model the time it takes until a certain event takes place.We include the possibility that T =∞ with some strictly positive probability. This is natural

35

Page 39: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

as sometimes the event of interest never takes place. This means that∫∞

0f(s)ds = Pr(T <

∞) can be strictly less than one in which case f is, strictly speaking, not a probabilitydensity function (but just a density function). Instead of working with the cdf or (p)df ofT , it is often more convenient to work with the survival or hazard function of the survivaltime. We will introduce these two functions next.

Definition 4.1. The survival function of the survival time T is defined as

S(t) := Pr(T > t), 0 ≤ t <∞.

Note that S(t) is decreasing (in the weak sense) function, S(t) is strictly positive sinceT is assumed to be unbounded and that limt→∞ S(t) = Pr(T =∞).

Definition 4.2. The hazard function or hazard rate of T with right-continuous densityfunction f is defined as

µ(t) := limh↓0

1

hPr(T ≤ t+ h|T > t)

= limh↓0

1

h

Pr(t < T ≤ t+ h)

Pr(T > t)

=1

S(t)limh↓0

Pr(T ≤ t+ h)− Pr(T ≤ t)

h

=1

S(t)limh↓0

∫ t+ht

f(s)ds

h

=f(t)

S(t), 0 < t <∞,

(4.2)

where the last equality follows by the assumption that f is right-continuous. Note that µ(t)is well-defined since S(t) > 0.

Interpreting T as the time until an event takes place, we have small h > 0,

f(t)h ≈P(t < T ≤ t+ h)

≈probability of event taking place between time t and t+ h

and

µ(t)h ≈Pr(T ≤ t+ h|T > t)

≈probability of event taking place between time t and t+ h

given that the event has not taken place before time t.

This explains the difference between the functions f and µ.It is useful to express the survival function directly in terms of the hazard function. The

next proposition does exactly this.

36

Page 40: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Proposition 4.3. We have

S(t) = e−∫ t0 µ(u)du, 0 ≤ t <∞. (4.3)

Proof. First we assume that the density function f is continuous. Then f(t) = ddt

Pr(T ≤t) = d

dt(1− S(t)) = −S ′(t). Using (4.2), it follows that S(t) is the solution of the following,

first-order, linear, ordinary differential equation

S ′(t) = −µ(t)S(t), 0 < t <∞ and S(0) = Pr(T > 0) = 1.

We can easily solve this ODE using separation of variables:

log(S(t)) = log(S(t))− log(S(0)) =

∫ t

0

S ′(u)

S(u)du = −

∫ t

0

µ(u)du.

By taking exponentials, we get S(t) = e−∫ t0 µ(u)du. In the general case where it is merely

assumed that f is right-continuous, the same arguments work with the understanding thatthe derivative everywhere is to be replaced by a right-derivative.

Example 4.4. The simplest example of a hazard function is the one with constant hazardrate, i.e. µ(t) = λ for all t ≥ 0, where λ > 0. Then by (4.3) the survival function is givenby S(t) = e−λt. Thus the density function is given by f(t) = µ(t)S(t) = λe−λt and werecognise that the survival time in this case is exponentially distributed with parameter λ.So a constant hazard rate corresponds to an exponential distribution.

Some other examples of hazard functions that are popular for modelling times of deathsof humans are:

• Gompertz: µ(t) = βeγt with parameters β > 0 and γ ∈ R.

• Makeham: µ(t) = α + βeγt with parameters α > 0, β > 0 and γ ∈ R.

• Weibull: µ(t) = αλαtα−1 with parameters α, λ > 0.

Note that the Makeham hazard function is a generalisation of the Gompertz one. Plots ofthe hazard function corresponding to the Weibull and the Gompertz-Makeham distributionare given in Figures 4.1 and 4.2.

Now let T be a survival time and fix x > 0. Then the function

t 7→ Pr(T ≤ t+ x|T > x), t ≥ 0,

is a cdf defined on [0,∞)∪∞ and with a right-continuous density function. Hence we candefine a survival time which has this function as its cdf. We denote this survival time byTx. We want to know what the survival and hazard functions are of Tx, i.e. what are the

37

Page 41: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

0 1 2 3 4 50

1

2

3

4

5

6

7

8

time

ha

zard

fu

nct

ion

Hazard function of the Weibull distribution with α=0.6,1 and 1.4

Figure 4.1: The hazard function of the Weibull distribution for the case α = 0.6(blue/decreasing), α = 1 (green/constant, exponential distribution) and α = 1.4(red/increasing).

0 0.5 1 1.5 2 2.5 30.5

1

1.5

2

2.5

3

3.5Hazard function of the Gompertz−Makeham distribution

time

hazard

function

α=1,β=−0.4,γ=−0.6

α=1,β=0.4,γ=0.6

α=1,β=0.4,γ=−0.6

Figure 4.2: The hazard function of the Gompertz-Makeham distribution for the case α =1, β = −0.4, γ = −0.6 (blue/bottom), α = 1, β = 0.4, γ = 0.6 (red/middle) and α = 1, β =−0.4, γ = −0.6 (green/top).

38

Page 42: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

survival and hazard functions of the conditional distribution of T given that T > x? Well,its survival function is given by

Pr(Tx > t) =Pr(T > t+ x|T > x)

=Pr(T > t+ x)

Pr(T > x)

=e−

∫ x+t0 µ(u)du

e−∫ x0 µ(u)du

=e−∫ x+tx µ(u)du

=e−∫ t0 µ(s+x)ds, t ≥ 0,

(4.4)

where we used (4.3) twice in the third equality. Hence by using (4.3) again, we see that thehazard function of Tx is

t 7→ µ(t+ x), t ≥ 0.

Example 4.5. Consider again the case where the hazard function of a survival time T isconstant, i.e. µ(t) = λ, t ≥ 0. Let Tx be the random variable with distribution equal to theconditional distribution of T given that T > x. Then by (4.4), for any x > 0,

Pr(Tx > t) = e−λt.

Hence for each x > 0, Tx is also exponentially distributed with parameter λ. This is theso-called memoryless property of the exponential distribution. Note that if the hazard func-tion is not a constant, then the distribution of Tx is not the same as the (unconditional)distribution of T .

4.2 Construction of an MJP

Let for each t ≥ 0, Q(t) = (µij(t))di,j=1 be a Q-matrix such that the transition intensities

µij(t) are continuous in t. For convenience we will use the following notation

µi(t) := −µii(t) =d∑

j=1,j 6=i

µij(t)

for the total transition rate out of state i. Note that µi(t) ≥ 0. Given this set of Q-matrices and an initial state i0, we construct next a stochastic process X = Xt : t ≥ 0with state space 1, . . . , d, right-continuous sample paths and initial state i0. Since X willhave right-continuous sample paths, we can define, as in Figure 3.1, the consecutive jumptimes by J(1), J(2), . . . and the consecutive holding times by H(1), H(2), . . .. We further setJ(0) = 0. Note that H(n) = J(n) − J(n − 1) for n ≥ 1 and by right-continuity we have,for any n ≥ 1, that Xt will be constant for J(n− 1) ≤ t < J(n). Note that the jump timesand holding times are random variables. The process X will be specified uniquely once we

39

Page 43: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

specify the (joint) distribution of the holding times H(1), H(2), . . . and the states to whichthe process jumps to at the jump times, i.e. the random variables XJ(1), XJ(2), . . .. We dothis recursively for n = 1, 2, . . . via the following two steps:

(i) Given J(n− 1) = s <∞ and XJ(n−1) = i, the holding time H(n) is

(a) defined as a survival time with hazard function t 7→ µi(t+ s) and

(b) assumed to be conditionally independent from H(1), H(2), . . . , H(n − 1) andXJ(0), XJ(1), . . . , XJ(n−2).

Note that step (i)(a) means via (4.3) that, for 0 ≤ s <∞,

Pr(H(n) > t|J(n− 1) = s,XJ(n−1) = i) = e−∫ t0 µi(u+s)du, t ≥ 0.

(ii) Given J(n) = t <∞ and XJ(n−1) = i, the state XJ(n) of the process at the next jumptime is

(a) set to be equal to k with probability µik(t)µi(t)

for k 6= i and

(b) assumed to be conditionally independently from H(1), H(2), . . . , H(n) andXJ(0), XJ(1), . . . , XJ(n−2).

Note that step (ii)(a) means that, for 0 ≤ t <∞,

Pr(XJ(n) = k|J(n) = t,XJ(n−1) = i) =µik(t)

µi(t), k 6= i.1 (4.5)

Because the transition rates are bounded on compact time intervals (this is a consequenceof the assumption that they are continuous), one can show that either there exists n ≥ 1such that J(n) =∞ or limn→∞ J(n) =∞, which means that we have defined the stochasticprocess X at all time points t ≥ 0.

The stochastic process X that we have constructed above (given a set of Q-matrices)turns out to satisfy the Markov property and thus turns out to be a Markov jump process.Before we show this we first look at residual holding times. For the process X as constructedabove, we define the residual holding time Rs by

Rs = inft > s : Xt 6= Xs − s,

i.e. it is the length of time between s and the next jump time after s, see also Figure 3.1.Note that there is a difference between a residual holding time Rs and a holding time H(n).

1One might be worried here that J(n) = t is such that the denominator µi(t) = 0, but one can show that

Pr(µi(J(n)) = 0|XJ(n−1) = i) = 0,

so this does not form a problem.

40

Page 44: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

The holding time H(i) is the amount of time the process stays in the same state right after ajump time (which is a random time), whereas Rs is the amount of time the process stays inthe same state right after a deterministic amount of time. Nevertheless, it turns out that theconditional (in a certain sense) distributions of both types of holding times are the same, seethe theorem below. Another important holding time random variable is the current holdingtime at s, denoted by Cs, which is defined to be the length of time between s and the lastjump time before s, see also Figure 3.1. Mathematically, this reads as

Cs := s− supt ≤ s : Xt 6= Xs.

If the process has not jumped before time s, then we set Cs = s. Note that we must have0 ≤ Cs ≤ s. The next theorem gives the conditional (given the state of the process at time s)distribution of the residual holding time at time s and says that conditional on the state ofthe process at time s, the residual and current holding time at time s are independent. Thelatter result is crucial for showing that the process we just constructed satisfies the Markovproperty.

Theorem 4.6. Given Xs = i, the residual holding time Rs is a survival time with hazardfunction t 7→ µi(s + t), t > 0 and is conditionally independent of the current holding timeCs.

Proof. Let Is be the random variable that indicates how many jumps have occurred untiltime s. Then by conditioning on the current holding time and Is,

Pr(Rs > t|Xs = i, Cs = w, Is = n)

=Pr(Rs > t,Xs = i, Cs = w, Is = n|Xs = i, Cs = w, Is = n)

=Pr(H(n+ 1) > t+ w|H(n+ 1) > w, J(n) = s− w,XJ(n) = i)

=Pr(H(n+ 1) > t+ w,H(n+ 1) > w|J(n) = s− w,XJ(n) = i)

Pr(H(n+ 1) > w|J(n) = s− w,XJ(n) = i)

=Pr(H(n+ 1) > t+ w|J(n) = s− w,XJ(n) = i)

Pr(H(n+ 1) > w|J(n) = s− w,XJ(n) = i)

=exp

(−∫ t+w

0µi(s− w + u)du

)exp

(−∫ w

0µi(s− w + u)du

)= exp

(−∫ t

0

µi(s+ u)du

),

(4.6)

where the the second equality follows because

Rs > t,Xs = i, Cs = w, Is = n =H(n+ 1) > t+ w, J(n) = s− w,XJ(n) = i,Xs = i, Cs = w, Is = n =H(n+ 1) > w, J(n) = s− w,XJ(n) = i,

the third equality follows by the definition of conditional probability and the fifth equality isbecause by step (i)(a) of the construction of X, H(n+ 1) given J(n) = s−w and XJ(n) = i

41

Page 45: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

has a hazard rate given by t 7→ µi(s − w + t), see also (4.3). Since the right hand side of(4.6) does not depend on w or n, it follows that conditional on Xs = i, Rs has a hazard rategiven by t 7→ µi(s+ t) and is (conditionally) independent of Cs.

By the above theorem we see that, given Xs = i, the probability of X being in state i attime t while not having left the state i in between, which is denoted by pii(s, t), is given by

pii(s, t) :=Pr(Xu = i for all u ∈ [s, t]|Xs = i)

=Pr(Rs > t− s|Xs = i)

= exp

(−∫ t

s

µi(u)du

).

(4.7)

We are not just interested in the residual holding Rs but also in which state the processX jumps to next after time s. We have the following result, which is very similar to (4.5).

Theorem 4.7. We have for s, r ≥ 0 and all states k 6= i,

Pr(XRs+s = k|Rs = r,Xs = i) =µik(s+ r)

µi(s+ r), k 6= i. (4.8)

Proof. Like in the proof of Theorem 4.6, we let Is be the random variable that indicateshow many jumps have occurred until time s. Then using similar steps as in that proof,

Pr(XRs+s = k|Rs = r,Xs = i, Cs = w, Is = n)

=Pr(XJ(n+1) = k|H(n+ 1) = w + r, J(n) = s− w,XJ(n) = i)

=Pr(XJ(n+1) = k|J(n+ 1) = s+ r, J(n) = s− w,XJ(n) = i)

=µik(s+ r)

µi(s+ r),

where the last line is because by step (ii) of the construction of X, conditional on J(n+1) =s + r and XJ(n) = i, we have XJ(n+1) is independent from J(n) and XJ(n+1) = k with

probability µik(s+r)µi(s+r)

. Since the right hand side of the above equation does not depend on wor n, the theorem follows.

We now turn to the theorem that says that the stochastic process X defined in thebeginning of this section is a Markov jump process.

Theorem 4.8. For a given set of Q-matrices Q(t) = (µij(t))di,j=1 with µij(t) continuous in

t, let X = Xt : t ≥ 0 be the stochastic process as constructed in the beginning of Section4.2. Then X satisfies the Markov property and therefore we call X the Markov jump processgenerated by Q(t), t ≥ 0.

Proof. We need to show that X satisfies the Markov property, i.e.

Pr(Xtn+1 = kn+1|Xt1 = k1, Xt2 = k2, . . . , Xtn = kn) = Pr(Xtn+1 = kn+1|Xtn = kn),

42

Page 46: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

for t1 < t2 < . . . < tn < tn+1 and states k1, k2, . . . , kn, kn+1. The first step is to realise bythe construction of the process X, we have that given the residual holding time Rtn = r andXtn = kn, what happens to the process after time tn does not depend on what has happenedbefore time tn, i.e.

Pr(Xtn+1 = kn+1|Xt1 = k1, Xt2 = k2, . . . , Xtn = kn, Rtn = r)

= Pr(Xtn+1 = kn+1|Xtn = kn, Rtn = r).

What then remains to be showed is that given Xtn = kn, the distribution of the residualholding time Rtn does not depend on what what has happened before time tn, i.e.

Pr(Rtn ≤ r|Xt1 = k1, Xt2 = k2, . . . , Xtn = kn) = Pr(Rtn ≤ r|Xtn = kn). (4.9)

Again, from the construction of the process, it is easy to see that given the current holdingtime Ctn = w and Xtn = kn, the residual holding time Rtn does not depend on anything elsethat happened before time tn, i.e.

Pr(Rtn ≤ r|Xt1 = k1, Xt2 = k2, . . . , Xtn = kn, Ctn = w) = Pr(Rtn ≤ r|Xtn = kn, Ctn = w).(4.10)

But by Theorem 4.6, we know that given Xtn , the residual and current holding time, Rtn

and Ctn , are independent, which means that

Pr(Rtn ≤ r|Xtn = kn, Ctn = w) = Pr(Rtn ≤ r|Xtn = kn).

Then since the right hand side of (4.10) does not depend on w, neither does the left handside of (4.10). Hence the left hand side of (4.10) is equal to the left hand side of (4.9). Hencewe have showed that (4.9) holds, which finishes the proof.

Let X be the MJP generated by the set of Q-matrices Q(t) satisfying the aforemen-tioned conditions. By Proposition 3.2 it follows that the matrices of transition probabilitiespij(s, t) := Pr(Xt = j|Xs = i), denoted by P(s, t), form a transition matrix function. Onthe other hand, in Theorem 3.4 we saw that given the set of Q-matrices Q(t), we can also

construct a transition matrix function P(s, t), s ≥ 0, t ≥ s which is the unique solution ofthe Kolmogorov forward equations,

∂tP(s, t) = P(s, t)Q(t), t > s,

with boundary condition P(s, s) = I. The question now is of course if these two transition

matrix functions are the same, i.e. do we have P(s, t) = P(s, t)? The answer is yes andthis follows from Theorem 4.9 below. Before stating and proving this theorem, we need tointroduce a few auxiliary functions.

To this end, denote by fRs|Xs(r|i) the conditional density (function) of Rs on (0,∞) givenXs = i. This means that

Pr(Rs ≤ t|Xs = i) =

∫ t

0

fRs|Xs(r|i)dr, 0 < t <∞.

43

Page 47: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Note that fRs|Xs(r|i) is not necessarily a probability density function since it is possible thatRs = ∞ (i.e. the process never jumps again after time s) with strictly positive probabilityand so

∫∞0fRs|Xs(r|i)dr can be strictly less than one. Combining Theorem 4.6 with (4.2)

and (4.3), we see that for s, r ≥ 0,

fRs|Xs(r|i) = µi(r + s) exp

(−∫ s+r

s

µi(u)du

). (4.11)

Denote further by fRs,XRs+s|Xs(r, k|i) the joint (again not necessarily probability) density/massfunction of Rs and XRs+s conditional on Xs = i. Note that this means that

Pr(Rs ≤ t,XRs+s = k|Xs = i) =

∫ t

0

fRs,XRs+s|Xs(r, k|i)dr, t > 0.

Then by the definition of conditional probability, (4.8) and (4.11),

fRs,XRs+s|Xs(r, k|i) =Pr(XRs+s = k|Rs = r,Xs = i)fRs|Xs(r|i)

=µik(s+ r)

µi(s+ r)µi(r + s) exp

(−∫ s+r

s

µi(u)du

)=µik(r + s) exp

(−∫ s+r

s

µi(u)du

).

(4.12)

Theorem 4.9. Given set of Q-matrices Q(t) = (µij(t))di,j=1 with µij(t) continuous in t, let

X = Xt : t ≥ 0 be the Markov jump process generated by Q(t), t ≥ 0 and let P(s, t) be thematrix whose (i, j)th entry is given by pij(s, t) = Pr(Xt = j|Xs = i). Then P(s, t) satisfiesthe Kolmogorov forward equations.

Proof. By Theorem 3.5 we only need to show that

limt↓s

pii(s, t)− 1

t− s= µii(s), lim

t↓s

pik(s, t)

t− s= µik(s), k 6= i (4.13)

and

limt↑s

pii(t, s)− 1

s− t= µii(s), lim

t↑s

pik(t, s)

s− t= µik(s), k 6= i. (4.14)

By (4.7) (and recalling that µi(t) := −µii(t)), we have

pii(s, t) ≥ pii(s, t) = exp

(−∫ t

s

µi(u)du

)= exp

(∫ t

s

µii(u)du

)and thus

limt↓s

pii(s, t)− 1

t− s≥ lim

t↓s

exp(∫ t

sµii(u)du

)− 1

t− s=

d

dt

(exp

(∫ t

s

µii(u)du

))t=s

= µii(s),

(4.15)

44

Page 48: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

if the limit on the left hand side exists. For k 6= i, we have

pik(s, t) ≥Pr(Rs ≤ t− s,XRs+s = k,RRs+s ≥ t− s−Rs|Xs = i)

=

∫ t−s

0

Pr(RRs+s ≥ t− s− r|Rs = r,XRs+s = k,Xs = i)fRs,XRs+s|Xs(r, k|i)dr

=

∫ t−s

0

Pr(Rr+s ≥ t− s− r|Xr+s = k)fRs,XRs+s|Xs(r, k|i)dr

=

∫ t−s

0

pkk(s+ r, t)fRs,XRs+s|Xs(r, k|i)dr

=

∫ t−s

0

exp

(∫ t

s+r

µkk(u)du

)µik(r + s) exp

(∫ s+r

s

µii(u)du

)dr,

where we used the Markov property in the third line and (4.7) and (4.12) in the last line.Hence taking (right-)derivatives in t on both sides and putting t = s, leads to

limt↓s

pik(s, t)

t− s≥ µik(t), (4.16)

if the limit on the left hand side exists. Since P(s, t) and Q(s) have rowsums equal to oneand zero respectively, we have with δik the (i, k)th entry of the identity matrix I,

limt↓s

d∑k=1

pik(s, t)− δikt− s

= 0 =d∑

k=1

µik(s).

Using the above one can show that the limits in (4.15) and (4.16) exists and that theinequalities in (4.15) and (4.16) have to be equalities. Hence (4.13) holds. Using the samearguments one can see that (4.14) holds as well.

4.3 Further remarks in the time homogeneous case

In this section we consider the time homogeneous case and see how the results of the previoussection simplify. To this end, let Q = (µij)

di,j=1 be a Q-matrix and let X be the time

homogeneous MJP generated by Q. We denote µi := −µii =∑d

j=1,j 6=i µj for the totaltransition rate out of state i. We have the following result regarding the holding timesH(1), H(2), . . . and the states XJ(1), XJ(2), . . . of the process at the jump times.

Theorem 4.10. Let X be the time homogeneous Markov jump process generated by the Q-matrix Q = (µij)

di,j=1. Let n ≥ 1 and suppose J(n− 1) <∞; here we set J(0) = 0. Then we

have the following.

(i) Suppose µi > 0. Then given XJ(n−1) = i, the holding time H(n) is

(a) exponentially distributed with parameter µi and

45

Page 49: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

(b) conditionally independent from H(1), H(2), . . . , H(n−1) and XJ(0), XJ(1), . . . , XJ(n−2).

(ii) Suppose µi > 0. Then given XJ(n−1) = i, the state XJ(n) of the process at the nextjump time is

(a) equal to k 6= i with probability µikµi

and

(b) conditionally independent from H(1), H(2), . . . , H(n) and XJ(0), XJ(1), . . . , XJ(n−2).

(iii) Suppose µi = 0. Then given XJ(n−1) = i, we have J(n) = H(n) = ∞ and Xt = i forall t ≥ J(n− 1) with probability 1.

Proof. From step (i) of the construction of an MJP in the beginning of Section 4.2 we have,given J(n − 1) = s and XJ(n−1) = i, that H(n) has a hazard rate given by µi. So thehazard rate is constant and further does not depend on s. Hence in the time homogeneouscase, given XJ(n−1) = i, the nth holding time H(n) does not depend on J(n − 1) and isexponentially distributed with parameter µi, provided µi > 0. If µi = 0 then H(n) = ∞with probability one and so the MJP is absorbed in state i. This explains parts (i) and (iii).Regarding (ii), we have from step (ii) of the construction of an MJP that the probabilitythat given J(n) = t < ∞ and XJ(n−1) = i (note that J(n) = ∞ if and only if µii = 0), theMJP jumps to state k 6= i with probability µik

µi. Since this probability does not depend on

t, it follows that given XJ(n−1) = i, the distribution of XJ(n) is independent from the jumptime J(n).

We have a very similar story for residual holding times.

Theorem 4.11. Let X be the time homogeneous Markov jump process generated by theQ-matrix Q = (µij)

di,j=1 and let s ≥ 0. Then we have the following.

(i) Suppose µi > 0. Then given Xs = i, the residual holding time Rs is exponentiallydistributed with parameter µi and does not depend on s.

(ii) Suppose µi > 0. Then given Xs = i, the state Xs+Rs at the next jump time after s isindependent from Rs or s and is equal to k 6= i with probability µik

µi.

(iii) Suppose µi = 0. Then given Xs = i, we have Rs = ∞ and Xt = i for all t ≥ s withprobability 1.

Proof. Follows directly from Theorems 4.6 and 4.7.

Example 4.12. Consider the following time homogeneous health-sickness-death model withthe addition of a ’terminally ill’ state k. The transition rates are as given below.

46

Page 50: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

h i k

d

-

0.02

1.00

-0.15

@@@@@@@R

0.05

0.40

?

0.05

(i) Given that the MJP is in state i:ill at some time, calculate the expected residual holdingtime.

Answer : Note that because of time homogeneity we can assume without loss of gen-erality that the MJP is in state ill at time 0. By Theorem 4.11(i), R0 is exponentiallydistributed with parameter µi = µih + µik + µid = 1.2. As the expectation of anexponential random variable with parameter α is 1/α, we have

E[R0|X0 = ill] =1

µi=

1

1.2≈ 0.833.

(ii) Calculate the probability that an ill life goes into state d:dead when he/she leaves thei:ill state.

Answer : By Theorem 4.11(ii), we have,

Pr(XR0 = dead|X0 = ill) =µidµi

=0.05

1.2≈ 0.042.

(iii) Calculate the probability that a healthy life becomes ill and then dies without goingto the terminally ill state or back to the healthy state.

Answer : We have

Pr(XJ(2) = dead,XJ(1) = ill|X0 = healthy)

=Pr(XJ(2) = dead|X0 = healthy, XJ(1) = ill)Pr(XJ(1) = ill|X0 = healthy)

=Pr(XJ(2) = dead|XJ(1) = ill)Pr(XJ(1) = ill|X0 = healthy)

=µidµi

µhiµh

=0.05

1.2× 0.02

0.07≈ 0.012,

where we used the definition of conditional probability in the first equality, the factthat given XJ(1), the random variable XJ(2) is independent of X0 = XJ(0) in the secondequality (i.e. Theorem 4.10(ii)(b)) and Theorem 4.10(ii)(a) in the third.

47

Page 51: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

(iv) Calculate the probability that a life in the sick state dies without ever recovering to ahealthy state.

Answer : Note that if the person gets terminally ill (the killing state k), this personwill die without recovering to a healthy state. Hence the required probability is thesum of the probability of going from sick to dead plus the probability of going fromsick to terminally ill, i.e. by Theorem 4.11(ii) (or Theorem 4.10(ii)(a)),

Pr(XR0 ∈ k, d|X0 = ill) =Pr(XR0 = k|X0 = ill) + Pr(XR0 = d|X0 = ill)

=0.15

1.2+

0.05

1.2= 0.167.

Note that R0 = H(1) = J(1).

48

Page 52: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 5

Estimation of transition rates for timehomogeneous Markov jump processes

Typically in practice when we want to use a model involving an MJP we do not know inadvance what the values of the transition intensities are or should be. So then we can tryto estimate these transition rates given some data. In this chapter we look at estimatingthe (constant) transition rates of time homogeneous Markov jump processes via maximumlikelihood in the case where we observe several independent sample paths of the MJP during agiven period. In particular we will find the maximum likelihood estimators/estimates (mles)of the transition intensities and find their asymptotic variance allowing us, for instance, toconstruct asymptotic confidence intervals for the transition rates. In the last section of thischapter we see how one can use the likelihood ratio test to perform some quite sophisticatedhypothesis tests that can give some indication about whether or not a time homogeneousMJP is really an appropriate model for a particular application.

5.1 Maximum likelihood for a time homogeneous MJP

Let X = Xt : t ≥ 0 be a time homogeneous MJP with state space 1, . . . , d and generatedby the Q-matrix Q = (µjk)

dj,k=1. Our aim here is to estimate the transition intensities µjk for

j 6= k, i.e. the off-diagonal elements of Q. There is no need to estimate the diagonal elementsµjj at the same time since µjj = −

∑dk=1,k 6=j µjk (recall the the rowsums of a Q-matrix are

0) and so we can estimate µjj after we have estimated the off-diagonal element µjk, j 6= k.

Regarding the type of data we have, we assume we observe n independent ‘individuals’all following the same MJP X and that individual i is observed from time 0 onwards untiltime c(i), where 0 < c(i) < ∞ is a deterministic quantity referred to as the censoring timeof individual i. So for each individual we observe Xt, its status at time t, for all t ∈ [0, c(i)].Note that by time homogeneity of X we can assume without loss of generality that allindividuals are observed from time 0 onwards and further that it is natural to have a finiteperiod of observation for each individual. Since we are only interested in estimating thetransition rates and not the initial distribution of the Markov chain we can assume that

49

Page 53: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

we know the starting location X0 of each individual in advance (i.e. before we start theobservations).

We now try to find the likelihood of the unknown transition intensities given that weobserve Xt, t ∈ (0, c] for some 0 < c <∞ and assuming that we know the starting point X0 inadvance. Let m ≥ 1 be such that J(m−1) ≤ c < J(m). Observing (the exact value of) Xt fort ∈ (0, c], is equivalent to observing the exact value of XJ(1), . . . , XJ(m−1), H(1), . . . , H(m−1)and observing a lower bound of H(m), confirm Figure 3.11, we obtain the data

H(1) = h1, XJ(1) = k1, H(2) = h2, XJ(2) = k2, . . . ,

H(m− 1) = hm−1, XJ(m−1) = km−1, H(m) > hm,

for some specific values h1, . . . , hm ≥ 0 satisfying∑m

j=1 hj = c and for some specific statesk1, . . . , km. Note that this second representation of the data only involves finitely manyrandom variables and so this is the more useful representation of the data for determiningthe likelihood. However, we still have the following three complications in comparison to thestandard case where we observe a number of i.i.d. random variables:

• the random variables XJ(1), . . . , XJ(m−1), H(1), . . . , H(m − 1), H(m) that we observeare not independent;

• some of the random variables that we observe are discrete whereas others are continuousrandom variables;

• due to the censoring we do not observe H(m) completely/exactly but we only observea lower bound, i.e. we have an incomplete observation for H(m).

Regarding the first issue, although the random variables are not independent, there is aconditional independence structure (recall Theorem 4.10) and we will use this later on. Inorder to deal with having a collection of discrete and continous random variables we usethe following notation and terminology. Let Y1, . . . , Yq be a collection of real-valued randomvariables where p ≥ 1. By the function

(y1, . . . , yq) 7→ f(Y1 = y1, . . . , Yq = yq)

we denote the (joint) probability function of (Y1, . . . , Yq). If e.g. Y1, . . . , Yr are discreterandom variables and Yr+1, . . . , Yq are continuous random variables, where 0 ≤ r ≤ q, thenthere is the following relation between the probability function and the (joint) cumulativedistribution function:

Pr(Y1 ≤ y1, . . . , Yq ≤ yq)

=∑x1≤y1

· · ·∑xr≤yr

∫ yr+1

xr+1=−∞· · ·∫ yq

xq=−∞f(Y1 = x1, . . . , Yq = xq)dxq . . . dxr+1.

1Note that when considering the sample path in Figure 3.1 restricted to the time interval [0, c] wherec = s, then m = 3 and we observe XJ(1), XJ(2), H(1), H(2) exactly but only a lower bound for H(3).

50

Page 54: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Note that if the random variables Y1, . . . , Yq are all discrete, then f(Y1 = y1, . . . , Yq = yq)is a conditional (joint) probability mass function, whereas if Y1, . . . , Yq are all continuousrandom variables, then f(Y1 = y1, . . . , Yq = yq) is a conditional (joint) probability densityfunction. Further, with Z1, . . . , Zp another collection of real-valued random variables wherer ≥ 1, we denote by

f(Y1 = y1, . . . , Yq = yq|Z1 = z1, . . . , Zp = zp) =f(Y1 = y1, . . . , Yq = yq)

f(Z1 = y1, . . . , Zp = yp)

the conditional (joint) probability function of (Y1, . . . , Yq) given (Z1 = z1, . . . , Zp = zp).With the notation f() for a probability function and f( | ) for a conditional probabilityfunction, the likelihood L of the transition intensities µij, i 6= j, given the data/observationsH(1) = h1, XJ(1) = k1, . . . , H(m−1) = hm−1, XJ(m−1) = km−1 and H(m) > hm and assumingthat we know X0 = k0 in advance, is given by

L = f(H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1,1H(m)>hm = 1|XJ(0) = k0

).

Note that the incomplete observation H(m) > hm of the continuous random variable H(m) isexpressed as the exact/complete observation 1H(m)>hm = 1 of the discrete random variable1H(m)>hm. The following lemma simplifies the above expression of the likelihood L by usingTheorem 4.10.

Lemma 5.1. We have

L =

(m−1∏l=1

f(H(l) = hl|XJ(l−1) = kl−1)f(XJ(l) = kl|XJ(l−1) = kl−1)

)Pr(Hm > hm|XJ(m−1) = km−1)

=

(m−1∏l=1

µkl−1exp

(−µkl−1

hl) µkl−1kl

µkl−1

)exp

(−µkm−1hm

)=

(m−1∏l=1

exp(−µkl−1

hl)µkl−1kl

)exp

(−µkm−1hm

)=

(m−1∏l=1

µkl−1kl

)(m∏l=1

exp(−µkl−1

hl))

,

(5.1)

where we understand that∏0

l=1 · = 1.

Proof. The second equality follows directly from parts (i)(a) and (ii)(a) of Theorem 4.10.The third and fourth equalities are obvious. So it is left to prove the first equality. By thedefinition of a conditional probability function and part (i)(b) of Theorem 4.10,

f(H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1,1H(m)>hm = 1|XJ(0) = k0

)=f(H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1|XJ(0) = k0

)× f

(1H(m)>hm = 1|XJ(0) = k0, H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1

)=f(H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1|XJ(0) = k0

)× f

(1H(m)>hm = 1|XJ(m−1) = km−1

),

51

Page 55: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

where we note that

f(1H(m)>hm = 1|XJ(m−1) = km−1

)=Pr(1H(m)>hm = 1|XJ(m−1) = km−1)

=Pr(Hm > hm|XJ(m−1) = km−1).

Similarly, by the definition of a conditional probability function and part (i)(b) and (ii)(b)of Theorem 4.10,

f(H(1) = h1, XJ(1) = k1, . . . , H(m− 1) = hm−1, XJ(m−1) = km−1|XJ(0) = k0

)=f(H(1) = h1, XJ(1) = k1, . . . , H(m− 2) = hm−2, XJ(m−2) = km−2|XJ(0) = k0

)× f

(H(m− 1) = hm−1|XJ(0) = k0, . . . , H(m− 2) = hm−2, XJ(m−2) = km−2

)× f

(XJ(m−1) = km−1|XJ(0) = k0, . . . , H(m− 2) = hm−2, XJ(m−2) = km−2, H(m− 1) = hm−1

)=f(H(1) = h1, XJ(1) = k1, . . . , H(m− 2) = hm−2, XJ(m−2) = km−2|XJ(0) = k0

)× f

(H(m− 1) = hm−1|XJ(m−2) = km−2

)× f

(XJ(m−1) = km−1|XJ(m−2) = km−2

).

By repeating the last argument over and over and then combining all computations, weobtain the first equality in the statement of the lemma.

Recalling that k0, k1, . . . , km−1 are the consecutive states that we observed the process tovisit, the first factor on the right hand side of (5.1) can be written as

m−1∏l=1

µkl−1kl =d∏j=1

d∏k=1,k 6=j

µδjkjk ,

where δjk ≥ 0 denotes the observed number of transitions from state j to state k by the MJP.Similarly, since h1 . . . , hm are the consecutive amounts of time that the MJP is observed tostay in the same state, we can write the second factor on the right hand side of (5.1) as

m∏l=1

exp(−µkl−1

hl)

= exp

(−

m∑l=1

µkl−1hl

)= exp

(−

d∑j=1

µjwj

),

where wj ≥ 0 denotes the observed total amount of time the MJP has spent in state j. Com-bining these two points, gives the following expression for the likelihood L of the transitionintensities µjk, j 6= k, given that we observe Xt during a finite time interval,

L =

(d∏j=1

d∏k=1,k 6=j

µδjkjk

)exp

(−

d∑j=1

µjwj

),

where to recall, δjk denotes the number of observed transition from state j to state k of theMJP and wj is the total amount of time the MJP was observed to be in state j. From theabove formula for the likelihood, we see that what is important are the number of transitionfrom a state j to another state k and the total time spent in a state j, but what is not

52

Page 56: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

important is the exact order of the transitions or the length of each separate visit in a statej.

We have just determined the likelihood given that we observe one partial sample path ofthe MJP. But recall that we assumed in the beginning that we observe n independent (partial)sample paths corresponding to n individuals. We should of course combine all observationsinto one likelihood which we (again) denote by L, but this is easy to do. Namely, individuali’s contribution to the likelihood L is given by

L(i) =

(d∏j=1

d∏k=1,k 6=j

µδ(i)jk

jk

)exp

(−

d∑j=1

µjw(i)j

),

where δ(i)jk denotes the observed number of transitions from state j to state k of individual

i and w(i)j denotes the total observed amount of time individual i spent in state j. Then

the likelihood L given the data of all individuals is, due to the assumed independence of theindividuals, given by

L =n∏i=1

L(i) =

(d∏j=1

d∏k=1,k 6=j

µδjkjk

)exp

(−

d∑j=1

µjwj

),

where δjk :=∑n

i=1 δ(i)jk is the total observed number of transitions from state j to state k of

all individuals combined and wj :=∑n

i=1 w(i)j is the total amount of time spent in state j by

all individuals combined.

Example 5.2. We consider the three state time homogeneous healthy-sick-dead (hsd) Markovjump process with constant transition rates as indicated in the diagram below:

h:healthy s:sick

d:dead

-

α

β

@@@@@@R

µ

ν

Our aim is to estimate the transition rates α, β, µ and ν given that we have observed two(independent) individuals during some time period and that the observations of individual1 and 2 are as in Figure 5.1 and Figure 5.2. Note that here we do not try to estimate µdh orµds but just make the logical assumption that they are both 0.

53

Page 57: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

healthy

sick

dead

τ1 τ2 τ4τ3

t0 c(1)

Figure 5.1: The evolution of the health of individual 1 over the time interval [0, c(1)].

healthy

sick

dead

0 c(2) t

t1 t2 t3

Figure 5.2: The evolution of the health of individual 2 over the time interval [0, c(2)].

54

Page 58: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Then individual 1’s contribution to the likelihood equals

L(1) = α2β exp(−(α + µ)(τ1 + τ3)− (β + ν)(τ2 + τ4))

and individual 2’s contribution to the likelihood equals

L(2) = αν exp(−(α + µ)t1 − (β + ν)t2 − 0 · t3).

Hence the (total) likelihood is given by

L = L(1)× L(2) = α3βν exp(−(α + µ)(τ1 + τ3 + t1)− (β + ν)(τ2 + τ4 + t2)).

Note that δhs = 3, δhd = 0, δsh = 1, δsd = 1, wh = τ1 + τ3 + t1, ws = τ2 + τ4 + t2.

Now that we have a fairly simple explicit expression of the likelihood, it is not hardto maximise it in order to find the mles of the transition intensities and their asymptoticvariances. This is done in the following theorem.

Theorem 5.3. For a time homogeneous Markov jump process X = Xt : t ≥ 0 with statespace 1, . . . , d and transition intensities µjk, suppose we observe n independent individualsduring some finite period of time. Then the likelihood of the transition rates µjk, j 6= k, isgiven by

L =

(d∏j=1

d∏k=1,k 6=j

µδjkjk

)exp

(−

d∑j=1

µjwj

),

where µj :=∑d

k=1,k 6=j µjk, δjk is the total observed number of transitions from state j to state

k of all individuals combined and wj :=∑n

i=1w(i)j is the total amount of time spent in state

j by all individuals combined. The maximum likelihood estimate (mle) of µjk, for j 6= k, isgiven by

µjk =δjkwj.

Further, under some conditions (which we not go into), we have that the maximum likelihoodestimator µjk, seen as a random variable, is asymptotically (i.e. as the sample size n tendsto infinity) normally distributed with mean µjk and asymptotic variance

δjkw2j

.

Proof. We have already proven the given expression for the likelihood. Taking the logarithmof L leads to the following log-likelihood:

` := logL =d∑j=1

d∑k=1,k 6=j

δjk log(µjk)−d∑j=1

µjwj =d∑j=1

d∑k=1,k 6=j

(δjk log(µjk)− µjkwj

).

55

Page 59: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

We see that we can write ` as

` =d∑j=1

d∑k=1,k 6=j

`jk(µjk), (5.2)

where `jk(µjk) = δjk log(µjk)−µjkwj is a function that depends on only one of the transitionintensities that we want to estimate. In order to find the maximum likelihood estimates ofthe transition rates µjk, j 6= k, we need to maximise L or equivalently ` with respect to intotal d(d− 1) parameters. But from the decomposition (5.2), we see that we can reduce thissingle, seemingly complex maximisation problem to d(d− 1) simple maximisation problems,namely for each j 6= k we need to maximise `jk with respect to one parameter only. We have

`′jk(µjk) =δjkµjk− wj, `′′(µjk) = − δjk

µ2jk

,

and we easily see that µjk =δjkwj

is the unique maximiser of `jk. Hence µjk =δjkwj

is the mle

of µjk (for j 6= k). From the theory on maximum likelihood and due to the simple structureof our log-likelihood (recall (5.2)) it is known that, under some conditions, we have, for eachj 6= k, that the mle µjk (seen as a random variable, i.e. seeing the data δjk and wj asrandom variables) is asymptotically normally distributed with mean given by the true valueof the unknown transition intensity µjk and asymptotic variance given by the inverse of theobserved Fisher information associated with `jk and evaluated at the mle, i.e.

−(`′′jk(µjk)

)−1∣∣∣µjk=µjk

= − 1

`′′jk(µjk)=µ2jk

δjk=δjkw2j

.

Remark 5.4. Note that the results in Theorem 5.3 still apply if we do not want to estimateall the transition intensities but only estimate a subset of them while setting some defaultvalues for the remaining transition rates. We had this situation in Example 5.2 where wedid not estimate µhd and µsd but set them equal to 0.

Let us consider the setting of above theorem and let us fix a j ∈ 1, . . . , d and k ∈1, . . . , d\j. The asymptotic normality of the mle µjk allows us to say something aboutthe accuracy of this estimator. In particular, we can derive explicit asymptotic confidenceintervals for the unknown transition intensity µjk. Namely, with α ∈ (0, 1) and zα ∈ R thenumber such that Pr(N (0, 1) ≤ zα) = 1− α, where N (µ, σ) denotes a random variable thatis normally distributed with mean µ and variance σ2, we have that

µjk ± zα/2

√δjkw2j

= µjk ± zα/2

√δjk

wj(5.3)

is a confidence interval of asymptotic level 1− α for µjk; note here that we use the notationa± b := [a− b, a + b]. By a confidence interval of asymptotic level 1− α we mean that the

56

Page 60: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

confidence interval, seen as a random interval which note depends on n, contains the truevalue of the parameter µjk with a probability that converges to 1 − α as the sample size ngoes to infinity, i.e.

limn→∞

Pr

µjk ∈ µjk ± zα/2√δjk

wj

= 1− α.

The confidence interval (5.3) for µjk allows us to easily carry out the so-called Wald test fortesting

H0 : µjk = µ versus HA : µjk 6= µ, (5.4)

where µ > 0 is a given numeral. Namely according to the Wald test we reject H0 atsignificance level α if and only if the numeral µ does not lie in the confidence interval (5.3).

Once we have estimated the transition rates, we can use these estimates to estimate otherquantities that depend on the transition intensities. For instance, consider the probability

pjj(t) = Pr(Xu = j for all u ∈ [0, t]|X0 = j) = exp(−tµj) = exp

(−t∑k 6=j

µjk

), (5.5)

where t > 0 is given. By the so-called invariance property/principle of mles, we have thatthe mle of pjj(t) is simply given by

pjj(t) = exp

(−t∑k 6=j

µjk

),

i.e. we replace the unknown values of the transition intensities in (5.5) by their mles. Asymp-

totic normality results can be found for pjj(t) by using the delta method, but we will notpursue this.

5.2 The likelihood ratio test: a tool for model selection

In the previous section we saw how we could perform the Wald test for testing that a certaintransition intensity is equal to some given value, see (5.4). In this section we are interestedin how we can test some more fundamental assumptions of a given Markov jump process.For instance, if we have decided to work with a time homogeneous MJP, we might want totest if this time homogeneity assumption is appropriate or not. An other example could beto test whether or not in the healthy-sick-dead model of Example 5.2 the transition intensityfrom healthy to dead is equal to the one from sick to dead. We might even want to testwhether the Markov property is an appropriate assumption for the particular applicationthat we have in mind. Note that in each case the question is whether we can work with aparticular (relatively) simple model for a given application or if we should consider a morecomplex one. Such model selection questions cannot be tackled by performing tests of the

57

Page 61: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

form (5.4) but lead to more complex hypothesis tests. In this section we explain how the(generalised) likelihood ratio test can be employed to deal with these type of model selectionquestions involving Markov jump processes. We start by describing the likelihood ratio testitself.

Likelihood ratio test for nested models Assume we are given a whole class of modelsparametrised by a vector ~θ. This class of models we refer to as the full model. Suppose wewant to test whether our data comes from a specific model (or subclass of models) in thisclass, which we refer to as the null model. Since the null model is a particular example ofthe full model, we can say that the null model is nested in the full model. Let Θ be the setof all possible values that ~θ can take in the full model and let Θ0 ⊂ Θ be the set of possiblevalues of ~θ that corresponds to the null model. So we want to test

H0 : ~θ ∈ Θ0, HA : ~θ /∈ Θ0.

In the (generalised) likelihood ration test we use the test statistic

Λ =max~θ∈Θ0

L(~θ)

max~θ∈Θ L(~θ),

where L is the likelihood of ~θ given the data. So Λ is the ratio of the maximum likelihoodunder the null model and the maximum likelihood of the full model. If the null model isan appropriate model for the data, then we expect that we cannot increase the maximumlikelihood by much if we consider the full model instead and so the test statistic Λ should inthis case be close to 1. On the other hand, if the null model is completely inappropriate, thenwe expect Λ to be close to 0. So by this reasoning we should reject H0 if Λ is too small. Nextwe make precise what is meant with ‘Λ being too small’ in the likelihood ratio test. Wilks’theorem says that under H0, assuming some regularity conditions (which we will not specify)and when the sample size is large enough, −2 log Λ is approximately chi-squared distributedwith degrees of freedom (dof) q given by dim(Θ) − dim(Θ0), i.e. q is the difference of thedimension of Θ and the dimension of Θ0. Or in other words, q is the difference betweenthe number of free parameters in the full model and the null model. Note that Λ beingtoo small corresponds to −2 log Λ being too large. We conclude that an appropriate test atsignificance level α to check whether the data comes from the null model is to reject H0 if−2 log Λ ≥ λα, where λα > 0 is defined such that

Pr(χ2q ≥ λα) = α,

where χ2q denotes a chi-squared distribution (or to be more precise, a random variable that

is chi-squared distributed) with q degrees of freedom. Note that the critical value λα can beobtained from chi-squared tables. Although, strictly speaking, one could conduct a likelihoodratio test in the case where the degrees of freedom q = 0, we are only interested in situationsfor which q ≥ 1.

58

Page 62: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Next we discuss three particular applications of the likelihood ratio test associated withthe healthy-sick-dead model of Example 5.2. Each time we assume that the data we have is asin the previous sections of this chapter, i.e. we observe a number of independent individualsand we observe their status (i.e. healthy, sick or dead) during a certain time period wherethe precise time period might be different for different individuals. These examples will befurther worked out in the exercises.

Example 5.5. Suppose we want to test if the transition intensity from healthy to dead isequal to the one from sick to dead, i.e. we want to test H0 : µ = ν versus HA : µ 6= ν. Withthe vector of parameters being ~θ = (α, β, µ, ν) we have that the parameter space of the fullmodel is Θ = (0,∞)4 whereas the parameter space of the null model is Θ0 = α, β, µ, ν) ∈(0,∞)4 : µ = ν. Note that dim(Θ) = 4 and dim(Θ0) = 3 in this case. With L(α, β, µ, ν)the likelihood of the parameters α, β, µ, ν given the data, we have that the likelihood ratiotest statistic is

Λ =max~θ∈Θ0

L(~θ)

max~θ∈Θ L(~θ)=

maxα,β,µ L(α, β, µ, µ)

maxα,β,µ,ν L(α, β, µ, ν).

Both maximisation problems (i.e. the one in the numerator and in the denominator) are ofthe form that is covered in Theorem 5.3 and so they are easy to solve. If we are given somereal data, then we can compute Λ and compare −2 log Λ with the appropriate quantile ofa chi-squared distribution with dim(Θ) − dim(Θ0) = 1 degree of freedom in order to seewhether we should reject H0 or not at a given significance level.

Example 5.6. Suppose we want to test if the assumption that α, the transition rate fromhealthy to sick, is constant in time is appropriate given the data. In order to apply the like-lihood ratio test we need to make some parametric assumption on the form of this transitionrate when it is no longer constant. Here we make the following choice:

α(t) =

α1 if t ∈ [0, 6),

α2 if t ≥ 6,(5.6)

where α1 and α2 are strictly positive, unknown constants and t is the time parameter. Sowe assume that α(t) is piecewise constant with the only possible change occurring at timet = 6.2 With this assumption in place we want to test H0 : α1 = α2 versus HA : α1 6= α2.Of course there are many other assumptions than (5.6) that we could have made here forthe parametric form of α(t) (the non-constant transition intensity from healthy to sick).Especially the assumption that it changes at exactly time t = 6 could be criticised (wecould instead have assumed that this change point is itself an unknown parameter and try to

2So far we have always assumed that the transition intensities are continuous in time. However, one canconstruct an MJP with piecewise constant and right-continuous transition intensities via the same procedureas in the beginning of Section 4.2. All the results in Section 4.2 still hold for the MJP with piecewiseconstant and right-continuous transition intensities with the expection that the forward equations (3.3) arenot satisfied any more at the points of discontinuity of the transition intensities. However one can show thatthe transition probabilities still satisfy an integral version of the Kolmogorov forward differential equations.

59

Page 63: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

estimate it), but it turns out that fixing the change point in advance is particularly easy todeal with. In order to compute the likelihood ratio test statistic, we need to determine themaximum likelihood of (α1, α2, β, µ, ν) in the full model. Note that we cannot use Theorem5.3 to determine this likelihood function, since in this theorem the underlying MJP wasassumed to be time homogeneous, whereas now we have a time inhomogeneous MJP sincethe transition intensity from healthy to sick is no longer assumed to be constant. Howeverusing the same principles as in Section 5.1, one can determine also the likelihood under theassumption (5.6), see the exercises.

Example 5.7. Suppose in the healthy-sick-dead model of Example 5.2 we believe that β,the transition rate from sick to healthy, might not be constant but depends on the amountof time a person has currently been sick (i.e. the current holding time). We would thenlike to test if, for the given data, it is appropriate to remain working with β constant or ifinstead we should work with a more complex model where the transition rate from sick tohealthy is duration-dependent. Although it might perhaps be intuitively clear what is meantwith duration dependent transition rates we have not actually encountered such stochasticprocesses (which note do not satisfy the Markov property) and a precise definition is thereforegiven in Appendix A.2. In order to use the likelihood ratio test we need to make a parametricassumption on the form of the transition rate β(w) as a function of duration (i.e. the currentholding time) w. Similar to the previous example we assume the parametric form,

β(w) =

β1 if w ∈ [0, 2.5),

β2 if w ≥ 2.5,

where β1 and β2 are strictly positive, unknown constants. So the transition rate from sickto healthy changes from β1 to β2 each time when the length of a period of sickness exceeds2.5 units of time. Our hypothesis test then takes the form H0 : β1 = β2 versus HA : β1 6= β2.Again similar to the previous example, we need to find out what the likelihood function isin the full model as we are outside the setting of Theorem 5.3. This is done in the exercises.

60

Page 64: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Chapter 6

Estimation of mortality rates usingclassical actuarial methods

In this chapter we look at a large group/population of people, assume that they all havethe same force of mortality (i.e. the hazard functions of the lifetimes of each person are thesame) and then estimate the common force of mortality µ(t), for t in a suitable range, usingclassical actuarial methods. What the actuaries traditionally do, is not to estimate the forceof mortality µ(t) for all ages t at once, but to split up the range of interest into intervals ofthe form [x, x+1], where x typically, but not always, is an integer and then estimate the forceof mortality for each age interval separately. Such an age interval [x, x + 1] is also referredto as a rate interval. In particular, this means that to estimate the force of mortality in therate interval [x, x+ 1], only data will be used that correspond to lives that are of exact agebetween x and x + 1. Once all rate intervals have been dealt with separately, the actuarieswill then link the different estimates together via a smoothing procedure called graduation.Instead of taking the force of mortality as the main reference point, some classical actuarialmethods look instead at what actuaries call annual death rates. The annual death rate atage x is denoted by qx and is defined as the probability that a person who is aged exactly xyears dies within the next year.

The contents of the rest of the chapter is as follows. We first look at estimating the forceof mortality corresponding to the rate interval [x, x + 1] by using the Poisson model. Animportant quantity in the estimation of this force of mortality is the so-called exposed to riskand we will see how one can determine or approximate this object. Afterwards we look atthe binomial model, which is a way of estimating the annual death rate qx corresponding tothe rate interval [x, x+1]. Finally, we study the concept of graduation and in particular lookat some statistical tests for judging whether the graduation procedure has been successfulor not.

61

Page 65: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

6.1 Poisson model

We fix an age x (which is typically an integer but not always) and consider the rate interval[x, x + 1]. Let us assume that there are in total n people who were observed during a timewhen they were of exact age between x and x+ 1. The total amount of time the individualswere observed to be alive while being of exact age in the rate interval [x, x+ 1] is referred toby actuaries as the (central) exposed to risk at age x and we denote this quantity by wx.

Under the Poisson model deaths in the rate interval [x, x + 1] are assumed to occuraccording to a Poisson process (see Exercise ??) with constant rate which we denote byµx+ 1

2. This means that under the Poisson model the total number of deaths amongst the

n persons during the time they were observed is assumed to follow a Poisson distributionwith parameter µx+ 1

2wx, i.e. with ∆x the random variable indicating the number of deaths

at exact age in the rate interval [x, x+ 1],

Pr(∆x = δx) =e−µ

x+12wx

(µx+ 12wx)

δx

δx!, δx = 0, 1, 2, . . . .

With this model, we can then use the number of observed deaths to estimate the unknownparameter µx+ 1

2. Indeed, assume that δx number of deaths are actually observed. Then the

likelihood is given by

L(µx+ 12) = Pr(∆x = δx) =

e−µ

x+12wx

(µx+ 12wx)

δx

δx!.

Hence the log-likelihood equals,

`(µx+ 12) = logL(µx+ 1

2) = δx log µx+ 1

2− µx+ 1

2wx + constant

and its derivative is given by

`′(µx+ 12) =

δxµx+ 1

2

− wx.

We have `′(δx/wx) = 0 and `′′(µx+ 12) = −δx/µ2

x+ 12

(which implies in particular that `′′(δx/wx) <

0), so we conclude that the maximum likelihood estimate (mle) of µx+ 12

is given by

µx+ 12

=δxwx.

Further, when µx+ 12wx is very large, which is for instance the case if the sample size n is

very large, ∆x is approximately N (µx+ 12wx, µx+ 1

2wx) distributed, where N (µ, σ2) stands for

the normal distribution with mean µ and variance σ2.The mle in the Poisson model has the form as in the time homogeneous alive-dead MJP

model (in combination with the methods of Chapter 5), though the assumptions underlyingboth models are different. Note that there is an unrealistic feature in the Poisson model.

62

Page 66: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Namely, under the Poisson model the probability of more than n deaths is strictly positive,whereas there are in total only n individuals in the study. However, when µx+ 1

2is small (as

it is likely to be in mortality problems) and n large, the probability of more than n deathsis very small.

6.2 Estimating the exposed to risk

In order to compute the mle of µx+ 12

in the Poisson model, we need to determine the number

of deaths δx and the exposed to risk wx in the rate interval [x, x + 1].1 In practice theremight be some complications in determining these numbers, especially for the exposed torisk wx, and we go over some of these difficulties in this section.

Suppose individuals are observed from time 0 to time T for some T > 0 and let Nx(t)denote the total number of individuals that are exposed to risk at time t and are of exactage in [x, x+ 1] at time t. We have the following relation between wx and Nx(t).

Proposition 6.1. For any x,

wx =

∫ T

0

Nx(t)dt. (6.1)

Proof. Suppose in total n individuals are observed and let w(i)x be the total amount of time

during [0, T ] individual i was observed to be alive/exposed to risk while being of exact agein [x, x+ 1]. Further, define

N (i)x (t) =

1 if individual i was exposed to risk and of exact age in [x, x+ 1] at time t,

0 otherwise.

Note that wx =∑n

i=1 w(i)x and Nx(t) =

∑ni=1N

(i)x (t). We have that w

(i)x is the amount of

time that the function t 7→ N(i)x (t) equals 1 for t ∈ [0, T ], or to be more precise, w

(i)x is the

length of the set t ∈ [0, T ] : N(i)x (t) = 1. Hence since N

(i)x (t) is either equal to 0 or 1,

w(i)x =

∫t∈[0,T ]:N

(i)x (t)=1

1dt =

∫ T

0

N (i)x (t)dt.

Summing over i on both sides gives

wx =n∑i=1

w(i)x =

n∑i=1

∫ T

0

N (i)x (t)dt =

∫ T

0

n∑i=1

N (i)x (t)dt =

∫ T

0

Nx(t)dt.

1Note that the actuaries tend to denote δx by dx and wx by Ecx.

63

Page 67: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

For determining the exposed to risk, one should take into account the so-called principleof correspondence. This principle says that an individual belongs to Nx(t) if and only ifhe/she is of exact age in [x, x+ 1] at time t and he/she is observed to be alive at time t. Toillustrate this principle, consider a study that observes a number of people during a periodof length T = 2 years and consider a person who was not observed from the start of thestudy but only joined after six months (i.e. at time t = 1/2) when he/she was of exact age60.5. Then this person would be included in N60(t) for t ∈ [1/2, 1] provided he/she is alive attime t, but would not be included in N60(t) for t ∈ (1, 2] because his/her exact age at thosetimes does not belong to the rate interval [60, 61] and he/she would also not be included inN60(t) for t ∈ [0, 1/2) because he/she was not part of the study (and thus was not observed)at those ages.

When we observe the exact time of entrance and exit of observation of each individual,then it is straightforward to determine the exposed to risk. In practice it is not alwayspossible to observe these exit and entrance times. Especially when dealing with large numbersof people this is simply not possible and what is typically observed instead is Nx(t) for afew values of t only. To deal with this case, let’s assume that we observe Nx(t) only fort = t0, t1, . . . , tk, where t0 = 0, tk = T and k ≥ 1 is some integer. The set of timest0, t1, . . . , tk are referred to as the census dates2. Since we do not know Nx(t) for allt ∈ [0, T ], we cannot exactly determine wx via (6.1). Instead we can use the so-called censusapproximation, which approximates Nx(t) for t /∈ t0, t1, . . . , tk via linear interpolation, i.e.we use the approximation

Nx(t) ≈ Nx(tj) +t− tjtj+1 − tj

(Nx(tj+1)−Nx(tj)) for t ∈ [tj, tj+1].

This leads to the following approximation for wx,

wx =

∫ T

0

Nx(t)dt ≈k−1∑j=0

∫ tj+1

tj

(Nx(tj) +

t− tjtj+1 − tj

(Nx(tj+1)−Nx(tj))

)dt

=1

2

k−1∑j=0

(Nx(tj+1) +Nx(tj)) (tj+1 − tj).

An additional complication can occur due to how the ages of individuals are recorded.Typically the age that is recorded is not the exact age of the individual but the exact ageis rounded off to some integer and that integer is recorded as the age of the individual. Inpractice one distinguishes between three different ways of rounding off the exact age to aninteger age, namely rounding off to (i) age last birthday or (ii) age next birthday or (iii)age nearest birthday. As an example, assume that a person dies at exact age 60.4. Thenthe age that is recorded as the age of death is (i) 60 if age last birthday is used, (ii) 61if age next birthday is used and (iii) 60 if age nearest birthday is used since 60.4 is closerto 60 than to 61. Difficulties can arise when the death data and the census data (i.e. the

2A census is a complete population count for a given area or place taken on a specific date.

64

Page 68: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

data Nx(t) for t ∈ t0, t1, . . . , tk) use different ways to record the age. For instance, assumethat for the death data age is defined by/recorded as age nearest birthday while for thecensus data age is defined by age last birthday. The rule that actuaries use is that the deathdata determines the rate interval, so let’s assume we are interested in estimating the forceof mortality corresponding to people dying at a recorded age of 60. Then the relevant rateinterval is [59.5, 60.5] since any individual who dies at exact age in this interval will havea recorded age of death of 60. The corresponding numbers of death, δ59.5, can be read offfrom the death data. But the corresponding number of individuals exposed to risk at acensus date, N59.5(tj), cannot be read off from the census data since for the census dataa different definition of (recorded) age is used, which means that the census data gives usN59(tj) and N60(tj) but not N59.5(tj). In that case we need to use another approximationto get N59.5(tj). A reasonable assumption is to assume that births are distributed uniformlyover the year, so that we can use N59.5(tj) ≈ 1

2(N59(tj) +N60(tj)). Then we can proceed

with the census approximation to approximate w59.5 and consequently estimate the force ofmortality corresponding to the rate interval [59.5, 60.5] via µ60 = δ59.5/w59.5.

6.3 Binomial model

In order to use the Poisson model, we require knowledge of the exposed to risk which in turnrequires the exact time at which individuals died during the study or alternatively somecensus data. This information is not always available in which case there will be need toresort to alternative methods, one of them being the binomial model. It must be stressedhowever that the binomial approach should be used only if we do not have any informationon the exposed to risk. In the binomial model one estimates the annual death rate qxcorresponding to the rate interval [x, x+ 1] rather than the force of mortality itself.

We assume that n independent individuals, all aged exactly x years are observed for oneyear. Let us denote by ∆x the random variable representing the number of people thathave died at the end of the year. Clearly, each individual dies within the next year withprobability qx and survives with probability 1− qx. Hence ∆x is the sum of n independentBernoulli random variables with parameter qx and therefore ∆x ∼ Bin(n, qx), i.e. ∆x has abinomial distribution with parameters n and qx. The likelihood of qx given the observationthat ∆x = δx is therefore

L(qx) =

(n

δ

)(qx)

δx(1− qx)n−δx

and the log-likelihood is

`(qx) = δx log qx + (n− δx) log(1− qx) + log

(n

δx

).

The mle qx of qx is obtained as the solution of `′(qx) = 0, which gives us

qx =δxn.

65

Page 69: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

64 66 68 70 72 74 76 78 80 820.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Age x

mort

alit

y r

ate

Figure 6.1: Plot of the crude rates µx+ 12

indicated by (+). The solid line represents the ‘true’

(but unobservable) values of the mortality rates µ(x).

Note again that `′′(qx) < 0 so that qx = δx/n truly maximises the likelihood. Moreover, usingthe property that a binomial distribution can be approximated by a normal distribution 3

when n is large, we have that for large n, the maximum likelihood estimator qx = ∆x/n isapproximately N

(qx,

qxn

(1− qx))

distributed.

6.4 Graduation

We aim to estimate the force of mortality for a certain range of ages (e.g. from age 30 to50). Suppose that for each of the integer ages x in this range we have have an estimate ofthe force of mortality corresponding to the rate interval [x, x+ 1], which we denote by µx+ 1

2.

From now on we refer to these estimates as the crude estimates or crude rates. We assumehere that the crude estimates are obtained by the Poisson model via µx+ 1

2= δx

wx, but other

methods for obtaining the crude rates can be used as well.

If we were to plot the crude estimates µx+ 12

indicated by (+) against x, we would getsomething like Figure 6.1. As can be seen the crude estimates typically do not progresssmoothly with age. They exhibit random fluctuations because they are based on randomlyselected samples of individuals. Somewhere between these estimates should lie the true curveof µ(x). The progression of the true values µ(x) is smooth or should be smooth since humanmortality is influenced by a gradual aging process. We would therefore like our estimates also

3Recall also that the expectation of a binomial random variable with parameters n and p is equal to npand its variance is equal to np(1− p).

66

Page 70: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Figure 6.2: Examples of an over-smoothed graduation (left) and of an under-smoothedgraduation (right).

to progress smoothly with age. We must therefore learn to develop a smoothing/graduationprocess that will produce graduated rates from the crude rates.

Remark 6.2. Similarly, one can do this procedure also for the annual death rates qx. Namely,first estimate via e.g. the binomial model, for each integer x, the annual death rate qx byusing data of a group of individuals who are observed while they are between the ages x andx + 1 and then perform graduation to smooth the estimates qx of qx. We focus here on thecase of graduating the crude estimates of the force of mortality.

There are two steps in the graduation process.

• The construction of the graduated rates µ(x) for all ages x, from the crude rates µx+ 12

which are determined for integer-valued ages x only. This is a curve fitting exercisecomparable to that in regression models in which a regression curve µ(x) is made tobest fit the observed responses µxi+ 1

2, i = 1, 2, . . . with x1, x2, . . . integers.

• The testing of the graduated rates µx+ 12

:= µ(x + 12) for integer-valued ages x to

determine whether they are

(i) acceptably smooth.

(ii) an acceptable fit to the original data (i.e. the crude estimates µx+ 12).

Aims (i) and (ii) may be in conflict as over-smoothing may destroy important features ofthe data resulting in poor fit while under-smoothing may improve the fit but leaves us withunacceptably irregular graduated estimates.

The crude estimates (indicated by •) in the left plot of Figure 6.2 have been over-smoothedby the graduation (graduated values are indicated by the solid line). As can be seen fromthe plot the curvature of relationship between the crude estimates and age has been mis-represented by the graduated values - the graduated values appear to consistently lie above

67

Page 71: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

the crude estimates for middle ages whereas the graduated values consistently lie below thecrude estimates for large ages. On the other hand the crude estimates in the right plot maypossibly have been under-smoothed. The graduated values seem to wiggle too much; to beginwith they progress upwards rather slowly, then much faster, then the upwards progressionslows down again and picks up again at very large ages.

6.4.1 Smoothness

Smoothness is a function of the first and higher order derivatives. (A straight line has zerohigher order derivatives; a quadratic has constant second derivative and zero third and higherorder derivatives; a cubic has constant third order derivative and zero fourth and higher orderderivatives etc.) One measure of smoothness for graduated rates used by actuarial offices isthat the third order differences of the graduated rates

• should be small in comparison to the estimates and

• should progress regularly.

We remark that for several methods of graduation, the obtained graduated rates will au-tomatically be smooth so in that case there is no need to test the graduated rates forsmoothness.

Definition 6.3. Let µx+ 12

be the graduated values for integer-valued x. The differences ofµx+ 1

2are defined as follows:

• first difference: ∆µx+ 12

= µx+ 32− µx+ 1

2,

• second difference: ∆2µx+ 12

= ∆µx+ 32−∆µx+ 1

2,

• third difference: ∆3µx+ 12

= ∆2µx+ 32−∆2µx+ 1

2.

Example 6.4. The two tables Table 6.1 and Table 6.2 give the crude and graduated valuesof the two graduations that produced the plots in Figure 6.2. Table 6.3 gives the thirddifferences of the graduated values for the ages 40, 41 and 42 for each of the two graduations.It can be seen that the first graduation is acceptably smooth as the third differences are smallcompared to the graduated values and progress smoothly (in fact we have seen from the plotin Figure 6.2 that this graduation is possibly over-smoothed). The table of third differencesfor the second graduation shows that the graduated values are under-smoothed as the thirddifferences are relatively large and their progression is irregular.

68

Page 72: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

x wx δx µx+ 12

µx+ 12

wxµx+ 12

zx z2x

30 70000 39 0.000557 .000387 27.09 2.29 5.2431 66672 43 .000645 .000424 28.29 2.77 7.6732 68375 34 .000497 .000473 32.34 0.29 0.0933 65420 31 .000474 .000523 34.21 -0.55 0.3034 61779 23 .000372 .000579 35.77 -2.14 4.5635 66091 50 .000757 .000640 42.30 1.18 1.4036 68514 48 .000701 .000707 48.44 -0.06 0.0037 69560 43 .000618 .000782 54.40 -1.55 2.3938 65000 48 .000738 .000864 56.16 -1.09 1.1939 66279 47 .000709 .000955 63.30 -2.05 4.2040 67300 62 .000921 .001056 71.07 -1.08 1.1641 65368 63 .000964 .001167 76.28 -1.52 2.3142 65391 84 .001285 .001290 84.35 -0.04 0.0043 62917 86 .001367 .001426 89.72 -0.39 0.1544 66537 120 .001804 .001576 104.86 1.48 2.1945 62302 121 .001942 .001742 108.53 1.20 1.4346 62145 122 .001963 .001926 119.69 0.21 0.0447 63856 162 .002537 .002129 135.95 2.24 4.9948 61097 151 .002471 .002353 143.76 0.60 0.3649 61110 184 .003011 .002600 158.89 1.99 3.97

Total 1561 1515.65 43.72

Table 6.1: Table of crude and graduates rates and other quantities corresponding to the leftpicture in Figure 6.2. Hereby wxµx+ 1

2forms an approximation of the expected total number

of deaths in the age interval [x, x + 1] when the mortality rates are given by the graduated

rates and zx = (δx−wxµx+ 12)/√wxµx+ 1

2are the so-called standardised residuals. We do not

specify here how the graduated rates µx+ 12

were obtained.

69

Page 73: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

x wx δx µx+ 12

µx+ 12

wxµx+ 12

zx z2x

30 70000 39 0.000557 .000460 32.20 1.20 1.4431 66672 43 .000645 .000508 33.87 1.57 2.4632 68375 34 .000497 .000548 37.47 -0.57 0.3233 65420 31 .000474 .000578 37.81 -1.11 1.2334 61779 23 .000372 .000600 37.07 -2.31 5.3435 66091 50 .000757 .000616 40.71 1.46 2.1236 68514 48 .000701 .000633 43.30 0.71 0.5137 69560 43 .000618 .000654 45.49 -0.37 0.1438 65000 48 .000738 .000693 45.04 0.44 0.1939 66279 47 .000709 .000761 50.44 -0.48 0.2340 67300 62 .000921 .000867 58.35 0.48 0.2341 65368 63 .000964 .001018 66.54 -0.43 0.1942 65391 84 .001285 .001215 79.45 0.51 0.2643 62917 86 .001367 .001494 94.00 -0.83 0.6844 66537 120 .001804 .001701 113.18 0.64 0.4145 62302 121 .001942 .001945 121.18 -0.02 0.0046 62145 122 .001963 .002155 133.92 -1.03 1.0647 63856 162 .002537 .002332 148.91 1.07 1.1548 61097 151 .002471 .002545 155.49 -0.36 0.1349 61110 184 .003011 .003002 183.45 0.04 0.00

Total 1561 1557.87 18.09

Table 6.2: Table of crude and graduates rates and other quantities corresponding to the rightpicture in Figure 6.2. Hereby wxµx+ 1

2forms an approximation of the expected total number

of deaths in the age interval [x, x + 1] when the mortality rates are given by the graduated

rates and zx = (δx−wxµx+ 12)/√wxµx+ 1

2are the so-called standardised residuals. We do not

specify here how the graduated rates µx+ 12

were obtained.

70

Page 74: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

First graduationx µx+ 1

2∆µx+ 1

2∆2µx+ 1

2∆3µx+ 1

2

40 0.001056 0.000111 0.000012 0.00000141 0.001167 0.000123 0.000013 0.00000142 0.001290 0.000136 0.000014 0.00000243 0.001426 0.000150 0.00001644 0.001576 0.00016645 0.001742

Second graduationx µx+ 1

2∆µx+ 1

2∆2µx+ 1

2∆3µx+ 1

2

40 0.000867 0.000151 0.000046 0.00003641 0.001018 0.000197 0.000082 -0.00015442 0.001215 0.000279 -0.000072 0.00010943 0.001494 0.000207 0.00003744 0.001701 0.00024445 0.001945

Table 6.3: Table of first second and third differences of the graduated values in the twograduations. Hereby the first graduation corresponds to Table 6.1 and the left plot of Figure6.2 and the second graduation corresponds to Table 6.2 and the right plot of Figure 6.2.

71

Page 75: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

6.4.2 Methods of graduation

There are three main methods of graduation:

• graduation by a mathematical formula/model,

• graduation by reference to a standard table,

• graphical graduation.

Graduation by mathematical formula/model

This approach is the preferred method for large data sets. The underlying assumption isthat µ(x) can be represented/modelled using a mathematical formula:

µ(x) = g(x, ~α) (6.2)

which may involve a (finite) number of unspecified parameters ~α = (α1, α2, . . .). Some of theparametric formula that have been used successfully in graduation are:

• Gompertz: µ(x) = Beγx = eβ+γx;

• Makeham: µ(x) = α +Beγx = α + eβ+γx;

• Perks: µ(x) =A+Beγx

Ee−γx + 1 +Deγx.

The Gompertz-Makeham GMm,n(~α) family of curves A curve is said to belong tothe GMm,n(~α) family with m+ n parameters

~α = (α1, α2, . . . , αm, αm+1, αm+2, . . . , αm+n),

if it can be represented by the function

Gm,n~α (x) =

m∑i=1

αixi−1 + exp

(n∑i=1

αm+ixi−1

),

i.e. the curve is the sum of a polynomial of degree m − 1 plus a term whose logarithm isa polynomial of degree n − 1. We see that in the Gompertz model, µ(x) is modelled by aGM0,2(~α) curve and in the Makeham model, µ(x) is modelled by a GM1,2(~α) curve.

There are three steps in the graduation process when we graduate by using a mathemat-ical formula/model:

Step 1 Choose the mathematical model g(x, ~α) to model the graduated rates. This is doneby examining the shape of various, probably low order, GMm,n(~α) formulae and choosingone that seems by eye to best fit the crude rates. If none appears suitable you may need tolook for a possible graduation formula outside the GMm,n(~α) family.

72

Page 76: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Step 2 Determine the values ~α of the parameters ~α which make the graduated rates bestfit the crude rates. There are three main methods to achieve this.

1. Weighted least squares

Choose ~α so that ∑x

wx

(µx+ 1

2− g

(x+

1

2, ~α

))2

is minimised, where wx are the corresponding weights. A possible choice for the weightsis wx = wx or some function of wx.

2. Minimizing the chi-squared statistic

Choose ~α so that ∑x

(δx − g(x+ 12, ~α)wx)

2

g(x+ 12, ~α)wx

is minimised. We will later see what the chi-squared distribution has to do with theabove expression.

3. Maximum likelihood Under the assumption that the force of mortality in the rateinterval [x, x+ 1) is a constant given by g(x+ 1/2, ~α), the likelihood of the parameters~α given the observations (δx, wx) is given by

L(~α) =∏x

g

(x+

1

2, ~α

)δxexp

(−g(x+

1

2, ~α

)wx

),

see the later Chapter 5. Now choose ~α such that L(~α) is maximised with respect to ~α,

i.e. choose ~α as the mle of ~α.

Step 3 Once ~α is determined by any of the three above methods, the graduated values arecalculated by

µ(x) = g(x, ~α

).

Remark 6.5. The maximum likelihood method of determining ~α has one advantage over theother two methods. Namely, if we use the maximum likelihood method and if M1 and M2 aretwo alternative candidate models for producing the graduated rates with M1 nested withinM2, i.e. M1 is a sub-model of M2 with k fewer parameters4, then we can use the generalisedlikelihood ratio test to check whether M2 provides a significantly better fit to the data thanM1, or not.

This test goes as follows. We suppose that the true mortality rates are given by g(x, ~α)with unknown ~α ∈M2. Let the null and alternative hypothesis be given by

H0 : ~α ∈M1, HA : ~α ∈M2\M1.

4For example, GM0,2(~α) is nested within the GM2,2(~α) model. In fact GMm,n(~α) is nested within theGM i,j(~α) model as long as m ≤ i and n ≤ j with at least one of the two inequalities being strict.

73

Page 77: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Form the statistic

Λ =max~α∈M1 L(~α)

max~α∈M2 L(~α).

Note that Λ ∈ [0, 1]. Under the null hypothesis the test statistic −2 log Λ is approximately(for large sample sizes) chi-squared distributed with degrees of freedom equal to

dimM2 − dimM1 = k.

Then an appropriate test at significance level α to check whether or not M2 provides a betterfit or not than M1 is to reject the null hypothesis if −2 log Λ ≥ λα, where λα > 0 is definedsuch that

Pr(χ2k ≥ λα) = α,

whereby χ2k is a chi-squared distribution with k degrees of freedom. Note that we reject

the null hypothesis if Λ is sufficiently small, i.e. if the maximum likelihood under M2 issufficiently larger than the maximum likelihood under M1.

The advantages of graduation by mathematical formula is that,

• provided a reasonably low order formula is chosen, smoothness is guaranteed,

• it is not subjective,

• it is suitable for large data,

• as indicated above, in the case of using maximum likelihood estimation to fit thegraduation curve, we can asses the significance of the improvement in model fit asadditional parameters are added.

Graduation by reference to a standard table

If the mortality study is not sufficiently extensive i.e. it is based on relatively few data andyou believe that the lives under consideration are similar to those of a large number of livesthat formed the basis of a standard life table, we can allow the basic structure/features of thestandard life table5 to be imported to our new graduated rates. The levels of the mortalityrates in our table may be different from those in the standard table but the progression of therates in our table is made similar to those in the standard table. For this method one selectsa reasonably simple function g(·, ~α) depending on a (low-dimensional) vector of parametersα and set

µx+ 12

= g(µ

(s)

x+ 12

, ~α),

5A life table is a table which shows, for each age, what the probability is that a person of that age will diebefore his or her next birthday. Examples include national life tables based on a country’s entire population(e.g. the English Life Tables) and insured lives tables based on large numbers of insured lives (e.g. the “92series” tables).

74

Page 78: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

where µ(s)

x+ 12

are the mortality rates corresponding to a standard life table. We will not go

into more details here. The specific form of the function g and the value of the parameter ~αare determined in a similar way as for the method of graduation by a mathematical formula.

Graphical graduation

As the name suggest, this method of graduation involves drawing by hand a smooth curvethat follow the crude rates in a satisfactory way. This method is not suitable when a highdegree of precision is required.

6.5 Testing the graduation for goodness of fit

Once the graduation is performed, we need to examine whether the graduated rates aresmooth or not and whether or not they provide a good fit to the crude rates. In the mostcommon graduation methods the graduated rates will automatically be smooth and thereforewe focus here on testing the goodness of fit.

6.5.1 Chi-squared overall goodness of fit test

First we look how to assess the overall goodness of fit of the graduated values µx+ 12. We saw

in Section 6.1 that with the Poisson model, we have for large n,

∆x ' N (µx+ 12wx, µx+ 1

2wx),

where the symbol ‘'’ means ‘is approximately equal in distribution to’. This means that(again for large sample sizes)

∆x − µx+ 12wx√

µx+ 12wx

' N (0, 1),

and thus assuming that independent individuals are used for the collection of data for intotal k different ages, we have that approximately,

∑x

(∆x − µx+ 1

2wx√

µx+ 12wx

)2

' χ2k,

where χ2k is a chi-squared distribution with k degrees of freedom6.

Now if the graduated rates are the true rates, then asymptotically

∆x − µx+ 12wx√

µx+ 12wx

' N (0, 1),

6Recall that the sum of the square of k independent standard normal random variables has a chi-squareddistribution with k degrees of freedom.

75

Page 79: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

which means that for a large sample size the standardised residuals zx defined by

zx =δx − µx+ 1

2wx√

µx+ 12wx

are approximately samples from a standard normal distribution. Now it is tempting to saythat then

∑x z

2x follows a chi-squared distribution with degrees of freedom equal to the

number of age groups, but one should keep in mind that the graduated rate µx+ 12

is notobtained from just µx+ 1

2but also typically from the crude rates corresponding to other ages

(i.e. µx+ 32, µx− 1

2, µx+ 5

2, µx− 3

2, . . .) and so the standardised residuals are not independent in

general. What the actuaries do is that they still compare∑

x z2x to a chi-squared distribution

but with a possible lower number of degrees of freedom than the number of different agegroups. The exact number of degrees of freedom depends on the graduation method used.For instance if the graduated rates are obtained by using a mathematical formula with lparameters, then the degrees of freedom equals the number of age groups minus l. Thus, anappropriate test at significance level α to check whether the graduated rates fit, overall, thecrude rates well, with the null hypothesis being

H0 : the graduated rates are the true rates,

is to reject H0 if∑

x z2x ≥ λα, where λα > 0 is defined such that

Pr(χ2k ≥ λα) = α,

where the degrees of freedom k depend on the number of age groups and the graduationmethod. Note that the critical value λα can be observed from chi-squared tables.

Example 6.6. For the graduation corresponding to the data in Table 6.1, the graduatedrates were produced by using a mathematical formula/model with two parameters. Since thegraduation involves 20 ages in total, the goodness of fit chi-squared test will have 20−2 = 18degrees of freedom. From Table 6.1 we see that

∑x z

2x = 43.72. Since

43.72 > λ0.05 = 28.87,

this indicates at the 5% level of significance that the graduated values are a poor fit tothe data. In fact since λ0.005 = 37.16 the same conclusion would have been reached at the12% level of significance - pretty strong evidence of a bad fit. The graduation needs to be

rethought.

Even if the result of the overall goodness of fit test is to not reject the null hypothesis -indicating an overall good fit - too small a goodness of fit statistic may suggest an under-smoothed graduation and so we still need to do further tests to check for certain defects thatthe goodness of fit test can fail to detect. The overall goodness of fit chi-squared test willfail to

76

Page 80: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

• detect the few large residuals if the majority of the rest are small,

• detect patterns in residuals indicating consistent bias overall or in some age regions -a result of persistent underestimation or persistent overestimation.

Note that this could have considerable financial importance. Underestimation of mortalityresults in losses in life insurance as premiums do not cover benefits and overestimation ofmortality results in losses in pensions and annuity work since benefits are paid out longerthan estimated.

6.5.2 Residual plots

The presence of a few unacceptably large standardised residuals amongst many acceptablysmall residuals can be detected by plotting the standardised residuals against age. If thegraduation is a successful one the scatter plot should be a random cloud of points about thezero level and with 95% of the points being within the -2, +2 band and none of the pointsoutside the -3, +3 band. Points outside the -3,+3 band should be considered as outliers,i.e. the graduated values differ substantially more than can be explained by chance. If thescatter plot of the standardised residuals against age is not a cloud of random points andexhibits a pattern, that suggests that the graduation is not successful. You should be ableto tell from the pattern what the defects of the graduation are. The scatter plot in Figure6.3 gives the standardised residuals plotted against age for the graduation data in Table 6.1.The plot shows there are five residuals outside the -2,+2 band; admittedly four of them areonly just outside the -2,+2 band. More worrying, however, is the presence of a pattern inthe residuals. They tend to be positive for the early ages then they are negative for themiddle ages and switch back to being positive at the later ages. This suggests a bias inthe graduated values; the graduated values under-represent the crude rates at the earlierages, then the graduated values tend to over-represent the crude rates at the middle agesand again the graduated values under-represent the the crude rates at the later ages. Thecurvature of the graduation curve is not as sharp as that of the crude rates. We thereforeneed to work with a graduation curve with a higher curvature to fit the crude rates.

We can also produce normal plots to check that the standardised residuals have thestandard normal distribution. For if they don’t that is an indication that the graduation iswrong. One way to check graphically for normality is to plot the ith largest standardisedresidual against the ith normal score (i.e. the ith largest value amongst a number of indepen-dent samples from the standard normal distribution). Another way is to plot, on a specialprobability paper in which the scale on the vertical axis is distorted by the transformation

Φ−1(v) with Φ(·) being the standard normal distribution function, the value ofi− 1

3

n+ 38

against

the ith largest standardised residual. Here n is the number of different ages involved in thegraduation. If the standardised residuals are indeed standard normally distributed, thenwith either plot we get a near straight line. If we don’t, that is an indication that either theresiduals are not standard normally distributed, i.e. the graduation is wrong, or there aresome outliers i.e. some ages with unacceptably large residuals.

77

Page 81: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Figure 6.3: Plot of standardised residuals against age for the data in Table 6.1.

In Figure 6.4 and Figure 6.5 the two types of probability plots are displayed correspondingto the data in Table 6.1. Both plots are near straight lines suggesting that the standardisedresiduals are indeed standard normal. In fact, most statistical packages, like the one used toproduce these plots, calculate the Anderson-Darling (AD) statistic for testing formally thatthe data represent a sample from the standard normal distribution and report the p-value7

of this statistic. In this case the reported p-value is 0.721 showing that there is absolutelyno evidence of lack of normality in the standardised residuals.

6.5.3 The sign test

This test checks whether there is an overall bias, i.e. whether the graduated values consis-tently underestimate the crude rates or consistently overestimate the crude rates over allages in the table. If the graduated rates are a good representation of the crude rates with-out bias then at each agethere should be the same chance of the standardised residual beingpositive as of being negative. Let S be the random variable that gives the number of positive(standard) residuals in the n ages of the table. Then with the null hypothesis

H0 : the graduated rates are the true rates, (6.3)

we have that under the null hypothesis that for each x in the age range, the binary random

variable Ix defined by Ix =

1 if zx ≥ 0

0 if zx < 0is a Bernoulli random variable with parameter

1/2. If these n Bernoulli random variables are also independent then S =∑

x Ix would bebinomially distributed with parameters n and 1/2, i.e. S ∼ Bin(n, 1

2). Now when discussing

7Recall that the p-value of a hypothesis test is the smallest value of the significance level α for which thenull hypothesis will be rejected (given the observation of the test statistic).

78

Page 82: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Figure 6.4: Plot of standardised residuals against Normal scores for the data in Table 6.1.

Figure 6.5: Probability plot of residuals for the data in Table 6.1.

79

Page 83: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

the chi-squared overall goodness of fit test, we said that the standardised residuals zx arenot actually independent, but for the sign test this is simply ignored and it is just assumedthat S ∼ Bin(n, 1

2) under the null hypothesis. With our definition of S, we have that if the

observed value s of S is either too small or too large, then that is an indication that thegraduated values are consistently above or consistently below the crude rates, suggestingoverall bias. Therefore a test to check the overall bias at significance level α is to reject thenull hypothesis if

Pr(Z ≥ s) ≤ α/2 if s > n/2,

Pr(Z ≤ s) ≤ α/2 if s < n/2,

with Z being a binomial distributed random variable with parameters n and 1/2. Note thatif s = n/2, then we will not reject the null hypothesis.

Given the observed number s of positive residuals we can also compute the p-value ofthe test. This is done as follows. Since the Bin(n, 1

2) distribution is a symmetric distribution

and we are looking for either too many or too few positive residuals to reject the graduation,then the p-value equals

2Pr(Z ≥ s) = 2∑n

k=s

(nk

)(1

2)n if s > n

2

2Pr(Z ≤ s) = 2∑s

k=0

(nk

)(1

2)n if s < n

2.

These probabilities can be looked up in binomial tables to save us the hand calculation or,if n is large, exploit the fact that under the null hypothesis S ∼ N (n/2, n/4) approximately,which enables us to calculate these p-values using the normal distribution. Note that if wehave s = n/2, then the p-value is usually defined to be equal to 1.

Example 6.7. In the graduation of the data of Table 6.1 there are 10 positive residuals outof 20. This means that there is absolutely no evidence of overall bias.

The sign test is not good at detecting local bias, i.e. when the graduated values eitherconsistently above or consistently below the crude rates in subregions of ages. For local bias,there is either the change of sign test or the grouping of signs test.

6.5.4 Change of sign test

Local bias can be detected by how often the standard residuals change sign with x (age).Ideally we should expect n − 1 changes of sign (recall that n is the number of differentages/age groups) if there is no local bias anywhere. Too few changes of sign indicates thepresence of local bias somewhere. If the graduation is successful and the residuals are indeedN (0, 1) distributed and independent8, then each residual has a chance of 1

2of being the same

sign as the previous residual and a chance of 12

of being of the opposite sign as the previousone. Thus under the null hypothesis (6.3) the number of sign changes C has a Bin(n− 1, 1

2)

8Again as for the sign test we ignore here that the standardised residuals are typically dependent becauseof how the graduation is carried out.

80

Page 84: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

distribution. Since local bias is identified by too few changes, for the change of sign test wereject the null hypothesis (6.3) at significance level α if

Pr(Z ≤ c) ≤ α,

where c is the observed value of C and Z a binomial distributed random variable withparameter n and 1/2. The corresponding p-value of the observed number c of changes is

Pr(Z ≤ c) =c∑

k=0

(n− 1

k

)(1

2)n−1,

which can again be computed by consulting binomial tables or, in the case of large n byusing the normal approximation to the binomial distribution.

Example 6.8. In the graduation of the data of Table 6.1 there are 4 sign changes in thesequence of standardised residuals. Hence the p-value is given by

∑4k=0

(19k

)(1

2)19 ≈ 0.010,

which means that there is pretty strong evidence for local bias.

6.5.5 Grouping of signs test

This is another test for detecting local bias. Given the observed number of positive residualss, the test statistic is

G = number of groups of positive signs in the standardised residuals.

What we exactly mean by a group of positive signs is best made clear by an example. Takefor instance the sequence of signs ’+ + + − − + − + +’; for this sequence there are threegroups of positive signs, the first consisting of 3 + the second run consists of a single + andthe third run consists of 2 +.

We would like to find the distribution of G under the null hypothesis 6.3 given that thereare s positive standardised residuals and n− s negative ones. For this we need to know howmany arrangements, denoted by M , of pluses and minuses there are that lead to g groupsof pluses (given that there are in total s pluses and n − s minuses). This we do in 2 steps.First we compute in how many ways g groups of pluses can be placed between n− s minuses(without distinguishing between the exact size of each group of pluses); denote this numberby M1. For instance, if n− s = 5 and g is three, this number is equal to the number of waysthree of the below six circles can be filled by a +:

− − − − −

For this example, M1 =(

63

)and in general we have M1 =

(n−s+1

g

). Next we compute the

number M2 of different ways one can create g (non-empty) groups of positive signs out of spositive signs. For example if s = 7 and g = 3 then M2 is equal to the number of ways onecan fill two of the six open circles:

+ + + + + + +

81

Page 85: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

For this example, M2 =(

62

)and in general M2 =

(s−1g−1

). It follows now that the number M

is equal to

M = M1M2 =

(n− s+ 1

g

)(s− 1

g − 1

).

Now the total number of arrangement of pluses and minuses when there are s pluses andn− s minuses is equal to

(ns

). Under the null hypothesis each arrangement is equally likely

and thus under the null hypothesis the probability of G = g equals

Pr(G = g|H0, S = s) =

(n−s+1

g

)(s−1g−1

)(ns

) .

Too few runs of positive signs is an indication of over-smoothing and the presence oflocal biases. Therefore we should reject the null hypothesis if G is too small. Hence withthe grouping of signs test we reject the null hypothesis at significance level α if

Pr(Z ≤ g) ≤ α,

with g the observed number of positive groups and Z a discrete-valued random variablewhose probability mass function is given by

Pr(Z = z) =

(n−s+1

z

)(s−1z−1

)(ns

) , z = 1, 2, . . . ,mins, n− s+ 1.

The corresponding p-value of the observed number g of groups of positive sign is then

Pr(Z ≤ g) =

g∑z=1

(n−s+1

z

)(s−1z−1

)(ns

) .

When n is large then Z is known to be approximately N ( s(n−s+1)n

, (s(n−s))2n3 ) distributed and

then e.g. the p-value can be calculated using this approximation.Note that one of the differences between the grouping of signs test and the change of sign

test is that in the former one conditions on the number of positive standardised residuals S.

Example 6.9. In the graduation of the data of Table 6.1, g = 3, s = 10 and n = 20. Hencethe p-value of the grouping of signs test equals

3∑z=1

(11z

)(9z−1

)(2010

) ≈ 0.035,

which means that, just as in the change of sign test, there is pretty strong evidence for localbias.

82

Page 86: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Appendix A

A.1 Dominated convergence theorem and Fubini

Theorem A.1 (Dominated convergence theorem). (i) Assume (fn)n≥0 is a sequence ofcontinuous, functions defined on R such that f(x) := limn→∞ fn(x) exists for all x ∈ R.Assume moreover that there exists a function g with

∫∞−∞ g(x)dx <∞ such that for all

x ∈ R, for all n ≥ 0, |fn(x)| ≤ g(x). Then

limn→∞

∫ ∞−∞

fn(x)dx =

∫ ∞−∞

limn→∞

fn(x)dx =

∫ ∞−∞

f(x)dx.

(ii) Let (am,n)∞m,n=0 be a double sequence of numbers such that am := limn→∞ am,n exists forall m ≥ 0. Assume moreover that there exists a sequence (bm)∞m=0 with

∑∞m=0 bm <∞

such that for all m ≥ 0 and for all n ≥ 0, |am,n| ≤ bm. Then

limn→∞

∞∑m=0

am,n =∞∑m=0

limn→∞

am,n =∞∑m=0

am.

Theorem A.2 (Fubini). Let f : R× R→ R be a continuous function.

(i) If f(x, y) is a positive (in the weak sense) function, then∫ ∞−∞

(∫ ∞−∞

f(x, y)dx

)dy =

∫ ∞−∞

(∫ ∞−∞

f(x, y)dy

)dx.

(ii) If∫∞−∞

(∫∞−∞ |f(x, y)|dx

)dy <∞ or

∫∞−∞

(∫∞−∞ |f(x, y)|dy

)dx <∞, then∫ ∞

−∞

(∫ ∞−∞

f(x, y)dx

)dy =

∫ ∞−∞

(∫ ∞−∞

f(x, y)dy

)dx.

(iii) If (am,n)∞m,n=0 is a double sequence of positive numbers (i.e. am,n ≥ 0), then

∞∑m=0

∞∑n=0

am,n =∞∑n=0

∞∑m=0

am,n.

83

Page 87: Ronnie Loe en November 22, 2019 - University of Manchester · 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt 4 2 1 0 1 2 3 4 5 6 7 3 5 6 X X X X X X X t Xt Figure 1.1: Two possible

Note that this equality in particular says that if one side of the equality equals infinity,then the other side equals infinity as well.

Part (i) of above theorem is also referred to as Tonelli’s theorem. Note that, for instance,the condition am,n ≥ 0 in (iii) cannot be dropped. Consider e.g. the numbers am,n definedby

am,n =

−1 if m = n,

+1 if m = n− 2,

0 otherwise.

Then∑∞

m=0

∑∞n=0 am,n = 0 and

∑∞n=0

∑∞m=0 am,n = −2.

A.2 Duration dependent transition rates

Let X = Xt : t ≥ 0 be a continuous time stochastic process with state space 1, . . . , dand with right-continuous sample paths. We can then define, as in Section 4.2, the jumptimes J(1), J(2), . . . (with J(0) = 0) and the holding times H(1), H(2), . . ., see Figure 3.1.Let, for i 6= j, w 7→ µij(w) be some given right-continuous and locally integrable functions

and let µi(w) =∑d

j=1,j 6=i µij(w). We call X a semi-Markov process with duration dependenttransition rates/intensities µij(w), i 6= j, w ≥ 0, if the following two properties are satisfiedfor n ≥ 1:

(i) Given J(n− 1) <∞ and XJ(n−1) = i, the holding time H(n) is

(a) a survival time with hazard function w 7→ µi(w) and

(b) assumed to be conditionally independent from H(1), H(2), . . . , H(n − 1) andXJ(0), XJ(1), . . . , XJ(n−2).

(ii) Given H(n) = w <∞ and XJ(n−1) = i, the state XJ(n) of the process at the next jumptime is

(a) equal to k with probability µik(w)µi(w)

for k 6= i and

(b) assumed to be conditionally independently from H(1), H(2), . . . , H(n− 1) andXJ(0), XJ(1), . . . , XJ(n−2).

Note that these two properties characterise the joint distribution of the consecutive holdingtimes H(1), H(2), . . . and the consecutive new states that are visited XJ(1), XJ(2), . . .. Notethe differences between the above two properties and the ones in the beginning of Section 4.2where the latter characterise Markov jump processes with time-dependent transition rates.Semi-Markov process with duration dependent transition rates do not have the Markovproperty (unless the transition rates µij(w) are constant with respect to w, i.e. they do notdepend on the duration w). In particular for such a semi-Markov process X we have thatgiven Xs = i, the future status Xt, with t > s, depends on the current holding time at times (which recall is defined as the length of time between s and the last jump time before s).

84