an introduction to bayesian statistics and mcmc docente/blasco/bayes cha… · an introduction to...

1

AN INTRODUCTION TO BAYESIAN STATISTICS AND MCMC

CHAPTER 5

THE BABY MODEL

Essentially, all models are wrong, but some are useful.

George Box and Norman Drapper, 1987, p. 424

5.1. The model

5.2. Analytical solutions

5.2.1. Marginal posterior distribution of the mean and variance

5.2.2. Joint posterior distribution of the mean and variance

5.2.3. Inferences

5.3. Working with MCMC

5.3.1. The process

5.3.2. Using Flat priors

5.3.2. Using vague informative priors

5.3.3. Common misinterpretations

Appendix 5.1

Appendix 5.2

Appendix 5.3

2

5.1. The model

We will start this chapter with the simplest possible model, and we will see

models that are more complicated in chapter 6. Our model consists of only a

mean plus an error term (1).

yi = μ+ ei

throughout the book, we will consider that the data are normally distributed,

although all procedures and conclusions can be applied to other distributions.

The “Normal” distribution was called “normal” because it is the most common

one in nature (2). All errors have mean “zero” and are uncorrelated, and all

sampled data comes from the same distribution. Thus, to describe our model

we will say that

yi ~ N(μ, σ2)

y ~ N(1μ, Iσ2)

where 1’ = [1, 1, 1, ... , 1] and I is the identity matrix.

2

i2

i 22

1f | ,

2

ypy ex

2

1 We took from Daniel Gianola the humoristic name “the baby model”, called “baby” because it

is the most elementary one.

2 We will say ‘Normal’ throughout the book to distinguish the density distribution from the

common uses of the word ‘normal’. The term “normal” applied to the density distribution

appeared independently in various statisticians along the 19th century (reescribe la frase);

Galton in 1877, Edgeworth in 1887, Pearson in 1893 and Poincairé in 1893 were using this term

(see Kruskal and Stigler 1997).

3

2 2 2 2 2

1 2 n 1 2 ny y y yf | , f , , , | , f | , f | , f | ,y y y

2 2

n ni i

n2 2n211 2 22

1 1exp exp

2 22 2

y y

as we saw in 3.5.3. Now we have to establish our objectives. What we want is

to estimate the unknowns ‘µ’ and ‘σ2’ that define the distribution.

5.2. Analytical solutions

5.2.1. Marginal posterior density of the mean and variance

We will try to find the marginal posterior distributions for each unknown because

this distribution takes into account the uncertainty when estimating the other

parameter, as we have seen in chapters 2 and 3. Thus, we should find the

posterior distribution for μ after having marginalised it for σ2, and the posterior

distribution for σ2 after having marginalised it for μ. Although both are marginal

distributions, they are conditional to the data. In the Bayesian School, all

inferences are made conditional to the data, if the data change, the inferences

change as well. We will see this later with some examples.

2 2

0f | f , | d

y y

2 2f | f , | d

y y

We have derived these distributions by calculating the integrals in chapter 3.

2

n 1f | t y,s

y

4

This is a “Student” t-distribution with parameters y and s2, and n-1 degrees of

freedom, where

n

i

1

1y y

n ;

n22

i

1

1s y y

n 1

The other marginal density we look for is (see Chapter 3)

n2

i

1

n 1

2

2

22

y y1

f | exp ,2

y IG

This is an Inverted Gamma distribution with parameters α, β

n 11

2

;

n2

i

1

1y y

2

5.2.2. Joint Posterior density of the mean and variance

We have seen that, using flat priors for the mean and variance,

2 2

2 2 2 2f | , f ,

f , | f | , f , f | ,f

yy y y

y

n

2

i

1

n

2

22

n

22

y1

f , | exp2

2

y

now both parameters are in red because this is a bivariate distribution.

5

5.2.3. Inferences

We can draw inferences from the joint or from the marginal posterior

distributions. For example, if we find the maximum from the joint posterior

distribution, this would be the most probable value for both parameters µ and σ2

simultaneously. This is not the most probable value for the mean and the

variance when all possible values of the other parameter have been weighted

by their probability, and summed up (i.e.: the mode of the marginal posterior

densities of μ and σ2). We will now show some inferences that have related

estimators in the frequentist world.

Mode of the joint posterior density

To find the mode, as it is the maximum value of the posterior distribution, we

derive and equal to zero (Appendix 5.1)

corresponding to

corresponding to

i ML

22 2

i M

2

2

2

2 L

1f , 0 y yˆ ˆ

nmode f ,

1f , 0 y yˆ ˆ

n

| y

| y

| y

thus the mode of the joint posterior density gives formulas that look like the

maximum likelihood (ML) estimates of the variances, although here the

interpretation is different. Here they mean that this estimate is the most

probable value of the unknowns μ and σ2, whereas in a frequentist context this

means that these values would make our current sample y most probable if

they were the true values. The numeric value of the estimate is the same, but

the interpretation is different.

Notice that we will not usually make inferences from joint posterior distributions

because when estimating one of the parameters we do no take into account the

6

uncertainty of estimating the other parameter, unless we are interested in

simultaneous inferences for some reason.

Mean, median and mode of the marginal posterior density of the mean

As the marginal posterior distribution of the mean is

2

n 1t y,s , the mean,

median and mode are the same and they are equal to the sample mean. For

credibility intervals, we can consult a table of the tn-1 distribution.

Mode of the marginal posterior density of the variance

Deriving the marginal posterior density and equating to zero, we obtain

(Appendix 5.3)

22 2

i R

2 2

2 EML

1mode f | f | 0 ˆ y y corresponding to ˆ

n 1

y y

thus the mode of the joint posterior density gives formulas that look like the

maximum residual likelihood (REML) estimates of the variances, although here

the interpretation is different. Here this estimate is the most probable value of

the unknown σ2 when the values of the other unknown μ have been considered,

weighted by their probability and integrated out (summed up). In a frequentist

context, we mean that this value would make the sample most probable if this

was the true value, working in a subspace in which there is no µ (see Blasco,

2001 for a more detailed interpretation). The numeric value of the estimate is

the same, but the interpretation is different. Here the use of this estimate seems

to be more founded than in the frequentist case, but notice that the frequentist

properties are different from the Bayesian ones, thus a good Bayesian estimator

is not necessarily a good frequentist estimator, and vice versa.

Mean of the marginal posterior distribution of the variance

To calculate the mean of a distribution is not so simple because we need to

calculate an integral. Because of that, modes were more popular before the

MCMC era. However, the mean has a better loss function than the mode, as we

7

have seen in chapter 2, and we may prefer this estimate. By definition of the

mean, we have to calculate the integral:

22 2 2

0mean f | f | f d

y y

In this case we know that 2f | y is an inverted gamma with parameters α and

β that we have seen in 5.2.2. We can calculate the mean of this distribution if

we know its parameters, and the formula can be found in several books (see,

for example, Bernardo and Smith, 1994). Taking the value of α and β from

paragraph 5.2.2, we have:

n2

i n21

i

1

1y

12Mean INVERTED GAMMA y

n 31 n 51

2

which does not have an equivalent in the frequentist world. This also gives a

smaller estimate than the mode.

Notice that this estimate does not agree with the frequentist estimate of

minimum quadratic risk that we saw in 1.4.4. This estimate had the same

expression but dividing by n+1 instead of by n-5. The reason is on one hand

that we are not minimizing the same risk; in the frequentist case the variable is

the estimate 2̂ , which is a combination of data, whereas in the Bayesian case

the variable is the parameter σ2 , which is not a data combination. Thus, when

calculating the risk we integrate in one case on this combination of data and in

the other case the parameter (3).

Bayesian RISK 2 2

uE û – u û – u f u du

3 Remember that the parameter has a true unknown value 2

TRUE and we use 2 to express our

uncertainty about 2

TRUE , thus, properly speaking, we do not integrate the parameter but the

auxiliary variable 2

8

Frequentist RISK 2 2

yE û – u û – u f y dy

The other reason is that these Bayesian estimates have been derived under the

assumption of flat (constant) prior information. These estimates will be different

if other prior information is used. For example, we will see in chapter 9 that

there are reasons for taking other prior for the variance:

2

2

1f

In this case, we have

n n

2 2

i i

1 1

n 1 n 11

2 2

2

2 2 22 2

y y y y1 1 1

f | exp exp2 2

y

then, calculating the mean as before

n2

i n21

i

1

1y

12Mean INVERTED GAMMA y

n 11 n 31

2

which is different from the former estimate. When the sample is high, all

estimates are similar, but if ‘n’ is low we will get different results. Prior

information may be important in Bayesian analyses if the samples are very

small.

Median of the posterior marginal distribution of the variance

We stressed in chapter 2 the advantages of the median as an estimator that

uses a reasonable loss function and that it is invariant to transformations. The

median my is the value that

9

ym2 1

f( | y)dy2

thus we must calculate the integral.

Credibility intervals between two values ‘a’ and ‘b’

The probability that the true value lies between ‘a’ and ‘b’ is

b

2 2 2 2

aP a b | f d f y

thus we must calculate the integral.

5.3. Working with MCMC

5.3.1. The process

The process for estimating marginal posterior densities is:

1. To write the joint posterior distribution of all parameters. To do this,

we need

a. The distribution of the data

b. The joint prior distributions

We have seen how to write the joint posterior distribution for μ and σ in

5.2.2 in the case of using flat priors.

2. To write the conditional distributions for each individual parameter,

given all other parameters and the data. We just take the joint posterior

distribution, leave in red colour the parameter of interest, and paint in

black the other parameters, which now become constants. In our

example, we have two conditional distributions, one for μ and another

one for σ.

10

3. To find algorithms allowing us to take random samples from the

conditional distributions. We can find these algorithms in books

dedicated to distributions. If we cannot find these algorithms, we should

use alternative sampling methods like Metropolis, as we have seen in

chapter 4.

4. To generate chains using the Gibbs sampling mechanism. In

general, we can run several chains from different starting points, and

check convergence using a test or by visual inspection of the chains. We

can also calculate the Monte Carlo Standard Error (MCSE) and check

the correlation between consecutive samples, in order to throw samples

if the correlation is high. We will also discard the first part of the chains

(the “burning off”) until we are sure that the chain has converged to the

posterior distribution. In simple problems such as linear models with fixed

effects for mean comparisons, the chains converge immediately, the

MCSE are very low and the correlation between consecutive samples is

near zero, thus chains are used without discarding any sample.

5. To make inferences from the samples of the marginal posterior

distributions. We have seen how to make these inferences from chains

in chapter 4.

Now we will put two examples with different prior information

5.3.2. Using flat priors

We have written the joint posterior distribution for μ and σ in 5.2.2. Now we will

write the conditionals.

n

2

i2 2 1

n 22 2

y1

f | , f | , exp2

y y

11

n

2

i

1

n

2 2

2

22n

y1

f | , f | , exp2

2

y y

Notice that the formulae are the same, but the variable is in red, thus the

functions are completely different. As we saw in chapter 3, the first is a normal

distribution and the second an inverted gamma distribution.

2

2f | , N y,n

y

n

i

2 2

1

1 nf | , , ; y ; 1

2 2 y IG

In 4.2.1 we have seen that we have algorithms for random sampling from

normal distributions, and we also have algorithms for random sampling from

inverted gamma distributions. We start, for example, with an arbitrary value for

the variance 2

0 and then we get a sample value of the mean. We substitute this

value in the conditional of the mean and we get a random value of the variance.

We then substitute it in the conditional distribution of the mean and we continue

the process (figure 5.1)

12

Figure 5.1. Gibbs sampling process for the mean and the variance of the “Baby

model”

We will put an example. We have a data vector with four samples

y' = [2, 4, 4, 2]

then we calculate

i i

n n n2 2 2 2

i

1 1 1

y 3

n 4

y y 2 y 40 2

and we can prepare the first conditional distributions

2 2

2f | , ~ N y, N 3,n 4

y

13

n2

2

2i

1

1y 1

40 22f | , ~ Igamma Igamma 2

n11

2

y

Now we start the Gibbs sampling process by taking an arbitrary value for σ2, for

example

2

0 1

then we substitute this arbitrary value in the first conditional distribution and we

have

2 1f | , ~ N 3,

4

y

we sample from this distribution using an appropriate algorithm and we find

µ0 = 4

then we substitute this sampled value into the second conditional distribution,

2f | , ~ Igamma 12, 1 y

now we sample from this distribution using an appropriate algorithm and we find

2

1 5

then we substitute this sampled value in the first conditional distribution,

2 5f | , ~ N 3,

4

y

14

now we sample from this distribution using an appropriate algorithm and we find

1 3

then we substitute this sampled value in the second conditional distribution, and

continue the process. Notice that we sample each time from a different

conditional distribution. The first conditional distribution of µ was a normal with

mean equal to 3 and variance equal to 1, but the second time we sampled it

was a normal with the same mean but with variance equal to 5. The same

happened with the different Inverted Gamma distributions from which we were

sampling.

We obtain two chains. Each of the samples belongs to different conditional

distributions, but after a while, they are also samples from the respective

marginal posterior distributions (figure 5.2). After rejecting the samples of the

“burning period”, we can use the rest of the samples for inferences as we did in

chapter 4 with the MCMC chains.

Figure 5.2. A Gibbs sampling process. Samples are obtained from different

conditional distributions, but after a “burning” in which samples are rejected, the

rest of them are samples of the marginal posterior distributions.

15

5.3.3. Using vague informative priors

As we saw in chapter 2, and we will see again in chapter 9 with more detail,

vague informative priors should reflect the beliefs of the researcher, beliefs that

are supposed to be shared by the scientific community to a higher or lesser

degree. They need not to be precise, since very precise informative priors

would make it unnecessary to perform the experiment in the first place.There

are many density functions that can express our vague beliefs. For example, we

may have an a priori expectation of obtaining a difference of 100 g of liveweight

between two treatments for poultry growth. We believe that it is less probable to

obtain 50 g, than 150 g. We also believe that it is rather improbable to obtain 25

g, as improbable as to obtain 175 g of difference between both treatments.

These beliefs are symmetrical around the most probable value, 100 g, and can

be approximately represented by a Normal distribution, but also by a t-

distribution or a Cauchy distribution, all of them symmetrical. We will choose the

most convenient distribution in order to facilitate the way of obtaining the

conditional distributions we need for the Gibbs sampling. We will see below that

this will be a Normal distribution. The same can be said about the variance, but

here our beliefs are typically asymmetric because we are using square

measurements. For example, we will not believe that the heritability of growth

rate in beef cattle is going to be 0.8 or 0.9; even if our expectations are around

0.5 we tend to believe that lower values are more probable than higher values.

These beliefs can be represented by many density functions, but again we will

choose the ones that will facilitate our task of obtaining the conditional

distributions we need for Gibbs sampling. These are called conjugate density

distributions. Figure 2.2 shows an example of different prior beliefs represented

by inverted gamma distributions.

Vague Informative priors for the variance

Let us take independent prior distributions for μ and σ2, a flat prior for the mean

and an inverse gamma distribution to represent our asymmetric beliefs for the

16

variance. This function can show very different shapes by changing its

parameters α and β, as we have seen in Figure 2.2.

2 2 2

22

1

1f , f f f exp

Then the joint posterior distribution is

n2

i

1

n

2

2 22

21

2

y1 1

f | e exp, xp2

y

and the conditional distribution of σ2 given μ and the data y is

2

2 22

2

n2

i

1

n 1

2

y1 1

f | , exp exp2

y

n

2 22

2

i

1

n a

221

1

y 21 1 b

exp exp2

which is an inverted gamma with parameters ‘a’ and ‘b’

n

2

i

1

na

2

1b y

2

17

Vague Informative priors for the mean

Now we use an informative prior for the mean and a flat prior for the variance.

As we said before, our beliefs can be represented by a normal distribution. Thus

we can determine the mean and variance of our beliefs, ‘m’ and ‘v’ respectively.

Our prior will be

2

1 22 2

2 2m1

f , f f f exp2v

v

The joint distribution is

2

22

n2

2i

1

n 1 222 2

ym1 1

f | exp exp2 2v

v

, y

and the conditional of μ given σ2 and the data y is

n2

2i

2 1

n 2 1 22 22 2

ym1 1

f | , exp exp2 2v

v

y

after some algebra gymnastics (Appendix 5.3) this becomes a normal

distribution with parameters w and d2

2

2 2

2

w1f | , exp N w,d

2 d

y

the values of w and d2 can be found in Appendix 5.3. This is a normal

distribution with mean w and variance d2 and we know how to sample from a

18

normal distribution. Thus, we can start with the Gibbs sampling mechanism as

in 5.3.1.

5.3.4. Common misinterpretations

We estimate the parameters “by Gibbs sampling”: This is incorrect. Gibbs

sampling is not a method of estimation, but a method of numeric calculus to

integrate the joint posterior distribution and find the marginal ones. We estimate

our parameters using the mode, median or mean of the marginal posterior

distribution, never “by Gibbs sampling”.

The parameters of the inverted gamma distribution are degrees of

freedom: The concept of degrees of freedom was developed by Fisher (1922),

who represented the sample in a space of n-dimensions (see Blasco 2001 for

an intuitive representation of degrees of freedom). This has no relation with

what we want. We manipulate the parameters in order to change the shape of

the function, and it is irrelevant whether these “hyper parameters” are natural

numbers or fractions. For example, Blasco et al. (1998) use fractions for these

parameters.

One of the parameters of the inverted gamma distribution represents

credibility and the other represents the variance of the function: Both

parameters modify the shape of the function in both senses, dispersion and

sharpness, showing more credibility, therefore it is incorrect to name one of

them as the parameter of credibility. Both parameters should be manipulated in

order to obtain a shape that will show our beliefs, and it is irrelevant which

values they have as far as the shape of the function represents something

similar to our state of beliefs. Often standard deviations or variances coming

from other experiments are used as ‘dispersion parameter’ or ‘window

parameter’ of the inverted gamma distribution, but this is only correct when,

after drawing the distribution, we agree in that it actually represents our sate of

prior uncertaninty.

19

Appendix 5.1

n

2

i

1

nn

2

2

22

2

y1

f , | exp2

2

y

n n

2

i i

1 1

nn 2 22 22

2 y y1

exp 02 2

2

n n n

i i i

1 1 1

1y ˆ 0 y nˆ 0 ˆ y

n

n

2

i2 1

n2 2n

2

22 2

y1

f , exp2

2

| y

n n n

2 2 2

i i i

1 1 1

n 2 nn n 12 22

22

2

2 22

ny y y1 2exp exp 0

2 222 2

n2

i

1

n nn n2 22 22 2

2

2 2

n2y

1 20

22 2 2

n n

2 22 2

i i

1 1

1y ˆ n ˆ 0 ˆ y ˆ

n

20

Appendix 5.2

n

2

i

12

2 2 n 21

22

y y1

| exp2

f y

2 22

n n n2 2

i i i

1 1 1

n 1 21

22n 1

22

n 1y y y y y y1 2exp exp 0

2 22

n

i

1

n 1 n 1n2 22 22

2

2 2

n 12y y

1 20

22 2

n n

2 22 2

i i

1 1

1y ˆ n 1 ˆ 0 ˆ y ˆ

n 1

Appendix 5.3

n

22

i2 2 1

n 12 22 22 2

ym1 1

f | , f | , f exp exp2 2v

v

y y

but we saw in chapter 3, Appendix 3.2, that

2

2

1 22 2

y1f | , exp

2n

n

y

21

then, substituting, we have

2 2

2 2

1 2 1 22 22 2

y m1 1f | , f | , f exp exp

2v2 v

nn

y y

2

2 222 2

1 2 2222 2

2

v y my m1 nexp exp2v

2 2 vv n n

n

Now, the exponential can be transformed if we take into account that

2 2 2 2

2 22 2 2 2 2 2 2v y m v 2 v y m v y mn n n n

22

2

22

2 22 2 2

22

v y mn

2

vn

v 2 v y m1n n

vn

calling

22

22

v y mn

w

vn

and substituting in the expression, it becomes

22

22 2 2 2

2 2 2 2 2 2

w2w 2w w w

1 1 1

v /n v /n v /n

2

2

2 2

2 2

wf | , exp

/ n v2

/ n v

y

calling

2 2

2 2 2 2 2

/n v1 1 1

d /n v /n v

we have

2

2 2

2

wf | , exp N w,d

2 dy

an introduction to bayesian statistics and mcmc docente/blasco/bayes cha… · an introduction to...

Documents