an introduction to bayesian statistics and mcmc docente/blasco/bayes cha… · an introduction to...
TRANSCRIPT
1
AN INTRODUCTION TO BAYESIAN STATISTICS AND MCMC
CHAPTER 5
THE BABY MODEL
Essentially, all models are wrong, but some are useful.
George Box and Norman Drapper, 1987, p. 424
5.1. The model
5.2. Analytical solutions
5.2.1. Marginal posterior distribution of the mean and variance
5.2.2. Joint posterior distribution of the mean and variance
5.2.3. Inferences
5.3. Working with MCMC
5.3.1. The process
5.3.2. Using Flat priors
5.3.2. Using vague informative priors
5.3.3. Common misinterpretations
Appendix 5.1
Appendix 5.2
Appendix 5.3
2
5.1. The model
We will start this chapter with the simplest possible model, and we will see
models that are more complicated in chapter 6. Our model consists of only a
mean plus an error term (1).
yi = μ+ ei
throughout the book, we will consider that the data are normally distributed,
although all procedures and conclusions can be applied to other distributions.
The “Normal” distribution was called “normal” because it is the most common
one in nature (2). All errors have mean “zero” and are uncorrelated, and all
sampled data comes from the same distribution. Thus, to describe our model
we will say that
yi ~ N(μ, σ2)
y ~ N(1μ, Iσ2)
where 1’ = [1, 1, 1, ... , 1] and I is the identity matrix.
2
i2
i 22
1f | ,
2
ypy ex
2
1 We took from Daniel Gianola the humoristic name “the baby model”, called “baby” because it
is the most elementary one.
2 We will say ‘Normal’ throughout the book to distinguish the density distribution from the
common uses of the word ‘normal’. The term “normal” applied to the density distribution
appeared independently in various statisticians along the 19th century (reescribe la frase);
Galton in 1877, Edgeworth in 1887, Pearson in 1893 and Poincairé in 1893 were using this term
(see Kruskal and Stigler 1997).
3
2 2 2 2 2
1 2 n 1 2 ny y y yf | , f , , , | , f | , f | , f | ,y y y
2 2
n ni i
n2 2n211 2 22
1 1exp exp
2 22 2
y y
as we saw in 3.5.3. Now we have to establish our objectives. What we want is
to estimate the unknowns ‘µ’ and ‘σ2’ that define the distribution.
5.2. Analytical solutions
5.2.1. Marginal posterior density of the mean and variance
We will try to find the marginal posterior distributions for each unknown because
this distribution takes into account the uncertainty when estimating the other
parameter, as we have seen in chapters 2 and 3. Thus, we should find the
posterior distribution for μ after having marginalised it for σ2, and the posterior
distribution for σ2 after having marginalised it for μ. Although both are marginal
distributions, they are conditional to the data. In the Bayesian School, all
inferences are made conditional to the data, if the data change, the inferences
change as well. We will see this later with some examples.
2 2
0f | f , | d
y y
2 2f | f , | d
y y
We have derived these distributions by calculating the integrals in chapter 3.
2
n 1f | t y,s
y
4
This is a “Student” t-distribution with parameters y and s2, and n-1 degrees of
freedom, where
n
i
1
1y y
n ;
n22
i
1
1s y y
n 1
The other marginal density we look for is (see Chapter 3)
n2
i
1
n 1
2
2
22
y y1
f | exp ,2
y IG
This is an Inverted Gamma distribution with parameters α, β
n 11
2
;
n2
i
1
1y y
2
5.2.2. Joint Posterior density of the mean and variance
We have seen that, using flat priors for the mean and variance,
2 2
2 2 2 2f | , f ,
f , | f | , f , f | ,f
yy y y
y
n
2
i
1
n
2
22
n
22
y1
f , | exp2
2
y
now both parameters are in red because this is a bivariate distribution.
5
5.2.3. Inferences
We can draw inferences from the joint or from the marginal posterior
distributions. For example, if we find the maximum from the joint posterior
distribution, this would be the most probable value for both parameters µ and σ2
simultaneously. This is not the most probable value for the mean and the
variance when all possible values of the other parameter have been weighted
by their probability, and summed up (i.e.: the mode of the marginal posterior
densities of μ and σ2). We will now show some inferences that have related
estimators in the frequentist world.
Mode of the joint posterior density
To find the mode, as it is the maximum value of the posterior distribution, we
derive and equal to zero (Appendix 5.1)
corresponding to
corresponding to
i ML
22 2
i M
2
2
2
2 L
1f , 0 y yˆ ˆ
nmode f ,
1f , 0 y yˆ ˆ
n
| y
| y
| y
thus the mode of the joint posterior density gives formulas that look like the
maximum likelihood (ML) estimates of the variances, although here the
interpretation is different. Here they mean that this estimate is the most
probable value of the unknowns μ and σ2, whereas in a frequentist context this
means that these values would make our current sample y most probable if
they were the true values. The numeric value of the estimate is the same, but
the interpretation is different.
Notice that we will not usually make inferences from joint posterior distributions
because when estimating one of the parameters we do no take into account the
6
uncertainty of estimating the other parameter, unless we are interested in
simultaneous inferences for some reason.
Mean, median and mode of the marginal posterior density of the mean
As the marginal posterior distribution of the mean is
2
n 1t y,s , the mean,
median and mode are the same and they are equal to the sample mean. For
credibility intervals, we can consult a table of the tn-1 distribution.
Mode of the marginal posterior density of the variance
Deriving the marginal posterior density and equating to zero, we obtain
(Appendix 5.3)
22 2
i R
2 2
2 EML
1mode f | f | 0 ˆ y y corresponding to ˆ
n 1
y y
thus the mode of the joint posterior density gives formulas that look like the
maximum residual likelihood (REML) estimates of the variances, although here
the interpretation is different. Here this estimate is the most probable value of
the unknown σ2 when the values of the other unknown μ have been considered,
weighted by their probability and integrated out (summed up). In a frequentist
context, we mean that this value would make the sample most probable if this
was the true value, working in a subspace in which there is no µ (see Blasco,
2001 for a more detailed interpretation). The numeric value of the estimate is
the same, but the interpretation is different. Here the use of this estimate seems
to be more founded than in the frequentist case, but notice that the frequentist
properties are different from the Bayesian ones, thus a good Bayesian estimator
is not necessarily a good frequentist estimator, and vice versa.
Mean of the marginal posterior distribution of the variance
To calculate the mean of a distribution is not so simple because we need to
calculate an integral. Because of that, modes were more popular before the
MCMC era. However, the mean has a better loss function than the mode, as we
7
have seen in chapter 2, and we may prefer this estimate. By definition of the
mean, we have to calculate the integral:
22 2 2
0mean f | f | f d
y y
In this case we know that 2f | y is an inverted gamma with parameters α and
β that we have seen in 5.2.2. We can calculate the mean of this distribution if
we know its parameters, and the formula can be found in several books (see,
for example, Bernardo and Smith, 1994). Taking the value of α and β from
paragraph 5.2.2, we have:
n2
i n21
i
1
1y
12Mean INVERTED GAMMA y
n 31 n 51
2
which does not have an equivalent in the frequentist world. This also gives a
smaller estimate than the mode.
Notice that this estimate does not agree with the frequentist estimate of
minimum quadratic risk that we saw in 1.4.4. This estimate had the same
expression but dividing by n+1 instead of by n-5. The reason is on one hand
that we are not minimizing the same risk; in the frequentist case the variable is
the estimate 2̂ , which is a combination of data, whereas in the Bayesian case
the variable is the parameter σ2 , which is not a data combination. Thus, when
calculating the risk we integrate in one case on this combination of data and in
the other case the parameter (3).
Bayesian RISK 2 2
uE û – u û – u f u du
3 Remember that the parameter has a true unknown value 2
TRUE and we use 2 to express our
uncertainty about 2
TRUE , thus, properly speaking, we do not integrate the parameter but the
auxiliary variable 2
8
Frequentist RISK 2 2
yE û – u û – u f y dy
The other reason is that these Bayesian estimates have been derived under the
assumption of flat (constant) prior information. These estimates will be different
if other prior information is used. For example, we will see in chapter 9 that
there are reasons for taking other prior for the variance:
2
2
1f
In this case, we have
n n
2 2
i i
1 1
n 1 n 11
2 2
2
2 2 22 2
y y y y1 1 1
f | exp exp2 2
y
then, calculating the mean as before
n2
i n21
i
1
1y
12Mean INVERTED GAMMA y
n 11 n 31
2
which is different from the former estimate. When the sample is high, all
estimates are similar, but if ‘n’ is low we will get different results. Prior
information may be important in Bayesian analyses if the samples are very
small.
Median of the posterior marginal distribution of the variance
We stressed in chapter 2 the advantages of the median as an estimator that
uses a reasonable loss function and that it is invariant to transformations. The
median my is the value that
9
ym2 1
f( | y)dy2
thus we must calculate the integral.
Credibility intervals between two values ‘a’ and ‘b’
The probability that the true value lies between ‘a’ and ‘b’ is
b
2 2 2 2
aP a b | f d f y
thus we must calculate the integral.
5.3. Working with MCMC
5.3.1. The process
The process for estimating marginal posterior densities is:
1. To write the joint posterior distribution of all parameters. To do this,
we need
a. The distribution of the data
b. The joint prior distributions
We have seen how to write the joint posterior distribution for μ and σ in
5.2.2 in the case of using flat priors.
2. To write the conditional distributions for each individual parameter,
given all other parameters and the data. We just take the joint posterior
distribution, leave in red colour the parameter of interest, and paint in
black the other parameters, which now become constants. In our
example, we have two conditional distributions, one for μ and another
one for σ.
10
3. To find algorithms allowing us to take random samples from the
conditional distributions. We can find these algorithms in books
dedicated to distributions. If we cannot find these algorithms, we should
use alternative sampling methods like Metropolis, as we have seen in
chapter 4.
4. To generate chains using the Gibbs sampling mechanism. In
general, we can run several chains from different starting points, and
check convergence using a test or by visual inspection of the chains. We
can also calculate the Monte Carlo Standard Error (MCSE) and check
the correlation between consecutive samples, in order to throw samples
if the correlation is high. We will also discard the first part of the chains
(the “burning off”) until we are sure that the chain has converged to the
posterior distribution. In simple problems such as linear models with fixed
effects for mean comparisons, the chains converge immediately, the
MCSE are very low and the correlation between consecutive samples is
near zero, thus chains are used without discarding any sample.
5. To make inferences from the samples of the marginal posterior
distributions. We have seen how to make these inferences from chains
in chapter 4.
Now we will put two examples with different prior information
5.3.2. Using flat priors
We have written the joint posterior distribution for μ and σ in 5.2.2. Now we will
write the conditionals.
n
2
i2 2 1
n 22 2
y1
f | , f | , exp2
y y
11
n
2
i
1
n
2 2
2
22n
y1
f | , f | , exp2
2
y y
Notice that the formulae are the same, but the variable is in red, thus the
functions are completely different. As we saw in chapter 3, the first is a normal
distribution and the second an inverted gamma distribution.
2
2f | , N y,n
y
n
i
2 2
1
1 nf | , , ; y ; 1
2 2 y IG
In 4.2.1 we have seen that we have algorithms for random sampling from
normal distributions, and we also have algorithms for random sampling from
inverted gamma distributions. We start, for example, with an arbitrary value for
the variance 2
0 and then we get a sample value of the mean. We substitute this
value in the conditional of the mean and we get a random value of the variance.
We then substitute it in the conditional distribution of the mean and we continue
the process (figure 5.1)
12
Figure 5.1. Gibbs sampling process for the mean and the variance of the “Baby
model”
We will put an example. We have a data vector with four samples
y' = [2, 4, 4, 2]
then we calculate
i i
n n n2 2 2 2
i
1 1 1
y 3
n 4
y y 2 y 40 2
and we can prepare the first conditional distributions
2 2
2f | , ~ N y, N 3,n 4
y
13
n2
2
2i
1
1y 1
40 22f | , ~ Igamma Igamma 2
n11
2
y
Now we start the Gibbs sampling process by taking an arbitrary value for σ2, for
example
2
0 1
then we substitute this arbitrary value in the first conditional distribution and we
have
2 1f | , ~ N 3,
4
y
we sample from this distribution using an appropriate algorithm and we find
µ0 = 4
then we substitute this sampled value into the second conditional distribution,
2f | , ~ Igamma 12, 1 y
now we sample from this distribution using an appropriate algorithm and we find
2
1 5
then we substitute this sampled value in the first conditional distribution,
2 5f | , ~ N 3,
4
y
14
now we sample from this distribution using an appropriate algorithm and we find
1 3
then we substitute this sampled value in the second conditional distribution, and
continue the process. Notice that we sample each time from a different
conditional distribution. The first conditional distribution of µ was a normal with
mean equal to 3 and variance equal to 1, but the second time we sampled it
was a normal with the same mean but with variance equal to 5. The same
happened with the different Inverted Gamma distributions from which we were
sampling.
We obtain two chains. Each of the samples belongs to different conditional
distributions, but after a while, they are also samples from the respective
marginal posterior distributions (figure 5.2). After rejecting the samples of the
“burning period”, we can use the rest of the samples for inferences as we did in
chapter 4 with the MCMC chains.
Figure 5.2. A Gibbs sampling process. Samples are obtained from different
conditional distributions, but after a “burning” in which samples are rejected, the
rest of them are samples of the marginal posterior distributions.
15
5.3.3. Using vague informative priors
As we saw in chapter 2, and we will see again in chapter 9 with more detail,
vague informative priors should reflect the beliefs of the researcher, beliefs that
are supposed to be shared by the scientific community to a higher or lesser
degree. They need not to be precise, since very precise informative priors
would make it unnecessary to perform the experiment in the first place.There
are many density functions that can express our vague beliefs. For example, we
may have an a priori expectation of obtaining a difference of 100 g of liveweight
between two treatments for poultry growth. We believe that it is less probable to
obtain 50 g, than 150 g. We also believe that it is rather improbable to obtain 25
g, as improbable as to obtain 175 g of difference between both treatments.
These beliefs are symmetrical around the most probable value, 100 g, and can
be approximately represented by a Normal distribution, but also by a t-
distribution or a Cauchy distribution, all of them symmetrical. We will choose the
most convenient distribution in order to facilitate the way of obtaining the
conditional distributions we need for the Gibbs sampling. We will see below that
this will be a Normal distribution. The same can be said about the variance, but
here our beliefs are typically asymmetric because we are using square
measurements. For example, we will not believe that the heritability of growth
rate in beef cattle is going to be 0.8 or 0.9; even if our expectations are around
0.5 we tend to believe that lower values are more probable than higher values.
These beliefs can be represented by many density functions, but again we will
choose the ones that will facilitate our task of obtaining the conditional
distributions we need for Gibbs sampling. These are called conjugate density
distributions. Figure 2.2 shows an example of different prior beliefs represented
by inverted gamma distributions.
Vague Informative priors for the variance
Let us take independent prior distributions for μ and σ2, a flat prior for the mean
and an inverse gamma distribution to represent our asymmetric beliefs for the
16
variance. This function can show very different shapes by changing its
parameters α and β, as we have seen in Figure 2.2.
2 2 2
22
1
1f , f f f exp
Then the joint posterior distribution is
n2
i
1
n
2
2 22
21
2
y1 1
f | e exp, xp2
y
and the conditional distribution of σ2 given μ and the data y is
2
2 22
2
n2
i
1
n 1
2
y1 1
f | , exp exp2
y
n
2 22
2
i
1
n a
221
1
y 21 1 b
exp exp2
which is an inverted gamma with parameters ‘a’ and ‘b’
n
2
i
1
na
2
1b y
2
17
Vague Informative priors for the mean
Now we use an informative prior for the mean and a flat prior for the variance.
As we said before, our beliefs can be represented by a normal distribution. Thus
we can determine the mean and variance of our beliefs, ‘m’ and ‘v’ respectively.
Our prior will be
2
1 22 2
2 2m1
f , f f f exp2v
v
The joint distribution is
2
22
n2
2i
1
n 1 222 2
ym1 1
f | exp exp2 2v
v
, y
and the conditional of μ given σ2 and the data y is
n2
2i
2 1
n 2 1 22 22 2
ym1 1
f | , exp exp2 2v
v
y
after some algebra gymnastics (Appendix 5.3) this becomes a normal
distribution with parameters w and d2
2
2 2
2
w1f | , exp N w,d
2 d
y
the values of w and d2 can be found in Appendix 5.3. This is a normal
distribution with mean w and variance d2 and we know how to sample from a
18
normal distribution. Thus, we can start with the Gibbs sampling mechanism as
in 5.3.1.
5.3.4. Common misinterpretations
We estimate the parameters “by Gibbs sampling”: This is incorrect. Gibbs
sampling is not a method of estimation, but a method of numeric calculus to
integrate the joint posterior distribution and find the marginal ones. We estimate
our parameters using the mode, median or mean of the marginal posterior
distribution, never “by Gibbs sampling”.
The parameters of the inverted gamma distribution are degrees of
freedom: The concept of degrees of freedom was developed by Fisher (1922),
who represented the sample in a space of n-dimensions (see Blasco 2001 for
an intuitive representation of degrees of freedom). This has no relation with
what we want. We manipulate the parameters in order to change the shape of
the function, and it is irrelevant whether these “hyper parameters” are natural
numbers or fractions. For example, Blasco et al. (1998) use fractions for these
parameters.
One of the parameters of the inverted gamma distribution represents
credibility and the other represents the variance of the function: Both
parameters modify the shape of the function in both senses, dispersion and
sharpness, showing more credibility, therefore it is incorrect to name one of
them as the parameter of credibility. Both parameters should be manipulated in
order to obtain a shape that will show our beliefs, and it is irrelevant which
values they have as far as the shape of the function represents something
similar to our state of beliefs. Often standard deviations or variances coming
from other experiments are used as ‘dispersion parameter’ or ‘window
parameter’ of the inverted gamma distribution, but this is only correct when,
after drawing the distribution, we agree in that it actually represents our sate of
prior uncertaninty.
19
Appendix 5.1
n
2
i
1
nn
2
2
22
2
y1
f , | exp2
2
y
n n
2
i i
1 1
nn 2 22 22
2 y y1
exp 02 2
2
n n n
i i i
1 1 1
1y ˆ 0 y nˆ 0 ˆ y
n
n
2
i2 1
n2 2n
2
22 2
y1
f , exp2
2
| y
n n n
2 2 2
i i i
1 1 1
n 2 nn n 12 22
22
2
2 22
ny y y1 2exp exp 0
2 222 2
n2
i
1
n nn n2 22 22 2
2
2 2
n2y
1 20
22 2 2
n n
2 22 2
i i
1 1
1y ˆ n ˆ 0 ˆ y ˆ
n
20
Appendix 5.2
n
2
i
12
2 2 n 21
22
y y1
| exp2
f y
2 22
n n n2 2
i i i
1 1 1
n 1 21
22n 1
22
n 1y y y y y y1 2exp exp 0
2 22
n
i
1
n 1 n 1n2 22 22
2
2 2
n 12y y
1 20
22 2
n n
2 22 2
i i
1 1
1y ˆ n 1 ˆ 0 ˆ y ˆ
n 1
Appendix 5.3
n
22
i2 2 1
n 12 22 22 2
ym1 1
f | , f | , f exp exp2 2v
v
y y
but we saw in chapter 3, Appendix 3.2, that
2
2
1 22 2
y1f | , exp
2n
n
y
21
then, substituting, we have
2 2
2 2
1 2 1 22 22 2
y m1 1f | , f | , f exp exp
2v2 v
nn
y y
2
2 222 2
1 2 2222 2
2
v y my m1 nexp exp2v
2 2 vv n n
n
Now, the exponential can be transformed if we take into account that
2 2 2 2
2 22 2 2 2 2 2 2v y m v 2 v y m v y mn n n n
22
2
22
2 22 2 2
22
v y mn
2
vn
v 2 v y m1n n
vn
calling
22
22
v y mn
w
vn
and substituting in the expression, it becomes
22
22 2 2 2
2 2 2 2 2 2
w2w 2w w w
1 1 1
v /n v /n v /n
2
2
2 2
2 2
wf | , exp
/ n v2
/ n v
y
calling
2 2
2 2 2 2 2
/n v1 1 1
d /n v /n v
we have
2
2 2
2
wf | , exp N w,d
2 dy