an introduction to bayesian statistics and mcmc
TRANSCRIPT
AN INTRODUCTION TO BAYESIAN STATISTICS AND MCMC
CHAPTER 9
PRIOR INFORMATION
“We were certainly aware that inferences must make use of prior
information, but after some considerable thought and discussion round
these matters we came to the conclusion, rightly or wrongly, that it was so
rarely possible to give sure numerical values to these entities, that our line
of approach must proceed otherwise”
Egon Pearson, 1962.
9.1. Exact prior information
9.1.1. Prior information
9.1.2. Posterior probabilities with exact prior information
9.1.3. Influence of prior information in posterior probabilities
9.2. Vague prior information
9.2.1. A vague definition of vague prior information
9.2.2. Examples of the use of vague prior information
9.3. No prior information
9.3.1. Flat priors
9.3.2. Jeffrey’s priors
9.3.3. Bernardo’s “Reference” priors
9.4. Improper priors
9.5. The Achilles heel of Bayesian inference
9.1. Exact prior information
9.1.1. Prior information
We have delayed until this chapter the assessment of prior information, and we
have used constant or conjugated priors without explaining the underlying
reasons for using them. One of the most attractive characteristics of Bayesian
theory is the possibility of integrating prior information in the inferences, thus we
should discuss here why this is so rarely done, at least in the biological area of
knowledge. When there is exact prior information available there is no
discussion about Bayesian methods and prior information can be integrated
using the rules of probability. The following example is based on an example
prepared by Fisher (1959), who was a notorious anti-Bayesian, but he never
objected to the use of prior probability when clearly established. We have
introduced this example in chapter 2 (paragraph 2.1.3).
Suppose there is a type of laboratory mouse whose skin colour is controlled by
a single gene with two alleles, ‘A’ and ‘a’, so that when the mouse has two
copies of the recessive allele (aa) its skin is brown, and it is black in the other
cases (AA and Aa). We cross a black and a brown mouse and we obtain two
black mice, that should have the alleles (Aa) because they have received the
allele ‘a’ from the brown parent, and the allele of the other parent must be ‘A’,
otherwise they would have been brown. We now cross both black (Aa) mice,
and we have a descent that is black coloured. This black mouse could have
received both ‘A’ alleles from its parents, and in this case we would say that it is
homozygous (AA), or it could have received one ‘A’ allele form one parent and
one ‘a’ allele from the other parent, in this case we would say it is heterozygous
(Aa or aA). We are interested in knowing which type of black mouse it is,
homozygous or heterozygous. In order to find out, we must cross this mouse
with a brown mouse (aa) and examine the offspring (Figure 9.1).
Figure 9.1. Experiment to determine whether a parent is homozygous (AA) or
heterozygous (Aa or aA)
If we get a brown mouse as an offspring, we will know that the mouse is
heterozygous, because the brown mouse has to be (aa) and should have
received an ‘a’ allele from each parent, but if we only get black offsprings we
still will have the doubt about whether it is homo- or heterozygous. If we get
many black offsprings, it will be unlikely that the mouse is heterozygous,
because the ‘a’ allele should have been transmitted at least to one of its
descents. Notice that before making the experiment we can calculate the
probability of obtaining black or brown offspring. We know that the mouse to be
tested cannot be (aa) because otherwise it would be brown. Thus, it either
received both ‘A’ alleles from its mother and father and therefore it is (AA), or an
‘A’ allele from the father and an ‘a’ allele from the mother to become (Aa), or the
opposite, to become (aA). We have three possibilities, thus the probability of
being ‘AA’ is 1/3 and the probabilities of being heterozygous (‘Aa’ or ‘aA’, both
are genetically identical) is 2/3. This is what we expect before having any data
from the experiment. Notice that these expectations are not merely ‘beliefs’, but
quantified probabilities. Also notice that they come from our knowledge of the
Mendel laws and from the knowledge that our mouse is the son of two
heterozygotes, this prior knowledge is based in previous experience.
9.1.2. Posterior probabilities with exact prior information
Now, the experiment is made and we obtain three offspring, all black (figure
9.2). They received for sure an ‘a’ allele from the mother and an ‘A’ allele from
our mouse, but our mouse can still be homozygous (AA) or heterozygous (we
will not make any difference from Aa and aA in the rest of the chapter, since
they are genetically identical, we will use (Aa) for both). Which is the probability
of being each type?
Figure 9.2. Experiment to determine whether a parent is homozygous (AA) or
heterozygous (Aa or aA)
To find out, we will apply the Bayes Theorem. The probability of being
homozygous (AA) given that we have obtained three black offspring is
P 3 black | PP | 3 bla
AAck
P 3 black
AAAA
yy
y
We know that if it is true that our mouse is AA, the probability of obtaining a
black offspring is 1, since the offspring will always have an ‘A’ allele. Thus,
P 3 black | 1AA y
We also know that the prior probability of being AA is 1/3, thus
P AA 0.33
Finally, the probability of the sample is the sum of the probabilities of two
excluding events: having a parent homozygous (AA) or having a parent
heterozygous (Aa).
AA AaP 3 black P 3 black & P 3 black &
P 3 black | AA P P 3 black | AA A AaPa
y y y
y y
to calculate it we need the prior probability of being heterozygous, which we
know it is:
P Aa2
3
and the probability of obtaining our sample if it is true that our mouse is Aa. If
our mouse was Aa, the only way of obtaining a black offspring is if this offspring
got its A allele from him, thus the probability of obtaining one black offspring
would be ½. The probability of obtaining three black offspring will be ½ x ½ x ½,
thus
3
1P 3 black |
2Aa
y
Now we can calculate the probability of our sample:
3
P 3 black P 3 black | P P 3 black |AA AA Aa A P
1 1 21 0.42
3
a
2 3
y y y
Then, applying Bayes theorem
AAP 3 black AAAA
| P 1 0.33P | 3 black 0.80
P 3 black 0.42
yy
y
The probability of being heterozygous can be calculated again using the Bayes
theorem, or simply
P | 3 black 1 P | 3 black 1 0.80 0.20Aa AA y y
Thus, we had a prior probability before obtaining any data, and a probability
after obtaining three black offspring
prior P 0.33 posterior P | 0.80
prior P 0.67 posterior
AA AA
Aa AaP | 0.20
y
y
before the experiment was performed it was more probable that our mouse was
heterozygous (Aa), but after the experiment it is more probable that it is
homozygous (AA).
Notice that the sum of both probabilities is 1
P(AA | y) + P(Aa | y) = 1.00
thus the posterior probabilities give a relative measure of uncertainty (80% and
20% respectively).
However, using the maximum likelihood method, we can also give a higher
chance to the homozygous (AA), but we do not have a measure of evidence.
Consider the likelihood of both events; as we have seen before,
If it is true that our mouse is AA, the probability of obtaining a black offspring is 1
P 3 black | 1AA y
If it is true that our mouse is Aa, the probability of obtaining our sample is
3
1P 3 black | .a 0 5
2A 12
y
Notice that the sum of the likelihoods is not 1 because they come from different
events
P(y |AA) + P(y |Aa) = 1.125
thus the likelihoods do not provide a measure of uncertainty. Even if we rescale
the likelihood to force the sum to 1, the relative value of the likelihoods still does
not provide a relative measure of their evidence as the probability does.
9.1.3. Influence of prior information in posterior probabilities
If we had used flat priors, instead of using exact prior information, repeating the
calculus, we will obtain
prior P 0.50 posterior P | 0.89
prior P 0.50 posterior
AA AA
Aa AaP | 0.11
y
y
we can see that flat prior information had an influence in the final result. We still
favour the homozygous case (AA), but we are overestimating the evidence in
favour of it. When having exact prior information, the correct way to make an
inference is to use this exact prior information.
Suppose now that we have a large amount of information favouring the case
(Aa). Suppose we know that our prior information of being AA is P(AA) = 0.002.
Computing the probabilities again we obtain
prior P 0.002 posterior P | 0.02
prior P 0.998 posterior
AA AA
Aa A 9aP | 0. 8
y
y
thus despite of having evidence from the data in favour of AA, we decide that
the mouse is Aa because prior information dominates, and the posterior
distribution favours Aa. This has been a frequent criticism to Bayesian
inference, but one wonders why an experiment should be performed when the
previous evidence is so strong in favour of Aa.
What could have happened if instead of three black offspring we had obtained
seven black offspring?
Repeating the calculus for y = 7 black, we obtain
prior P 0.33 posterior P | 0.99
prior P 0.67 posterior
AA AA
Aa AaP | 0.01
y
y
If flat priors were used, we obtain
prior P 0.50 posterior P | 0.99
prior P 0.50 posterior
AA AA
Aa AaP | 0.01
y
y
in this case the evidence provided by the data dominates over the prior
information. However, if the prior information is very accurate
prior P 0.002 posterior P | 0.33
prior P 0.998 posterior
AA AA
Aa A 6aP | 0. 7
y
y
thus having even more data, prior information dominates the final result when it
is very accurate, which should not normally be the case. In general, prior
information loses importance with larger samples. For example, if we have n
uncorrelated data,
taking logarithms
1 2 nlog f | log f | log f | log f | log fy y y y
we can see that prior information has less and less importance as the amount of
data augments.
9.2. Vague prior information
9.2.1. A vague definition of vague prior information
It is infrequent to find exact prior information. Usually there is prior information,
but it is not clear how to formalize it in order to describe this information using a
prior distribution. For example, if we are going to estimate the heritability of litter
size of a rabbit breed we know that this heritability has also been estimated in
other breeds and it has often given values between 0.05 and 0.11. We have a
case in which the estimate was 0.30, but the standard error was high. We also
have a high realized heritability in an experiment that that we have reasons for
not taking it too seriously. A high heritability was also presented in a Congress,
but this paper did not pass the usual peer review filter and we tend to give less
credibility to these results. Moreover, some of the experiments are performed in
situations that are more similar to our own experiment, or with breeds that are
1 2 n 1 2 nf | f | f f y ,y , ,y | f f | f | f | fy y y y y
closer to ours. It is obvious that we have prior information, but, how can we
manage all of this?
One of the disappointments when arriving to Bayesian inference attracted by
the possibility of using prior information, is that modern Bayesians tend to avoid
the use of prior information due to the difficulties of defining it properly. A
solution for integrating all sources of prior knowledge in a clearly defined prior
information was offered independently in the decade of the 30s by the British
philosopher Frank Ramsay (1931) and by the Italian mathematician Bruno de
Finetti (1937), but the solution is unsatisfactory in many cases as we will see.
They propose that what we call probability is just a state of beliefs. This
definition has the advantage of including events like the probability of obtaining
a 6 when throwing a dice and the probability of Scotland becoming an
independent republic in this decade. Of course, in the first case we have some
mathematical rules that will determine our beliefs and in the second we do not
have these rules, but in both cases we can express sensible beliefs about the
events.
Transforming probability, which looks as an external concept to us, into beliefs,
which look like an arbitrary product of our daily mood, is a step that some
scientists refuse to walk. Nevertheless, there are three aspects to consider:
1. It should be clear that although beliefs are subjective, this does not
mean that they are arbitrary. Ideally, previous beliefs should be
expressed by experts and there should be a good agreement among
experts on how prior information is evaluated.
2. Prior beliefs should be vague and contain little information, otherwise
there is no reason to perform the experiment in the first place, as we
have seen in 9.1.3. In some cases, an experiment may be performed
in order to add more accuracy to a good previous estimation, but this
is not normally the case.
3. When having enough data, prior information loses importance, and
different prior beliefs can give rise to the same result, as we have
seen in 9.1.3 (1).
There is another problem of a different nature. In the case of multivariate
analyses, it is almost impossible to determine a rational state of beliefs. For
example, suppose we are analysing heritabilities and genetic correlations of
three traits. How can we determine our beliefs about the heritability of the first
trait, when the second trait has a heritability of 0.2, the correlation between both
traits is -0.7, the heritability of the third trait is 0.1, the correlation between the
first and the third traits is 0.3 and the correlation between the second and the
third trait is 0.4; then our beliefs about the heritability of the first trait when the
heritability of the second trait is 0.1, …etc.? Here we are unable to represent
any state of beliefs, even a vague one.
9.2.2. Examples of the use of vague prior information
We can describe our prior information using many density functions, but it would
be convenient to propose a density that can be easily combined with the
distribution of the data, what we called a conjugate distribution in chapter 5
(paragraph 5.3.3). When we are comparing two means, as in the example in
5.3.3, it is easy to understand how we can construct a prior density. We
compare an a priori expectation of obtaining a difference of 100 g of live weight
between two treatments for poultry growth. We believe that it is less probable to
obtain 50 g, than 150 g. We also believe that it is rather improbable to obtain 25
g, as improbable as to obtain 175 g of difference between both treatments. We
can draw a diagram with our beliefs, representing the probabilities we give to
these differences. These beliefs are symmetrical around the most probable
1 Bayesian statisticians often stress that when having enough data the problem of the prior
information becomes irrelevant. However, having enough data, Statistics is rather irrelevant; the
science of Statistics is useful when we want to determine which part of our observation is due to
random sampling and what is due to a natural law.
value, 100 g, and can be approximately represented by a Normal distribution. It
is not important whether the representation of our beliefs is very accurate or not,
since it should be necessarily vague, otherwise we would not perform the
experiment, as we have said before.
This is a simple example about how to construct prior beliefs. The reader can
find friendly examples about how to construct prior beliefs for simpler problems
in Press (2002). However, it is generally difficult to work with prior information in
more complex problems. To illustrate the difficulties of working with vague prior
information we will show here two examples of attempts of constructing prior
information for the variance components in genetic analyses. In the first
attempt, Blasco et al. (1998) try to express the prior information about variance
components for ovulation rate in pigs. As the available information is on
heritabilities, they consider that the phenotypic variance is estimated without
error (in genetic experiments, often performed with a large amount of data, this
error is very small), and express their beliefs about additive variances, just as if
they were heritabilities (2). The error variance priors are constructed according
to these beliefs. Blasco et al. (1998) used inverted gamma densities to express
the uncertainty about the heritability of one trait, because they are conjugated
priors that can be easily combined with the distribution of the data, as we have
seen in chapter 5 (paragraph 5.3.3). These functions depend on two
parameters that we have called been α and β in 3.5.4,
1
1f | , e
xxpx
x
The shape of the function changes when changing α and β, thus the researcher
can try different values for these parameters until she finds a shape that agrees
with her beliefs (an example is drawn in figure 9.3). These parameters are
frequently called (degrees of freedom) and S2 (‘window’ parameter) for prior
2 There are other alternatives, for example, thinking in standardised data to construct prior
beliefs for the additive variance.
densities of variances, because they affect mainly the shape and width of the
distribution respectively.
22
1
2
22
1 Sf | ,S exp
This leads to some confusion, since the distribution shape and width depends
on both parameters, but also because their only meaning for us is to have two
parameters that change the distribution until it represents our prior beliefs,
without any relationship with “degrees of freedom”. Moreover, v does not need
to be a natural number; for example, Blasco et al. (1998) used v= 2.5 and v=
6.5 for the prior of one of the variances. We will now see how Blasco et al.
(1998) constructed their prior densities.
“Prior distributions for variance components were built on the basis of information from
the literature. For ovulation rate, most of the published research shows heritabilities of
either 0.1 or 0.4, ranging from 0.1 to 0.6 (Blasco et al. 1993). Bidanel et al. (1992) reports
an estimate of heritability of 0.11 with a standard error of 0.02 in a French Large White
population. On the basis of this prior information for ovulation rate, and assuming [without
error] a phenotypic variance of 6.25 (Bidanel et al. 1996), three different sets of prior
distributions reflecting different states of knowledge were constructed for the variance
components. In this way, we can study how the use of different prior distributions affects
the conclusions from the experiment. The first set is an attempt to ignore prior knowledge
about the additive variance for ovulation rate. This was approximated assuming a uniform
distribution, where the additive variance can take any positive value up to the assumed
value of the phenotypic variance, with equal probability. In set two, the prior distribution of
the additive variance is such that its most probable value is close to 2.5 [corresponding to
a heritability of 0.4], but the opinion about this value is rather vague. Thus, the
approximate prior distribution assigns similar probabilities to different values of the
additive variance of 2.5. The last case is state three, which illustrates a situation where a
stronger opinion about the probable distribution of the additive variance is held, a priori,
based on the fact that the breed used in this experiment is the same as in Bidanel et al.
(1992). The stronger prior opinion is reflected in a smaller prior standard deviation. Priors
describing states two and three are scaled inverted chi-square distributions. The scaled
inverted chi-square distribution has two parameters, v and S2. These parameters were
varied on a trial and error basis until the desired shape was obtained. Figure 1 [Figure 9.3
in this book] illustrates the three prior densities for the additive variance for ovulation
rate.”
Blasco et al., 1998
In the second example, Blasco et al. (2001) make an attempt of drawing prior
information on uterine capacity in rabbits. Here phenotypic variances are
considered to be estimated with error, and the authors argue about heritabilities
using the transformation that can be found in Sorensen and Gianola (2002).
“Attempts were made to choose prior distributions that represent the state of knowledge
available on uterine capacity up to the time the present experiment was initiated. This
knowledge is however extremely limited; the only available information about this trait has
been provided in rabbits by BOLET et al. (1994) and in mice by GION et al. (1990), who
report heritabilities of 0.05 and 0.08 respectively… Under this scenario of uncertain prior
information, we decided to consider three possible prior distributions. State 1 considers
proper uniform priors for all variance components. The (open) bounds assumed for the
additive variance, the permanent environmental variance and the residual variance were
(0.0, 2.0), (0.0, 0.7) and (0.0, 10.0), respectively. Uniform priors are used for two reasons: as
Figure 9.3. Prior densities showing three different states of belief about the
heritability of ovulation rate in French Landrace pigs (from Blasco et al., 1998).
an attempt to show prior indifference about the values of the parameters, and to use them as
approximate reference priors in Bayesian analyses. In states 2 and 3, we have assumed
scaled inverse chi-square distributions for the variance components, as proposed by
SORENSEN et al. (1994). The scaled inverse chi-square distribution has 2 parameters,
and S2, which define the shape. In state 2, we assigned the following values to these two
parameters: (6.5, 1.8), (6.5, 0.9) and (30.0, 6.3) for the additive genetic, permanent
environmental and residual variance, respectively. In state 3, the corresponding values of
and S2 were (6.5, 0.3), (6.5, 0.2) and (20.0, 10.0). The implied, assumed mean value for the
heritability and repeatability under these priors, approximated …[Sorensen and Gianola,
2002], is 0.15 and 0.21, respectively for state 1, 0.48 and 0.72 for state 2, and 0.08 and 0.16
for state 3”.
Blasco et al. ,2001.
These are particularly complex examples of constructing prior beliefs, but there
are cases in which it is not feasible to construct prior opinions. In particular, in
the multivariate case, comparisons between several priors should be made with
caution. We often find authors that express their multivariate beliefs as inverted
Wishart distributions (the multivariate version of the inverted gamma
distributions), changing the hyper parameters arbitrarily and saying that “since
results almost do not change, this means that we have enough data, and that
prior information does not affect the results”. If we do not know the amount of
information we are introducing when changing priors, then this is nonsense,
because we can always find a multivariate prior sharp enough to dominate the
results. Moreover, inverted Wishart distributions without bounds can produce
covariance components that are obvious outliers. There is no clear solution for
this problem. Blasco et al. (2003) uses priors considering that, in the univariate
case, the parameter S2 is frequently used to represent the variability of the
density distribution. They generalize to the multivariate case, and take the
corresponding parameter in the Wishart distribution, as similar to a matrix of
(co)variances:
We can then compare the two possible states of opinion, and study how the use of
the different prior distributions affects the conclusions from the experiment. We first used
flat priors (with limits that guarantee the property of the distribution) for two reasons: to
show indifference about their value and to use them as reference priors, since they are
usual in Bayesian analyses. Since prior opinions are difficult to draw in the multivariate
case, we chose the second prior by substituting a (co)variance matrix of the components
in the hyper parameters SR and SG and using nR = nG = 3, as proposed by Gelman et al.
[11] in order to have a vague prior information. These last priors are based on the idea
that S is a scale-parameter of the inverted Wishart function, thus using for SR and SG
prior covariance matrixes with a low value for n, would be a way of expressing prior
uncertainty. We proposed SR and SG from phenotypic covariances obtained from the data
of Blasco and Gómez [5].
Blasco et al.. 2003.
The reader probably feels that we are far from the beauty initially proposed by
the Bayesian paradigm, in which prior information was integrated with the
information of the experiment to better asses the current knowledge. Therefore,
the reader should not be surprised when learning that modern Bayesian
statisticians tend to avoid vague prior information, or to use it only as a tool with
no particular meaning. As Bernardo and Smith (1994) say:
“The problem of characterizing a ‘non-informative’ or ‘objective’ prior distribution,
representing ‘prior ignorance’, ‘vague prior knowledge’ and ‘letting the data speak for
themselves’ is far more complex that the apparent intuitive immediacy of these words would
suggest… ‘vague’ is itself much too vague idea to be useful. There is no “objective” prior
that represents ignorance… the reference prior component of the analysis is simply a
mathematical tool.”
Bernardo and Smith, 1994
The advantage of using this “mathematical tool” is not anymore the possibility of
introducing prior information in the analysis, but the possibility of still working
with probabilities, with all the advantages that we have seen in chapter 2
(multiple confidence intervals, marginalisation, etc.).
9.3. No prior information
It sounds quite unrealistic, the assumption of having no prior information. We
can have rather vague prior information, but in Biology it is difficult to sustain
that we have complete lack of information. We can say that we have no
information about the average height of the Scottish people, but we know at
least that they are humans, thus it is very improbable that their average height
is higher than 2 m or lower than 1.5 m. It is also improbable, but less
improbable, that their average height is higher than 1.9 m or lower than 1.6 m.
We can construct a vague state of beliefs as before, only based in our vague
information about how humans are. Later, as we have seen, this vague state of
beliefs will not affect our results unless our sample is very small. However, even
in this case, we can be interested in knowing which results we would obtain if
we had no prior information at all; i.e., how would the results look like if they
were based only in our data. The problem is not as easy as it appears, and
several methods to deal with the no- information cases have been proposed.
We will only examine some of them, since, as we have said, using vague prior
information will be enough for most of the problems we have in Biology or
Agriculture.
9.3.1. Flat priors
As we said in chapter 2 (paragraph 2.1.3), we can try to represent ignorance as
synonymous of indifference, and say that all possible values of the parameters
to be estimated were equally probable before performing the experiment (3).
Since the origins of Bayesian inference (Laplace, 1774) and during its
development in the XIX century, Bayesian inference was always performed
under the supposition that prior ignorance was well represented by flat priors.
Laplace himself, Gauss, Pearson and others suspected that flat priors did not
represent prior ignorance, and moved on to examine the properties of the
sampling distribution. Including actual prior information in the inferences,
instead of using flat priors, was not proposed until the work of Ramsay and de
3 Bernard Bousanquet, a XIXth century British logician, quotes a play of Richard Sheridan to
illustrate the principle of indifference, that Bousanquet considers a sophism:
ABSOLUTE. - Sure, Sir, this is not very reasonable, to summon my affection for a lady I
know nothing of.
SIR ANTHONY. - I am sure, Sir, it is more unreasonable in you to object to a lady you
know nothing of.
Finetti, quoted before. It is quite easy to see why flat priors cannot represent
ignorance: Suppose we believe that we do not have any prior information about
the heritability of a trait. If we represent this using a flat prior (figure 9.4), the
event A “the heritability is lower than 0.5” has a probability of 50% (blue area).
Take now the event B “the square of the heritability is lower than 0.25”. This is
exactly the same event (4) as event A, thus its probability should be the same,
also a 50%.
Figure 9.4. Flat priors are informative
We are just as ignorant about h4 as about h2, thus, if flat priors represent
ignorance, we should also represent the ignorance about h4 with flat priors.
However, if we do this, and we also maintain that P(h4<0.25) = 50%, we arrive
to an absurd conclusion: we do not know anything about h2 but we know that h4
is closer to zero than h2.
To avoid this absurd conclusion, we have to admit that flat priors do not
represent ignorance, but they are informative. The problem is that we do not
know what this prior information really means. However, this information is very
vague and should not cause problems; in most cases that data will dominate
and the results will not be practically affected by the prior. Nevertheless, it is a
bad practice to define flat priors as “non-informative priors”
4 If the reader does not like squares, take half of the heritability or other transformation.
Event A ≡ h2<0.5 Event B ≡ h4<0.25
P(B) = ½ P(A) = ½
Uninformative ?? Informative !!
9.3.2. Jeffreys prior
Ideally, the estimate of a parameter should be invariant to transformations. It is
somewhat annoying that having an estimate of the standard deviation (for
example, the mean of the marginal posterior distribution of the standard
deviation) and an estimate of the variance (for example, the mean of the
marginal posterior distribution of the variance), one it is not the square of the
other. It is quite common that the good properties of estimations are not
conserved when transforming the parameter; for example, the least square
estimator of the variance is unbiased and is also ‘least square’, but the square
root of this is a biased estimator of the standard deviation, and it is not a ‘least
square’ estimator anymore. In the case of representing prior information, we
would like that our prior information was conserved after transformations; i.e., if
we represent vague information about the standard deviation, we would like the
information of the variance to be equally vague. For example, if the prior
information on the standard deviation is represented by 2f , the prior
information of the variance should be, according to the transformation rule we
exposed in 3.3.2,
212
1
d 1f f f 2 f
d 2
Harold Jeffreys proposed to use priors that were invariant to transformations, so
that if f is a Jeffreys prior for , then 2
f should be a prior for 2 . For
example, the Jeffreys prior for the standard deviation is
Jeffreys prior 1
f
Then, the prior for the variance would be:
2
2
1 1 1 1f f
2 2
which is the square of the Jeffreys prior for the standard deviation.
The Jeffrey’s prior of a parameter θ is:
2
y
logf( | )Jeffreys prior f E
y
For example, the Jeffreys prior of the variance is
2
y
22
2
logf( | )f E
y2
1
The deduction of this prior is in appendix 9.1. In Appendix 9.2 we show that
these priors are invariant to transformations.
Jeffreys priors are widely used in univariate problems, but they lead to some
paradoxes in multivariate problems.
9.3.3. Bernardo’s “Reference” priors
If we do not have prior information and we know that all priors are informative, a
sensible solution may be to use priors with minimum information. Bernardo
(1979) proposed to calculate posterior distributions in which the amount of
information provided by the prior is minimal. These priors have the great
advantage of being invariant to reparametrization.
To build these posterior densities we need:
1. To use some definition of information. Bernardo proposes the
definition of Shanon (1948), which we will see in chapter 10.
2. To define the amount of information provided by an experiment: This
is defined as the distance between the prior and the posterior
information, averaging for all possible samples.
3. To use some definition of distance between distributions. Bernardo
uses the Kullback’s divergence between distributions, that, which is
based on Bayesian arguments. We will see Kullback’s divergence in
chapter 10.
4. To find a technically feasible way for solving the problem and deriving
the posterior distributions. This is technically complex and reference
priors are not easy to derive.
In the multivariate case we should transform the multivariate problem in
univariate ones taking into account the parameters that are not of interest. This
should be made, for technical reasons, conditionalising the other parameters in
some order. The problem is that the reference prior obtained differs depending
on the order of conditionalization. This is somewhat uncomfortable, since it
would oblige the scientist to consider several orders of conditionalization to see
how the results differ, a procedure feasible with few parameters, but not in
cases in which the problems are highly parametrized. For the latter cases, the
strategy would be to use sensible vague informative priors when possible or to
nest the problems as we have seen in chapter 8, when parameters of growth
curves were under environmental and genetic effects, that were distributed
normally with known variances that were defined using conjugated priors
defined by some hyperparameters (for example v and S2 as before). Reference
priors would be calculated only in the last step, the hyper-parameters, or in
former steps in cases in which we do not consider the prior information
introduced (by the distribution we consider).
Bernardo’s reference priors are obviously out of the scope of this book.
However, José-Miguel Bernardo has developed in Valencia University a
promising area in which the same “reference” idea has been applied to
hypothesis tests, credibility intervals and other areas, creating a “reference
Bayesian statistics”.
9.4. Improper priors
Some priors are not densities, for example: f k , where k is an arbitrary
constant; it is not a density because f d . However, improper priors
lead to proper posterior densities when
( | ) df y f y f
Sometimes they are innocuous and they are not used in the inference, for
example,
f(y) N(μ,1)
f(μ) = k
2 2
f f | f d f | k d k f | d
1 1k exp d k
y y y
exp d ky
2 22 2
y
y
f y | f f y | k
f | y f y |f y k
thus in this case the posterior density of μ does not take into account the prior.
In general, it is recommended to always use proper priors, to be sure that we
always obtain proper posterior densities. When using Gibbs sampling, some
densities look as proper ones, but they may be improper. Although when using
MCMC all densities are, in practice, “proper” ones (we never sample in the
infinite), samples can have very long burning periods and can lead to chains
that only apparently have converged. The recommendation is always to use
proper priors (bounded priors with reasonable limits, for example), unless it has
been proved that they are innocuous (Hobert and Casella, 1992).
9.5. The Achilles heel of Bayesian inference Bayesian inference, or Inverse probability, as it was always called before and
should still be called, is extremely attractive because of the use of probabilities
and the possibility of integrating prior information. However, integrating prior
information is much more difficult that the optimistic Bayesians of the fifties
thought. We have put an example in which actual prior information was used,
and there are some fields of knowledge in which integrating prior information
has had success (see McGrayne, 2011, for an excellent and entertaining history
lesson of this “proper” Bayesian inference). However, in the biological and
agricultural fields, in which the model used often have many parameters to be
estimated, and multivariate procedures are frequent, integrating prior
information has been showed to be a challenge beyond the possibilities of the
average researcher. This situation has led to the use several artefacts,
substituting the proper ‘prior information’ in order to make possible the use of
probability. Some statisticians think that an artefact multiplied by a probability
will give an artefact and not a probability, and consequently they are reluctant to
use Bayesian inference. There is not a definitive answer to this problem, and it
is a matter of opinion to use Bayesian or frequentist statistics, both are now
widely used and no paper will be refused by a publisher because it uses one
type of statistics or the other .
Many users of statistics, like the author of this book, are not “Bayesians” or
“frequentists”, but just people with problems. Statistics is a tool for helping
solving these problems, and users are more interested in the availability of easy
solutions and friendly software than in the background philosophy. I use
Bayesian statistics because I understand probability better than significance
levels and because it permits to me to express my results in a clearer way for
later discussion. Some other users prefer Bayesian statistics because there is a
route for solving their problems: make a joint posterior distribution, find the
conditionals and use MCMC to find the marginal distributions. We behave as if
we were working with real probabilities. This should not be objected by
frequentists, who behave as if their alternative hypothesis was true or their
confidence interval contained the true value. We have seen in chapter 1 that
Neyman and Pearson (1933) justified frequentist statistics in the grounds of the
right scientific behaviour. Behaving as if the probabilities found using ‘innocent
priors’ were true probabilities, does not seem to be dangerous, since very little
prior information is introduced in the inference. To acknowledge the true
probabilities drives us to the Problem of Induction, a very difficult problem out of
the scope of this book. The Appendix “Three new dialogues between Hylas and
Filonus” tries to expose this problem in a literary form. Virgil, in an agrarian
context, exposed our wish for having certainty. He said.
“Felix qui potuit rerum cognoscere causas” (Happy the man that can
know the causes of the things!
Virgil, Egloga IX, 63
Appendix 9.1
2
y
22 2
2
logf( | )f I | E
yy
n
2
i
1
n
2
22
n2
y1
f( | ) exp2
2
y
2 i2
2
2yn
logf( | ) k log2 2
y
2
2 22
2
i
2
ylogf(y | ) n
2 2
2
2 2
22
i
22
2 3
2 ylogf(y | ) n
2 2
2 22
22
y i
y y 2 2 322 2 2
E ylogf( | ) logf(y | ) nE E
2
y
n2 2 2
y i y i
1
E y nE y n
2 2
22 2 2 2 2
2
y 2 3 2 2 2 22
log f( | ) n n n n 1 n 1E
22 2
y
22
2 2
2y 22
2
log f( | ) 1 1f I | E
yy
Appendix 9.2
2 22 2
2
y y y
logf y | g( ) logf y | g( )logf(y | )I | E E E
I |
y
y