an introduction to bayesian statistics and mcmc

26
AN INTRODUCTION TO BAYESIAN STATISTICS AND MCMC CHAPTER 9 PRIOR INFORMATION “We were certainly aware that inferences must make use of prior information, but after some considerable thought and discussion round these matters we came to the conclusion, rightly or wrongly, that it was so rarely possible to give sure numerical values to these entities, that our line of approach must proceed otherwise” Egon Pearson, 1962. 9.1. Exact prior information 9.1.1. Prior information 9.1.2. Posterior probabilities with exact prior information 9.1.3. Influence of prior information in posterior probabilities 9.2. Vague prior information 9.2.1. A vague definition of vague prior information 9.2.2. Examples of the use of vague prior information 9.3. No prior information 9.3.1. Flat priors 9.3.2. Jeffrey’s priors 9.3.3. Bernardo’s “Reference” priors 9.4. Improper priors 9.5. The Achilles heel of Bayesian inference

Upload: others

Post on 16-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

AN INTRODUCTION TO BAYESIAN STATISTICS AND MCMC

CHAPTER 9

PRIOR INFORMATION

“We were certainly aware that inferences must make use of prior

information, but after some considerable thought and discussion round

these matters we came to the conclusion, rightly or wrongly, that it was so

rarely possible to give sure numerical values to these entities, that our line

of approach must proceed otherwise”

Egon Pearson, 1962.

9.1. Exact prior information

9.1.1. Prior information

9.1.2. Posterior probabilities with exact prior information

9.1.3. Influence of prior information in posterior probabilities

9.2. Vague prior information

9.2.1. A vague definition of vague prior information

9.2.2. Examples of the use of vague prior information

9.3. No prior information

9.3.1. Flat priors

9.3.2. Jeffrey’s priors

9.3.3. Bernardo’s “Reference” priors

9.4. Improper priors

9.5. The Achilles heel of Bayesian inference

9.1. Exact prior information

9.1.1. Prior information

We have delayed until this chapter the assessment of prior information, and we

have used constant or conjugated priors without explaining the underlying

reasons for using them. One of the most attractive characteristics of Bayesian

theory is the possibility of integrating prior information in the inferences, thus we

should discuss here why this is so rarely done, at least in the biological area of

knowledge. When there is exact prior information available there is no

discussion about Bayesian methods and prior information can be integrated

using the rules of probability. The following example is based on an example

prepared by Fisher (1959), who was a notorious anti-Bayesian, but he never

objected to the use of prior probability when clearly established. We have

introduced this example in chapter 2 (paragraph 2.1.3).

Suppose there is a type of laboratory mouse whose skin colour is controlled by

a single gene with two alleles, ‘A’ and ‘a’, so that when the mouse has two

copies of the recessive allele (aa) its skin is brown, and it is black in the other

cases (AA and Aa). We cross a black and a brown mouse and we obtain two

black mice, that should have the alleles (Aa) because they have received the

allele ‘a’ from the brown parent, and the allele of the other parent must be ‘A’,

otherwise they would have been brown. We now cross both black (Aa) mice,

and we have a descent that is black coloured. This black mouse could have

received both ‘A’ alleles from its parents, and in this case we would say that it is

homozygous (AA), or it could have received one ‘A’ allele form one parent and

one ‘a’ allele from the other parent, in this case we would say it is heterozygous

(Aa or aA). We are interested in knowing which type of black mouse it is,

homozygous or heterozygous. In order to find out, we must cross this mouse

with a brown mouse (aa) and examine the offspring (Figure 9.1).

Figure 9.1. Experiment to determine whether a parent is homozygous (AA) or

heterozygous (Aa or aA)

If we get a brown mouse as an offspring, we will know that the mouse is

heterozygous, because the brown mouse has to be (aa) and should have

received an ‘a’ allele from each parent, but if we only get black offsprings we

still will have the doubt about whether it is homo- or heterozygous. If we get

many black offsprings, it will be unlikely that the mouse is heterozygous,

because the ‘a’ allele should have been transmitted at least to one of its

descents. Notice that before making the experiment we can calculate the

probability of obtaining black or brown offspring. We know that the mouse to be

tested cannot be (aa) because otherwise it would be brown. Thus, it either

received both ‘A’ alleles from its mother and father and therefore it is (AA), or an

‘A’ allele from the father and an ‘a’ allele from the mother to become (Aa), or the

opposite, to become (aA). We have three possibilities, thus the probability of

being ‘AA’ is 1/3 and the probabilities of being heterozygous (‘Aa’ or ‘aA’, both

are genetically identical) is 2/3. This is what we expect before having any data

from the experiment. Notice that these expectations are not merely ‘beliefs’, but

quantified probabilities. Also notice that they come from our knowledge of the

Mendel laws and from the knowledge that our mouse is the son of two

heterozygotes, this prior knowledge is based in previous experience.

9.1.2. Posterior probabilities with exact prior information

Now, the experiment is made and we obtain three offspring, all black (figure

9.2). They received for sure an ‘a’ allele from the mother and an ‘A’ allele from

our mouse, but our mouse can still be homozygous (AA) or heterozygous (we

will not make any difference from Aa and aA in the rest of the chapter, since

they are genetically identical, we will use (Aa) for both). Which is the probability

of being each type?

Figure 9.2. Experiment to determine whether a parent is homozygous (AA) or

heterozygous (Aa or aA)

To find out, we will apply the Bayes Theorem. The probability of being

homozygous (AA) given that we have obtained three black offspring is

P 3 black | PP | 3 bla

AAck

P 3 black

AAAA

yy

y

We know that if it is true that our mouse is AA, the probability of obtaining a

black offspring is 1, since the offspring will always have an ‘A’ allele. Thus,

P 3 black | 1AA y

We also know that the prior probability of being AA is 1/3, thus

P AA 0.33

Finally, the probability of the sample is the sum of the probabilities of two

excluding events: having a parent homozygous (AA) or having a parent

heterozygous (Aa).

AA AaP 3 black P 3 black & P 3 black &

P 3 black | AA P P 3 black | AA A AaPa

y y y

y y

to calculate it we need the prior probability of being heterozygous, which we

know it is:

P Aa2

3

and the probability of obtaining our sample if it is true that our mouse is Aa. If

our mouse was Aa, the only way of obtaining a black offspring is if this offspring

got its A allele from him, thus the probability of obtaining one black offspring

would be ½. The probability of obtaining three black offspring will be ½ x ½ x ½,

thus

3

1P 3 black |

2Aa

y

Now we can calculate the probability of our sample:

3

P 3 black P 3 black | P P 3 black |AA AA Aa A P

1 1 21 0.42

3

a

2 3

y y y

Then, applying Bayes theorem

AAP 3 black AAAA

| P 1 0.33P | 3 black 0.80

P 3 black 0.42

yy

y

The probability of being heterozygous can be calculated again using the Bayes

theorem, or simply

P | 3 black 1 P | 3 black 1 0.80 0.20Aa AA y y

Thus, we had a prior probability before obtaining any data, and a probability

after obtaining three black offspring

prior P 0.33 posterior P | 0.80

prior P 0.67 posterior

AA AA

Aa AaP | 0.20

y

y

before the experiment was performed it was more probable that our mouse was

heterozygous (Aa), but after the experiment it is more probable that it is

homozygous (AA).

Notice that the sum of both probabilities is 1

P(AA | y) + P(Aa | y) = 1.00

thus the posterior probabilities give a relative measure of uncertainty (80% and

20% respectively).

However, using the maximum likelihood method, we can also give a higher

chance to the homozygous (AA), but we do not have a measure of evidence.

Consider the likelihood of both events; as we have seen before,

If it is true that our mouse is AA, the probability of obtaining a black offspring is 1

P 3 black | 1AA y

If it is true that our mouse is Aa, the probability of obtaining our sample is

3

1P 3 black | .a 0 5

2A 12

y

Notice that the sum of the likelihoods is not 1 because they come from different

events

P(y |AA) + P(y |Aa) = 1.125

thus the likelihoods do not provide a measure of uncertainty. Even if we rescale

the likelihood to force the sum to 1, the relative value of the likelihoods still does

not provide a relative measure of their evidence as the probability does.

9.1.3. Influence of prior information in posterior probabilities

If we had used flat priors, instead of using exact prior information, repeating the

calculus, we will obtain

prior P 0.50 posterior P | 0.89

prior P 0.50 posterior

AA AA

Aa AaP | 0.11

y

y

we can see that flat prior information had an influence in the final result. We still

favour the homozygous case (AA), but we are overestimating the evidence in

favour of it. When having exact prior information, the correct way to make an

inference is to use this exact prior information.

Suppose now that we have a large amount of information favouring the case

(Aa). Suppose we know that our prior information of being AA is P(AA) = 0.002.

Computing the probabilities again we obtain

prior P 0.002 posterior P | 0.02

prior P 0.998 posterior

AA AA

Aa A 9aP | 0. 8

y

y

thus despite of having evidence from the data in favour of AA, we decide that

the mouse is Aa because prior information dominates, and the posterior

distribution favours Aa. This has been a frequent criticism to Bayesian

inference, but one wonders why an experiment should be performed when the

previous evidence is so strong in favour of Aa.

What could have happened if instead of three black offspring we had obtained

seven black offspring?

Repeating the calculus for y = 7 black, we obtain

prior P 0.33 posterior P | 0.99

prior P 0.67 posterior

AA AA

Aa AaP | 0.01

y

y

If flat priors were used, we obtain

prior P 0.50 posterior P | 0.99

prior P 0.50 posterior

AA AA

Aa AaP | 0.01

y

y

in this case the evidence provided by the data dominates over the prior

information. However, if the prior information is very accurate

prior P 0.002 posterior P | 0.33

prior P 0.998 posterior

AA AA

Aa A 6aP | 0. 7

y

y

thus having even more data, prior information dominates the final result when it

is very accurate, which should not normally be the case. In general, prior

information loses importance with larger samples. For example, if we have n

uncorrelated data,

taking logarithms

1 2 nlog f | log f | log f | log f | log fy y y y

we can see that prior information has less and less importance as the amount of

data augments.

9.2. Vague prior information

9.2.1. A vague definition of vague prior information

It is infrequent to find exact prior information. Usually there is prior information,

but it is not clear how to formalize it in order to describe this information using a

prior distribution. For example, if we are going to estimate the heritability of litter

size of a rabbit breed we know that this heritability has also been estimated in

other breeds and it has often given values between 0.05 and 0.11. We have a

case in which the estimate was 0.30, but the standard error was high. We also

have a high realized heritability in an experiment that that we have reasons for

not taking it too seriously. A high heritability was also presented in a Congress,

but this paper did not pass the usual peer review filter and we tend to give less

credibility to these results. Moreover, some of the experiments are performed in

situations that are more similar to our own experiment, or with breeds that are

1 2 n 1 2 nf | f | f f y ,y , ,y | f f | f | f | fy y y y y

closer to ours. It is obvious that we have prior information, but, how can we

manage all of this?

One of the disappointments when arriving to Bayesian inference attracted by

the possibility of using prior information, is that modern Bayesians tend to avoid

the use of prior information due to the difficulties of defining it properly. A

solution for integrating all sources of prior knowledge in a clearly defined prior

information was offered independently in the decade of the 30s by the British

philosopher Frank Ramsay (1931) and by the Italian mathematician Bruno de

Finetti (1937), but the solution is unsatisfactory in many cases as we will see.

They propose that what we call probability is just a state of beliefs. This

definition has the advantage of including events like the probability of obtaining

a 6 when throwing a dice and the probability of Scotland becoming an

independent republic in this decade. Of course, in the first case we have some

mathematical rules that will determine our beliefs and in the second we do not

have these rules, but in both cases we can express sensible beliefs about the

events.

Transforming probability, which looks as an external concept to us, into beliefs,

which look like an arbitrary product of our daily mood, is a step that some

scientists refuse to walk. Nevertheless, there are three aspects to consider:

1. It should be clear that although beliefs are subjective, this does not

mean that they are arbitrary. Ideally, previous beliefs should be

expressed by experts and there should be a good agreement among

experts on how prior information is evaluated.

2. Prior beliefs should be vague and contain little information, otherwise

there is no reason to perform the experiment in the first place, as we

have seen in 9.1.3. In some cases, an experiment may be performed

in order to add more accuracy to a good previous estimation, but this

is not normally the case.

3. When having enough data, prior information loses importance, and

different prior beliefs can give rise to the same result, as we have

seen in 9.1.3 (1).

There is another problem of a different nature. In the case of multivariate

analyses, it is almost impossible to determine a rational state of beliefs. For

example, suppose we are analysing heritabilities and genetic correlations of

three traits. How can we determine our beliefs about the heritability of the first

trait, when the second trait has a heritability of 0.2, the correlation between both

traits is -0.7, the heritability of the third trait is 0.1, the correlation between the

first and the third traits is 0.3 and the correlation between the second and the

third trait is 0.4; then our beliefs about the heritability of the first trait when the

heritability of the second trait is 0.1, …etc.? Here we are unable to represent

any state of beliefs, even a vague one.

9.2.2. Examples of the use of vague prior information

We can describe our prior information using many density functions, but it would

be convenient to propose a density that can be easily combined with the

distribution of the data, what we called a conjugate distribution in chapter 5

(paragraph 5.3.3). When we are comparing two means, as in the example in

5.3.3, it is easy to understand how we can construct a prior density. We

compare an a priori expectation of obtaining a difference of 100 g of live weight

between two treatments for poultry growth. We believe that it is less probable to

obtain 50 g, than 150 g. We also believe that it is rather improbable to obtain 25

g, as improbable as to obtain 175 g of difference between both treatments. We

can draw a diagram with our beliefs, representing the probabilities we give to

these differences. These beliefs are symmetrical around the most probable

1 Bayesian statisticians often stress that when having enough data the problem of the prior

information becomes irrelevant. However, having enough data, Statistics is rather irrelevant; the

science of Statistics is useful when we want to determine which part of our observation is due to

random sampling and what is due to a natural law.

value, 100 g, and can be approximately represented by a Normal distribution. It

is not important whether the representation of our beliefs is very accurate or not,

since it should be necessarily vague, otherwise we would not perform the

experiment, as we have said before.

This is a simple example about how to construct prior beliefs. The reader can

find friendly examples about how to construct prior beliefs for simpler problems

in Press (2002). However, it is generally difficult to work with prior information in

more complex problems. To illustrate the difficulties of working with vague prior

information we will show here two examples of attempts of constructing prior

information for the variance components in genetic analyses. In the first

attempt, Blasco et al. (1998) try to express the prior information about variance

components for ovulation rate in pigs. As the available information is on

heritabilities, they consider that the phenotypic variance is estimated without

error (in genetic experiments, often performed with a large amount of data, this

error is very small), and express their beliefs about additive variances, just as if

they were heritabilities (2). The error variance priors are constructed according

to these beliefs. Blasco et al. (1998) used inverted gamma densities to express

the uncertainty about the heritability of one trait, because they are conjugated

priors that can be easily combined with the distribution of the data, as we have

seen in chapter 5 (paragraph 5.3.3). These functions depend on two

parameters that we have called been α and β in 3.5.4,

1

1f | , e

xxpx

x

The shape of the function changes when changing α and β, thus the researcher

can try different values for these parameters until she finds a shape that agrees

with her beliefs (an example is drawn in figure 9.3). These parameters are

frequently called (degrees of freedom) and S2 (‘window’ parameter) for prior

2 There are other alternatives, for example, thinking in standardised data to construct prior

beliefs for the additive variance.

densities of variances, because they affect mainly the shape and width of the

distribution respectively.

22

1

2

22

1 Sf | ,S exp

This leads to some confusion, since the distribution shape and width depends

on both parameters, but also because their only meaning for us is to have two

parameters that change the distribution until it represents our prior beliefs,

without any relationship with “degrees of freedom”. Moreover, v does not need

to be a natural number; for example, Blasco et al. (1998) used v= 2.5 and v=

6.5 for the prior of one of the variances. We will now see how Blasco et al.

(1998) constructed their prior densities.

“Prior distributions for variance components were built on the basis of information from

the literature. For ovulation rate, most of the published research shows heritabilities of

either 0.1 or 0.4, ranging from 0.1 to 0.6 (Blasco et al. 1993). Bidanel et al. (1992) reports

an estimate of heritability of 0.11 with a standard error of 0.02 in a French Large White

population. On the basis of this prior information for ovulation rate, and assuming [without

error] a phenotypic variance of 6.25 (Bidanel et al. 1996), three different sets of prior

distributions reflecting different states of knowledge were constructed for the variance

components. In this way, we can study how the use of different prior distributions affects

the conclusions from the experiment. The first set is an attempt to ignore prior knowledge

about the additive variance for ovulation rate. This was approximated assuming a uniform

distribution, where the additive variance can take any positive value up to the assumed

value of the phenotypic variance, with equal probability. In set two, the prior distribution of

the additive variance is such that its most probable value is close to 2.5 [corresponding to

a heritability of 0.4], but the opinion about this value is rather vague. Thus, the

approximate prior distribution assigns similar probabilities to different values of the

additive variance of 2.5. The last case is state three, which illustrates a situation where a

stronger opinion about the probable distribution of the additive variance is held, a priori,

based on the fact that the breed used in this experiment is the same as in Bidanel et al.

(1992). The stronger prior opinion is reflected in a smaller prior standard deviation. Priors

describing states two and three are scaled inverted chi-square distributions. The scaled

inverted chi-square distribution has two parameters, v and S2. These parameters were

varied on a trial and error basis until the desired shape was obtained. Figure 1 [Figure 9.3

in this book] illustrates the three prior densities for the additive variance for ovulation

rate.”

Blasco et al., 1998

In the second example, Blasco et al. (2001) make an attempt of drawing prior

information on uterine capacity in rabbits. Here phenotypic variances are

considered to be estimated with error, and the authors argue about heritabilities

using the transformation that can be found in Sorensen and Gianola (2002).

“Attempts were made to choose prior distributions that represent the state of knowledge

available on uterine capacity up to the time the present experiment was initiated. This

knowledge is however extremely limited; the only available information about this trait has

been provided in rabbits by BOLET et al. (1994) and in mice by GION et al. (1990), who

report heritabilities of 0.05 and 0.08 respectively… Under this scenario of uncertain prior

information, we decided to consider three possible prior distributions. State 1 considers

proper uniform priors for all variance components. The (open) bounds assumed for the

additive variance, the permanent environmental variance and the residual variance were

(0.0, 2.0), (0.0, 0.7) and (0.0, 10.0), respectively. Uniform priors are used for two reasons: as

Figure 9.3. Prior densities showing three different states of belief about the

heritability of ovulation rate in French Landrace pigs (from Blasco et al., 1998).

an attempt to show prior indifference about the values of the parameters, and to use them as

approximate reference priors in Bayesian analyses. In states 2 and 3, we have assumed

scaled inverse chi-square distributions for the variance components, as proposed by

SORENSEN et al. (1994). The scaled inverse chi-square distribution has 2 parameters,

and S2, which define the shape. In state 2, we assigned the following values to these two

parameters: (6.5, 1.8), (6.5, 0.9) and (30.0, 6.3) for the additive genetic, permanent

environmental and residual variance, respectively. In state 3, the corresponding values of

and S2 were (6.5, 0.3), (6.5, 0.2) and (20.0, 10.0). The implied, assumed mean value for the

heritability and repeatability under these priors, approximated …[Sorensen and Gianola,

2002], is 0.15 and 0.21, respectively for state 1, 0.48 and 0.72 for state 2, and 0.08 and 0.16

for state 3”.

Blasco et al. ,2001.

These are particularly complex examples of constructing prior beliefs, but there

are cases in which it is not feasible to construct prior opinions. In particular, in

the multivariate case, comparisons between several priors should be made with

caution. We often find authors that express their multivariate beliefs as inverted

Wishart distributions (the multivariate version of the inverted gamma

distributions), changing the hyper parameters arbitrarily and saying that “since

results almost do not change, this means that we have enough data, and that

prior information does not affect the results”. If we do not know the amount of

information we are introducing when changing priors, then this is nonsense,

because we can always find a multivariate prior sharp enough to dominate the

results. Moreover, inverted Wishart distributions without bounds can produce

covariance components that are obvious outliers. There is no clear solution for

this problem. Blasco et al. (2003) uses priors considering that, in the univariate

case, the parameter S2 is frequently used to represent the variability of the

density distribution. They generalize to the multivariate case, and take the

corresponding parameter in the Wishart distribution, as similar to a matrix of

(co)variances:

We can then compare the two possible states of opinion, and study how the use of

the different prior distributions affects the conclusions from the experiment. We first used

flat priors (with limits that guarantee the property of the distribution) for two reasons: to

show indifference about their value and to use them as reference priors, since they are

usual in Bayesian analyses. Since prior opinions are difficult to draw in the multivariate

case, we chose the second prior by substituting a (co)variance matrix of the components

in the hyper parameters SR and SG and using nR = nG = 3, as proposed by Gelman et al.

[11] in order to have a vague prior information. These last priors are based on the idea

that S is a scale-parameter of the inverted Wishart function, thus using for SR and SG

prior covariance matrixes with a low value for n, would be a way of expressing prior

uncertainty. We proposed SR and SG from phenotypic covariances obtained from the data

of Blasco and Gómez [5].

Blasco et al.. 2003.

The reader probably feels that we are far from the beauty initially proposed by

the Bayesian paradigm, in which prior information was integrated with the

information of the experiment to better asses the current knowledge. Therefore,

the reader should not be surprised when learning that modern Bayesian

statisticians tend to avoid vague prior information, or to use it only as a tool with

no particular meaning. As Bernardo and Smith (1994) say:

“The problem of characterizing a ‘non-informative’ or ‘objective’ prior distribution,

representing ‘prior ignorance’, ‘vague prior knowledge’ and ‘letting the data speak for

themselves’ is far more complex that the apparent intuitive immediacy of these words would

suggest… ‘vague’ is itself much too vague idea to be useful. There is no “objective” prior

that represents ignorance… the reference prior component of the analysis is simply a

mathematical tool.”

Bernardo and Smith, 1994

The advantage of using this “mathematical tool” is not anymore the possibility of

introducing prior information in the analysis, but the possibility of still working

with probabilities, with all the advantages that we have seen in chapter 2

(multiple confidence intervals, marginalisation, etc.).

9.3. No prior information

It sounds quite unrealistic, the assumption of having no prior information. We

can have rather vague prior information, but in Biology it is difficult to sustain

that we have complete lack of information. We can say that we have no

information about the average height of the Scottish people, but we know at

least that they are humans, thus it is very improbable that their average height

is higher than 2 m or lower than 1.5 m. It is also improbable, but less

improbable, that their average height is higher than 1.9 m or lower than 1.6 m.

We can construct a vague state of beliefs as before, only based in our vague

information about how humans are. Later, as we have seen, this vague state of

beliefs will not affect our results unless our sample is very small. However, even

in this case, we can be interested in knowing which results we would obtain if

we had no prior information at all; i.e., how would the results look like if they

were based only in our data. The problem is not as easy as it appears, and

several methods to deal with the no- information cases have been proposed.

We will only examine some of them, since, as we have said, using vague prior

information will be enough for most of the problems we have in Biology or

Agriculture.

9.3.1. Flat priors

As we said in chapter 2 (paragraph 2.1.3), we can try to represent ignorance as

synonymous of indifference, and say that all possible values of the parameters

to be estimated were equally probable before performing the experiment (3).

Since the origins of Bayesian inference (Laplace, 1774) and during its

development in the XIX century, Bayesian inference was always performed

under the supposition that prior ignorance was well represented by flat priors.

Laplace himself, Gauss, Pearson and others suspected that flat priors did not

represent prior ignorance, and moved on to examine the properties of the

sampling distribution. Including actual prior information in the inferences,

instead of using flat priors, was not proposed until the work of Ramsay and de

3 Bernard Bousanquet, a XIXth century British logician, quotes a play of Richard Sheridan to

illustrate the principle of indifference, that Bousanquet considers a sophism:

ABSOLUTE. - Sure, Sir, this is not very reasonable, to summon my affection for a lady I

know nothing of.

SIR ANTHONY. - I am sure, Sir, it is more unreasonable in you to object to a lady you

know nothing of.

Finetti, quoted before. It is quite easy to see why flat priors cannot represent

ignorance: Suppose we believe that we do not have any prior information about

the heritability of a trait. If we represent this using a flat prior (figure 9.4), the

event A “the heritability is lower than 0.5” has a probability of 50% (blue area).

Take now the event B “the square of the heritability is lower than 0.25”. This is

exactly the same event (4) as event A, thus its probability should be the same,

also a 50%.

Figure 9.4. Flat priors are informative

We are just as ignorant about h4 as about h2, thus, if flat priors represent

ignorance, we should also represent the ignorance about h4 with flat priors.

However, if we do this, and we also maintain that P(h4<0.25) = 50%, we arrive

to an absurd conclusion: we do not know anything about h2 but we know that h4

is closer to zero than h2.

To avoid this absurd conclusion, we have to admit that flat priors do not

represent ignorance, but they are informative. The problem is that we do not

know what this prior information really means. However, this information is very

vague and should not cause problems; in most cases that data will dominate

and the results will not be practically affected by the prior. Nevertheless, it is a

bad practice to define flat priors as “non-informative priors”

4 If the reader does not like squares, take half of the heritability or other transformation.

Event A ≡ h2<0.5 Event B ≡ h4<0.25

P(B) = ½ P(A) = ½

Uninformative ?? Informative !!

9.3.2. Jeffreys prior

Ideally, the estimate of a parameter should be invariant to transformations. It is

somewhat annoying that having an estimate of the standard deviation (for

example, the mean of the marginal posterior distribution of the standard

deviation) and an estimate of the variance (for example, the mean of the

marginal posterior distribution of the variance), one it is not the square of the

other. It is quite common that the good properties of estimations are not

conserved when transforming the parameter; for example, the least square

estimator of the variance is unbiased and is also ‘least square’, but the square

root of this is a biased estimator of the standard deviation, and it is not a ‘least

square’ estimator anymore. In the case of representing prior information, we

would like that our prior information was conserved after transformations; i.e., if

we represent vague information about the standard deviation, we would like the

information of the variance to be equally vague. For example, if the prior

information on the standard deviation is represented by 2f , the prior

information of the variance should be, according to the transformation rule we

exposed in 3.3.2,

212

1

d 1f f f 2 f

d 2

Harold Jeffreys proposed to use priors that were invariant to transformations, so

that if f is a Jeffreys prior for , then 2

f should be a prior for 2 . For

example, the Jeffreys prior for the standard deviation is

Jeffreys prior 1

f

Then, the prior for the variance would be:

2

2

1 1 1 1f f

2 2

which is the square of the Jeffreys prior for the standard deviation.

The Jeffrey’s prior of a parameter θ is:

2

y

logf( | )Jeffreys prior f E

y

For example, the Jeffreys prior of the variance is

2

y

22

2

logf( | )f E

y2

1

The deduction of this prior is in appendix 9.1. In Appendix 9.2 we show that

these priors are invariant to transformations.

Jeffreys priors are widely used in univariate problems, but they lead to some

paradoxes in multivariate problems.

9.3.3. Bernardo’s “Reference” priors

If we do not have prior information and we know that all priors are informative, a

sensible solution may be to use priors with minimum information. Bernardo

(1979) proposed to calculate posterior distributions in which the amount of

information provided by the prior is minimal. These priors have the great

advantage of being invariant to reparametrization.

To build these posterior densities we need:

1. To use some definition of information. Bernardo proposes the

definition of Shanon (1948), which we will see in chapter 10.

2. To define the amount of information provided by an experiment: This

is defined as the distance between the prior and the posterior

information, averaging for all possible samples.

3. To use some definition of distance between distributions. Bernardo

uses the Kullback’s divergence between distributions, that, which is

based on Bayesian arguments. We will see Kullback’s divergence in

chapter 10.

4. To find a technically feasible way for solving the problem and deriving

the posterior distributions. This is technically complex and reference

priors are not easy to derive.

In the multivariate case we should transform the multivariate problem in

univariate ones taking into account the parameters that are not of interest. This

should be made, for technical reasons, conditionalising the other parameters in

some order. The problem is that the reference prior obtained differs depending

on the order of conditionalization. This is somewhat uncomfortable, since it

would oblige the scientist to consider several orders of conditionalization to see

how the results differ, a procedure feasible with few parameters, but not in

cases in which the problems are highly parametrized. For the latter cases, the

strategy would be to use sensible vague informative priors when possible or to

nest the problems as we have seen in chapter 8, when parameters of growth

curves were under environmental and genetic effects, that were distributed

normally with known variances that were defined using conjugated priors

defined by some hyperparameters (for example v and S2 as before). Reference

priors would be calculated only in the last step, the hyper-parameters, or in

former steps in cases in which we do not consider the prior information

introduced (by the distribution we consider).

Bernardo’s reference priors are obviously out of the scope of this book.

However, José-Miguel Bernardo has developed in Valencia University a

promising area in which the same “reference” idea has been applied to

hypothesis tests, credibility intervals and other areas, creating a “reference

Bayesian statistics”.

9.4. Improper priors

Some priors are not densities, for example: f k , where k is an arbitrary

constant; it is not a density because f d . However, improper priors

lead to proper posterior densities when

( | ) df y f y f

Sometimes they are innocuous and they are not used in the inference, for

example,

f(y) N(μ,1)

f(μ) = k

2 2

f f | f d f | k d k f | d

1 1k exp d k

y y y

exp d ky

2 22 2

y

y

f y | f f y | k

f | y f y |f y k

thus in this case the posterior density of μ does not take into account the prior.

In general, it is recommended to always use proper priors, to be sure that we

always obtain proper posterior densities. When using Gibbs sampling, some

densities look as proper ones, but they may be improper. Although when using

MCMC all densities are, in practice, “proper” ones (we never sample in the

infinite), samples can have very long burning periods and can lead to chains

that only apparently have converged. The recommendation is always to use

proper priors (bounded priors with reasonable limits, for example), unless it has

been proved that they are innocuous (Hobert and Casella, 1992).

9.5. The Achilles heel of Bayesian inference Bayesian inference, or Inverse probability, as it was always called before and

should still be called, is extremely attractive because of the use of probabilities

and the possibility of integrating prior information. However, integrating prior

information is much more difficult that the optimistic Bayesians of the fifties

thought. We have put an example in which actual prior information was used,

and there are some fields of knowledge in which integrating prior information

has had success (see McGrayne, 2011, for an excellent and entertaining history

lesson of this “proper” Bayesian inference). However, in the biological and

agricultural fields, in which the model used often have many parameters to be

estimated, and multivariate procedures are frequent, integrating prior

information has been showed to be a challenge beyond the possibilities of the

average researcher. This situation has led to the use several artefacts,

substituting the proper ‘prior information’ in order to make possible the use of

probability. Some statisticians think that an artefact multiplied by a probability

will give an artefact and not a probability, and consequently they are reluctant to

use Bayesian inference. There is not a definitive answer to this problem, and it

is a matter of opinion to use Bayesian or frequentist statistics, both are now

widely used and no paper will be refused by a publisher because it uses one

type of statistics or the other .

Many users of statistics, like the author of this book, are not “Bayesians” or

“frequentists”, but just people with problems. Statistics is a tool for helping

solving these problems, and users are more interested in the availability of easy

solutions and friendly software than in the background philosophy. I use

Bayesian statistics because I understand probability better than significance

levels and because it permits to me to express my results in a clearer way for

later discussion. Some other users prefer Bayesian statistics because there is a

route for solving their problems: make a joint posterior distribution, find the

conditionals and use MCMC to find the marginal distributions. We behave as if

we were working with real probabilities. This should not be objected by

frequentists, who behave as if their alternative hypothesis was true or their

confidence interval contained the true value. We have seen in chapter 1 that

Neyman and Pearson (1933) justified frequentist statistics in the grounds of the

right scientific behaviour. Behaving as if the probabilities found using ‘innocent

priors’ were true probabilities, does not seem to be dangerous, since very little

prior information is introduced in the inference. To acknowledge the true

probabilities drives us to the Problem of Induction, a very difficult problem out of

the scope of this book. The Appendix “Three new dialogues between Hylas and

Filonus” tries to expose this problem in a literary form. Virgil, in an agrarian

context, exposed our wish for having certainty. He said.

“Felix qui potuit rerum cognoscere causas” (Happy the man that can

know the causes of the things!

Virgil, Egloga IX, 63

Appendix 9.1

2

y

22 2

2

logf( | )f I | E

yy

n

2

i

1

n

2

22

n2

y1

f( | ) exp2

2

y

2 i2

2

2yn

logf( | ) k log2 2

y

2

2 22

2

i

2

ylogf(y | ) n

2 2

2

2 2

22

i

22

2 3

2 ylogf(y | ) n

2 2

2 22

22

y i

y y 2 2 322 2 2

E ylogf( | ) logf(y | ) nE E

2

y

n2 2 2

y i y i

1

E y nE y n

2 2

22 2 2 2 2

2

y 2 3 2 2 2 22

log f( | ) n n n n 1 n 1E

22 2

y

22

2 2

2y 22

2

log f( | ) 1 1f I | E

yy

Appendix 9.2

2 22 2

2

y y y

logf y | g( ) logf y | g( )logf(y | )I | E E E

I |

y

y