robert

134
Bayes 250 th versus Bayes 2.5.0 Christian P. Robert Universit´ e Paris-Dauphine, University of Warwick, & CREST, Paris written for EMS 2013, Budapest

Post on 17-Sep-2014

362 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Robert

Bayes 250th versus Bayes 2.5.0

Christian P. RobertUniversite Paris-Dauphine, University of Warwick, & CREST, Paris

written for EMS 2013, Budapest

Page 2: Robert

Outline

Bayes, Thomas (1702–1761)

Jeffreys, Harold (1891–1989)

Lindley, Dennis (1923– )

Besag, Julian (1945–2010)

de Finetti, Bruno (1906–1985)

Page 3: Robert

Bayes, Price and Laplace

Bayes, Thomas (1702–1761)Bayes’ 1763 paperBayes’ exampleLaplace’s 1774 derivation

Jeffreys, Harold (1891–1989)

Lindley, Dennis (1923– )

Besag, Julian (1945–2010)

de Finetti, Bruno (1906–1985)

Page 4: Robert

a first Bayes 250

Took place in Edinburgh, Sept. 5–7, 2011:I Sparse Nonparametric Bayesian Learning from

Big Data David Dunson, Duke University

I Classification Models and Predictions for OrderedData Chris Holmes, Oxford University

I Bayesian Variable Selection in Markov MixtureModels Luigi Spezia, Biomathematics& Statistics Scotland, Aberdeen

I Bayesian inference for partially observed Markovprocesses, with application to systems biologyDarren Wilkinson, University of Newcastle

I Coherent Inference on Distributed BayesianExpert Systems Jim Smith, University of Warwick

I Probabilistic Programming John Winn, MicrosoftResearch

I How To Gamble If You Must (courtesy of theReverend Bayes) David Spiegelhalter, Universityof Cambridge

I Inference and computing with decomposablegraphs Peter Green, University of Bristol

I Nonparametric Bayesian Models for SparseMatrices and Covariances Zoubin Gharamani,University of Cambridge

I Latent Force Models Neil Lawrence, University ofSheffield

I Does Bayes Theorem Work? Michael Goldstein,Durham University

I Bayesian Priors in the Brain Peggy Series,University of Edinburgh

I Approximate Bayesian Computation for modelselection Christian Robert, UniversiteParis-Dauphine

I ABC-EP: Expectation Propagation forLikelihood-free Bayesian Computation NicholasChopin, CREST–ENSAE

I Bayes at Edinburgh University - a talk and tourDr Andrew Fraser, Honorary Fellow, University ofEdinburgh

I Intractable likelihoods and exact approximateMCMC algorithms Christophe Andrieu,University of Bristol

I Bayesian computational methods for intractablecontinuous-time non-Gaussian time series SimonGodsill, University of Cambridge

I Efficient MCMC for Continuous Time DiscreteState Systems Yee Whye Teh, GatsbyComputational Neuroscience Unit, UniversityCollege London

I Adaptive Control and Bayesian Inference CarlRasmussen, University of Cambridge

I Bernstein - von Mises theorem for irregularstatistical models Natalia Bochkina, University ofEdinburgh

Page 5: Robert

Why Bayes 250?

Publication on Dec. 23, 1763 of“An Essay towards solving aProblem in the Doctrine ofChances” by the lateRev. Mr. Bayes, communicatedby Mr. Price in the PhilosophicalTransactions of the Royal Societyof London.

c© 250th anniversary of the Essay

Page 6: Robert

Why Bayes 250?

Publication on Dec. 23, 1763 of“An Essay towards solving aProblem in the Doctrine ofChances” by the lateRev. Mr. Bayes, communicatedby Mr. Price in the PhilosophicalTransactions of the Royal Societyof London.

c© 250th anniversary of the Essay

Page 7: Robert

Why Bayes 250?

Publication on Dec. 23, 1763 of“An Essay towards solving aProblem in the Doctrine ofChances” by the lateRev. Mr. Bayes, communicatedby Mr. Price in the PhilosophicalTransactions of the Royal Societyof London.

c© 250th anniversary of the Essay

Page 8: Robert

Why Bayes 250?

Publication on Dec. 23, 1763 of“An Essay towards solving aProblem in the Doctrine ofChances” by the lateRev. Mr. Bayes, communicatedby Mr. Price in the PhilosophicalTransactions of the Royal Societyof London.

c© 250th anniversary of the Essay

Page 9: Robert

Breaking news!!!

An accepted paper by Stephen Stigler in Statistical Scienceuncovers the true title of the Essay:

A Method ofCalculating the Exact

Probability of AllConclusions founded on

Induction

Intended as a reply toHume’s (1748) evaluationof the probability ofmiracles

Page 10: Robert

Breaking news!!!

I may have been written as early as 1749: “we may hope todetermine the Propositions, and, by degrees, the whole Natureof unknown Causes, by a sufficient Observation of theireffects” (D. Hartley)

I in 1767, Richard Price usedBayes’ theorem as a tool toattack Hume’s argument,refering to the above title

I Bayes’ offprints available atYale’s Beinecke Library (butmissing the title page) andat the Library Company ofPhiladelphia (Franklin’slibrary)

[Stigler, 2013]

Page 11: Robert

Bayes Theorem

Bayes theorem = Inversion of causes and effects

If A and E are events such that P(E ) 6= 0, P(A|E ) and P(E |A) arerelated by

P(A|E ) =

P(E |A)P(A)

P(E |A)P(A) + P(E |Ac)P(Ac)

=P(E |A)P(A)

P(E )

Page 12: Robert

Bayes Theorem

Bayes theorem = Inversion of causes and effects

If A and E are events such that P(E ) 6= 0, P(A|E ) and P(E |A) arerelated by

P(A|E ) =

P(E |A)P(A)

P(E |A)P(A) + P(E |Ac)P(Ac)

=P(E |A)P(A)

P(E )

Page 13: Robert

Bayes Theorem

Bayes theorem = Inversion of causes and effects

Continuous version for randomvariables X and Y

fX |Y (x |y) =fY |X (y |x)× fX (x)

fY (y)

Page 14: Robert

Who was Thomas Bayes?

Reverend Thomas Bayes (ca. 1702–1761), educated in Londonthen at the University of Edinburgh (1719-1721), presbyterianminister in Tunbridge Wells (Kent) from 1731, son of JoshuaBayes, nonconformist minister.

“Election to the Royal Society based ona tract of 1736 where he defended theviews and philosophy of Newton.A notebook of his includes a method offinding the time and place ofconjunction of two planets, notes onweights and measures, a method ofdifferentiation, and logarithms.”

[Wikipedia]

Page 15: Robert

Who was Thomas Bayes?

Reverend Thomas Bayes (ca. 1702–1761), educated in Londonthen at the University of Edinburgh (1719-1721), presbyterianminister in Tunbridge Wells (Kent) from 1731, son of JoshuaBayes, nonconformist minister.

“Election to the Royal Society based ona tract of 1736 where he defended theviews and philosophy of Newton.A notebook of his includes a method offinding the time and place ofconjunction of two planets, notes onweights and measures, a method ofdifferentiation, and logarithms.”

[Wikipedia]

Page 16: Robert

Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W .

Page 17: Robert

Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W .

Bayes’ question:

Given X , what inference can we make on p?

Page 18: Robert

Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W .

Bayes’ wording:

“Given the number of times in which an unknown event hashappened and failed; Required the chance that the probability ofits happening in a single trial lies somewhere between any twodegrees of probability that can be named.”

Page 19: Robert

Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W .

Modern translation:

Derive the posterior distribution of p given X , when

p ∼ U([0, 1]) and X |p ∼ B(n, p)

Page 20: Robert

Resolution

Since

P(X = x |p) =

(n

x

)px(1 − p)n−x ,

P(a < p < b and X = x) =

∫ba

(n

x

)px(1 − p)n−xdp

and

P(X = x) =

∫1

0

(n

x

)px(1 − p)n−x dp,

Page 21: Robert

Resolution (2)

then

P(a < p < b|X = x) =

∫ba

(nx

)px(1 − p)n−x dp∫1

0

(nx

)px(1 − p)n−x dp

=

∫ba px(1 − p)n−x dp

B(x + 1, n − x + 1),

i.e.p|x ∼ Be(x + 1, n − x + 1)

[Beta distribution]

Page 22: Robert

Resolution (2)

then

P(a < p < b|X = x) =

∫ba

(nx

)px(1 − p)n−x dp∫1

0

(nx

)px(1 − p)n−x dp

=

∫ba px(1 − p)n−x dp

B(x + 1, n − x + 1),

i.e.p|x ∼ Be(x + 1, n − x + 1)

In Bayes’ words:

“The same things supposed, I guess that the probabilityof the event M lies somewhere between 0 and the ratio ofAb to AB, my chance to be in the right is the ratio ofAbm to AiB.”

Page 23: Robert

Laplace’s version

Pierre Simon (de) Laplace (1749–1827):“Je me propose de determiner la probabilitedes causes par les evenements matiere neuve abien des egards et qui merite d’autant plusd’etre cultivee que c’est principalement sous cepoint de vue que la science des hasards peutetre utile a la vie civile.”

[Memoire sur la probabilite des causes par les evenemens, 1774]

Page 24: Robert

Laplace’s version

“Si un evenement peut etre produit par unnombre n de causes differentes, les probabilitesde l’existence de ces causes prises del’evenement, sont entre elles comme lesprobabilites de l’evenement prises de cescauses, et la probabilite de l’existence dechacune d’elles, est egale a la probabilite del’evenement prise de cette cause, divisee par lasomme de toutes les probabilites del’evenement prises de chacune de ces causes.”

[Memoire sur la probabilite des causes par les evenemens, 1774]

Page 25: Robert

Laplace’s version

In modern terms: Under a uniform prior,

P(Ai |E )

P(Aj |E )=

P(E |Ai )

P(E |Aj)

and

f (x |y) =f (y |x)∫

f (y |x) dy

[Memoire sur la probabilite des causes par les evenemens, 1774]

Page 26: Robert

Laplace’s version

Later Laplace acknowledges Bayes by“Bayes a cherche directement la probabiliteque les possibilites indiquees par desexperiences deja faites sont comprises dans leslimites donnees et il y est parvenu d’unemaniere fine et tres ingenieuse”

[Essai philosophique sur les probabilites, 1810]

Page 27: Robert

Another Bayes 250

Meeting that took place at the Royal Statistical Society, June19-20, 2013, on the current state of Bayesian statisticsI G. Roberts (University of Warwick) “Bayes for

differential equation models”

I N. Best (Imperial College London) “Bayesianspace-time models for environmentalepidemiology”

I D. Prangle (Lancaster University) “ApproximateBayesian Computation”

I P. Dawid (University of Cambridge), “PuttingBayes to the Test”

I M. Jordan (UC Berkeley) “Feature Allocations,Probability Functions, and Paintboxes”

I I. Murray (University of Edinburgh) “Flexiblemodels for density estimation”

I M. Goldstein (Durham University) “GeometricBayes”

I C. Andrieu (University of Bristol) “Inference withnoisy likelihoods”

I A. Golightly (Newcastle University), “Auxiliaryparticle MCMC schemes for partially observeddiffusion processes”

I S. Richardson (MRC Biostatistics Unit)“Biostatistics and Bayes”

I C. Yau (Imperial College London)“Understanding cancer through Bayesianapproaches”

I S. Walker (University of Kent) “The MisspecifiedBayesian”

I S. Wilson (Trinity College Dublin), “Linnaeus,Bayes and the number of species problem”

I B. Calderhead (UCL) “Probabilistic Integrationfor Differential Equation Models”

I P. Green (University of Bristol and UT Sydney)“Bayesian graphical model determination”

Page 28: Robert

The search for certain π

Bayes, Thomas (1702–1761)

Jeffreys, Harold (1891–1989)Keynes’ treatiseJeffreys’ prior distributionsJeffreys’ Bayes factorexpected posterior priors

Lindley, Dennis (1923– )

Besag, Julian (1945–2010)

de Finetti, Bruno (1906–1985)

Page 29: Robert

Keynes’ dead end

In John Maynard Keynes’s A Treatise on Probability (1921):

“I do not believe that there isany direct and simple method bywhich we can make the transitionfrom an observed numericalfrequency to a numerical measureof probability.”

[Robert, 2011, ISR]

Page 30: Robert

Keynes’ dead end

In John Maynard Keynes’s A Treatise on Probability (1921):

“Bayes’ enunciation is strictlycorrect and its method of arrivingat it shows its true logicalconnection with morefundamental principles, whereasLaplace’s enunciation gives it theappearance of a new principlespecially introduced for thesolution of causal problems.”

[Robert, 2011, ISR]

Page 31: Robert

Who was Harold Jeffreys?

Harold Jeffreys (1891–1989)mathematician, statistician,geophysicist, and astronomer.Knighted in 1953 and GoldMedal of the Royal AstronomicalSociety in 1937. Funder ofmodern British geophysics. Manyof his contributions aresummarised in his book TheEarth.

[Wikipedia]

Page 32: Robert

Theory of Probability

The first modern and comprehensive treatise on (objective)Bayesian statistics

Theory of Probability (1939)begins with probability, refiningthe treatment in ScientificInference (1937), and proceeds tocover a range of applicationscomparable to that in Fisher’sbook.

[Robert, Chopin & Rousseau, 2009, Stat. Science]

Page 33: Robert

Jeffreys’ justifications

I All probability statements are conditional

I Actualisation of the information on θ by extracting theinformation on θ contained in the observation x

The principle of inverse probability does correspondto ordinary processes of learning (I, §1.5)

I Allows incorporation of imperfect information in the decisionprocess

A probability number can be regarded as ageneralization of the assertion sign (I, §1.51).

Page 34: Robert

Posterior distribution

I Operates conditional upon the observations

I Incorporates the requirement of the Likelihood Principle

...the whole of the information contained in theobservations that is relevant to the posteriorprobabilities of different hypotheses is summed up inthe values that they give the likelihood (II, §2.0).

I Avoids averaging over the unobserved values of x

I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

...can be used as the prior probability in takingaccount of a further set of data, and the theory cantherefore always take account of new information (I,§1.5).

I Provides a complete inferential scope

Page 35: Robert

Posterior distribution

I Operates conditional upon the observations

I Incorporates the requirement of the Likelihood Principle

...the whole of the information contained in theobservations that is relevant to the posteriorprobabilities of different hypotheses is summed up inthe values that they give the likelihood (II, §2.0).

I Avoids averaging over the unobserved values of x

I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

...can be used as the prior probability in takingaccount of a further set of data, and the theory cantherefore always take account of new information (I,§1.5).

I Provides a complete inferential scope

Page 36: Robert

Posterior distribution

I Operates conditional upon the observations

I Incorporates the requirement of the Likelihood Principle

...the whole of the information contained in theobservations that is relevant to the posteriorprobabilities of different hypotheses is summed up inthe values that they give the likelihood (II, §2.0).

I Avoids averaging over the unobserved values of x

I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

...can be used as the prior probability in takingaccount of a further set of data, and the theory cantherefore always take account of new information (I,§1.5).

I Provides a complete inferential scope

Page 37: Robert

Posterior distribution

I Operates conditional upon the observations

I Incorporates the requirement of the Likelihood Principle

...the whole of the information contained in theobservations that is relevant to the posteriorprobabilities of different hypotheses is summed up inthe values that they give the likelihood (II, §2.0).

I Avoids averaging over the unobserved values of x

I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

...can be used as the prior probability in takingaccount of a further set of data, and the theory cantherefore always take account of new information (I,§1.5).

I Provides a complete inferential scope

Page 38: Robert

Subjective priors

Subjective nature of priors

Critics (...) usually say that the prior probability is‘subjective’ (...) or refer to the vagueness of previousknowledge as an indication that the prior probabilitycannot be assessed (VIII, §8.0).

Long walk (from Laplace’s principle of insufficient reason) to areference prior:

A prior probability used to express ignorance is merely theformal statement of ignorance (VIII, §8.1).

Page 39: Robert

Subjective priors

Subjective nature of priors

Critics (...) usually say that the prior probability is‘subjective’ (...) or refer to the vagueness of previousknowledge as an indication that the prior probabilitycannot be assessed (VIII, §8.0).

Long walk (from Laplace’s principle of insufficient reason) to areference prior:

A prior probability used to express ignorance is merely theformal statement of ignorance (VIII, §8.1).

Page 40: Robert

The fundamental prior

...if we took the prior probability density for theparameters to be proportional to ||gik ||

1/2 [= |I (θ)|1/2], itcould stated for any law that is differentiable with respectto all parameters that the total probability in any regionof the αi would be equal to the total probability in thecorresponding region of the α′i ; in other words, it satisfiesthe rule that equivalent propositions have the sameprobability (III, §3.10)

Note: Jeffreys never mentions Fisher information in connectionwith (gik)

Page 41: Robert

The fundamental prior

In modern terms:if I (θ) is the Fisher information matrix associated with thelikelihood `(θ|x),

I (θ) = Eθ[∂`

∂θT

∂`

∂θ

]the reference prior distribution is

π∗(θ) ∝ |I (θ)|1/2

Note: Jeffreys never mentions Fisher information in connectionwith (gik)

Page 42: Robert

Objective prior distributions

I reference priors (Bayarri, Bernardo, Berger, ...)

I not supposed to represent complete ignorance (Kass& Wasserman, 1996)

The prior probabilities needed to express ignoranceof the value of a quantity to be estimated, wherethere is nothing to call special attention to aparticular value are given by an invariance theory(Jeffreys, VIII, §8.6).

I often endowed with or seeking frequency-based properties

I Jeffreys also proposed another Jeffreys prior dedicated totesting (Bayarri & Garcia-Donato, 2007)

Page 43: Robert

Jeffreys’ Bayes factor

Definition (Bayes factor, Jeffreys, V, §5.01)

For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0

B01 =π(Θ0|x)

π(Θc0|x)

/π(Θ0)

π(Θc0)

=

∫Θ0

f (x |θ)π0(θ)dθ∫Θc

0

f (x |θ)π1(θ)dθ

Equivalent to Bayes rule: acceptance if

B01 > (1 − π(Θ0))/a1/π(Θ0)/a0

What if... π0 is improper?![DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]

Page 44: Robert

Jeffreys’ Bayes factor

Definition (Bayes factor, Jeffreys, V, §5.01)

For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0

B01 =π(Θ0|x)

π(Θc0|x)

/π(Θ0)

π(Θc0)

=

∫Θ0

f (x |θ)π0(θ)dθ∫Θc

0

f (x |θ)π1(θ)dθ

Equivalent to Bayes rule: acceptance if

B01 > (1 − π(Θ0))/a1/π(Θ0)/a0

What if... π0 is improper?![DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]

Page 45: Robert

Expected posterior priors (example)

Starting from reference priors πN0 and πN1 , substitute by priordistributions π0 and π1 that solve the system of integral equations

π0(θ0) =

∫X

πN0 (θ0 | x)m1(x)dx

and

π1(θ1) =

∫X

πN1 (θ1 | x)m0(x)dx ,

where x is an imaginary minimal training sample and m0, m1 arethe marginals associated with π0 and π1 respectively

m0(x) =

∫f0(x |θ0)π0(dθ0) m1(x) =

∫f1(x |θ1)π1(dθ1)

[Perez & Berger, 2000]

Page 46: Robert

Existence/Unicity

Recurrence conditionWhen both the observations and the parameters in both modelsare continuous, if the Markov chain with transition

Q(θ′0 | θ0

)=

∫g(θ0, θ′0, θ1, x , x ′

)dxdx ′dθ1

where

g(θ0, θ′0, θ1, x , x ′

)= πN0

(θ′0 | x

)f1 (x | θ1)π

N1

(θ1 | x ′

)f0

(x ′ | θ0

),

is recurrent, then there exists a solution to the integral equations,unique up to a multiplicative constant.

[Cano, Salmeron, & Robert, 2008, 2013]

Page 47: Robert

Bayesian testing of hypotheses

Bayes, Thomas (1702–1761)

Jeffreys, Harold (1891–1989)

Lindley, Dennis (1923– )Lindley’s paradoxdual versions of the paradox“Who should be afraid of theLindley–Jeffreys paradox?”Bayesian resolutions

Besag, Julian (1945–2010)

de Finetti, Bruno (1906–1985)

Page 48: Robert

Who is Dennis Lindley?

British statistician, decision theorist andleading advocate of Bayesian statistics.Held positions at Cambridge,Aberystwyth, and UCL, retiring at theearly age of 54 to become an itinerantscholar. Wrote four books andnumerous papers on Bayesian statistics.

c© “Coherence is everything”

Page 49: Robert

Lindley’s paradox

In a normal mean testing problem,

xn ∼ N(θ,σ2/n) , H0 : θ = θ0 ,

under Jeffreys prior, θ ∼ N(θ0,σ2), the Bayes factor

B01(tn) = (1 + n)1/2 exp(−nt2

n/2[1 + n])

,

where tn =√

n|xn − θ0|/σ, satisfies

B01(tn)n−→∞−→ ∞

[assuming a fixed tn]

Page 50: Robert

Lindley’s paradox

Often dubbed Jeffreys–Lindley paradox...

In terms of

t =√

n − 1x/s ′, ν = n − 1

K ∼

√πν

2

(1 +

t2

ν

)−1/2ν+1/2

.

(...) The variation of K with t is much more importantthan the variation with ν (Jeffreys, V, §5.2).

Page 51: Robert

Two versions of the paradox

“the weight of Lindley’s paradoxical result (...) burdensproponents of the Bayesian practice”.

[Lad, 2003]

I official version, opposing frequentist and Bayesian assessments[Lindley, 1957]

I intra-Bayesian version, blaming vague and improper priors forthe Bayes factor misbehaviour:if π1(·|σ) depends on a scale parameter σ, it is often the casethat

B01(x)σ−→∞−→ +∞

for a given x , meaning H0 is always accepted[Robert, 1992, 2013]

Page 52: Robert

Evacuation of the first version

Two paradigms [(b) versus (f)]

I one (b) operates on the parameter space Θ, while the other(f) is produced from the sample space

I one (f) relies solely on the point-null hypothesis H0 and thecorresponding sampling distribution, while the other(b) opposes H0 to a (predictive) marginal version of H1

I one (f) could reject “a hypothesis that may be true (...)because it has not predicted observable results that have notoccurred” (Jeffreys, VII, §7.2) while the other (b) conditionsupon the observed value xobs

I one (f) resorts to an arbitrary fixed bound α on the p-value,while the other (b) refers to the boundary probability of 1

2

Page 53: Robert

More arguments on the first version

I observing a constant tn as n increases is of limited interest:under H0 tn has limiting N(0, 1) distribution, while, under H1

tn a.s. converges to ∞I behaviour that remains entirely compatible with the

consistency of the Bayes factor, which a.s. converges either to0 or ∞, depending on which hypothesis is true.

Consequent literature (e.g., Berger & Sellke,1987) has since thenshown how divergent those two approaches could be (to the pointof being asymptotically incompatible).

[Robert, 2013]

Page 54: Robert

Nothing’s wrong with the second version

I n, prior’s scale factor: prior variance n times larger than theobservation variance and when n goes to ∞, Bayes factorgoes to ∞ no matter what the observation is

I n becomes what Lindley (1957) calls “a measure of lack ofconviction about the null hypothesis”

I when prior diffuseness under H1 increases, only relevantinformation becomes that θ could be equal to θ0, and thisoverwhelms any evidence to the contrary contained in the data

I mass of the prior distribution in the vicinity of any fixedneighbourhood of the null hypothesis vanishes to zero underH1

[Robert, 2013]

c© deep coherence in the outcome: being indecisive about thealternative hypothesis means we should not chose it

Page 55: Robert

Nothing’s wrong with the second version

I n, prior’s scale factor: prior variance n times larger than theobservation variance and when n goes to ∞, Bayes factorgoes to ∞ no matter what the observation is

I n becomes what Lindley (1957) calls “a measure of lack ofconviction about the null hypothesis”

I when prior diffuseness under H1 increases, only relevantinformation becomes that θ could be equal to θ0, and thisoverwhelms any evidence to the contrary contained in the data

I mass of the prior distribution in the vicinity of any fixedneighbourhood of the null hypothesis vanishes to zero underH1

[Robert, 2013]

c© deep coherence in the outcome: being indecisive about thealternative hypothesis means we should not chose it

Page 56: Robert

“Who should be afraid of the Lindley–Jeffreys paradox?”

Recent publication by A. Spanos with above title:

I the paradox demonstrates againstBayesian and likelihood resolutions of theproblem for failing to account for thelarge sample size.

I the failure of all three main paradigmsleads Spanos to advocate Mayo’s andSpanos’“postdata severity evaluation”

[Spanos, 2013]

Page 57: Robert

“Who should be afraid of the Lindley–Jeffreys paradox?”

Recent publication by A. Spanos with above title:

“the postdata severity evaluation(...) addresses the key problem withFisherian p-values in the sense thatthe severity evaluation provides the“magnitude” of the warranteddiscrepancy from the null by takinginto account the generic capacity ofthe test (that includes n) in questionas it relates to the observeddata”(p.88)

[Spanos, 2013]

Page 58: Robert

On some resolutions of the second version

I use of pseudo-Bayes factors, fractional Bayes factors, &tc,which lacks proper Bayesian justification

[Berger & Pericchi, 2001]

I use of identical improper priors on nuisance parameters, anotion already entertained by Jeffreys

[Berger et al., 1998; Marin & Robert, 2013]

I use of the posterior predictive distribution, which uses thedata twice (see also Aitkin’s (2010) integrated likelihood)

[Gelman, Rousseau & Robert, 2013]

I use of score functions extending the log score function

log B12(x) = log m1(x) − log m2(x) = S0(x , m1) − S0(x , m2) ,

that are independent of the normalising constant[Dawid et al., 2013]

Page 59: Robert

Bayesian computing (R)evolution

Bayes, Thomas (1702–1761)

Jeffreys, Harold (1891–1989)

Lindley, Dennis (1923– )

Besag, Julian (1945–2010)Besag’s early contributionsMCMC revolution and beyond

de Finetti, Bruno (1906–1985)

Page 60: Robert

computational jam

In the 1970’s and early 1980’s, theoretical foundations of Bayesianstatistics were sound, but methodology was lagging for lack ofcomputing tools.

I restriction to conjugate priors

I limited complexity of models

I small sample sizes

The field was desperately in need of a new computing paradigm![Robert & Casella, 2012]

Page 61: Robert

MCMC as in Markov Chain Monte Carlo

Notion that i.i.d. simulation is definitely not necessary, all thatmatters is the ergodic theoremRealization that Markov chains could be used in a wide variety ofsituations only came to mainstream statisticians with Gelfand andSmith (1990) despite earlier publications in the statistical literaturelike Hastings (1970) and growing awareness in spatial statistics(Besag, 1986)Reasons:

I lack of computing machinery

I lack of background on Markov chains

I lack of trust in the practicality of the method

Page 62: Robert

Who was Julian Besag?

British statistician known chiefly for hiswork in spatial statistics (including itsapplications to epidemiology, imageanalysis and agricultural science), andBayesian inference (including Markovchain Monte Carlo algorithms).Lecturer in Liverpool and Durham, thenprofessor in Durham and Seattle.

[Wikipedia]

Page 63: Robert

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

[Hammersley and Clifford, 1971]

Page 64: Robert

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

“What is the most general form of the conditionalprobability functions that define a coherent jointfunction? And what will the joint look like?”

[Besag, 1972]

Page 65: Robert

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

Joint distribution of vector associated with a dependence graphmust be represented as product of functions over the cliques of thegraphs, i.e., of functions depending only on the componentsindexed by the labels in the clique.

[Cressie, 1993; Lauritzen, 1996]

Page 66: Robert

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

A probability distribution P with positive and continuous density fsatisfies the pairwise Markov property with respect to anundirected graph G if and only if it factorizes according to G, i.e.,

(F ) ≡ (G )

[Cressie, 1993; Lauritzen, 1996]

Page 67: Robert

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

Under the positivity condition, the joint distribution g satisfies

g(y1, . . . , yp) ∝p∏

j=1

g`j (y`j |y`1 , . . . , y`j−1, y ′`j+1

, . . . , y ′`p)

g`j (y′`j|y`1 , . . . , y`j−1

, y ′`j+1, . . . , y ′`p)

for every permutation ` on 1, 2, . . . , p and every y ′ ∈ Y.

[Cressie, 1993; Lauritzen, 1996]

Page 68: Robert

To Gibbs or not to Gibbs?

Julian Besag should certainly be credited to a large extent of the(re?-)discovery of the Gibbs sampler.

Page 69: Robert

To Gibbs or not to Gibbs?

Julian Besag should certainly be credited to a large extent of the(re?-)discovery of the Gibbs sampler.

“The simulation procedure is to consider the sitescyclically and, at each stage, to amend or leave unalteredthe particular site value in question, according to aprobability distribution whose elements depend upon thecurrent value at neighboring sites (...) However, thetechnique is unlikely to be particularly helpful in manyother than binary situations and the Markov chain itselfhas no practical interpretation.”

[Besag, 1974]

Page 70: Robert

Clicking in

After Peskun (1973), MCMC mostly dormant in mainstreamstatistical world for about 10 years, then several papers/bookshighlighted its usefulness in specific settings:

I Geman and Geman (1984)

I Besag (1986)

I Strauss (1986)

I Ripley (Stochastic Simulation, 1987)

I Tanner and Wong (1987)

I Younes (1988)

Page 71: Robert

Enters the Gibbs sampler

Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima

Page 72: Robert

Enters the Gibbs sampler

Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima

Page 73: Robert

Besag (1986) integrates GS for SA...

“...easy to construct the transition matrix Q, of adiscrete time Markov chain, with state space Ω and limitdistribution (4). Simulated annealing proceeds byrunning an associated time inhomogeneous Markov chainwith transition matrices QT , where T is progressivelydecreased according to a prescribed “schedule” to a valueclose to zero.”

[Besag, 1986]

Page 74: Robert

...and links with Metropolis-Hastings...

“There are various related methods of constructing amanageable QT (Hastings, 1970). Geman and Geman(1984) adopt the simplest, which they term the ”Gibbssampler” (...) time reversibility, a common ingredient inthis type of problem (see, for example, Besag, 1977a), ispresent at individual stages but not over complete cycles,though Peter Green has pointed out that it returns if QT

is taken over a pair of cycles, the second of which visitspixels in reverse order”

[Besag, 1986]

Page 75: Robert

The candidate’s formula

Representation of the marginal likelihood as

m(x) =π(θ)f (x |θ)

π(θ|x)

or of the marginal predictive as

pn(y?|y) = f (y?|θ)πn(θ|y)

/πn+1(θ|y , y?)

[Besag, 1989]

Why candidate?

“Equation (2) appeared without explanation in a DurhamUniversity undergraduate final examination script of1984. Regrettably, the student’s name is no longerknown to me.”

Page 76: Robert

The candidate’s formula

Representation of the marginal likelihood as

m(x) =π(θ)f (x |θ)

π(θ|x)

or of the marginal predictive as

pn(y?|y) = f (y?|θ)πn(θ|y)

/πn+1(θ|y , y?)

[Besag, 1989]

Why candidate?

“Equation (2) appeared without explanation in a DurhamUniversity undergraduate final examination script of1984. Regrettably, the student’s name is no longerknown to me.”

Page 77: Robert

Implications

I Newton and Raftery (1994) used this representation to derivethe [infamous] harmonic mean approximation to the marginallikelihood

I Gelfand and Dey (1994)

I Geyer and Thompson (1995)

I Chib (1995)

I Marin and Robert (2010) and Robert and Wraith (2009)

[Chen, Shao & Ibrahim, 2000]

Page 78: Robert

Implications

I Newton and Raftery (1994)

I Gelfand and Dey (1994) also relied on this formula for thesame purpose in a more general perspective

I Geyer and Thompson (1995)

I Chib (1995)

I Marin and Robert (2010) and Robert and Wraith (2009)

[Chen, Shao & Ibrahim, 2000]

Page 79: Robert

Implications

I Newton and Raftery (1994)

I Gelfand and Dey (1994)

I Geyer and Thompson (1995) derived MLEs by a Monte Carloapproximation to the normalising constant

I Chib (1995)

I Marin and Robert (2010) and Robert and Wraith (2009)

[Chen, Shao & Ibrahim, 2000]

Page 80: Robert

Implications

I Newton and Raftery (1994)

I Gelfand and Dey (1994)

I Geyer and Thompson (1995)

I Chib (1995) uses this representation to build a MCMCapproximation to the marginal likelihood

I Marin and Robert (2010) and Robert and Wraith (2009)

[Chen, Shao & Ibrahim, 2000]

Page 81: Robert

Implications

I Newton and Raftery (1994)

I Gelfand and Dey (1994)

I Geyer and Thompson (1995)

I Chib (1995)

I Marin and Robert (2010) and Robert and Wraith (2009)corrected Newton and Raftery (1994) by restricting theimportance function to an HPD region

[Chen, Shao & Ibrahim, 2000]

Page 82: Robert

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;Wang & al., 1993, 1994)

I generalized linear mixed models (Albert & Chib, 1993)

I mixture models (Tanner & Wong, 1987; Diebolt & Robert., 1990,1994; Escobar & West, 1993)

I changepoint analysis (Carlin & al., 1992)

I point processes (Grenander & Møller, 1994)

I &tc

Page 83: Robert

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I genomics (Stephens & Smith, 1993; Lawrence & al., 1993;Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,2000)

I ecology (George & Robert, 1992)

I variable selection in regression (George & mcCulloch, 1993; Green,1995; Chen & al., 2000)

I spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))

I longitudinal studies (Lange & al., 1992)

I &tc

Page 84: Robert

MCMC and beyond

I reversible jump MCMC which impacted considerably Bayesianmodel choice (Green, 1995)

I adaptive MCMC algorithms (Haario & al., 1999; Roberts& Rosenthal, 2009)

I exact approximations to targets (Tanner & Wong, 1987;Beaumont, 2003; Andrieu & Roberts, 2009)

I particle filters with application to sequential statistics,state-space models, signal processing, &tc. (Gordon & al.,1993; Doucet & al., 2001; del Moral & al., 2006)

Page 85: Robert

MCMC and beyond beyond

I comp’al stats catching up with comp’al physics: free energysampling (e.g., Wang-Landau), Hamilton Monte Carlo(Girolami & Calderhead, 2011)

I sequential Monte Carlo (SMC) for non-sequential problems(Chopin, 2002; Neal, 2001; Del Moral et al 2006)

I retrospective sampling

I intractability: EP – GIMH – PMCMC – SMC2 – INLA

I QMC[MC] (Owen, 2011)

Page 86: Robert

Particles

Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!

[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing

[Marshall, 1965; Handschin and Mayne, 1969]

Use of the term “particle” dates back to Kitagawa (1996), and Carpenter

et al. (1997) coined the term “particle filter”.

Page 87: Robert

Particles

Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!

[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing

[Marshall, 1965; Handschin and Mayne, 1969]

Use of the term “particle” dates back to Kitagawa (1996), and Carpenter

et al. (1997) coined the term “particle filter”.

Page 88: Robert

pMC & pMCMC

I Recycling of past simulations legitimate to build betterimportance sampling functions as in population Monte Carlo

[Iba, 2000; Cappe et al, 2004; Del Moral et al., 2007]

I synthesis by Andrieu, Doucet, and Hollenstein (2010) usingparticles to build an evolving MCMC kernel pθ(y1:T ) in statespace models p(x1:T )p(y1:T |x1:T )

I importance sampling on discretely observed diffusions[Beskos et al., 2006; Fearnhead et al., 2008, 2010]

Page 89: Robert

towards ever more complexity

Bayes, Thomas (1702–1761)

Jeffreys, Harold (1891–1989)

Lindley, Dennis (1923– )

Besag, Julian (1945–2010)

de Finetti, Bruno (1906–1985)de Finetti’s exchangeability theoremBayesian nonparametricsBayesian analysis in a Big Data era

Page 90: Robert

Who was Bruno de Finetti?

“Italian probabilist, statistician andactuary, noted for the “operationalsubjective” conception of probability.The classic exposition of his distinctivetheory is the 1937 “La prevision: seslois logiques, ses sources subjectives,”which discussed probability founded onthe coherence of betting odds and theconsequences of exchangeability.”

[Wikipedia]Chair in Financial Mathematics at Trieste University (1939) andRoma (1954) then in Calculus of Probabilities (1961). Mostfamous sentence:

“Probability does not exist”

Page 91: Robert

Who was Bruno de Finetti?

“Italian probabilist, statistician andactuary, noted for the “operationalsubjective” conception of probability.The classic exposition of his distinctivetheory is the 1937 “La prevision: seslois logiques, ses sources subjectives,”which discussed probability founded onthe coherence of betting odds and theconsequences of exchangeability.”

[Wikipedia]Chair in Financial Mathematics at Trieste University (1939) andRoma (1954) then in Calculus of Probabilities (1961). Mostfamous sentence:

“Probability does not exist”

Page 92: Robert

Exchangeability

Notion of exchangeable sequences:

A random sequence (x1, . . . , xn, . . .) is exchangeable if forany n the distribution of (x1, . . . , xn) is equal to thedistribution of any permutation of the sequence(xσ1 , . . . , xσn)

de Finetti’s theorem (1937):

An exchangeable distribution is a mixture of iiddistributions

p(x1, . . . , xn) =

∫ n∏i=1

f (xi |G )dπ(G )

where G can be infinite-dimensional

Extension to Markov chains (Freedman, 1962; Diaconis& Freedman, 1980)

Page 93: Robert

Exchangeability

Notion of exchangeable sequences:

A random sequence (x1, . . . , xn, . . .) is exchangeable if forany n the distribution of (x1, . . . , xn) is equal to thedistribution of any permutation of the sequence(xσ1 , . . . , xσn)

de Finetti’s theorem (1937):

An exchangeable distribution is a mixture of iiddistributions

p(x1, . . . , xn) =

∫ n∏i=1

f (xi |G )dπ(G )

where G can be infinite-dimensional

Extension to Markov chains (Freedman, 1962; Diaconis& Freedman, 1980)

Page 94: Robert

Exchangeability

Notion of exchangeable sequences:

A random sequence (x1, . . . , xn, . . .) is exchangeable if forany n the distribution of (x1, . . . , xn) is equal to thedistribution of any permutation of the sequence(xσ1 , . . . , xσn)

de Finetti’s theorem (1937):

An exchangeable distribution is a mixture of iiddistributions

p(x1, . . . , xn) =

∫ n∏i=1

f (xi |G )dπ(G )

where G can be infinite-dimensional

Extension to Markov chains (Freedman, 1962; Diaconis& Freedman, 1980)

Page 95: Robert

Bayesian nonparametrics

Based on de Finetti’s representation,

I use of priors on functional spaces (densities, regression, trees,partitions, clustering, &tc)

I production of Bayes estimates in those spaces

I convergence mileage may vary

I available efficient (MCMC) algorithms to conductnon-parametric inference

[van der Vaart, 1998; Hjort et al., 2010; Muller & Rodriguez, 2013]

Page 96: Robert

Dirichlet processes

One of the earliest examples of priors on distributions[Ferguson, 1973]

stick-breaking construction of D(α0,G0)

I generate βk ∼ B(1,α0)

I define π1 = β1 and πk =∏k−1

j=1 (1 − βj)βk

I generate θk ∼ G0

I derive G =∑

k πkδθk ∼ D(α0, G0)

[Sethuraman, 1994]

Page 97: Robert

Chinese restaurant process

If we assume

G ∼ D(α0, G0)

θi ∼ G

then the marginal distribution of (θ1, . . .) is a Chinese restaurantprocess (Polya urn model), which is exchangeable. In particular,

θi |θ1:i−1 ∼ α0G0 +

i−1∑j=1

δθj

Posterior distribution built by MCMC[Escobar and West, 1992]

Page 98: Robert

Chinese restaurant process

If we assume

G ∼ D(α0, G0)

θi ∼ G

then the marginal distribution of (θ1, . . .) is a Chinese restaurantprocess (Polya urn model), which is exchangeable. In particular,

θi |θ1:i−1 ∼ α0G0 +

i−1∑j=1

δθj

Posterior distribution built by MCMC[Escobar and West, 1992]

Page 99: Robert

Many alternatives

I truncated Dirichlet processes

I Pitman Yor processes

I completely random measures

I normalized random measures with independent increments(NRMI)

[Muller and Mitra, 2013]

Page 100: Robert

Theoretical advances

I posterior consistency: Seminal work of Schwarz (1965) in iidcase and extension of Barron et al. (1999) for generalconsistency

I consistency rates: Ghosal & van der Vaart (2000) Ghosal etal. (2008) with minimax (adaptive ) Bayesian nonparametricestimators for nonparametric process mixtures (Gaussian,Beta) (Rousseau, 2008; Kruijer, Rousseau & van der Vaart,2010; Shen, Tokdar & Ghosal, 2013; Scricciolo, 2013)

I Bernstein-von Mises theorems: (Castillo, 2011; Rivoirard& Rousseau, 2012; Kleijn & Bickel, 2013; Castillo& Rousseau, 2013)

I recent extensions to semiparametric models

Page 101: Robert

Consistency and posterior concentration rates

Posterior

dπ(θ|X n) =fθ(X

n)dπ(θ)

m(X n)m(X n) =

∫Θ

fθ(Xn)dπ(θ)

and posterior concentration: Under Pθ0

Pπ [d(θ, θ0) 6 ε|Xn] = 1+op(1), Pπ [d(θ, θ0) 6 εn|X

n] = 1+op(1)

Given εn: consistencywhere d(θ, θ ′) is a loss function. e.g. Hellinger, L1, L2, L∞

Page 102: Robert

Consistency and posterior concentration rates

Posterior

dπ(θ|X n) =fθ(X

n)dπ(θ)

m(X n)m(X n) =

∫Θ

fθ(Xn)dπ(θ)

and posterior concentration: Under Pθ0

Pπ [d(θ, θ0) 6 ε|Xn] = 1+op(1), Pπ [d(θ, θ0) 6 εn|X

n] = 1+op(1)

Setting εn ↓ 0: consistency rateswhere d(θ, θ ′) is a loss function. e.g. Hellinger, L1, L2, L∞

Page 103: Robert

Bernstein–von Mises theorems

Parameter of interest

ψ = ψ(θ) ∈ Rd , d < +∞, θ ∼ π

(with dim(θ) = +∞)BVM:

π[√

n(ψ− ψ) 6 z |X n] = Φ(z/√

V0) + op(1), Pθ0

and √n(ψ−ψ(θ0)) ≈ N(0, V0) under Pθ0

[Doob, 1949; Le Cam, 1986; van der Vaart, 1998]

Page 104: Robert

New challenges

Novel statisticial issues that forces a different Bayesian answer:

I very large datasets

I complex or unknown dependence structures with maybe p n

I multiple and involved random effects

I missing data structures containing most of the information

I sequential structures involving most of the above

Page 105: Robert

New paradigm?

“Surprisingly, the confident prediction of the previousgeneration that Bayesian methods would ultimately supplantfrequentist methods has given way to a realization that Markovchain Monte Carlo (MCMC) may be too slow to handlemodern data sets. Size matters because large data sets stresscomputer storage and processing power to the breaking point.The most successful compromises between Bayesian andfrequentist methods now rely on penalization andoptimization.”

[Lange at al., ISR, 2013]

Page 106: Robert

New paradigm?

Observe (Xi , Ri , YiRi ) where

Xi ∼ U(0, 1)d , Ri |Xi ∼ B(π(Xi )) and Yi |Xi ∼ B(θ(Xi ))

(π(·) is known and θ(·) is unknwon)Then any estimator of E[Y ] that does not depend on π isinconsistent.

c© There is no genuine Bayesian answer producing a consistentestimator (without throwing away part of the data)

[Robins & Wasserman, 2000, 2013]

Page 107: Robert

New paradigm?

I sad reality constraint thatsize does matter

I focus on much smallerdimensions and on sparsesummaries

I many (fast if non-Bayesian)ways of producing thosesummaries

I Bayesian inference can kickin almost automatically atthis stage

Savage and de Finetti, 1961

Page 108: Robert

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 109: Robert

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 110: Robert

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 111: Robert

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 112: Robert

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 113: Robert

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 114: Robert

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 115: Robert

ABC algorithm

In most implementations, degree of approximation:

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)

until ρη(z),η(y) 6 εset θi = θ

end for

where η(y) defines a (not necessarily sufficient) statistic

Page 116: Robert

Comments

I role of distance paramount(because ε 6= 0)

I scaling of components of η(y) alsocapital

I ε matters little if “small enough”

I representative of “curse ofdimensionality”

I small is beautiful!, i.e. data as awhole may be weakly informativefor ABC

I non-parametric method at core

Page 117: Robert

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 118: Robert

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 119: Robert

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 120: Robert

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 121: Robert

ABC as an inference machine

Starting point is summary statistic η(y), either chosen forcomputational realism or imposed by external constraints

I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)

I inference based on ABC may be consistent or not, so it needsto be validated on its own

I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints

Page 122: Robert

ABC as an inference machine

Starting point is summary statistic η(y), either chosen forcomputational realism or imposed by external constraints

I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)

I inference based on ABC may be consistent or not, so it needsto be validated on its own

I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints

Page 123: Robert

How Bayesian aBc is..?

At best, ABC approximates π(θ|η(y)):

I approximation error unknown (w/o massive simulation)

I pragmatic or empirical Bayes (there is no other solution!)

I many calibration issues (tolerance, distance, statistics)

I the NP side should be incorporated into the whole Bayesianpicture

I the approximation error should also be part of the Bayesianinference

Page 124: Robert

Noisy ABC

ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 125: Robert

Noisy ABC

ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 126: Robert

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

Page 127: Robert

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I borrowing tools from data analysis (LDA) machine learning

[Estoup et al., ME, 2012]

Page 128: Robert

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

I may be imposed for external/practical reasons

I may gather several non-B point estimates

I we can learn about efficient combination

I distance can be provided by estimation techniques

Page 129: Robert

Which summary for model choice?

‘This is also why focus on model discrimination typically(...) proceeds by (...) accepting that the Bayes Factorthat one obtains is only derived from the summarystatistics and may in no way correspond to that of thefull model.’

[Scott Sisson, Jan. 31, 2011, xianblog]

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη

2 (η(y)|θ2) dθ2,

is either consistent or inconsistent[Robert et al., PNAS, 2012]

Page 130: Robert

Which summary for model choice?

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη

2 (η(y)|θ2) dθ2,

is either consistent or inconsistent[Robert et al., PNAS, 2012]

Gauss Laplace

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n=100

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Page 131: Robert

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

Theorem

If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :

0 = min (inf|µ0 − µi (θi )|; θi ∈ Θi , i = 1, 2)

< max (inf|µ0 − µi (θi )|; θi ∈ Θi , i = 1, 2) ,

then the Bayes factor Bη12 is consistent

[Marin et al., 2012]

Page 132: Robert

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

M1 M2

0.3

0.4

0.5

0.6

0.7

M1 M2

0.3

0.4

0.5

0.6

0.7

M1 M2

0.3

0.4

0.5

0.6

0.7

M1 M2

0.0

0.2

0.4

0.6

0.8

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

M1 M2

0.0

0.2

0.4

0.6

0.8

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

[Marin et al., 2012]

Page 133: Robert

on some Bayesian open problems

In 2011, Michael Jordan, then ISBA President, conducted amini-survey on Bayesian open problems:

I Nonparametrics and semiparametrics: assessing and validatingpriors on infinite dimension spaces with an infinite number ofnuisance parameters

I Priors: elicitation mecchanisms and strategies to get the priorfrom the likelihood or even from the posterior distribution

I Bayesian/frequentist relationships: how far should one reachfor frequentist validation?

I Computation and statistics: computational abilities should bepart of the modelling, with some expressing doubts aboutINLA and ABC

I Model selection and hypothesis testing: still unsettledopposition between model checking, model averaging andmodel selection

[Jordan, ISBA Bulletin, March 2011]

Page 134: Robert

yet another Bayes 250

Meeting that will take place in Duke University, December 17:

I Stephen Fienberg, Carnegie-MellonUniversity

I Michael Jordan, University ofCalifornia, Berkeley

I Christopher Sims, Princeton University

I Adrian Smith, University of London

I Stephen Stigler, University of Chicago

I Sharon Bertsch McGrayne, author of“the theory that would not die”