pittsburgh and toronto "halloween us trip" seminars
DESCRIPTION
Those are the slides of the talks given in Pittsburgh and Toronto, on Oct. 30 and 31, 2013TRANSCRIPT
![Page 1: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/1.jpg)
Relevant statistics for (approximate) Bayesianmodel choice
Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry
![Page 2: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/2.jpg)
a meeting of potential interest?!
MCMSki IV to be held in Chamonix Mt Blanc, France, fromMonday, Jan. 6 to Wed., Jan. 8 (and BNP workshop onJan. 9), 2014All aspects of MCMC++ theory and methodology and applicationsand more! All posters welcomed!
Website http://www.pages.drexel.edu/∼mwl25/mcmski/
![Page 3: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/3.jpg)
Outline
Intractable likelihoods
ABC methods
ABC as an inference machine
ABC for model choice
Model choice consistency
Conclusion and perspectives
![Page 4: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/4.jpg)
Intractable likelihood
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
I is (really!) not available in closed form
I cannot (easily!) be either completed or demarginalised
I cannot be (at all!) estimated by an unbiased estimator
c© Prohibits direct implementation of a generic MCMC algorithmlike Metropolis–Hastings
![Page 5: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/5.jpg)
Intractable likelihood
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
I is (really!) not available in closed form
I cannot (easily!) be either completed or demarginalised
I cannot be (at all!) estimated by an unbiased estimator
c© Prohibits direct implementation of a generic MCMC algorithmlike Metropolis–Hastings
![Page 6: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/6.jpg)
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach
Empirical approximations to the original B problem
I Degrading the data precision down to a tolerance ε
I Replacing the likelihood with a non-parametric approximation
I Summarising/replacing the data with insufficient statistics
![Page 7: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/7.jpg)
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach
Empirical approximations to the original B problem
I Degrading the data precision down to a tolerance ε
I Replacing the likelihood with a non-parametric approximation
I Summarising/replacing the data with insufficient statistics
![Page 8: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/8.jpg)
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach
Empirical approximations to the original B problem
I Degrading the data precision down to a tolerance ε
I Replacing the likelihood with a non-parametric approximation
I Summarising/replacing the data with insufficient statistics
![Page 9: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/9.jpg)
Different [levels of] worries about abc
Impact on B inference
I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!
I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)
I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)
![Page 10: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/10.jpg)
Different [levels of] worries about abc
Impact on B inference
I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!
I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)
I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)
![Page 11: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/11.jpg)
Different [levels of] worries about abc
Impact on B inference
I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!
I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)
I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)
![Page 12: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/12.jpg)
Approximate Bayesian computation
Intractable likelihoods
ABC methodsGenesis of ABCABC basicsAdvances and interpretationsABC as knn
ABC as an inference machine
ABC for model choice
Model choice consistency
Conclusion and perspectives
![Page 13: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/13.jpg)
Genetic background of ABC
skip genetics
ABC is a recent computational technique that only requires beingable to sample from the likelihood f (·|θ)
This technique stemmed from population genetics models, about15 years ago, and population geneticists still contributesignificantly to methodological developments of ABC.
[Griffith & al., 1997; Tavare & al., 1999]
![Page 14: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/14.jpg)
Demo-genetic inference
Each model is characterized by a set of parameters θ that coverhistorical (time divergence, admixture time ...), demographics(population sizes, admixture rates, migration rates, ...) and genetic(mutation rate, ...) factors
The goal is to estimate these parameters from a dataset ofpolymorphism (DNA sample) y observed at the present time
Problem:
most of the time, we cannot calculate the likelihood of thepolymorphism data f (y|θ)...
![Page 15: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/15.jpg)
Demo-genetic inference
Each model is characterized by a set of parameters θ that coverhistorical (time divergence, admixture time ...), demographics(population sizes, admixture rates, migration rates, ...) and genetic(mutation rate, ...) factors
The goal is to estimate these parameters from a dataset ofpolymorphism (DNA sample) y observed at the present time
Problem:
most of the time, we cannot calculate the likelihood of thepolymorphism data f (y|θ)...
![Page 16: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/16.jpg)
Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium
Sample of 8 genes
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼
Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
![Page 17: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/17.jpg)
Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼
Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
![Page 18: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/18.jpg)
Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼
Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
![Page 19: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/19.jpg)
Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium
Observations: leafs of the treeθ = ?
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼
Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
![Page 20: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/20.jpg)
Much more interesting models. . .
I several independent locusIndependent gene genealogies and mutations
I different populationslinked by an evolutionary scenario made of divergences,admixtures, migrations between populations, selectionpressure, etc.
I larger sample sizeusually between 50 and 100 genes
A typical evolutionary scenario:
MRCA
POP 0 POP 1 POP 2
τ1
τ2
![Page 21: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/21.jpg)
Instance of ecological questions
I How the Asian ladybirdbeetle arrived in Europe?
I Why does they swarm rightnow?
I What are the routes ofinvasion?
I How to get rid of them?
I Why did the chicken crossthe road?
[Lombaert & al., 2010, PLoS ONE]
![Page 22: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/22.jpg)
Worldwide invasion routes of Harmonia axyridis
Worldwide routes of invasion of Harmonia axyridis
For each outbreak, the arrow indicates the most likely invasionpathway and the associated posterior probability, with 95% credibleintervals in brackets
[Estoup et al., 2012, Molecular Ecology Res.]
![Page 23: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/23.jpg)
c© Intractable likelihood
Missing (too much missing!) data structure:
f (y|θ) =
∫G
f (y|G ,θ)f (G |θ)dG
cannot be computed in a manageable way...[Stephens & Donnelly, 2000]
The genealogies are considered as nuisance parameters
This modelling clearly differs from the phylogenetic perspective
where the tree is the parameter of interest.
![Page 24: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/24.jpg)
c© Intractable likelihood
Missing (too much missing!) data structure:
f (y|θ) =
∫G
f (y|G ,θ)f (G |θ)dG
cannot be computed in a manageable way...[Stephens & Donnelly, 2000]
The genealogies are considered as nuisance parameters
This modelling clearly differs from the phylogenetic perspective
where the tree is the parameter of interest.
![Page 25: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/25.jpg)
A?B?C?
I A stands for approximate[wrong likelihood / pic?]
I B stands for Bayesian
I C stands for computation[producing a parametersample]
![Page 26: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/26.jpg)
A?B?C?
I A stands for approximate[wrong likelihood / pic?]
I B stands for Bayesian
I C stands for computation[producing a parametersample]
![Page 27: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/27.jpg)
A?B?C?
I A stands for approximate[wrong likelihood / pic?]
I B stands for Bayesian
I C stands for computation[producing a parametersample]
ESS=155.6
θ
Den
sity
−0.5 0.0 0.5 1.0
0.0
1.0
ESS=75.93
θ
Den
sity
−0.4 −0.2 0.0 0.2 0.4
0.0
1.0
2.0
ESS=76.87
θ
Den
sity
−0.4 −0.2 0.0 0.2
01
23
4
ESS=91.54
θ
Den
sity
−0.6 −0.4 −0.2 0.0 0.2
01
23
4
ESS=108.4
θ
Den
sity
−0.4 0.0 0.2 0.4 0.6
0.0
1.0
2.0
3.0
ESS=85.13
θ
Den
sity
−0.2 0.0 0.2 0.4 0.6
0.0
1.0
2.0
3.0
ESS=149.1
θ
Den
sity
−0.5 0.0 0.5 1.0
0.0
1.0
2.0
ESS=96.31
θ
Den
sity
−0.4 0.0 0.2 0.4 0.6
0.0
1.0
2.0
ESS=83.77
θ
Den
sity
−0.6 −0.4 −0.2 0.0 0.2 0.4
01
23
4
ESS=155.7
θ
Den
sity
−0.5 0.0 0.5
0.0
1.0
2.0
ESS=92.42
θ
Den
sity
−0.4 0.0 0.2 0.4 0.6
0.0
1.0
2.0
3.0 ESS=95.01
θ
Den
sity
−0.4 0.0 0.2 0.4 0.6
0.0
1.5
3.0
ESS=139.2
Den
sity
−0.6 −0.2 0.2 0.6
0.0
1.0
2.0
ESS=99.33
Den
sity
−0.4 −0.2 0.0 0.2 0.4
0.0
1.0
2.0
3.0
ESS=87.28
Den
sity
−0.2 0.0 0.2 0.4 0.6
01
23
![Page 28: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/28.jpg)
How Bayesian is aBc?
Could we turn the resolution into a Bayesian answer?
I ideally so (not meaningfull: requires ∞-ly powerful computer
I approximation error unknown (w/o costly simulation)
I true B for wrong model (formal and artificial)
I true B for noisy model (much more convincing)
I true B for estimated likelihood (back to econometrics?)
I illuminating the tension between information and precision
![Page 29: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/29.jpg)
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]
![Page 30: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/30.jpg)
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]
![Page 31: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/31.jpg)
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]
![Page 32: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/32.jpg)
A as A...pproximative
When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone
ρ(y, z) 6 ε
where ρ is a distanceOutput distributed from
π(θ)Pθ{ρ(y, z) < ε}def∝ π(θ|ρ(y, z) < ε)
[Pritchard et al., 1999]
![Page 33: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/33.jpg)
A as A...pproximative
When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone
ρ(y, z) 6 ε
where ρ is a distanceOutput distributed from
π(θ)Pθ{ρ(y, z) < ε}def∝ π(θ|ρ(y, z) < ε)
[Pritchard et al., 1999]
![Page 34: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/34.jpg)
ABC algorithm
In most implementations, further degree of A...pproximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N dorepeat
generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)
until ρ{η(z),η(y)} 6 εset θi = θ
′
end for
where η(y) defines a (not necessarily sufficient) statistic
![Page 35: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/35.jpg)
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f (z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
...does it?!
![Page 36: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/36.jpg)
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f (z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
...does it?!
![Page 37: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/37.jpg)
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f (z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
...does it?!
![Page 38: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/38.jpg)
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f (z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of therestricted posterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|η(y)) .
Not so good..!
![Page 39: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/39.jpg)
Comments
I Role of distance paramount (because ε 6= 0)
I Scaling of components of η(y) is also determinant
I ε matters little if “small enough”
I representative of “curse of dimensionality”
I small is beautiful!
I the data as a whole may be paradoxically weakly informativefor ABC
![Page 40: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/40.jpg)
curse of dimensionality in summaries
Consider that
I the simulated summary statistics η(y1), . . . ,η(yN)
I the observed summary statistics η(yobs)
are iid with uniform distribution on [0, 1]d
Take δ∞(d ,N) = E[
mini=1,...,N
∥∥η(yobs) − η(yi )∥∥∞]
N = 100 N = 1, 000 N = 10, 000 N = 100, 000δ∞(1,N) 0.0025 0.00025 0.000025 0.0000025δ∞(2,N) 0.033 0.01 0.0033 0.001δ∞(10,N) 0.28 0.22 0.18 0.14δ∞(200,N) 0.48 0.48 0.47 0.46
![Page 41: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/41.jpg)
ABC (simul’) advances
how approximative is ABC? ABC as knn
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
![Page 42: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/42.jpg)
ABC (simul’) advances
how approximative is ABC? ABC as knn
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
![Page 43: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/43.jpg)
ABC (simul’) advances
how approximative is ABC? ABC as knn
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
![Page 44: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/44.jpg)
ABC (simul’) advances
how approximative is ABC? ABC as knn
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
![Page 45: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/45.jpg)
ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,
ε = εN = qα(d1, . . . , dN)
I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice
[Blum & Francois, 2010]
I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]
![Page 46: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/46.jpg)
ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,
ε = εN = qα(d1, . . . , dN)
I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice
[Blum & Francois, 2010]
I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]
![Page 47: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/47.jpg)
ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,
ε = εN = qα(d1, . . . , dN)
I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice
[Blum & Francois, 2010]
I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]
![Page 48: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/48.jpg)
ABC consistency
Provided
kN/ log logN −→∞ and kN/N −→ 0
as N →∞, for almost all s0 (with respect to the distribution ofS), with probability 1,
1
kN
kN∑j=1
ϕ(θj) −→ E[ϕ(θj)|S = s0]
[Devroye, 1982]
Biau et al. (2013) also recall pointwise and integrated mean square errorconsistency results on the corresponding kernel estimate of theconditional posterior distribution, under constraints
kN →∞, kN/N → 0, hN → 0 and hpNkN →∞,
![Page 49: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/49.jpg)
ABC consistency
Provided
kN/ log logN −→∞ and kN/N −→ 0
as N →∞, for almost all s0 (with respect to the distribution ofS), with probability 1,
1
kN
kN∑j=1
ϕ(θj) −→ E[ϕ(θj)|S = s0]
[Devroye, 1982]
Biau et al. (2013) also recall pointwise and integrated mean square errorconsistency results on the corresponding kernel estimate of theconditional posterior distribution, under constraints
kN →∞, kN/N → 0, hN → 0 and hpNkN →∞,
![Page 50: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/50.jpg)
Rates of convergence
Further assumptions (on target and kernel) allow for precise(integrated mean square) convergence rates (as a power of thesample size N), derived from classical k-nearest neighbourregression, like
I when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4p+8
I when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4p+8 logN
I when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4m+p+4
[Biau et al., 2013]
Drag: Only applies to sufficient summary statistics
![Page 51: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/51.jpg)
Rates of convergence
Further assumptions (on target and kernel) allow for precise(integrated mean square) convergence rates (as a power of thesample size N), derived from classical k-nearest neighbourregression, like
I when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4p+8
I when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4p+8 logN
I when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4m+p+4
[Biau et al., 2013]
Drag: Only applies to sufficient summary statistics
![Page 52: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/52.jpg)
ABC inference machine
Intractable likelihoods
ABC methods
ABC as an inference machineError inc.Exact BC and approximatetargetssummary statistic
ABC for model choice
Model choice consistency
Conclusion and perspectives
![Page 53: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/53.jpg)
How Bayesian is aBc..?
I may be a convergent method of inference (meaningful?sufficient? foreign?)
I approximation error unknown (w/o massive simulation)
I pragmatic/empirical B (there is no other solution!)
I many calibration issues (tolerance, distance, statistics)
I the NP side should be incorporated into the whole B picture
I the approximation error should also be part of the B inference
![Page 54: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/54.jpg)
ABCµ
Idea Infer about the error as well as about the parameter:Use of a joint density
f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)
where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.
[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
![Page 55: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/55.jpg)
ABCµ
Idea Infer about the error as well as about the parameter:Use of a joint density
f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)
where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.
[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
![Page 56: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/56.jpg)
ABCµ
Idea Infer about the error as well as about the parameter:Use of a joint density
f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)
where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.
[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
![Page 57: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/57.jpg)
Wilkinson’s exact BC (not exactly!)
ABC approximation error (i.e. non-zero tolerance) replaced withexact simulation from a controlled approximation to the target,convolution of true posterior with kernel function
πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ
,
with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]
Theorem
The ABC algorithm based on the production of a randomisedobservation y = y+ ξ, ξ ∼ Kε, and an acceptance probability of
Kε(y− z)/M
gives draws from the posterior distribution π(θ|y).
![Page 58: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/58.jpg)
Wilkinson’s exact BC (not exactly!)
ABC approximation error (i.e. non-zero tolerance) replaced withexact simulation from a controlled approximation to the target,convolution of true posterior with kernel function
πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ
,
with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]
Theorem
The ABC algorithm based on the production of a randomisedobservation y = y+ ξ, ξ ∼ Kε, and an acceptance probability of
Kε(y− z)/M
gives draws from the posterior distribution π(θ|y).
![Page 59: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/59.jpg)
How exact a BC?
Pros
I Pseudo-data from true model vs. observed data from noisymodel
I Outcome is completely controlled
I Link with ABCµ when assuming y observed withmeasurement error with density Kε
I Relates to the theory of model approximation[Kennedy & O’Hagan, 2001]
I leads to “noisy ABC”: justdo it!, i.e. perturbates the data down to precision ε and proceed
[Fearnhead & Prangle, 2012]
Cons
I Requires Kε to be bounded by M
I True approximation error never assessed
![Page 60: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/60.jpg)
How exact a BC?
Pros
I Pseudo-data from true model vs. observed data from noisymodel
I Outcome is completely controlled
I Link with ABCµ when assuming y observed withmeasurement error with density Kε
I Relates to the theory of model approximation[Kennedy & O’Hagan, 2001]
I leads to “noisy ABC”: justdo it!, i.e. perturbates the data down to precision ε and proceed
[Fearnhead & Prangle, 2012]
Cons
I Requires Kε to be bounded by M
I True approximation error never assessed
![Page 61: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/61.jpg)
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics
I Loss of statistical information balanced against gain in dataroughening
I Approximation error and information loss remain unknown
I Choice of statistics induces choice of distance functiontowards standardisation
I may be imposed for external/practical reasons (e.g., DIYABC)
I may gather several non-B point estimates
I can learn about efficient combination
![Page 62: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/62.jpg)
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics
I Loss of statistical information balanced against gain in dataroughening
I Approximation error and information loss remain unknown
I Choice of statistics induces choice of distance functiontowards standardisation
I may be imposed for external/practical reasons (e.g., DIYABC)
I may gather several non-B point estimates
I can learn about efficient combination
![Page 63: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/63.jpg)
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics
I Loss of statistical information balanced against gain in dataroughening
I Approximation error and information loss remain unknown
I Choice of statistics induces choice of distance functiontowards standardisation
I may be imposed for external/practical reasons (e.g., DIYABC)
I may gather several non-B point estimates
I can learn about efficient combination
![Page 64: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/64.jpg)
Semi-automatic ABC
Fearnhead and Prangle (2012) study ABC and the selection of thesummary statistic in close proximity to Wilkinson’s proposal
I ABC considered as inferential method and calibrated as such
I randomised (or ‘noisy’) version of the summary statistics
η(y) = η(y) + τε
I derivation of a well-calibrated version of ABC, i.e. analgorithm that gives proper predictions for the distributionassociated with this randomised summary statistic
[weak property]
![Page 65: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/65.jpg)
Summary [of F&P/statistics)
I optimality of the posterior expectation
E[θ|y]
of the parameter of interest as summary statistics η(y)![requires iterative process]
I use of the standard quadratic loss function
(θ− θ0)TA(θ− θ0)
I recent extension to model choice, optimality of Bayes factor
B12(y)
[F&P, ISBA 2012, Kyoto]
![Page 66: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/66.jpg)
Summary [of F&P/statistics)
I optimality of the posterior expectation
E[θ|y]
of the parameter of interest as summary statistics η(y)![requires iterative process]
I use of the standard quadratic loss function
(θ− θ0)TA(θ− θ0)
I recent extension to model choice, optimality of Bayes factor
B12(y)
[F&P, ISBA 2012, Kyoto]
![Page 67: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/67.jpg)
LDA summaries for model choice
In parallel to F& P semi-automatic ABC, selection of mostdiscriminant subvector out of a collection of summary statistics,based on Linear Discriminant Analysis (LDA)
[Estoup & al., 2012, Mol. Ecol. Res.]
Solution now implemented in DIYABC.2[Cornuet & al., 2008, Bioinf.; Estoup & al., 2013]
![Page 68: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/68.jpg)
Implementation
Step 1: Rake a subset of the α% (e.g., 1%) best simulationsfrom an ABC reference table usually including 106–109
simulations for each of the M compared scenarios/models
Selection based on normalized Euclidian distance computedbetween observed and simulated raw summaries
Step 2: run LDA on this subset to transform summaries into(M − 1) discriminant variables
When computing LDA functions, weight simulated data with anEpanechnikov kernel
Step 3: Estimation of the posterior probabilities of eachcompeting scenario/model by polychotomous local logisticregression against the M − 1 most discriminant variables
[Cornuet & al., 2008, Bioinformatics]
![Page 69: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/69.jpg)
LDA advantages
I much faster computation of scenario probabilities viapolychotomous regression
I a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table
I allows for a large collection of initial summaries
I ability to evaluate Type I and Type II errors on more complexmodels
I LDA reduces correlation among explanatory variables
When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated
![Page 70: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/70.jpg)
LDA advantages
I much faster computation of scenario probabilities viapolychotomous regression
I a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table
I allows for a large collection of initial summaries
I ability to evaluate Type I and Type II errors on more complexmodels
I LDA reduces correlation among explanatory variables
When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated
![Page 71: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/71.jpg)
ABC for model choice
Intractable likelihoods
ABC methods
ABC as an inference machine
ABC for model choiceBMC PrincipleGibbs random fields (counterexample)Generic ABC model choice
Model choice consistency
Conclusion and perspectives
![Page 72: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/72.jpg)
Bayesian model choice
BMC Principle
Several modelsM1,M2, . . .
are considered simultaneously for dataset y and model index M
central to inference.Use of
I prior π(M = m), plus
I prior distribution on the parameter conditional on the value mof the model index, πm(θm)
![Page 73: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/73.jpg)
Bayesian model choice
BMC Principle
Several modelsM1,M2, . . .
are considered simultaneously for dataset y and model index M
central to inference.Goal is to derive the posterior distribution of M,
π(M = m|data)
a challenging computational target when models are complex.
![Page 74: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/74.jpg)
Generic ABC for model choice
Algorithm 2 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T dorepeat
Generate m from the prior π(M = m)Generate θm from the prior πm(θm)Generate z from the model fm(z|θm)
until ρ{η(z),η(y)} < εSet m(t) = m and θ(t) = θm
end for
[Grelaud & al., 2009; Toni & al., 2009]
![Page 75: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/75.jpg)
ABC estimates
Posterior probability π(M = m|y) approximated by the frequencyof acceptances from model m
1
T
T∑t=1
Im(t)=m .
![Page 76: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/76.jpg)
ABC estimates
Posterior probability π(M = m|y) approximated by the frequencyof acceptances from model m
1
T
T∑t=1
Im(t)=m .
Extension to a weighted polychotomous logistic regressionestimate of π(M = m|y), with non-parametric kernel weights
[Cornuet et al., DIYABC, 2009]
![Page 77: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/77.jpg)
Potts model
Skip MRFs
Potts model
Distribution with an energy function of the form
θS(y) = θ∑l∼i
δyl=yi
where l∼i denotes a neighbourhood structure
In most realistic settings, summation
Zθ =∑x∈X
exp{θS(x)}
involves too many terms to be manageable and numericalapproximations cannot always be trusted
![Page 78: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/78.jpg)
Potts model
Skip MRFs
Potts model
Distribution with an energy function of the form
θS(y) = θ∑l∼i
δyl=yi
where l∼i denotes a neighbourhood structure
In most realistic settings, summation
Zθ =∑x∈X
exp{θS(x)}
involves too many terms to be manageable and numericalapproximations cannot always be trusted
![Page 79: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/79.jpg)
Neighbourhood relations
SetupChoice to be made between M neighbourhood relations
im∼ i ′ (0 6 m 6 M − 1)
withSm(x) =
∑im∼i ′
I{xi=xi ′ }
driven by the posterior probabilities of the models.
![Page 80: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/80.jpg)
Model index
Computational target:
P(M = m|x) ∝∫Θm
fm(x|θm)πm(θm) dθm π(M = m)
If S(x) sufficient statistic for the joint parameters(M, θ0, . . . ,θM−1),
P(M = m|x) = P(M = m|S(x)) .
![Page 81: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/81.jpg)
Model index
Computational target:
P(M = m|x) ∝∫Θm
fm(x|θm)πm(θm) dθm π(M = m)
If S(x) sufficient statistic for the joint parameters(M, θ0, . . . ,θM−1),
P(M = m|x) = P(M = m|S(x)) .
![Page 82: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/82.jpg)
Sufficient statistics in Gibbs random fields
Each model m has its own sufficient statistic Sm(·) andS(·) = (S0(·), . . . ,SM−1(·)) is also (model-)sufficient.Explanation: For Gibbs random fields,
x |M = m ∼ fm(x|θm) = f 1m(x|S(x))f2m(S(x)|θm)
=1
n(S(x))f 2m(S(x)|θm)
wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}
c© S(x) is sufficient for the joint parameters
![Page 83: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/83.jpg)
Sufficient statistics in Gibbs random fields
Each model m has its own sufficient statistic Sm(·) andS(·) = (S0(·), . . . ,SM−1(·)) is also (model-)sufficient.Explanation: For Gibbs random fields,
x |M = m ∼ fm(x|θm) = f 1m(x|S(x))f2m(S(x)|θm)
=1
n(S(x))f 2m(S(x)|θm)
wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}
c© S(x) is sufficient for the joint parameters
![Page 84: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/84.jpg)
Toy example
iid Bernoulli model versus two-state first-order Markov chain, i.e.
f0(x|θ0) = exp
(θ0
n∑i=1
I{xi=1}
)/{1 + exp(θ0)}
n ,
versus
f1(x|θ1) =1
2exp
(θ1
n∑i=2
I{xi=xi−1}
)/{1 + exp(θ1)}
n−1 ,
with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).
![Page 85: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/85.jpg)
About sufficiency
If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x),η2(x)) is not always sufficient for (m, θm)
c© Potential loss of information at the testing level
![Page 86: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/86.jpg)
About sufficiency
If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x),η2(x)) is not always sufficient for (m, θm)
c© Potential loss of information at the testing level
![Page 87: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/87.jpg)
Poisson/geometric example
Samplex = (x1, . . . , xn)
from either a Poisson P(λ) or from a geometric G(p)Sum
S =
n∑i=1
xi = η(x)
sufficient statistic for either model but not simultaneously
![Page 88: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/88.jpg)
Limiting behaviour of B12 (T →∞)
ABC approximation
B12(y) =
∑Tt=1 Imt=1 Iρ{η(zt),η(y)}6ε∑Tt=1 Imt=2 Iρ{η(zt),η(y)}6ε
,
where the (mt , z t)’s are simulated from the (joint) priorAs T goes to infinity, limit
Bε12(y) =
∫Iρ{η(z),η(y)}6επ1(θ1)f1(z|θ1) dz dθ1∫Iρ{η(z),η(y)}6επ2(θ2)f2(z|θ2) dz dθ2
=
∫Iρ{η,η(y)}6επ1(θ1)f
η1 (η|θ1) dη dθ1∫
Iρ{η,η(y)}6επ2(θ2)fη2 (η|θ2) dη dθ2
,
where f η1 (η|θ1) and f η2 (η|θ2) distributions of η(z)
![Page 89: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/89.jpg)
Limiting behaviour of B12 (T →∞)
ABC approximation
B12(y) =
∑Tt=1 Imt=1 Iρ{η(zt),η(y)}6ε∑Tt=1 Imt=2 Iρ{η(zt),η(y)}6ε
,
where the (mt , z t)’s are simulated from the (joint) priorAs T goes to infinity, limit
Bε12(y) =
∫Iρ{η(z),η(y)}6επ1(θ1)f1(z|θ1) dz dθ1∫Iρ{η(z),η(y)}6επ2(θ2)f2(z|θ2) dz dθ2
=
∫Iρ{η,η(y)}6επ1(θ1)f
η1 (η|θ1) dη dθ1∫
Iρ{η,η(y)}6επ2(θ2)fη2 (η|θ2) dη dθ2
,
where f η1 (η|θ1) and f η2 (η|θ2) distributions of η(z)
![Page 90: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/90.jpg)
Limiting behaviour of B12 (ε→ 0)
When ε goes to zero,
Bη12(y) =
∫π1(θ1)f
η1 (η(y)|θ1) dθ1∫
π2(θ2)fη2 (η(y)|θ2) dθ2
c© Bayes factor based on the sole observation of η(y)
![Page 91: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/91.jpg)
Limiting behaviour of B12 (ε→ 0)
When ε goes to zero,
Bη12(y) =
∫π1(θ1)f
η1 (η(y)|θ1) dθ1∫
π2(θ2)fη2 (η(y)|θ2) dθ2
c© Bayes factor based on the sole observation of η(y)
![Page 92: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/92.jpg)
Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic in both models,
fi (y|θi ) = gi (y)fηi (η(y)|θi )
Thus
B12(y) =
∫Θ1π(θ1)g1(y)f
η1 (η(y)|θ1) dθ1∫
Θ2π(θ2)g2(y)f
η2 (η(y)|θ2) dθ2
=g1(y)
∫π1(θ1)f
η1 (η(y)|θ1) dθ1
g2(y)∫π2(θ2)f
η2 (η(y)|θ2) dθ2
=g1(y)
g2(y)Bη12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]
c© No discrepancy only when cross-model sufficiency
![Page 93: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/93.jpg)
Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic in both models,
fi (y|θi ) = gi (y)fηi (η(y)|θi )
Thus
B12(y) =
∫Θ1π(θ1)g1(y)f
η1 (η(y)|θ1) dθ1∫
Θ2π(θ2)g2(y)f
η2 (η(y)|θ2) dθ2
=g1(y)
∫π1(θ1)f
η1 (η(y)|θ1) dθ1
g2(y)∫π2(θ2)f
η2 (η(y)|θ2) dθ2
=g1(y)
g2(y)Bη12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]
c© No discrepancy only when cross-model sufficiency
![Page 94: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/94.jpg)
Poisson/geometric example (back)
Samplex = (x1, . . . , xn)
from either a Poisson P(λ) or from a geometric G(p)
Discrepancy ratio
g1(x)
g2(x)=
S !n−S/∏
i xi !
1/(
n+S−1S
)
![Page 95: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/95.jpg)
Poisson/geometric discrepancy
Range of B12(x) versus Bη12(x): The values produced have nothingin common.
![Page 96: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/96.jpg)
Formal recovery
Creating an encompassing exponential family
f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}
leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]
![Page 97: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/97.jpg)
Formal recovery
Creating an encompassing exponential family
f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}
leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]
In the Poisson/geometric case, if∏
i xi ! is added to S , nodiscrepancy
![Page 98: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/98.jpg)
Formal recovery
Creating an encompassing exponential family
f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}
leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]
Only applies in genuine sufficiency settings...
c© Inability to evaluate information loss due to summarystatistics
![Page 99: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/99.jpg)
Meaning of the ABC-Bayes factor
The ABC approximation to the Bayes Factor is based solely on thesummary statistics....In the Poisson/geometric case, if E[yi ] = θ0 > 0,
limn→∞Bη12(y) =
(θ0 + 1)2
θ0e−θ0
![Page 100: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/100.jpg)
Meaning of the ABC-Bayes factor
The ABC approximation to the Bayes Factor is based solely on thesummary statistics....In the Poisson/geometric case, if E[yi ] = θ0 > 0,
limn→∞Bη12(y) =
(θ0 + 1)2
θ0e−θ0
![Page 101: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/101.jpg)
ABC model choice consistency
Intractable likelihoods
ABC methods
ABC as an inference machine
ABC for model choice
Model choice consistencyFormalised frameworkConsistency resultsSummary statistics
Conclusion and perspectives
![Page 102: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/102.jpg)
The starting point
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic T(y)consistent?
Note/warnin: c© drawn on T(y) through BT12(y) necessarily differs
from c© drawn on y through B12(y)[Marin, Pillai, X, & Rousseau, JRSS B, 2013]
![Page 103: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/103.jpg)
The starting point
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic T(y)consistent?
Note/warnin: c© drawn on T(y) through BT12(y) necessarily differs
from c© drawn on y through B12(y)[Marin, Pillai, X, & Rousseau, JRSS B, 2013]
![Page 104: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/104.jpg)
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/
√2), Laplace distribution with mean θ2
and scale parameter 1/√
2 (variance one).
![Page 105: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/105.jpg)
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/
√2), Laplace distribution with mean θ2
and scale parameter 1/√
2 (variance one).Four possible statistics
1. sample mean y (sufficient for M1 if not M2);
2. sample median med(y) (insufficient);
3. sample variance var(y) (ancillary);
4. median absolute deviation mad(y) = med(|y− med(y)|);
![Page 106: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/106.jpg)
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/
√2), Laplace distribution with mean θ2
and scale parameter 1/√
2 (variance one).
●
●
●
●
●
●
●●
●
●
●
Gauss Laplace
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n=100
![Page 107: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/107.jpg)
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/
√2), Laplace distribution with mean θ2
and scale parameter 1/√
2 (variance one).
●
●
●
●
●
●
●●
●
●
●
Gauss Laplace
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n=100
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
n=100
![Page 108: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/108.jpg)
Framework
Starting from sample
y = (y1, . . . , yn)
the observed sample, not necessarily iid with true distribution
y ∼ Pn
Summary statistics
T(y) = Tn = (T1(y),T2(y), · · · ,Td(y)) ∈ Rd
with true distribution Tn ∼ Gn.
![Page 109: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/109.jpg)
Framework
c© Comparison of
– under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1
– under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2
turned into
– under M1, T(y) ∼ G1,n(·|θ1), and θ1|T(y) ∼ π1(·|Tn)
– under M2, T(y) ∼ G2,n(·|θ2), and θ2|T(y) ∼ π2(·|Tn)
![Page 110: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/110.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[A1] is a standard central limit theorem under the true model[A2] controls the large deviations of the estimator Tn from theestimand µ(θ)[A3] is the standard prior mass condition found in Bayesianasymptotics (di effective dimension of the parameter)[A4] restricts the behaviour of the model density against the truedensity
[Think CLT!]
![Page 111: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/111.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[Think CLT!]
[A1] There exist
I a sequence {vn} converging to +∞,
I a distribution Q,
I a symmetric, d × d positive definite matrix V0 and
I a vector µ0 ∈ Rd
such thatvnV
−1/20 (Tn − µ0)
n→∞ Q, under Gn
![Page 112: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/112.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[Think CLT!]
[A2] For i = 1, 2, there exist sets Fn,i ⊂ Θi , functions µi (θi ) andconstants εi , τi ,αi > 0 such that for all τ > 0,
supθi∈Fn,i
Gi ,n
[|Tn − µi (θi )| > τ|µi (θi ) − µ0| ∧ εi |θi
](|τµi (θi ) − µ0| ∧ εi )
−αi. v−αi
n
withπi (F
cn,i ) = o(v−τin ).
![Page 113: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/113.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[Think CLT!]
[A3] If inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, defining (u > 0)
Sn,i (u) ={θi ∈ Fn,i ; |µi (θi ) − µ0| 6 u v−1
n
},
there exist constants di < τi ∧ αi − 1 such that
πi (Sn,i (u)) ∼ udi v−din , ∀u . vn
![Page 114: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/114.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[Think CLT!]
[A4] If inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, for any ε > 0, there existU, δ > 0 and (En)n such that, if θi ∈ Sn,i (U)
En ⊂ {t ; gi (t |θi ) < δgn(t)} and Gn (Ecn ) < ε.
![Page 115: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/115.jpg)
Assumptions
A collection of asymptotic “standard” assumptions:
[Think CLT!]
Again (sumup)[A1] is a standard central limit theorem under the true model[A2] controls the large deviations of the estimator Tn from theestimand µ(θ)[A3] is the standard prior mass condition found in Bayesianasymptotics (di effective dimension of the parameter)[A4] restricts the behaviour of the model density against the truedensity
![Page 116: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/116.jpg)
Effective dimension
Understanding di in [A3]:defined only when µ0 ∈ {µi (θi ), θi ∈ Θi },
πi (θi : |µi (θi ) − µ0| < n−1/2) = O(n−di/2)
is the effective dimension of the model Θi around µ0
![Page 117: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/117.jpg)
Effective dimension
Understanding di in [A3]:when inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
mi (Tn)
gn(Tn)
∼ v−din ,
thuslog(mi (T
n)/gn(Tn)) ∼ −di log vn
and v−din penalization factor resulting from integrating θi out (see
effective number of parameters in DIC)
![Page 118: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/118.jpg)
Effective dimension
Understanding di in [A3]:In regular models, di dimension of T(Θi ), leading to BICIn non-regular models, di can be smaller
![Page 119: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/119.jpg)
Asymptotic marginals
Asymptotically, under [A1]–[A4]
mi (t) =
∫Θi
gi (t |θi )πi (θi ) dθi
is such that(i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
Clvd−din 6 mi (T
n) 6 Cuvd−din
and(ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0
mi (Tn) = oPn [vd−τin + vd−αi
n ].
![Page 120: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/120.jpg)
Within-model consistency
Under same assumptions,if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
the posterior distribution of µi (θi ) given Tn is consistent at rate1/vn provided αi ∧ τi > di .
![Page 121: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/121.jpg)
Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!
![Page 122: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/122.jpg)
Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!
Indeed, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
Clv−(d1−d2)n 6 m1(T
n)/m2(T
n) 6 Cuv−(d1−d2)n ,
where Cl ,Cu = OPn(1), irrespective of the true model.c© Only depends on the difference d1 − d2: no consistency
![Page 123: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/123.jpg)
Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!
Else, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
thenm1(T
n)
m2(Tn)> Cu min
(v−(d1−α2)n , v
−(d1−τ2)n
)
![Page 124: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/124.jpg)
Consistency theorem
If
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0,
Bayes factor
BT12 = O(v
−(d1−d2)n )
irrespective of the true model. It is inconsistent since it alwayspicks the model with the smallest dimension
![Page 125: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/125.jpg)
Consistency theorem
If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor BT12 is consistent
![Page 126: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/126.jpg)
Consequences on summary statistics
Bayes factor driven by the means µi (θi ) and the relative positionof µ0 wrt both sets {µi (θi ); θi ∈ Θi }, i = 1, 2.
For ABC, this implies the simplest summary statistics Tn areancillary statistics with different mean values under both models
Else, if Tn asymptotically depends on some of the parameters ofthe models, it is possible that there exists θi ∈ Θi such thatµi (θi ) = µ0 even though model M1 is misspecified
![Page 127: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/127.jpg)
Consequences on summary statistics
Bayes factor driven by the means µi (θi ) and the relative positionof µ0 wrt both sets {µi (θi ); θi ∈ Θi }, i = 1, 2.
For ABC, this implies the simplest summary statistics Tn areancillary statistics with different mean values under both models
Else, if Tn asymptotically depends on some of the parameters ofthe models, it is possible that there exists θi ∈ Θi such thatµi (θi ) = µ0 even though model M1 is misspecified
![Page 128: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/128.jpg)
Toy example: Laplace versus Gauss [1]
If
Tn = n−1n∑
i=1
X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
and the true distribution is Laplace with mean θ0 = 1, under theGaussian model the value θ∗ = 2
√3 − 3 also leads to µ0 = µ(θ
∗)[here d1 = d2 = d = 1]
![Page 129: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/129.jpg)
Toy example: Laplace versus Gauss [1]
If
Tn = n−1n∑
i=1
X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
and the true distribution is Laplace with mean θ0 = 1, under theGaussian model the value θ∗ = 2
√3 − 3 also leads to µ0 = µ(θ
∗)[here d1 = d2 = d = 1]
c© A Bayes factor associated with such a statistic is inconsistent
![Page 130: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/130.jpg)
Toy example: Laplace versus Gauss [1]
If
Tn = n−1n∑
i=1
X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
![Page 131: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/131.jpg)
Toy example: Laplace versus Gauss [1]
If
Tn = n−1n∑
i=1
X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
●
●●
●
●
●●
M1 M2
0.30
0.35
0.40
0.45
n=1000
Fourth moment
![Page 132: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/132.jpg)
Toy example: Laplace versus Gauss [0]
WhenT(y) =
{y(4)n , y
(6)n
}and the true distribution is Laplace with mean θ0 = 0, thenµ0 = 6, µ1(θ
∗1) = 6 with θ∗1 = 2
√3 − 3
[d1 = 1 and d2 = 1/2]thus
B12 ∼ n−1/4 → 0 : consistent
Under the Gaussian model µ0 = 3, µ2(θ2) > 6 > 3 ∀θ2
B12 → +∞ : consistent
![Page 133: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/133.jpg)
Toy example: Laplace versus Gauss [0]
WhenT(y) =
{y(4)n , y
(6)n
}and the true distribution is Laplace with mean θ0 = 0, thenµ0 = 6, µ1(θ
∗1) = 6 with θ∗1 = 2
√3 − 3
[d1 = 1 and d2 = 1/2]thus
B12 ∼ n−1/4 → 0 : consistent
Under the Gaussian model µ0 = 3, µ2(θ2) > 6 > 3 ∀θ2
B12 → +∞ : consistent
![Page 134: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/134.jpg)
Toy example: Laplace versus Gauss [0]
WhenT(y) =
{y(4)n , y
(6)n
}
●
●
●●
●
●
Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
n=100
●
●
●●●●●
●
●
●●
●
●
●●●
●
Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
n=1000
●
●
●
●●
●
Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
n=10000
Fourth and sixth moments
![Page 135: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/135.jpg)
Checking for adequate statistics
Run a practical check of the relevance (or non-relevance) of Tn
null hypothesis that both models are compatible with the statisticTn
H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0
againstH1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0
testing procedure provides estimates of mean of Tn under eachmodel and checks for equality
![Page 136: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/136.jpg)
Checking in practice
I Under each model Mi , generate ABC sample θi ,l , l = 1, · · · , L
I For each θi ,l , generate yi ,l ∼ Fi ,n(·|ψi ,l), derive Tn(yi ,l) andcompute
µi =1
L
L∑l=1
Tn(yi ,l), i = 1, 2 .
I Conditionally on Tn(y),
√L(µi − Eπ [µi (θi )|Tn(y)]) N(0,Vi ),
I Test for a common mean
H0 : µ1 ∼ N(µ0,V1) , µ2 ∼ N(µ0,V2)
against the alternative of different means
H1 : µi ∼ N(µi ,Vi ), with µ1 6= µ2 .
![Page 137: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/137.jpg)
Toy example: Laplace versus Gauss
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●
Gauss Laplace Gauss Laplace
010
2030
40
Normalised χ2 without and with mad
![Page 138: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/138.jpg)
A population genetic illustration
Two populations (1 and 2) having diverged at a fixed known timein the past and third population (3) which diverged from one ofthose two populations (models 1 and 2, respectively).
Observation of 50 diploid individuals/population genotyped at 5,50 or 100 independent microsatellite loci.
Model 2
![Page 139: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/139.jpg)
A population genetic illustration
Two populations (1 and 2) having diverged at a fixed known timein the past and third population (3) which diverged from one ofthose two populations (models 1 and 2, respectively).
Observation of 50 diploid individuals/population genotyped at 5,50 or 100 independent microsatellite loci.
Stepwise mutation model: the number of repeats of the mutatedgene increases or decreases by one. Mutation rate µ common to allloci set to 0.005 (single parameter) with uniform prior distribution
µ ∼ U[0.0001, 0.01]
![Page 140: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/140.jpg)
A population genetic illustration
Summary statistics associated to the (δµ)2 distance
xl ,i ,j repeated number of allele in locus l = 1, . . . , L for individuali = 1, . . . , 100 within the population j = 1, 2, 3. Then
(δµ)2j1,j2 =
1
L
L∑l=1
1
100
100∑i1=1
xl ,i1,j1 −1
100
100∑i2=1
xl ,i2,j2
2
.
![Page 141: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/141.jpg)
A population genetic illustration
For two copies of locus l with allele sizes xl ,i ,j1 and xl ,i ′,j2 , mostrecent common ancestor at coalescence time τj1,j2 , gene genealogydistance of 2τj1,j2 , hence number of mutations Poisson withparameter 2µτj1,j2 . Therefore,
E{(
xl ,i ,j1 − xl ,i ′,j2)2
|τj1,j2
}= 2µτj1,j2
andModel 1 Model 2
E{(δµ)
21,2
}2µ1t
′ 2µ2t′
E{(δµ)
21,3
}2µ1t 2µ2t
′
E{(δµ)
22,3
}2µ1t
′ 2µ2t
![Page 142: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/142.jpg)
A population genetic illustration
Thus,
I Bayes factor based only on distance (δµ)21,2 not convergent: if
µ1 = µ2, same expectation
I Bayes factor based only on distance (δµ)21,3 or (δµ)
22,3 not
convergent: if µ1 = 2µ2 or 2µ1 = µ2 same expectation
I if two of the three distances are used, Bayes factor converges:there is no (µ1,µ2) for which all expectations are equal
![Page 143: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/143.jpg)
A population genetic illustration
●
● ●
5 50 100
0.0
0.4
0.8
DM2(12)
●
●
●
●
●
●
●
●
●
●●
● ●
●
5 50 100
0.0
0.4
0.8
DM2(13)
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●●●●●●●
●
●
●
●
●
●
5 50 100
0.0
0.4
0.8
DM2(13) & DM2(23)
Posterior probabilities that the data is from model 1 for 5, 50and 100 loci
![Page 144: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/144.jpg)
A population genetic illustration
●●●●●
DM2(12) DM2(13) & DM2(23)
020
4060
8010
012
014
0
●●●●●●●●●
DM2(12) DM2(13) & DM2(23)
0.0
0.2
0.4
0.6
0.8
1.0
p−values
![Page 145: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/145.jpg)
Embedded models
When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order
v−(d1−d2)n
![Page 146: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/146.jpg)
Embedded models
When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order
v−(d1−d2)n
If summary statistic only informative on a parameter that is thesame under both models, i.e if d1 = d2, then
c© the Bayes factor is not consistent
![Page 147: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/147.jpg)
Embedded models
When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order
v−(d1−d2)n
Else, d1 < d2 and Bayes factor is going to ∞ under M1. If truedistribution not in M1, then
c© Bayes factor is consistent only if µ1 6= µ2 = µ0
![Page 148: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/148.jpg)
Summary
I Model selection feasible with ABC
I Choice of summary statistics is paramount
I At best, ABC → π(. | T(y)) which concentrates around µ0
I For consistent estimation : {θ;µ(θ) = µ0} = θ0
I For consistent testing{µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
![Page 149: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/149.jpg)
Summary
I Model selection feasible with ABC
I Choice of summary statistics is paramount
I At best, ABC → π(. | T(y)) which concentrates around µ0
I For consistent estimation : {θ;µ(θ) = µ0} = θ0
I For consistent testing{µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
![Page 150: Pittsburgh and Toronto "Halloween US trip" seminars](https://reader033.vdocuments.site/reader033/viewer/2022052822/554e8030b4c90545698b5261/html5/thumbnails/150.jpg)
Conclusion/perspectives
I abc part of a wider pictureto handle complex/Big Datamodels, able to start fromrudimentary machinelearning summaries
I many formats of empirical[likelihood] Bayes methodsavailable
I lack of comparative toolsand of an assessment forinformation loss