statistical analysis of compartmental models: epidemiology
TRANSCRIPT
Statistical analysis of compartmental models:epidemiology, molecular biology, and
everything in between
Vladimir N. Minin
Department of StatisticsUniversity of California, Irvine
BIRS Workshop “Challenges in the Statistical Modeling ofStochastic Processes for the Natural Sciences"
July 2017
joint work with Jason Xu (UW → UCLA), Sam Koelle (UW), Peter Guttorp (UW), Chuanfeng Wu(NIH), Cynthia Dunbar (NIH), Jan Abkowitz (UW), Amanda Koepke (NIST), Ira Longini (UF), BetzHalloran (Fred Hutch and UW), Jon Wakefield (UW), and Jon Fintzi (UW)
funding: NIH R01-AI107034 and NIH U54 GM111274 1
Hematopoiesis HMM driven by a branching process
λνa
µ a
HSC Progenitor
µ 5
µ 3
µ 2
µ 1
Mo
T
B
Gr
NK
µ 4
ν 1
ν 2
ν3
ν 4
ν 5
I Latent process: each barcode lineage evolves as a multi-typebranching process X(t) whose components are counts of eachcell type.
I Observation process: multivariate hypergeometric distributionY ∼ mvhypgeo(X) — driven by experimental design.
I Read data: read counts Y are proportional to Y with unknownamplification constant.
I Likelihood is intractable due to massive size of the state space ofthe branching process.
2
Systems (molecular) biologynature.com Publications A-Z index Browse by subject
Nature Protocols ISSN 1754-2189 EISSN 1750-2799
© 2014 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
partner of AGORA, HINARI, OARE, INASP, ORCID, CrossRef, COUNTER and COPE
Login Register Cart
Figure 2: Models and data.FromA framework for parameter estimation and model selection from experimental data in systems biology usingapproximate Bayesian computationJuliane Liepe, Paul Kirk, Sarah Filippi, Tina Toni, Chris P Barnes & Michael P H StumpfNature Protocols 9, 439–456 (2014) doi:10.1038/nprot.2014.025Published online 23 January 2014
(a–e) The full mRNA self-regulation model is shown in a. mRNA (m) produces protein P1, which can be transformed into protein P2. P1 is required to producemRNA, whereas P2 degrades mRNA into the empty set (–). P1 and P2 can also be degraded. The reactions that occur according to this model are shown in b.Fitting of the model to the data (c), which comprise mRNA measurements over time. The second model (d) is based on the first model, but it does not containprotein P2. The relevant reactions are shown in e. 3
Case study: cholera in BangladeshGoal: To understand the dynamics of endemic cholera in Bangladeshand to develop a model that will be able to predict outbreaks severalweeks in advance.
I Specifically, to understand how the disease dynamics are relatedto environmental covariates.
Bakerganj
Chhatak
Chaugachha Matlab
Mathbaria
4
Mathbaria cholera incidence data and covariates
Jan
04A
pr 0
4Ju
l 04
Oct
04
Jan
05A
pr 0
5Ju
l 05
Oct
05
Jan
06A
pr 0
6Ju
l 06
Oct
06
Jan
07A
pr 0
7Ju
l 07
Oct
07
Jan
08A
pr 0
8Ju
l 08
Oct
08
Jan
09A
pr 0
9Ju
l 09
Oct
09
Jan
10A
pr 1
0Ju
l 10
Oct
10
Jan
11A
pr 1
1Ju
l 11
Oct
11
Jan
12A
pr 1
2Ju
l 12
Oct
12
Jan
13A
pr 1
3Ju
l 13
0
2
4
6
8
10N
umbe
r C
hole
ra C
ases
in M
athb
aria
−2
−1
01
23
Sta
ndar
dize
d C
ovar
iate
s
Case CountsWater DepthWater Temperature
I Covariates are the smoothed standardized daily valuesCWT (i) = (WT (i)−WT )/sWT , CWD(i) = (WD(i)−WD)/sWD.
I We could use a phenomenological model (e.g., Poissonregression), but this would not tell us anything about the numberof infected people, and disease transmission parameters.
5
Deterministic SIR-like models
Easy system of nonlinear differential equations. Models meanbehavior of a stochastic system.
Problem: Poorly mimics stochastic dynamics when one of thecompartments is low. Not clear how to model noisy data.Solution: Use a real stochastic model (e.g., continuous-time Markovchain or SDE).
6
SIRS model
S, I, RS+1, I, R-1 S, I-1, R+1
S-1, I+1, R
I S, I, and R denote the numbers of susceptible, infectious, andrecovered individuals at time t ; N = S + I + R
I β represents the infectious contact rate,α(t) = αAi = exp [α0 + α1CWD(i − κ) + α2CWT (i − κ)] representsthe time-varying environmental force of infection, γ is therecovery rate, and µ is the rate at which immunity is lost.
7
Hidden SIRS model
St0 It0
Rt0
St1
It1
Rt1
Stn
Itn
Rtn
Xt0 Xt1
Xtn
yt0 yt1
ytn
I Xt only indirectly observed through yt ⇒hidden Markov model (HMM)
I yt ∼ Binomial(It , ρ)
I ρ depends on the number ofsymptomatic infectious individuals thatseek treatment
We are interested in the posterior distribution
Pr(θ|y) ∝ Pr(y |θ)Pr(θ),
where y = (yt0 , . . . , ytn), θ = (β, γ, ρ, α0, . . . , αk ), and
Pr(y|θ)=∑
X
(n∏
i=0
Pr(yti |Iti , ρ)
[Pr(X t0 |φS, φI)
n∏i=1
p(X ti |X ti−1 ,θ)
]).
Problem: This likelihood is intractable, because the state space of Xtis too large even for moderately high population size N.
8
Estimation resultsPosterior medians and 95% CIs for the parameters of the SIRSmodel.
Coefficient Estimate 95% CIsβ × N 0.491 (0.103, 0.945)
γ 0.115 (0.096, 0.142)(β × N)/γ 4.35 (0.99 , 7.15)
α0 -5.32 (-6.63 , -4.51)α1 -1.37 (-1.98 , -0.98)α2 2.18 (1.8 , 2.62)
ρ× N 55.8 (43.4 , 73.5)
Recall:I N is the population size (set artificially to 10,000).I β is the infectious contact rate.I γ is the recovery rate.I ρ is the reporting rate.I α(t) = αAi = exp [α0 + α1CWD(i − κ) + α2CWT (i − κ)]
9
Prediction results
Mar 12 Apr 12 May 12 Jun 12 Jul 12 Aug 12 Sep 12 Oct 12 Nov 12 Dec 12 Jan 13 Feb 13 Mar 13 Apr 13 May 13
01
23
45
67
89
Cou
nts
0.0
0.2
0.4
0.6
0.8
1.0
Pos
terio
r P
roba
bilit
y
Mar 12 Apr 12 May 12 Jun 12 Jul 12 Aug 12 Sep 12 Oct 12 Nov 12 Dec 12 Jan 13 Feb 13 Mar 13 Apr 13 May 13
00.
050.
10.
15S
usce
ptib
le F
ract
ion
Time (Day)
00.
020.
040.
06
Infe
cted
Fra
ctio
n
SusceptibleInfected
10
What models cause all this trouble?I Too broad of an answer: partially observed stochastic processes
I Too narrow of an answer: intractable HMMs
I Mouthful, but good enough answer: Markov compartmentalmodels, with a subset of compartments are observed over timewith noise
11
What models cause all this trouble?I Too broad of an answer: partially observed stochastic processes
I Too narrow of an answer: intractable HMMs
I Mouthful, but good enough answer: Markov compartmentalmodels, with a subset of compartments are observed over timewith noise
11
What models cause all this trouble?I Too broad of an answer: partially observed stochastic processes
I Too narrow of an answer: intractable HMMs
I Mouthful, but good enough answer: Markov compartmentalmodels, with a subset of compartments are observed over timewith noise
11
Statistical inference options
I “Exact" likelihood-based methods:– Bayesian inference with particle MCMC (Koepke et al.)
– Maximum likelihood inference with particle MCMC (Ionides et al.)
I “Exact" summary statistics based methods:– Approximate Bayesian Computation (Liepe et al.)
– Quasi- and pseudo-maximum likelihood estimators and generalizedmethods of moments (Chen and Hyrien, Xu et al.)
I Likelihood approximations:– Trajectory matching with iid normal error
– Pretending some compartments change slowly (diffusion, branchingprocesses)
– Emulating with GPs to reduce costly likelihood evaluations (Jandarov et al.)
– Linear noise approximation — makes Markov transition densities Gaussian,with complicated mean and covariance functions.
12
Statistical inference options
I “Exact" likelihood-based methods:– Bayesian inference with particle MCMC (Koepke et al.)
– Maximum likelihood inference with particle MCMC (Ionides et al.)
I “Exact" summary statistics based methods:– Approximate Bayesian Computation (Liepe et al.)
– Quasi- and pseudo-maximum likelihood estimators and generalizedmethods of moments (Chen and Hyrien, Xu et al.)
I Likelihood approximations:– Trajectory matching with iid normal error
– Pretending some compartments change slowly (diffusion, branchingprocesses)
– Emulating with GPs to reduce costly likelihood evaluations (Jandarov et al.)
– Linear noise approximation — makes Markov transition densities Gaussian,with complicated mean and covariance functions.
12
Statistical inference options
I “Exact" likelihood-based methods:– Bayesian inference with particle MCMC (Koepke et al.)
– Maximum likelihood inference with particle MCMC (Ionides et al.)
I “Exact" summary statistics based methods:– Approximate Bayesian Computation (Liepe et al.)
– Quasi- and pseudo-maximum likelihood estimators and generalizedmethods of moments (Chen and Hyrien, Xu et al.)
I Likelihood approximations:– Trajectory matching with iid normal error
– Pretending some compartments change slowly (diffusion, branchingprocesses)
– Emulating with GPs to reduce costly likelihood evaluations (Jandarov et al.)
– Linear noise approximation — makes Markov transition densities Gaussian,with complicated mean and covariance functions.
12
Open problems and glimpses of possible solutions
I Open problem:
– Computationally stable inference procedure that can handle at least 10-100compartments
I Current lines of attack:
– Ho et al. use clever re-parameterization to develop a new deterministicalgorithm for computing SIR (and more complex) transition densities“exactly." Doesn’t solve all the problems, because not all compartments areobserved perfectly.
– Approximate Bayesian computation and/or indirect inference. But modelcomparison is tough!
– New data augmentation methods with block updates of latent variables.
13
References
I Xu J, Koelle S, Guttorp G, Wu C, Dunbar CE, Abkowitz JL, Minin VN. Statistical inference inpartially observed stochastic compartmental models with application to cell lineage trackingof in vivo hematopoiesis, https://arxiv.org/abs/1610.07550.
I Koepke AA, Longini, IM, Halloran ME, Wakefield J, Minin VN. Predictive modeling of choleraoutbreaks in Bangladesh, Annals of Applied Statistics, 10, 575–595, 2016, arXiv:1402.0536[stat.AP].
I Fintzi J, Wakefield J, Minin VN. Efficient data augmentation for fitting stochastic epidemicmodels to prevalence data, arXiv:1606.07995 [stat.CO].
I Ho LST, Xu J, Crawford FW, Minin VN, Suchard, MA. Birth(death)/birth-death processes andtheir computable transition probabilities with statistical applications, arXiv:1603.03819[stat.CO].
I http://www.stat.osu.edu/ pfc/BIRS/
14