the bayesian approach to inverse problems€¦ · statistical inference why is a statistical...
TRANSCRIPT
The Bayesian approach to inverse problems
Youssef Marzouk
Department of Aeronautics and Astronautics
Center for Computational Engineering
Massachusetts Institute of Technology
[email protected], http://uqgroup.mit.edu
7 July 2015
Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29
Statistical inference
Why is a statistical perspective useful in inverse problems?
To characterize uncertainty in the inverse solution
To understand how this uncertainty depends on the number and
quality of observations, features of the forward model, prior
information, etc.
To make probabilistic predictions
To choose “good” observations or experiments
To address questions of model error, model validity, and model
selection
Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29
Bayesian inference
Bayes’ rule
p(θ|y) =p(y |θ)p(θ)
p(y)
Key idea: model parameters θ are treated as random variables
(For simplicity, we let our random variables have densities)
Notation
θ are model parameters; y are the data; assume both to be
finite-dimensional unless otherwise indicated
p(θ) is the prior probability density
L(θ) ≡ p(y |θ) is the likelihood function
p(θ|y) is the posterior probability density
p(y) is the evidence, or equivalently, the marginal likelihood
Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29
Bayesian inference
Summaries of the posterior distribution
What information to extract?
Posterior mean of θ; maximum a posteriori (MAP) estimate of θ
Posterior covariance or higher moments of θ
Quantiles
Credibile intervals: C (y) such that P [θ ∈ C (y) | y ] = 1− α.
Credibile intervals are not uniquely defined above; thus consider, for
example, the HPD (highest posterior density) region.
Posterior realizations: for direct assessment, or to estimate posterior
predictions or other posterior expectations
Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29
Bayesian and frequentist statistics
Understanding both perspectives is useful and important. . .
Key differences between these two statistical paradigms
Frequentists do not assign probabilities to unknown parameters θ.
One can write likelihoods pθ(y) ≡ p(y |θ) but not priors p(θ) or
posteriors. θ is not a random variable.
In the frequentist viewpoint, there is no single preferred
methodology for inverting the relationship between parameters and
data. Instead, consider various estimators θ(y) of θ.
The estimator θ is a random variable. Why? Frequentist paradigm
considers y to result from a random and repeatable experiment.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29
Bayesian and frequentist statistics
Key differences (continued)
Evaluate quality of θ through various criteria: bias, variance,
mean-square error, consistency, efficiency, . . .
One common estimator is maximum likelihood:
θML = argmaxθ p(y |θ). p(y |θ) defines a family of distributions
indexed by θ.
Link to Bayesian approach: MAP estimate maximizes a “penalized
likelihood.”
What about Bayesian versus frequentist prediction of ynew ⊥ y | θ?
Frequentist: “plug-in” or other estimators of ynewBayesian: posterior prediction via integration
Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29
Bayesian inference
Likelihood functions
In general, p(y |θ) is a probabilistic model for the data
In the inverse problem or parameter estimation context, the
likelihood function is where the forward model appears, along with a
noise model and (if applicable) an expression for model discrepancy
Contrasting example (but not really!): parametric density
estimation, where the likelihood function results from the probability
density itself.
Selected examples of likelihood functions1 Bayesian linear regression2 Nonlinear forward model g(θ) with additive Gaussian noise3 Nonlinear forward model with noise + model discrepancy
Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29
Bayesian inference
Prior distributions
In ill-posed parameter estimation problems, e.g., inverse problems,
prior information plays a key role
Intuitive idea: assign lower probability to values of θ that you don’t
expect to see, higher probability to values of θ that you do expect to
see
Examples1 Gaussian processes with specified covariance kernel2 Gaussian Markov random fields3 Gaussian priors derived from differential operators4 Hierarchical priors5 Besov space priors6 Higher-level representations (objects, marked point processes)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29
Gaussian process priors
Key idea: any finite-dimensional distribution of the stochastic
process θ(x, ω) : D × Ω→ R is multivariate normal.
In other words: θ(x, ω) is a collection of jointly Gaussian random
variables, indexed by x
Specify via mean function and covariance function
E [θ(x)] = µ(x)E [(θ(x)− µ) (θ(x′)− µ)] = C (x, x′)
Smoothness of process is controlled by behavior of covariance
function as x′ → xRestrictions: stationarity, isotropy, . . .
Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29
Example: stationary Gaussian random fieldsGaussian process priors
•! Prior is a stationary Gaussian random field:
(exponential covariance kernel) (Gaussian covariance kernel)
M(x,!) = µ(x) + "ici(!) #
i(x)
i=1
K
$(Karhunen-Loève expansion)
Both are θ(x, ω) : D × Ω→ R, with D = [0, 1]2.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 10 / 29
Gaussian Markov random fields
Key idea: discretize space and specify a sparse inverse covariance
(“precision”) matrix W
p(θ) ∝ exp
(−
1
2γθTWθ
)where γ controls scale
Full conditionals p(θi |θ∼i) are available analytically and may simplify
dramatically. Represent as an undirected graphical model
Example: E [θi |θ∼i ] is just an average of site i ’s nearest neighbors
Quite flexible; even used to simulate textures
Marzouk (MIT) ICERM IdeaLab 7 July 2015 11 / 29
Priors through differential operators
Key idea: return to infinite-dimensional setting; again penalize
roughness in θ(x)
Stuart 2010: define the prior using fractional negative powers of the
Laplacian A = −∆:
θ ∼ N(θ0, βA−α
)Sufficiently large α (α > d/2), along with conditions on the
likelihood, ensures that posterior measure is well defined
Marzouk (MIT) ICERM IdeaLab 7 July 2015 12 / 29
GPs, GMRFs, and SPDEs
In fact, all three “types” of Gaussian priors just described are closely
connected.
Linear fractional SPDE:(κ2 −∆
)β/2θ(x) =W(x), x ∈ Rd , β = ν + d/2, κ > 0, ν > 0
Then θ(x) is a Gaussian field with Matern covariance:
C (x, x′) =σ2
2ν−1Γ(ν)(κ‖x− x′‖)νKν (κ‖x− x′‖)
Covariance kernel is Green’s function of differential operator(κ2 −∆
)βC (x, x′) = δ(x− x′)
ν = 1/2 equivalent to exponential covariance; ν →∞ equivalent to
squared exponential covariance
Can construct a discrete GMRF that approximates the solution ofSPDE (See Lindgren, Rue, Lindstrom JRSSB 2011.)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29
Hierarchical Gaussian priors
Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo
0 0.2 0.4 0.6 0.8 1
!0.2
!0.1
0
0.1
0.2
0.3
0.4
0.5
0 0.2 0.4 0.6 0.8 1
! 0.2
! 0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 1. Three realization drawn from the prior (6) with constant variance !j = !0 (left) and fromthe corresponding prior where the variance is 100–fold at two points indicated by arrows (right).
where X and W are the n-variate random variables with components Xj and Wj , respectively,and
L =
!
"""#
1!1 1
. . .. . .
!1 1
$
%%%&, D = diag(!1, !2, . . . , !n). (5)
Since W is a standard normal random variable, relation (4) allows us to write the (prior)probability density of X as
"prior(x) " exp'! 1
2#D!1/2LX#2(. (6)
Not surprisingly, the first-order autoregressive Markov model leads to a first-order smoothnessprior for the variable X. The variance vector ! expresses the expected variability of the signalover the support interval, and provides a handle to control the qualitative behavior of the signal.Assume, for example, that we set !j = !0 = const., 1 ! j ! n, leading to a homogenoussmoothness over the support interval. By changing some of the components, e.g., setting!k = !# = 100!0 for some k, #, we expect the signal to have jumps of standard deviation$
!k =$
!# = 10$
!0 at the grid intervals [tk!1, tk] and [t#!1, t#]. This is illustrated infigure 1, where we show some random draws from the prior. It is important to note that thehigher values of !j s do not force, but make the jumps simply more likely to occur by increasingthe local variance.
This observation suggests that when the number, location and expected amplitudes of thejumps are known, that is, when the prior information is quantitative, the first-order Markovmodel provides the means to encode the available information into the prior. Suppose now thatthe only available information about the solution of the inverse problem is qualitative: jumpsmay occur, but there is no available information of how many, where and how large. Adheringto the Bayesian paradigm, we express this lack of quantitative information by modeling thevariance of the Markov process as a random variable. The estimation of the variance vectorthus becomes a part of the inverse problem.
4
Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 14 / 29
Hierarchical Gaussian priors
Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo
Iteration 1 Iteration 3 Iteration 5
Iteration 1 Iteration 3 Iteration 5
Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the GMRES method to computethe updated of the image at each iteration step.
Iteration 1 Iteration 3 Iteration 7
Iteration 1 Iteration 3 Iteration 7
Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to computethe updated of the image at each iteration step
The graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior.The value of the objective function levels off after five iterations, and this could be the basisfor a stopping criterion. Note that after seven iterations, the norm of the estimation errorstarts to grow again, typical of algorithms which exhibit semi-convergence. The specklingphenomenon, by which individual pixel values close to the discontinuity start to divergeis partly responsible for the growth of the error. This suggests that the iterations should bestopped soon after the settling of the objective function. The fact that the norm of the derivativeis small already at the end of the first iterations which indicate that the sequential iterationfinds indeed a good approximation to a minimizer.
15
Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 15 / 29
Non-Gaussian priors
Besov space Bspq(T):
θ(x) = c0 +
∞∑j=0
2j−1∑h=0
wj ,hψj ,h(x)
and
‖θ‖Bspq(T) :=
|c0|q +
∞∑j=0
2jq(s+12− 1p)
2j−1∑h=0
|wj ,h|pq/p
1/q
<∞.
Consider p = q = s = 1:
‖θ‖B111(T) = |c0|+∞∑j=0
2j−1∑h=0
2j/2|wj ,h|.
Then the distribution of θ is a Besov prior if αc0 and α2j/2wj ,h are
independent and Laplace(1).
Loosely, π(θ) = exp(−α‖θ‖B111(T)
).
Marzouk (MIT) ICERM IdeaLab 7 July 2015 16 / 29
Higher-level representations
Marked point processes, and more:
Rue & Hurn, Biometrika 86 (1999) 649–660.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 17 / 29
Bayesian inference
Hierarchical modeling
One of the key flexibilities of the Bayesian construction!
Hierarchical modeling has important implications for the design of
efficient MCMC samplers (later in the lecture)
Examples:
1 Unknown noise variance2 Unknown variance of a Gaussian process prior (cf. choosing the
regularization parameter)3 Many more, as dictated by the physical models at hand
Marzouk (MIT) ICERM IdeaLab 7 July 2015 18 / 29
Example: prior variance hyperparameter in an inverse
diffusion problem
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
θ
p(θ)
or
p(θ
|d)
hyperpriorposterior, ς=10−1, 13 sensors
posterior, ς=10−2, 25 sensors
Figure : Posterior marginal density of the variance hyperparameter θ, versus
quality of data, contrasted with its hyperprior density. “Regularization” ∝ ς2/θ.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 19 / 29
The linear Gaussian model
A key building-block problem:
Parameters θ ∈ Rn, observations y ∈ Rm
Forward model f (θ) = Gθ, where G ∈ Rm×n
Additive noise yields observations: y = Gθ + ε
ε ∼ N(0, Γobs) and is independent of θ
Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).
Posterior probability density
p(θ|y) ∝ p(y |θ)p(θ) = L(θ)p(θ)
= exp
(−
1
2(y − Gθ)T Γ−1obs (y − Gθ)
)exp
(−
1
2θTΓ−1pr θ
)= exp
(−
1
2(θ − µpos)T Γ−1pos (θ − µpos)
)Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29
The linear Gaussian model
A key building-block problem:
Parameters θ ∈ Rn, observations y ∈ Rm
Forward model f (θ) = Gθ, where G ∈ Rm×n
Additive noise yields observations: y = Gθ + ε
ε ∼ N(0, Γobs) and is independent of θ
Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).
Posterior probability density
p(θ|y) ∝ p(y |θ)p(θ) = L(θ)p(θ)
= exp
(−
1
2(y − Gθ)T Γ−1obs (y − Gθ)
)exp
(−
1
2θTΓ−1pr θ
)= exp
(−
1
2(θ − µpos)T Γ−1pos (θ − µpos)
)Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29
The linear Gaussian model
Posterior is again Gaussian:
Γpos =(
GTΓ−1obsG + Γ−1pr)−1
= Γpr − ΓprGT(
G ΓprGT + Γobs
)−1G Γpr
= (I − K G ) Γpr
µpos = ΓposGTΓ−1obsy
In the context of filtering, K is known as the (optimal) Kalman gain.
H := GTΓ−1obsG is the Hessian of the negative log-likelihood
How does low rank of H affect the structure of the posterior? How
does H interact with the prior?
Marzouk (MIT) ICERM IdeaLab 7 July 2015 21 / 29
Likelihood-informed directions
Consider the Rayleigh ratio
R(w) =w>Hw
w>Γ−1pr w.
When R(w) is large, likelihood dominates the prior in direction w .
The ratio is maximized by solutions to the generalized eigenvalue problem
Hw = λΓ−1pr w .
The posterior covariance can be written as a negative update along these
”likelihood-informed” directions, and approximation can be obtained by
using only r largest eigenvalues:
Γpos = Γpr −n∑i=1
λi1 + λi
wiw>i ≈ Γpr −
r∑i=1
λi1 + λi
wiw>i (1)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29
Likelihood-informed directions
Consider the Rayleigh ratio
R(w) =w>Hw
w>Γ−1pr w.
When R(w) is large, likelihood dominates the prior in direction w .
The ratio is maximized by solutions to the generalized eigenvalue problem
Hw = λΓ−1pr w .
The posterior covariance can be written as a negative update along these
”likelihood-informed” directions, and approximation can be obtained by
using only r largest eigenvalues:
Γpos = Γpr −n∑i=1
λi1 + λi
wiw>i ≈ Γpr −
r∑i=1
λi1 + λi
wiw>i (1)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29
Likelihood-informed directions
Consider the Rayleigh ratio
R(w) =w>Hw
w>Γ−1pr w.
When R(w) is large, likelihood dominates the prior in direction w .
The ratio is maximized by solutions to the generalized eigenvalue problem
Hw = λΓ−1pr w .
The posterior covariance can be written as a negative update along these
”likelihood-informed” directions, and approximation can be obtained by
using only r largest eigenvalues:
Γpos = Γpr −n∑i=1
λi1 + λi
wiw>i ≈ Γpr −
r∑i=1
λi1 + λi
wiw>i (1)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29
Optimality results for Γpos
It turns out that the approximation
Γpos = Γpr −r∑i=1
λi1 + λi
wiw>i (2)
is optimal in a class of loss functions L(Γpos, Γpos) for approximations of
form Γpos = Γpr − KK>, where rank(K ) ≤ r .1
Γpos minimises the Hellinger distance and the KL-divergence between
N (µpos(y), Γpos) and N (µpos(y), Γpos).
The results can also be used to devise efficient approximations for the
posterior mean.
λ = 1 means that prior and likelihood are roughly balanced. Truncate at
λ = 0.1, for instance.
1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear
inverse problems, http://arxiv.org/abs/1407.3463Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29
Optimality results for Γpos
It turns out that the approximation
Γpos = Γpr −r∑i=1
λi1 + λi
wiw>i (2)
is optimal in a class of loss functions L(Γpos, Γpos) for approximations of
form Γpos = Γpr − KK>, where rank(K ) ≤ r .1
Γpos minimises the Hellinger distance and the KL-divergence between
N (µpos(y), Γpos) and N (µpos(y), Γpos).
The results can also be used to devise efficient approximations for the
posterior mean.
λ = 1 means that prior and likelihood are roughly balanced. Truncate at
λ = 0.1, for instance.
1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear
inverse problems, http://arxiv.org/abs/1407.3463Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29
Optimality results for Γpos
It turns out that the approximation
Γpos = Γpr −r∑i=1
λi1 + λi
wiw>i (2)
is optimal in a class of loss functions L(Γpos, Γpos) for approximations of
form Γpos = Γpr − KK>, where rank(K ) ≤ r .1
Γpos minimises the Hellinger distance and the KL-divergence between
N (µpos(y), Γpos) and N (µpos(y), Γpos).
The results can also be used to devise efficient approximations for the
posterior mean.
λ = 1 means that prior and likelihood are roughly balanced. Truncate at
λ = 0.1, for instance.
1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear
inverse problems, http://arxiv.org/abs/1407.3463Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29
Optimality results for Γpos
It turns out that the approximation
Γpos = Γpr −r∑i=1
λi1 + λi
wiw>i (2)
is optimal in a class of loss functions L(Γpos, Γpos) for approximations of
form Γpos = Γpr − KK>, where rank(K ) ≤ r .1
Γpos minimises the Hellinger distance and the KL-divergence between
N (µpos(y), Γpos) and N (µpos(y), Γpos).
The results can also be used to devise efficient approximations for the
posterior mean.
λ = 1 means that prior and likelihood are roughly balanced. Truncate at
λ = 0.1, for instance.
1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear
inverse problems, http://arxiv.org/abs/1407.3463Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29
Remarks on the optimal approximation
Γ∗pos = Γpr − KK>, KK> =
r∑i=1
λi1 + λi
wiw>i
The form of the optimal update is widely used (Flath et al. 2011)
Compute with Lanczos, randomized SVD, etc.
Directions wi = Γ−1pr wi maximize the relative difference between
prior and posterior variance:
Var(
w>i x)− Var
(w>i x | y
)Var
(w>i x
) =λi
1 + λi
Using the Frobenius norm as a loss would instead yield directions of
greatest absolute difference between prior and posterior variance.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 24 / 29
A metric between covariance matrices
Forstner metric
Let A,B 0, and (σi) be the eigenvalues
of (A,B), then:
d2F (A,B) = tr[
ln2(
B−12AB−
12
)]=
∑i
ln2 (σi)
B−1/2AB−1/2
σ1σ2
Compare curvatures: supu
u>Auu>Bu
= σ1
Invariance properties:
dF (A,B) = dF(
A−1,B−1)
dF (A,B) = dF(
MAM>, MBM>)
Frobenius dF (A,B) = ‖A− B‖F does not share the same properties
Marzouk (MIT) ICERM IdeaLab 7 July 2015 25 / 29
Example: computerized tomography
X-rays travel from sources to detectors through an object of interest. The
intensities from the sources are measured at the detectors, and the goal is to
reconstruct the density of the object.
0 20 40 60 80 1000.9
0.92
0.94
0.96
0.98
1
1.02
detector pixel
intensity
This synthetic example is motivated by a real application: real-time X-ray
imaging of logs that enter a saw mill for the purpose of automatic quality
control. 2
2Check www.bintec.fiMarzouk (MIT) ICERM IdeaLab 7 July 2015 26 / 29
Example: computerized tomography
Weaker data → faster decay of generalized eigenvalues → lower order
approximations possible.
index i0 100 200 300 400 500
eigenvalues
10-8
10-7
10-6
10-5
10-4
10-3
prior
limited angle
full angle
index i100 200 300 400 500
gen
eralizedeigenvalues
10-2
10-1
100
101
102
103
104
105
limited angle
full angle
rank of update100 200 300 400 500
dF
100
101
102
103
limited angle
full angle
In the limited angle case, roughly r = 200 is enough to get a good
approximation (with full angle r ≈ 800 needed).
pri or
ï5.8
ï5.6
ï5.4rank = 50 rank = 100 rank = 200 poste ri or
ï7
ï6.8
ï6.6
ï6.4
ï6.2
Marzouk (MIT) ICERM IdeaLab 7 July 2015 27 / 29
Example: computerized tomography
Approximation of the mean: µpos(y) = Γpos G>Γ−1obs y ≈ Ar y
Note: pre-computing the LIS basis offline allows fast online reconstructions for
repeated data!
Marzouk (MIT) ICERM IdeaLab 7 July 2015 28 / 29
Questions yet to answer
How to simulate from or explore the posterior distribution?
How to make Bayesian inference computationally tractable when the
forward model is expensive (e.g., a PDE) and the parameters are
high- or infinite-dimensional?
Downstream questions: model selection, optimal experimental
design, decision-making, etc.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 29 / 29