arxiv:2112.00713v1 [math.na] 1 dec 2021

32
hIPPYlib-MUQ: A Bayesian Inference Soο¬…ware Framework for Integration of Data with Complex Predictive Models under Uncertainty KI-TAE KIM, University of California, Merced, USA UMBERTO VILLA, Washington University in St. Louis, USA MATTHEW PARNO, Dartmouth College, USA YOUSSEF MARZOUK, Massachusetts Institute of Technology, USA OMAR GHATTAS, The University of Texas at Austin, USA NOEMI PETRA, University of California, Merced, USA Bayesian inference provides a systematic framework for integration of data with mathematical models to quantify the uncertainty in the solution of the inverse problem. However, solution of Bayesian inverse problems governed by complex forward models described by partial differential equations (PDEs) remains prohibitive with black-box Markov chain Monte Carlo (MCMC) methods. We present hIPPYlib-MUQ, an extensible and scalable software framework that contains implementations of state-of-the art algorithms aimed to overcome the challenges of high-dimensional, PDE-constrained Bayesian inverse problems. These algorithms accelerate MCMC sampling by exploiting the geometry and intrinsic low-dimensionality of parameter space via derivative information and low rank approximation. The software integrates two complementary open- source software packages, hIPPYlib and MUQ. hIPPYlib solves PDE-constrained inverse problems using automatically-generated adjoint-based derivatives, but it lacks full Bayesian capabilities. MUQ provides a spectrum of powerful Bayesian inversion models and algorithms, but expects forward models to come equipped with gradients and Hessians to permit large-scale solution. By combining these two complementary libraries, we created a robust, scalable, and efficient software framework that realizes the benefits of each and allows us to tackle complex large-scale Bayesian inverse problems across a broad spectrum of scientific and engineering disciplines. To illustrate the capabilities of hIPPYlib-MUQ, we present a comparison of a number of MCMC methods available in the integrated software on several high-dimensional Bayesian inverse problems. These include problems characterized by both linear and nonlinear PDEs, low and high levels of data noise, and different parameter dimensions. The results demonstrate that large (∼ 50Γ—) speedups over conventional black box and gradient-based MCMC algorithms can be obtained by exploiting Hessian information (from the log-posterior), underscoring the power of the integrated hIPPYlib-MUQ framework. CCS Concepts: β€’ Mathematics of computing β†’ Bayesian computation; Mathematical optimization; Partial differential equations; Computations on matrices; Discretization; Solvers; β€’ Computing method- ologies β†’ Uncertainty quantification;β€’ Applied computing β†’ Physical sciences and engineering. Authors’ addresses: Ki-Tae Kim, University of California, Merced, Applied Mathematics, School of Natural Sciences, Merced, CA, USA, [email protected]; Umberto Villa, Washington University in St. Louis, Electrical & Systems Engineering, St. Louis, MO, USA, [email protected]; Matthew Parno, Dartmouth College, Department of Mathematics, Hanover, NH, USA, [email protected]; Youssef Marzouk, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, Boston, MA, USA, [email protected]; Omar Ghattas, The University of Texas at Austin, Oden Institute for Computational Engineering & Sciences, Department of Mechanical Engineering, Department of Geological Sciences, Austin, TX, USA, [email protected]; Noemi Petra, University of California, Merced, Applied Mathematics, School of Natural Sciences, Merced, CA, USA, [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Β© 2021 Association for Computing Machinery. 0098-3500/2021/12-ART $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021. arXiv:2112.00713v1 [math.NA] 1 Dec 2021

Upload: others

Post on 16-Apr-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: A Bayesian Inference Software Frameworkfor Integration of Data with Complex Predictive Modelsunder UncertaintyKI-TAE KIM, University of California, Merced, USAUMBERTO VILLA,Washington University in St. Louis, USAMATTHEW PARNO, Dartmouth College, USAYOUSSEF MARZOUK,Massachusetts Institute of Technology, USAOMAR GHATTAS, The University of Texas at Austin, USANOEMI PETRA, University of California, Merced, USA

Bayesian inference provides a systematic framework for integration of data with mathematical models toquantify the uncertainty in the solution of the inverse problem. However, solution of Bayesian inverse problemsgoverned by complex forward models described by partial differential equations (PDEs) remains prohibitivewith black-box Markov chain Monte Carlo (MCMC) methods. We present hIPPYlib-MUQ, an extensibleand scalable software framework that contains implementations of state-of-the art algorithms aimed toovercome the challenges of high-dimensional, PDE-constrained Bayesian inverse problems. These algorithmsaccelerate MCMC sampling by exploiting the geometry and intrinsic low-dimensionality of parameter spacevia derivative information and low rank approximation. The software integrates two complementary open-source software packages, hIPPYlib and MUQ. hIPPYlib solves PDE-constrained inverse problems usingautomatically-generated adjoint-based derivatives, but it lacks full Bayesian capabilities. MUQ provides aspectrum of powerful Bayesian inversion models and algorithms, but expects forward models to come equippedwith gradients and Hessians to permit large-scale solution. By combining these two complementary libraries,we created a robust, scalable, and efficient software framework that realizes the benefits of each and allows usto tackle complex large-scale Bayesian inverse problems across a broad spectrum of scientific and engineeringdisciplines. To illustrate the capabilities of hIPPYlib-MUQ, we present a comparison of a number of MCMCmethods available in the integrated software on several high-dimensional Bayesian inverse problems. Theseinclude problems characterized by both linear and nonlinear PDEs, low and high levels of data noise, anddifferent parameter dimensions. The results demonstrate that large (∼ 50Γ—) speedups over conventional blackbox and gradient-based MCMC algorithms can be obtained by exploiting Hessian information (from thelog-posterior), underscoring the power of the integrated hIPPYlib-MUQ framework.

CCS Concepts: ‒ Mathematics of computing→ Bayesian computation; Mathematical optimization;Partial differential equations; Computations on matrices; Discretization; Solvers; ‒ Computing method-ologies→ Uncertainty quantification; ‒ Applied computing→ Physical sciences and engineering.

Authors’ addresses: Ki-Tae Kim, University of California, Merced, Applied Mathematics, School of Natural Sciences, Merced,CA, USA, [email protected]; Umberto Villa, Washington University in St. Louis, Electrical & Systems Engineering, St.Louis, MO, USA, [email protected]; Matthew Parno, Dartmouth College, Department of Mathematics, Hanover, NH, USA,[email protected]; Youssef Marzouk, Massachusetts Institute of Technology, Department of Aeronauticsand Astronautics, Boston, MA, USA, [email protected]; Omar Ghattas, The University of Texas at Austin, Oden Institutefor Computational Engineering & Sciences, Department of Mechanical Engineering, Department of Geological Sciences,Austin, TX, USA, [email protected]; Noemi Petra, University of California, Merced, Applied Mathematics, School ofNatural Sciences, Merced, CA, USA, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].Β© 2021 Association for Computing Machinery.0098-3500/2021/12-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

arX

iv:2

112.

0071

3v1

[m

ath.

NA

] 1

Dec

202

1

Page 2: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

2 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

Additional Key Words and Phrases: Infinite-dimensional inverse problems, adjoint-based methods, inexactNewton-CG method, low-rank approximation, Bayesian inference, uncertainty quantification, sampling,generic PDE toolkit

ACM Reference Format:Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra. 2021. hIPPYlib-MUQ: A Bayesian Inference Software Framework for Integration of Data with Complex PredictiveModels underUncertainty.ACMTrans. Math. Softw. 1, 1 (December 2021), 32 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONWith rapid explosion of observational and experimental data, a prominent challenge is how toderive knowledge and insight from this data to make better predictions and high-consequencedecisions. This question arises in all areas of science, engineering, technology, and medicine, andin many cases, there are mathematical models available that represent the underlying physicalsystems of which the data is observed or measured. These models are often subject to considerableuncertainties stemming from unknown or uncertain input model parameters (e.g., coefficientfields, constitutive laws, source terms, geometries, initial and/or boundary conditions) as well asfrom noisy and limited observations. The goal is to infer these unknown model parameters fromobservations of model outputs through corresponding partial differential equation (PDE) models,and to quantify the uncertainty in the solution of such inverse problems.

Bayesian inversion provides a systematic framework for integration of data with complex PDE-based models to quantify uncertainties in model parameter inference [34, 59]. In the Bayesianframework, noisy data and, possibly uncertain, mathematical models are integrated together witha prior information, yielding a posterior probability distribution of the model parameters. TheMarkov chain Monte Carlo (MCMC) method is a common way to explore the posterior distributionby use of sampling techniques. However, Bayesian inversion with complex forward models viaconventional MCMC methods faces several computational challenges. First, characterizing theposterior distribution of the model parameters or subsequent predictions often requires repeatedevaluations of expensive-to-solve large-scale PDE models. Second, the posterior distribution oftenhas a complex structure stemming from the nonlinear mapping from model parameter to observedquantities. Third, the parameters often are fields, which, after discretization, lead to very high-dimensional posteriors. These difficulties make the solution of Bayesian inverse problems withcomplex large-scale PDE forward models computationally intractable.

Extensive research efforts have been devoted to overcome the prohibitiveness of Bayesian inverseproblems governed by large-scale PDEs. With rapid progress in high-performance computing,and advances in scalable PDE solvers, repeated evaluations of forward PDE models for differentinput parameters [5, 61] are becoming tractable. Furthermore, structure-exploiting MCMC methodshave effectively facilitated the exploration of complex posterior distributions [9, 13, 18, 48]. Finally,dimension reduction methods have proved to significantly reduce the computational cost of MCMCsimulations [20, 69]. Applying and combining these advanced techniques can be extremely chal-lenging. Therefore, a computational tool that will assist the computational and scientific communityto apply, extend and tailor these methods will be very beneficial.In this paper, we present a software framework to tackle large-scale Bayesian inverse prob-

lems with PDE-based forward models, which has applications across a wide range of science andengineering fields. The software integrates two open-source software packages, an Inverse Prob-lems Python library (hIPPYlib) [65] and the MIT Uncertainty Quantification Library (MUQ) [46],respecting their attractive complementary capabilities.hIPPYlib is an extensible software framework for the solution of deterministic and linearized

Bayesian inverse problems constrained by complex PDE models. Based on FEniCS [37] for the

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 3: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 3

finite element approximation of PDEs and on PETSc [6] for high-performance linear algebraoperations and solvers, hIPPYlib allows users to describe (and solve) the underlying PDE-basedforward model (required by the inverse problem solver) in a relatively straightforward way. hIPPYlibcontains also implementations of efficient numerical methods for the solution of deterministicand linearized Bayesian inverse problems. These include globalized inexact Newton-conjugategradient [1, 10], adjoint-based computation of gradients and Hessian actions [62], randomizedlinear algebra [30], and scalable sampling from large-scale Gaussian fields. The state-of-the-artalgorithms implemented in hIPPYlib deliver the solution of the linearized Bayesian inverse problemat a cost that is independent of the parameter dimension. hIPPYlib is, however, mainly designedfor the deterministic and linearized Bayesian inverse problems, and lacks full Bayesian inversioncapabilities.MUQ complements hIPPYlib’s capabilities with more support for the formulation and solution

of Bayesian inference problems. MUQ is a modular software framework designed to addressuncertainty quantification problems involving complex models. The software provides an abstractmodeling interface for combining physical (e.g., PDEs) and statistical components (e.g., additive errormodels, Gaussian process priors, etc.) to define Bayesian posterior distributions in a flexible andsemi-intrusive way. MUQ also contains a suite of powerful uncertainty quantification algorithmsincluding Markov chain Monte Carlo (MCMC) methods [47], transport maps [40], likelihood-informed subspaces, sparse adaptive generalized polynomial chaos (gPC) expansions [17], Karhunen-LoΓ©ve expansions, Gaussian process modeling [31, 51], and prediction methods enabling globalsensitivity analysis and optimal experimental design. To effectively apply these tools to Bayesianinverse problems, however, MUQ needs to be equipped with the type of gradient and/or Hessianinformation that hIPPYlib can provide.

By interfacing these two software libraries, we aim to create a robust, scalable, efficient, flexible,and easy-to-use software framework that overcomes the computational challenges inherent in com-plex large-scale Bayesian inverse problems. Representative features of the software are summarizedas follows:β€’ The software combines the benefits of the two packages, hIPPYlib andMUQ, to enable scalablesolution of Bayesian inverse problems governed by large-scale PDEs.β€’ Various advanced MCMC methods are available that can exploit problem structure (e.g., thederivative/Hessian information of the log-posterior).β€’ The software makes use of sparsity, low-dimensionality, and geometric structure of thelog-posterior to achieve scalable and efficient MCMC methods.β€’ Convergence diagnostics are implemented to assess the quality of MCMC samples.

In the following sections, we first briefly review the Bayesian formulation of inverse problemsgoverned by PDEs both in infinite-dimensional and in finite-dimensional spaces (Section 2). We thendescribe MCMC methods used to characterize the posterior (Section 3) and summarize convergencediagnostics available in the software (Section 3.2). Next, we present the design of hIPPYlib-MUQ(Section 4). Finally, we present numerous benchmark problems and step-by-step implementationguide to illustrate the key aspect of the present software (Section 5). Section 6 provides concludingremarks.

2 THE BAYESIAN INFERENCE FRAMEWORKIn this section, we present a brief discussion of the Bayesian inference approach to solve inverseproblems governed by PDEs. We begin by providing an overview of the framework for infinite-dimensional Bayesian inverse problems following [14, 58, 65]. Then we present a brief discussionof the finite-dimensional approximations of the prior and the posterior distributions; a lengthier

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 4: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

4 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

discussion can be found in [14]. Finally, we present the Laplace approximation to the posterior distri-bution, which requires the solution of a PDE-constrained optimization problem for the computationof the maximum a posteriori (MAP).

2.1 Infinite-dimensional Bayesian inverse problemsThe objective of the inverse problem is to determine an unknown input parameter fieldπ‘š thatwould give rise to given observational (or experimental) data 𝒅 by means of a (physics-based)mathematical model. In other words, given 𝒅 ∈ Rπ‘ž , we seek to infer π‘š ∈ M (here, M is aninfinite-dimensional Hilbert space of functions defined on a domain D βŠ‚ R𝑑 ) such that

𝒅 β‰ˆ F (π‘š), (1)

where F :M β†’ Rπ‘ž is the parameter-to-observable map that predicts observations from a givenparameterπ‘š through a forward mathematical model. Note that the evaluation of this map involvessolving the forward PDE model givenπ‘š, followed by extracting the observations from the solutionof the forward problem.In the Bayesian approach, the inverse problem is framed as a statistical inference problem.

The uncertain parameterπ‘š and the observational data 𝒅 are deemed as random variables and thesolution is a conditional probability distribution that represents level of confidence in the estimationof the parameter given the data. The approach combines a prior model reflecting our knowledge ofthe parameter before the data is acquired, and a likelihood model measuring how likely an inputparameter field would result in the data.

Using the Radon-Nikodym derivative [66] of the posterior measure `post with respect to the priormeasure `prior, Bayes’ theorem in infinite dimensions is stated as

𝑑`post

𝑑`prior∝ πœ‹like (𝒅 |π‘š), (2)

where πœ‹like denotes the likelihood function. For detailed conditions under which the posteriormeasure is well defined, we refer the reader to Stuart [58].

For the construction of the likelihood function, we restrict our attention to additive noise models.Noise may stem from measurement uncertainties and/or modeling errors. In this work, we assumethat the noise is mutually independent of the parameterπ‘š, and can be modeled as a Gaussianrandom variable 𝜼 ∈ Rπ‘ž with zero mean and covariance matrix πšͺnoise ∈ Rπ‘žΓ—π‘ž , i.e.,

𝒅 = F (π‘š) + 𝜼; 𝜼 ∼ N(0, πšͺnoise) . (3)

This allows us to express the probability density function of the likelihood as

πœ‹like (𝒅 |π‘š) ∝ exp{βˆ’ Ξ¦(π‘š)

}, (4)

where Ξ¦(π‘š) = 12 βˆ₯F (π‘š) βˆ’ 𝒅βˆ₯

2πšͺβˆ’1noise

is referred to as the negative log-likelihood.We take the prior to be a Gaussian measure, i.e., `prior = N

(π‘špr, Cprior

), and assume that samples

from the prior distribution are square-integrable functions in the domain D, i.e. belong to 𝐿2 (D).The prior covariance operator Cprior is constructed to be a trace-class operator to guarantee boundedvariance of samples from the prior distribution and well-posedness of the Bayesian inverse problem;see Bui-Thanh et al. [14], Stuart [58], Villa et al. [65] for detailed explanation. Specifically, we takeCprior := Aβˆ’π‘£ = (βˆ’π›ΎΞ” + 𝛿𝐼 )βˆ’π‘£ ; 𝑣 > 𝑑

2 , where 𝛾 and 𝛿 > 0 control the correlation length and thepointwise variance of the prior operator; see Lindgren et al. [35], Villa et al. [65].

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 5: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 5

2.2 Discretization of the Bayesian formulationHere, we briefly present the finite-dimensional approximation of the Bayesian formulation describedin the previous section; we refer the reader to Bui-Thanh et al. [14] for a lengthier discussion.We consider a finite-dimensional subspace Mβ„Ž of M, defined by the span of a set of globallycontinuous basis functions

{πœ™ 𝑗

}𝑛𝑗=1. For example, for the finite element method, these basis functions

are piecewise polynomial on each element of a mesh discretization of the domain D [8, 57].The parameter field π‘š is then approximated as π‘š β‰ˆ π‘šβ„Ž =

βˆ‘π‘›π‘—=1π‘š π‘—πœ™ 𝑗 , and, in what follows,

π’Ž = (π‘š1, . . . ,π‘šπ‘›)𝑇 ∈ R𝑛 denotes the vector of the finite element coefficients ofπ‘šβ„Ž .The finite-dimensional approximation of the prior measure `prior is now specified by the density

πœ‹prior (π’Ž) ∝ exp(βˆ’ 1

2 βˆ₯π’Ž βˆ’π’Žprβˆ₯2πšͺβˆ’1prior

,

)(5)

where π’Žpr ∈ R𝑛 and πšͺprior ∈ R𝑛×𝑛 are the mean vector and the covariance matrix that arise upondiscretization ofπ‘špr and Cprior, respectively. We refer the reader to Bui-Thanh et al. [14], Villa et al.[65] for the explicit expression of the prior covariance matrix πšͺprior.

Then the Bayes’ theorem for the density of the finite-dimensional approximation of the posteriormeasure `post is given by

πœ‹post (π’Ž) := πœ‹post (π’Ž |𝒅) ∝ πœ‹like (𝒅 |π’Ž)πœ‹prior (π’Ž). (6)

The finite-dimensional posterior probability density function can be expressed explicitly as

πœ‹post (π’Ž) ∝ exp(βˆ’ 1

2 βˆ₯F(π’Ž) βˆ’ 𝒅βˆ₯2πšͺβˆ’1noiseβˆ’ 1

2 βˆ₯π’Ž βˆ’π’Žprβˆ₯2πšͺβˆ’1prior

,

)(7)

where F(π’Ž) refers to the parameter-to-observable map obtained from the finite element discretiza-tion of the forward model.

2.3 The Laplace approximation of the posterior distributionIn general, the posterior probability distribution (7) is not Gaussian due to the nonlinearity of theparameter-to-observable map. In this section, we discuss the solution to the so-called linearizedBayesian inverse problem by use of the Laplace approximation. The Laplace approximation amountsto constructing a Gaussian distribution centered at the maximum a posteriori (MAP) point. TheMAP point represents the most probable value of the parameter vector over the posterior, i.e.,

π’ŽMAP := argminπ’Ž(βˆ’ logπœ‹post (π’Ž)) = argmin

π’Ž

12 βˆ₯F(π’Ž) βˆ’ 𝒅βˆ₯

2πšͺβˆ’1noise+ 1

2 βˆ₯π’Ž βˆ’π’Žprβˆ₯2πšͺβˆ’1prior

. (8)

The covariance matrix of the Laplace approximation is the inverse of the Hessian of the negativelog-posterior evaluated at the MAP point. Then under the Laplace approximation, the solution ofthe linearized Bayesian inverse problem is given by

πœ‹post (π’Ž) ∼ N (π’ŽMAP, πšͺpost). (9)

withπšͺpost = Hβˆ’1 (π’ŽMAP) =

(Hmisfit (π’ŽMAP) + πšͺβˆ’1

prior

)βˆ’1, (10)

where H(π’ŽMAP) and Hmisfit (π’ŽMAP) denote the Hessian matrices of, respectively, the negative log-posterior and the negative log- likelihood evaluated at the MAP point.

The quality of the Gaussian approximation of the posterior depends on the degree of nonlinearityin the parameter-to-observable map, the noise covariance matrix, and number of observations [14,24, 26, 33, 50, 56, 59, 60, 68]. When the parameter-to-observable map is linear and the additive noise

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 6: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

6 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

and prior models are both Gaussian, the Laplace approximation is exact. Even if the parameter-to-observable map is significantly nonlinear, the Laplace approximation is a crucial ingredient toachieve scalable, efficient, and accurate posterior sampling with MCMC methods, as we will discussin the following section.

Note that the Laplace approximation involves the Hessian of the negative log-likelihood (the datamisfit part of the Hessian), which can not be explicitly constructed when the parameter dimensionis large. However, the data typically provide only limited information about the parameter field, andthus the eigenspectrum of the Hessian matrix often decays very rapidly. We exploit this compactnature of the Hessian to overcome its prohibitive computational cost, and construct a low-rankapproximation of the data misfit Hessian matrix using a matrix-free method (such as the randomizedsubspace iteration [30]).Concretely, we construct a low-rank approximation of the data misfit Hessian, i.e., Hmisfit β‰ˆ

πšͺβˆ’1priorVπ‘Ÿπš²π‘ŸVπ‘‡π‘Ÿ πšͺ

βˆ’1prior, where πš²π‘Ÿ = diag(_1, . . . , _π‘Ÿ ) ∈ Rπ‘ŸΓ—π‘Ÿ and Vπ‘Ÿ = [𝒗1, . . . , 𝒗𝒓 ] ∈ Rπ‘›Γ—π‘Ÿ contain the π‘Ÿ

largest eigenvalues and corresponding eigenvectors, respectively, of the generalized symmetriceigenvalue problems

Hmisfit𝒗𝑖 = _𝑖πšͺβˆ’1prior𝒗𝑖 ; 𝑖 = 1, . . . , 𝑛. (11)

Note that the eigenvectors 𝒗𝑖 are orthonormal with respect to πšͺβˆ’1prior, that is 𝒗𝑇𝑖 πšͺ

βˆ’1prior𝒗 𝑗 = 𝛿𝑖 𝑗 , where 𝛿𝑖 𝑗

is the Kronecker delta. With this row-rank approximation, using the Sherman-Morrison-Woodburyformula [28], we obtain, for the inverse of the Hessian in (10),

Hβˆ’1 =(Hmisfit + πšͺβˆ’1

prior

)βˆ’1= πšͺprior βˆ’ Vπ‘ŸDπ‘ŸVπ‘‡π‘Ÿ + O

(π‘›βˆ‘

𝑖=π‘Ÿ+1

_𝑖

1 + _𝑖

), (12)

where Dπ‘Ÿ = diag(_1/(_1 + 1), . . . , _π‘Ÿ/(_π‘Ÿ + 1)) ∈ Rπ‘ŸΓ—π‘Ÿ . We can see from the last remainder termin (12) that to obtain an accurate low-rank approximation of Hβˆ’1, we must keep eigenvectorscorresponding to eigenvalues that are greater than 1. This approximation is used to efficientlyperform various operations related to the Hessian, for example, applying square-root inverse of theHessian to a vector, which is needed to draw samples from the Gaussian approximation discussedin this section; see Villa et al. [65] for details.

3 MCMC SAMPLINGAs mentioned above, when the parameter-to-observable map is nonlinear, the Laplace approxima-tion may be a poor approximation of the posterior. IN this case, one needs to apply a sampling-basedmethod to explore the full posterior. In this section, we focus on several advanced Markov chainMonte Carlo (MCMC) methods available in the present software. We outline the general structureof MCMCmethods with a brief discussion of their key features. We then present various diagnosticsto assess the convergence of MCMC simulations.

3.1 Markov chain Monte CarloMCMC provides a flexible framework for exploring the posterior distribution. It generates samplesfrom the posterior distribution that can be employed in Monte Carlo approximations of poste-rior expectations. For example, the posterior expectation of a quantity of interest G(π‘š) can beapproximated by ∫

G(π‘š) 𝑑`post β‰ˆ1𝑁

π‘βˆ‘π‘–=1G (π‘šπ‘– ) , (13)

where eachπ‘šπ‘– ∼ `post is π‘–π‘‘β„Ž is a sample of the posterior distribution.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 7: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 7

MCMC techniques construct ergodic Markov chains where the posterior distribution is theunique stationary distribution of the chain [52]. Asymptotically, the states of the Markov chainare therefore exact samples of the posterior distribution and can be used in (13). Markov chainsare defined in terms of a transition kernel, which is a position dependent probability distribution𝐾 (Β·|π’Žπ‘– ) over state π’Žπ‘–+1 in the chain given the previous state π’Žπ‘– , i.e. π’Žπ‘–+1 ∼ 𝐾 (Β·|π’Žπ‘– ). Note thatchains of finite length must be employed in practice and the statistical accuracy of the Monte Carloestimator is therefore highly dependent on the ability of the transition kernel to efficiently explorethe parameter space.

There are several frameworks for constructing transition kernels that are appropriate for MCMC,including the well known Metropolis-Hastings (MH) rule [32, 42], Gibbs sampler (e.g., [16]), anddelayed rejection (DR) [43]. MUQ provides implementations of these frameworks, as well as thegeneralized Metropolis-Hastings (gMH) kernel [15] and multilevel MCMC framework of [23].Most of these frameworks start by drawing samples from one or more proposal distributionsπ‘ž1 (Β·|π’Žπ‘– ), . . . , π‘žπΎ (Β·|π’Žπ‘– ) that are easy to sample from (e.g., Gaussian) and then β€œcorrect” the proposedsamples to obtain exact, but correlated, posterior samples. In the MH and DR kernels, correctionstake the form of accepting or rejecting the proposed point. In the gMHkernel, the correction involvesanalytically sampling a finite state Markov chain over multiple proposed points. Intuitively, proposaldistributions that capture the shape of the posterior, either locally aroundπ‘šπ‘– or globally over theparameter space, tend to require fewer β€œcorrections” and yield more efficient algorithms.

Algorithm 1: Drawing a sample from the Metropolis-Hastings kernelInput: Current state π’Žπ‘– , Posterior density πœ‹post (π’Ž) , Proposal π‘ž ( Β· |π’Žπ‘– ) .Output: Next state π’Žπ‘–+1

1

/* Computes acceptance probability of proposed sample π’Žβ€² */

2 Function AcceptProb(π’Žπ‘– , π’Žβ€²):

3 𝛾 ← πœ‹post (π’Žβ€²)πœ‹post (π’Žπ‘– )

π‘ž (π’Žπ‘– |π’Žβ€²)π‘ž (π’Žβ€²|π’Žπ‘– )

4 𝛼 ← min{1, 𝛾 }5 return 𝛼

/* Draws a sample from the Metropolis-Hastings kernel. */

6 Function MHKernel(π’Žπ‘–):/* Sample the proposal. */

7 π’Žβ€² ∼ π‘ž ( Β· |π’Žπ‘– )

/* Calculate the acceptance probability: */

8 𝛼 ← AcceptProb (π’Žπ‘– , π’Žβ€²)

/* Accept proposed point with probability 𝛼. */

9 𝑒 ∼ π‘ˆ [0, 1]10 if 𝑒 < 𝛼 then

/* Accept the proposed point as the next step in the chain. */

11 returnπ’Žβ€²

12 else/* Reject proposed point. Return current state π’Žπ‘– as next state in chain. */

13 returnπ’Žπ‘–

Proposal Distributions. Let π‘ž(Β·|π’Žπ‘– ) denote a proposal distribution that is β€œparameterized" by thecurrent state of the chain π’Žπ‘– . We require that the proposal distribution is easily sampled and thatits density can be efficiently evaluated. The MH rule [32, 42] defines a transition kernel 𝐾𝑀𝐻 (Β·|π’Žπ‘– )

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 8: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

8 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

RW pCN

H-pCN

MALA ∞-MALA

H-MALA H-∞-MALA

Mesh independence

Gradient

inform

ation

Curvatureinform

ation

Fig. 1. The relationship of various MCMC proposal distributions with respect to mesh-refinement indepen-dence (blue arrow), gradient awareness (green arrow), and curvature awareness (red arrow).

through a two step process: first draw a random sampleπ’Žβ€² ∼ π‘ž(Β·|π’Žπ‘– ) from the proposal distribution,and then accept the proposed sample π’Žβ€² as the next step in the chain π’Žπ‘–+1 with a probability 𝛼 ,which is defined in Algorithm 1. If rejected, set π’Žπ‘–+1 = π’Žπ‘– . Under mild technical conditions onthe proposal distribution (see e.g., Roberts et al. [53]), the MH rule defines a Markov chain that isergodic and has `post as a stationary distribution, thus enabling states in the chain to be used inMonte Carlo estimators. Note that the detailed balance condition (see e.g., Owen [45]) is commonlyemployed to verify that a Markov chain has `post as a stationary distribution, but this conditionalone is not sufficient to guarantee that the chain will converge to the stationary distribution. SeeRoberts et al. [53] for a detailed discussion of MH convergence and convergence rates.

While the the MH rule will yield a valid MCMC kernel for a large class of proposal distributions,the dependence of the proposal on the previous state, combined with possible rejection of theproposed state, result in inter-sample correlations in theMarkov chain. Because of these correlations,the error of the Monte Carlo approximation in (13) will be larger when using MCMC than in theclassic Monte Carlo setting with independent samples. Markov chains with large correlations willresult in larger estimator variance. To reduce correlation in the Markov chain, we seek proposaldistributions that can take large steps with a high probability of acceptance. From the acceptanceprobability in Algorithm 1 we see that this can occur when the proposal density π‘ž(π’Ž |π’Žπ‘– ) is a goodapproximation to πœ‹post (π’Ž), so that 𝛾 is close to one.

We now turn to describing specific proposal distributions used in MUQ-hIPPYlib. First, we beginby describing common proposal mechanisms that exploit gradient and curvature information toaccelerate sampling in finite-dimensional spaces. These algorithms comprise the left face of the cubein Figure 1. We then show how these ideas can be extended to construct proposals with performancethat is independent of mesh-refinement, thus β€œlifting” the derivative-accelerated proposals to aninfinite-dimensional setting. This β€œlifting” operation transforms proposals on the left face in Figure1 to their dimension-independent analogs on the right face of the proposal cube.

Exploiting Gradient and Curvature Information. Perhaps the simplest and most common, but notgenerally efficient, proposal distribution takes the form of a Gaussian distribution centered at thecurrent state in the chain,

π‘žRW (π’Ž |π’Žπ‘– ) = N(π’Žπ‘– , πšͺprop

), (14)

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 9: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 9

where πšͺprop ∈ R𝑛×𝑛 is a user defined covariance matrix. When used with the MH rule, this randomwalk (RW) proposal yields anMCMC algorithm that is commonly called the randomwalkMetropolisalgorithm. The adaptive Metropolis (AM) algorithm employs a variant of this proposal where thecovariance πšͺprop is adapted based on previous samples [29]. A proposal covariance πšͺprop thatmatches posterior covariance increases efficiency, but the random walk proposal is still a poorapproximation of the posterior density.

A slightly more efficient proposal can be obtained through a one-step Euler-Maruyama discretiza-tion of the Langevin stochastic different equations [54]. The resulting Langevin proposal takes theform

π‘žMALA (π’Ž |π’Žπ‘– ) = N(π’Žπ‘– + 𝜏πšͺpropβˆ‡ logπœ‹post (π’Žπ‘– ), 2𝜏πšͺprop

), (15)

where 𝜏 is the step size parameter.MH samplers with this proposal are called Metropolis adjustedLangevin algorithms (MALA). Like the AM algorithm, adapting the covariance of theMALA proposalcan also improve performance [4, 38].Both (14) and (15) use a covariance that is constant across the parameter space. Allowing this

covariance to adapt to the local correlation structure of the posterior density enables higher orderapproximations to be obtained, resulting in more efficient MCMC algorithms. In Girolami andCalderhead [27], a differential geometric viewpoint was employed to define a family of proposalmechanisms on a Riemannian manifold. Adapting the MALA proposal in (15) to this manifoldsetting and ignoring the manifold’s curvature, results in

π‘žsMMALA (π’Ž |π’Žπ‘– ) = N(π’Žπ‘– + 𝜏Gβˆ’1 (π’Žπ‘– )βˆ‡ logπœ‹post (π’Žπ‘– ), 2𝜏Gβˆ’1 (π’Žπ‘– )

), (16)

where G(π’Ž) is a position-dependent metric tensor. This is known as the the simplified ManifoldMALA (sMMALA) proposal. Girolami and Calderhead [27] defined the metric tensor G(π’Ž) usingthe expected Fisher information metric, which provides a positive definite approximation of theposterior covariance at the point π’Ž. In this work however, we consider an alternative version ofthe sMMALA proposal that uses a constant metric built from a low rank-based approximation ofthe log-posterior Hessian at the MAP point (c.f. eq. (12))

π‘žH-MALA (π’Ž |π’Žπ‘– ) = N(π’Žπ‘– + 𝜏Hβˆ’1βˆ‡ logπœ‹post (π’Žπ‘– ), 2𝜏Hβˆ’1) . (17)

This metric is similar to the one used by Martin et al. [39] and is equivalent to the preconditionedMALA proposal in (15) using the covariance of the Laplace approximation in (10).

Hamiltonian Monte Carlo techniques define another important class of MCMC proposals. Thesetechniques approximately solve a Hamiltonian system to take large jumps in the parameter space.While efficient in various scenarios (see e.g., Neal [44]), we have found that solving the Hamiltoniansystem typically involves an intractable number of posterior gradient evaluations on our problemsof interest. The transport mapMCMC algorithms of Parno and Marzouk [47] are also not consideredhere because of the challenge of building high-dimensional transformations.

Dimension-Independent Proposal Distributions. For finite-dimensional parameters, the randomwalk and MALA proposals defined above can be used with the MH rule for MCMC. However,their performance is not discretization invariant. As the discretization of the functionπ‘š is refined,the performance of the samplers on the finite-dimensional posterior πœ‹post (π’Ž) will worsen. Somemodifications to the proposals are necessary to obtain β€œdimension-independent” performance.The works of Cotter et al. [18] , Beskos et al. [9], and Bardsley et al. [7], for example, modifyexisting finite-dimensional proposals to ensure the algorithm performance is independent of meshrefinement.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 10: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

10 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

The dimension-independent analog of the RW proposal is the preconditioned Crank-Nicolson(pCN) proposal introduced in Cotter et al. [18]. It takes the form

π‘žpCN (π’Ž |π’Žπ‘– ) = N(π’Žpr +

√1 βˆ’ 𝛽2 (π’Žπ‘– βˆ’π’Žpr), 𝛽2

πšͺprior

). (18)

Notice that when 𝛽 = 1, the pCN proposal is equal to the prior distribution. The MALA proposalwas also adapted in Cotter et al. [18] to obtain infinite-dimensional MALA (∞-MALA) proposal

π‘žβˆžMALA (π’Ž |π’Žπ‘– ) = 𝑁(√

1 βˆ’ 𝛽2π’Žπ‘– + π›½βˆšβ„Ž

2(π’Žpr βˆ’ πšͺpriorβˆ‡Ξ¦(π’Žπ‘– )

), 𝛽2

πšͺprior

), (19)

where 𝛽 = 4βˆšβ„Ž/(4 + β„Ž) and β„Ž is a parameter that can be tuned. While the pCN and ∞-MALA

proposals result in discretization-invariant Metropolis-Hastings algorithms, they suffer from thesame deficiencies as their finite-dimensional RW and MALA analogs: they do not capture theposterior geometry.

Several efforts have worked to minimize this deficiency, see for example Beskos et al. [9], Petraet al. [48], Pinski et al. [49], Rudolph and Sprungk [55]. We consider a generalization of the pCNproposal described in Pinski et al. [49]. It incorporates the MAP point and the posterior curvatureinformation at that point into the pCN proposal, which is denoted by H-pCN and takes the form

π‘žH-pCN (π’Ž |π’Žπ‘– ) = N(π’ŽMAP +

√1 βˆ’ 𝛽2 (π’Žπ‘– βˆ’π’ŽMAP), 𝛽2Hβˆ’1

). (20)

Another method that can exploit the posterior geometry is an extension of the∞-MALA proposaldiscussed in Beskos et al. [9]:

π‘žβˆžsMMALA (π’Ž |π’Žπ‘– ) = N (` β€²(π’Žπ‘– ), Ξ“β€²(π’Žπ‘– )) , (21)where

` β€²(π’Žπ‘– ) =√

1 βˆ’ 𝛽2π’Žπ‘– + π›½βˆšβ„Ž

2

(π’Žπ‘– βˆ’ Gβˆ’1

πšͺβˆ’1prior (π’Žπ‘– βˆ’π’Žpr) βˆ’ Gβˆ’1βˆ‡Ξ¦(π’Žπ‘– )

)(22)

Ξ“β€²(π’Žπ‘– ) = 𝛽2Gβˆ’1 (π’Žπ‘– ). (23)This∞-sMMALA proposal simplifies to∞-MALAwhenπΊβˆ’1 (π’Žπ‘– ) = πšͺprior. When𝐺 (π’Ž) is the Laplaceapproximation Hessian from (10), the∞-sMMALA proposal simplifies to

π‘žβˆžH-MALA (π’Ž |π’Žπ‘– ) = N(√

1 βˆ’ 𝛽2π’Žπ‘– + π›½βˆšβ„Ž

2

(π’Žπ‘– βˆ’ Hβˆ’1

πšͺβˆ’1prior (π’Žπ‘– βˆ’π’Žpr) βˆ’ Hβˆ’1βˆ‡Ξ¦(π’Žπ‘– )

), 𝛽2Hβˆ’1

),

(24)which we denote by H-∞-MALA.

Alternative Transition Kernels. The proposal distributions above are classically considered inthe context of a Metropolis-Hasting kernel. However, there are alternative transition kernelsthat also result in ergodic Markov chains. Here we consider transition kernels constructed fromthe delayed rejection approach of Mira et al. [43] as well as Metropolis-within-Gibbs kernels,which repeatedly use the Metropolis-Hastings rule on different conditional slices of the posteriordistribution to construct the Markov chain. In particular, we consider the family of dimension-independent likelihood-informed (DILI) approaches [19, 21], which define a Metropolis-within-Gibbs sampler that inherits dimension-independent properties from an appropriate dimension-independent proposal.

The delayed rejection kernel allows multiple proposals to be attempted in each step of the Markovchain. This can be advantageous when using multiple proposals with complementary properties.For example, it is possible to start with a proposal that attempts to make large ambitious jumps

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 11: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 11

across the parameter space but may have low acceptance probability while falling back on a moreconservative proposal that takes smaller steps with a larger probability of acceptance. Similarly, itis possible to start with a proposal that is more computationally efficient (e.g., does not requiregradient information) but less likely to be accepted, while employing a more expensive proposalmechanism in a second stage to ensure the chain explores the space. In either case, if the firstproposed move is rejected by the Metropolis-Hastings rule, another more expensive proposal thatis more likely to be accepted can be tried with an adjusted acceptance probability. More than twostages can also be employed. The details of delayed rejection are provided in Algorithm 2.

Algorithm 2: Drawing a sample from the delayed rejection kernelInput: Current state π’Žπ‘– , Posterior density πœ‹post (π’Ž) , Proposals π‘ž1 ( Β· |π’Žπ‘– ), . . . π‘ž 𝐽 ( Β· |π’Žπ‘– ) .Output: Next state π’Žπ‘–+1

1

/* Computes the probability of accepting proposed point π’Žβ€²π‘—from DR stage 𝑗 given the

previous point π’Žπ‘– in the chain and the 𝑗 βˆ’ 1 points [π’Žβ€²1, . . . ,π’Žβ€²π‘—βˆ’1 ] that were rejected in

previous DR stages. */

2 Function AcceptProb(π’Žπ‘– , [π’Žβ€²1 . . . ,π’Žβ€²π‘— ]):

3 𝛾 β†πœ‹post (π’Žβ€²π‘— )πœ‹post (π’Žπ‘– )

π‘ž 𝑗 (π’Žπ‘– |π’Žβ€²π‘— )π‘ž 𝑗 (π’Žβ€²π‘— |π’Žπ‘– )

π‘—βˆ’1βˆπ‘˜=1

[π‘žπ‘˜ (π’Žβ€²π‘—βˆ’π‘˜ |π’Ž

′𝑗)

π‘žπ‘˜ (π’Žβ€²π‘˜ |π’Žπ‘– )1βˆ’AcceptProb(π’Žβ€²

𝑗,[π’Žβ€²

π‘—βˆ’1,π’Žβ€²π‘—βˆ’2,...,π’Ž

β€²π‘—βˆ’π‘˜ ])

1βˆ’AcceptProb(π’Žπ‘– ,[π’Žβ€²1,π’Žβ€²2,...,π’Ž

β€²π‘˜])

]4 𝛼 ← min{1, 𝛾 }5 return 𝛼

/* Draws a sample from the delayed rejection kernel. */

6 Function DRKernel(π’Žπ‘–):7 for 𝑗 ← 1 to 𝐽 do

/* Sample the π‘—π‘‘β„Ž proposal. */

8 π’Žβ€²π‘—βˆΌ π‘ž 𝑗 ( Β· |π’Žπ‘– )

/* Calculate the acceptance probability: */

9 𝛼 ← AcceptProb (π’Žπ‘– , [π’Žβ€²1, . . . ,π’Ž 𝑗 ])

/* Accept current proposed point with probability 𝛼. */

10 𝑒 ∼ π‘ˆ [0, 1]11 if 𝑒 < 𝛼 then

/* Return the current proposed point. */

12 return π’Žβ€²π‘—

/* Return the current state. All proposed points were rejected. */

13 return π’Žπ‘–

DILI divides the parameter space into a finite-dimensional subspace, which can be exploredwith standard proposal mechanisms, and a complementary infinite-dimensional space that can beexplored with a dimension-independent approach, such as those described above. The resultingtransition kernel is more complicated than the Metropolis-Hastings rule, but inherits the dimension-independent properties of the complementary space proposal. The likelihood-informed subspace iscomputed using the generalized eigenvalue problem in (11). If the eigenvalue is larger than one,it indicates that the likelihood function dominates the prior density in that direction. The samelow rank structure used to approximate the posterior Hessian can therefore be used to decomposethe parameter space into a likelihood-informed subspace (LIS) spanned by the columns of Vπ‘Ÿand an orthogonal complementary space (CS). As shown in Algorithm 3, within each subspace, a

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 12: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

12 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

standard Metropolis-Hastings kernel is employed. As long as the kernel in the CS uses a dimension-independent proposal (typically pCN), then the DILI sampler will remain dimension-independent.Unlike the original implementation described in Cui et al. [19], the MUQ implementation doesnot use a whitening transform and thus avoids computing any symmetric decomposition of theprior covariance. In general, the Hessian used in (11) can be adapted to capture more correlationstructure. However, we did not find this necessary in the numerical experiments below.

Algorithm 3: Drawing a sample from the DILI kernelInput: Current state π’Žπ‘– , Current subspace Vπ‘Ÿ , Subspace kernel 𝐾𝑠 ( Β· |𝒓, 𝒄) , Complementary Kernel 𝐾𝑐 ( Β· |𝒓, 𝒄) .Output: Next state π’Žπ‘–+1

1

/* Use Metropolis-in-Gibbs steps to draw a sample from the DILI kernel. */

2 Function DILIKernel(π’Žπ‘–):

/* Split current state into LIS and CS components. */

3 Wπ‘Ÿ ← πšͺβˆ’1priorVπ‘Ÿ

4 𝒓𝑖 ←Wπ‘‡π‘Ÿ π’Žπ‘–

5 𝒄𝑖 ← (𝐼 βˆ’ Vπ‘Ÿ Wπ‘‡π‘Ÿ )π’Žπ‘–

/* Take a step in the likelihood-informed space (LIS). */

6 𝒓 β€² ← 𝐾𝑠 ( Β· |𝒓𝑖 , 𝒄𝑖 )

/* Take a step in the complementary space (CS). */

7 𝒄′ ← 𝐾𝑐 ( Β· |𝒓 β€², 𝒄𝑖 )

/* Compute the new location in the full space. */

8 π’Žβ€² = Vπ‘Ÿ 𝒓 β€² + 𝒄′

9 return π’Žβ€²

Assembling an MCMC Algorithm. It is possible to combine nearly any of the proposals and kernelsdescribed above, resulting in myriad possible MCMC algorithms. As suggested in Figure 2, thereare three fundamental building blocks to an MCMC algorithm. The chain keeps track of previouspoints and allows computing Monte Carlo estimates. The kernel defines a mechanism for samplingthe next state π’Žπ‘–+1 given the value of the current state π’Žπ‘– and one or more proposal distributions.The proposal defines a position specific probability distribution that can be easily sampled andhas a density that can be efficiently evaluated. We mimic these abstract interfaces in our softwaredesign to define and test a large number of kernel-proposal combinations.

3.2 MCMC diagnosticsTwo questions naturally arise when analyzing a length 𝑁 Markov chain [m1, . . . ,m𝑁 ] producedby MCMC. First, has the chain converged to the stationary distribution? Second, what is thestatistical efficiency of the chain? Most theoretical guarantees are asymptotic and it is importantto quantitatively answer these questions when employing finite-length MCMC chains. Based onthese considerations, this section describes the diagnostics implemented in MUQ-hIPPYlib to checkthe convergence and statistical efficiency of high dimensional MCMC chains.

3.2.1 Assessing Convergence. To assess convergence, we will compute two different asymptoticallyunbiased estimators of the posterior covariance: one that is an overestimate for finite 𝑁 and onethat is an underestimate for finite 𝑁 . As the ratio of these two estimates approaches one, we can beconfident that the MCMC chain has converged (see e.g., Brooks and Gelman [11], Gelman et al.[26], Vehtari et al. [64]).

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 13: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 13

Metropolis-Hastings Algorithms

Chain MH Kernel Proposal

Delayed Rejection Algorithms

Chain DR Kernel

Proposal 1

...

Proposal 𝐽

DILI Algorithms

Chain DILI Kernel

MH Kernel LIS Proposal

MH Kernel CS Proposal

Fig. 2. The flexible framework of MUQ-hIPPylib allows many different combinations of transition kernelsand proposal distributions to be employed. The components of the transition kernels defined in Algorithms1–3 are shown here. Note that each kernel can interact with any proposal distribution, which enables manydifferent MCMC algorithms to be constructed from the same basic components.

The estimates are based on running𝑀 independent chains starting from randomly chosen pointsthat are more disperse than the posterior distribution `post, where we define a β€œdisperse” distributionas one that has a larger covariance than `post. Each chain has the same length 𝑁 .Letting m𝑖 𝑗 be 𝑖th MCMC sample in chain 𝑗 , we define the within-sequence covariance matrix

W and the between-sequence covariance matrix B as

W =1

𝑀 (𝑁 βˆ’ 1)

π‘€βˆ‘π‘—=1

π‘βˆ‘π‘–=1(m𝑖 𝑗 βˆ’ m. 𝑗 ) (m𝑖 𝑗 βˆ’ m. 𝑗 )𝑇 ; m. 𝑗 =

1𝑁

π‘βˆ‘π‘–=1

m𝑖 𝑗 , (25)

B =𝑁

𝑀 βˆ’ 1

π‘€βˆ‘π‘—=1(m. 𝑗 βˆ’ m..) (m. 𝑗 βˆ’ m..)𝑇 ; m.. =

1𝑀

π‘€βˆ‘π‘—=1

m. 𝑗 . (26)

As pointed out in Brooks and Gelman [11], W and B can be combined to produce an estimate V ofthe posterior covariance that takes the form

V =𝑁 βˆ’ 1𝑁

W + 𝑀 + 1𝑀𝑁

B. (27)

The overdispersion of the initial points in each chain causes V to overestimate the posteriorcovariance for finite 𝑁 . On the other hand, the average within-chain covariance W will tend tounderestimate the covariance because the chains have not explored the entire parameter space.Comparing W and V thus provides a way of assessing convergence.The 𝑅 statistic of Gelman et al. [26] and Vehtari et al. [64] is a common way of comparing W

and V. It uses the ratio of the diagonal component of V and W to construct a componentwiseconvergence diagnostic. For high dimensional problems, it is more natural to consider a multivariateconvergence diagnostic. We will therefore employ the multivariate potential scale reduction factor(MPSRF) of Brooks and Gelman [11], which is a natural extension of the componentwise 𝑅 statistic.The MPSRF is defined by

MPSRF =

√maxπ‘Ž

π‘Žπ‘‡ Vπ‘Žπ‘Žπ‘‡Wπ‘Ž

=

βˆšπ‘ βˆ’ 1𝑁+ 𝑀 + 1𝑀𝑁

_max, (28)

where _max is the largest eigenvalue satisfying the generalized eigenvalue problem B𝒗 = _W𝒗.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 14: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

14 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

Note that by construction MPSRF β‰₯ 1. When the MPSRF approaches 1, the variance within eachsequence approaches the variance across sequences, thus indicating that each individual chain hasconverged to the target distribution. Following the recommendations of Vehtari et al. [64], we willconsider the chains β€œconverged” if MPSRF < 1.01.

3.2.2 Statistical Efficiency. The samples in an MCMC chain are generally correlated, which in-creases the variance of Monte Carlo estimators constructed with MCMC samples. For a quantityof interest G(π’Ž), the effective sample size (ESS) of a Markov chain is defined as the number ofindependent samples of the posterior that would be needed to estimate E[G] with the same statis-tical accuracy as an estimate from the Markov chain. The ESS is therefore a measure of how muchinformation is contained in the MCMC chain. In this work, it is commonly assumed that the ESS isderived for estimators of the posterior mean, i.e., E[G] = E[π’Ž]. Here we derive the ESS under thiscommon assumption, but discuss alternatives that are better suited for high dimensional parameterspaces in Section 3.2.3.

There are several ways of estimating the ESS. For instance, spectral approaches use the integratedautocorrelation of the MCMC chain to estimate the effective sample size (see e.g., Gelman et al.[26], Wolff et al. [67]). Other common methods use the statistics of small sample batches (see e.g.,Flegal and Jones [25], Vats et al. [63]). MUQ provides implementations of both spectral and batchmethods. Here we focus on the spectral formulation of ESS however, because it gives additionalinsight into the structure of MCMC chains. The ESS for component 𝑖 of π’Ž is defined by

ESS𝑖 =𝑀𝑁

1 + 2βˆ‘βˆžπ‘‘=1 πœŒπ‘–π‘‘

, (29)

where πœŒπ‘–π‘‘ is the autocorrelation function of component 𝑖 in the MCMC chain at lag 𝑑 . Here, theautocorrelation function πœŒπ‘–π‘‘ is estimated by the following formula [26]:

πœŒπ‘–π‘‘ β‰ˆ πœŒπ‘–π‘‘ = 1 βˆ’ 𝑣𝑖𝑑

2𝑉𝑖𝑖, (30)

where 𝑉𝑖𝑖 is the 𝑖th diagonal component of the posterior covariance estimate defined in (27) and 𝑣𝑖𝑑is the variogram defined by

𝑣𝑖𝑑 =1

𝑀 (𝑁 βˆ’ 𝑑)

π‘€βˆ‘π‘—=1

π‘βˆ‘π‘˜=𝑑+1(π‘šπ‘˜ 𝑗,𝑖 βˆ’π‘š (π‘˜βˆ’π‘‘ ) 𝑗,𝑖 )2. (31)

In practice, πœŒπ‘–π‘‘ is noisy for large values of 𝑑 and we truncate the summation (29) at some lag𝑑 β€². Following common practice, we choose 𝑑 β€² β‰₯ 0 to be the lag for which the sum successiveautocorrelation estimates 𝜌2𝑑 β€² + 𝜌2𝑑 β€²+1 is negative [26].

3.2.3 Projection along the dominant eigenvectors. We note that the evaluation of the integratedautocorrelation time and ESS for all components of the parameter vector π’Ž is computationallyintractable when π’Ž is high-dimensional. Moreover, a huge amount of disk storage is required tosave all the samples before the ESS evaluation. To alleviate these issues for large-scale problems, weconsider only the subspace spanned by the π‘Ÿ dominant eigenvectors of the generalized eigensystemin (11). Specifically, we compute the autocorrelation time and ESS based on a coefficient vectorc ∈ Rπ‘Ÿ defined by

c = Vπ‘‡π‘Ÿ πšͺβˆ’1priorm. (32)

4 SOFTWARE FRAMEWORKhIPPYlib-MUQ is a Python interface that integrates these two open source software libraries into aunique software framework, allowing the user to implement state-of-the-art Bayesian inversion

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 15: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 15

algorithms in a seamless way. In this framework, hIPPYlib is used to define the forward model, theprior, and the likelihood, to compute the maximum a posteriori (MAP) point, and to construct aGaussian (Laplace) approximation of the posterior distribution based on approximations of theposterior covariance as a low-rank update of the prior [14]. MUQ is employed to exploit advancedMCMC methods to fully characterize the posterior distribution in non-Gaussain/nonlinear settings.hIPPYlib-MUQ offers a set of wrappers that encapsulate the functionality of hIPPYlib in a way thatvarious features of hIPPYlib can be accessed by MUQ. A key aspect of hIPPYlib-MUQ is that itenables the use of curvature-informed MCMC methods, which is crucial for efficient and scalableexploration of the posterior distribution for large-scale Bayesian inverse problems. We summarize inFigure 3 the main functionalities of hIPPYlib and MUQ and the integration of their complementarycomponents.

Geometry, meshFinite element spacesAssembly of weak formsAutomatic differentiation

FEniCS

β€’ PDE– First/second orderforward/adjoint PDEs

β€’ Likelihood– Observation operator– Noise covarianceβ€’ Prior– Covariance/regularizationoperators

β€’ QOI– Prediction & sensitivities

hIPPYlib Model

β€’ Large-scale optimization algorithmsβ€’ Randomized linear algebra– Eigensolvers– Trace/diagonal estimatorsβ€’ Scalable Gaussian random fields

hIPPYlib Algorithms

Parallel linear algebraKrylov methodsPreconditioners

PETSc

β€’ Forward/adjoint solverβ€’ Incremental forward/adjointβ€’ Gradient evaluationβ€’ Hessian action

Model Evaluation &Sensitivities

β€’ MAP pointβ€’ Low rank-based decompositionof posterior covariance

Laplace Approximation

β€’ Abstract model interfaceβ€’ Probability distributions

ModPieces

β€’ Curvature-informed proposals– pCN and MALA withLaplace approximation

– Dimension-independentlikelihood-informed

β€’ Flexible kernels– Metropolis-Hastings– Delayed rejection

MCMC Proposals &Kernels

β€’ Graphical model specificationβ€’ Bayesian hierarchical modelingβ€’ Gradient/Hessian propagation

MUQModeling

β€’ Posterior sampling– MCMC– Transport maps– Likelihood-informed subspacesβ€’ Surrogates– Sparse adaptive gPC– Gaussian processesβ€’ Prediction tools– Global sensitivity analysis– Optimal experimental design

MUQ Algorithms

Interface

Fig. 3. Description of the functionalities of hIPPYlib and MUQ and their interface. Orange and red boxesrepresent hIPPYlib and MUQ functionalities, respectively. Green boxes indicate external software libraries,FEniCS and PETSc, that provide parallel implementation of finite element discretizations and solvers. Arrowsrepresent one-way or reciprocal interactions.

Figure 4 provides an overview of the Python classes implemented by the hIPPYlib-MUQ interface.Inherited from MUQ classes, the interface classes wrap the hIPPYlib functionalities needed toachieve curvature-informed MCMC sampling methods. These include:(1) Prior Gaussian interface (e.g., LaplaceGaussian class) to enable the use of hIPPYlib priormod-

els (e.g., LaplacianPrior class) in MUQ probability distribution models (e.g., GaussianBaseclass);

(2) Likelihood interface (Param2LogLikelihood class) to incorporate hIPPYlib likelihood models(Model class) into the MUQ Bayesian modeling framework so that MUQ can exploit the modelevaluation (the parameter-to-observable map) and optionally its gradient and Hessian actions.

(3) Laplace approximation interface (LAPosteriorGaussian class) to provide access to theLaplace approximation of the posterior distribution generated by hIPPYlib (GaussianLRPosteriorclass) from MUQ modeling component (ModPiece class).

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 16: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

16 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

LaplaceGaussian

LaplacianPrior

BiLaplaceGaussian

BiLaplacianPrior

LAPosteriorGaussian

GaussianLRPosterior

Param2LogLikelihood

Model

GaussianBase

Density

Distribution

ModPiece

Fig. 4. Class hierarchy for hIPPYlib-MUQ framework. Classes of hIPPYlib, MUQ, and the interface are coloredin orange, red, and blue, respectively. Dashed arrows represent inheritance relationship between two classes:the arrowhead attaches to the super-class and the other attaches to the sub-class.

# Example code snippetimport muq.Modeling as mmimport hippylib2muq as hm

# ... Use hIPPYlib to define prior and model variables

# Convert hiPPYlib components to MUQ componentsprior_density = hm.BiLaplaceGaussian(prior ). AsDensity()likelihood = hm.Param2LogLikelihood(model)

# Add all of the components to the graphgraph = mm.WorkGraph()graph.AddNode(mm.IdentityOperator(dim), 'Parameter')graph.AddNode(prior_density, 'Prior')graph.AddNode(likelihood, 'Likelihood')graph.AddNode(mm.DensityProduct(2), 'Posterior')

# Define right branch: Parameterβˆ’>Priorβˆ’>Posteriorgraph.AddEdge('Parameter', 0, 'Prior', 0)graph.AddEdge('Prior', 0, 'Posterior', 0)

# Define left branch: Parameterβˆ’>Likelihoodβˆ’>Posteriorgraph.AddEdge('Parameter', 0, 'Likelihood', 0)graph.AddEdge('Likelihood', 0, 'Posterior', 1)

Parameter(IdentityOperator)

Likelihood(Param2LogLikelihood)

Prior(e.g., LaplaceGaussian)

Posterior(DensityProduct)

Input

Output

Fig. 5. Graphical description of Bayesian posterior modeling using hIPPYlib-MUQ software framework (left)and an example code snippet (right). In the left figure, class names of MUQ and the interface are coloredin red and blue, respectively. MUQ WorkGraph class provides a way to combine all the Bayesian posteriormodel components by its member functions AddNode and AddEdge. MUQ IdentityOperator class identifiesinput parameters and the input argument dim represents the parameter dimension. MUQ DensityProductclass defines product of prior and likelihood densities and the input argument 2 means the number of inputdensities.

These interface classes can then be used to form a Bayesian posterior model governed by PDEsusing MUQ graphical modeling interface (WorkGraph) as shown in Figure 5, as well as to constructMCMC proposals.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 17: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 17

hIPPYlib-MUQ also implements the MCMC convergence diagnostics described in Section 3.2.These include the potential scale reduction factor and its extension to multivariate parametercases [11], the autocorrelation function, and the effective sample size. A detailed description of allclasses and functionalities of hIPPYlib-MUQ can also be found at https://hippylib2muq.readthedocs.io/en/latest/modules.html.

5 NUMERICAL ILLUSTRATIONThe objective of this section is to showcase applications of the integrated software frameworkdiscussed in previous sections via a step-by-step implementation procedure. We focus on comparingthe performance of several MCMCmethods available in the software framework. For the illustrationwe first revisit the model problem considered in Villa et al. [65], an inverse problem of reconstructingthe log-diffusion coefficient field in a two-dimensional elliptic partial differential equation. Wethen consider a nonlinear 𝑝-Poisson problem in three-dimension for which the forcing term of anatural boundary condition is inferred. In this section, we summarize the Bayesian formulationof the example problems and present numerical results obtained using the proposed softwareframework. The accompanying Jupyter notebook provides a detailed description of the hIPPYlib-MUQ implementations; see https://hippylib2muq.readthedocs.io/en/latest/tutorial.html.

5.1 Inferring coefficient field in a two-dimensional Poisson PDEWe first consider the coefficient field inversion in a Poisson partial differential equation givenpointwise noisy state measurements. We begin by describing the forward model setup and quantityof interest (the log flux through the bottom surface), followed by the definition of the prior andthe likelihood distributions. We next present the Laplace approximation of the posterior and applyseveral MCMCmethods to characterize the posterior distribution, as well as the predictive posteriordistribution of the scalar quantity of interest. The scalability of the proposed methods with respectof the parameter dimension is then assessed in a mesh refinement study. Finally, a comparisonbetween curvature-informed and classical MCMC methods is shown for a different noise level andnumber of observation points.

5.1.1 Forward model. Let Ξ© ∈ R𝑑 (𝑑 = 2, 3) be an open bounded domain with boundary πœ•Ξ© =

πœ•Ξ©π· βˆͺ πœ•Ξ©π‘ , πœ•Ξ©π· ∩ πœ•Ξ©π‘ = βˆ…. Given a realization of the uncertain parameter field π‘š, the statevariable 𝑒 is governed by

βˆ’βˆ‡ Β· (π‘’π‘šβˆ‡π‘’) = 𝑓 in Ξ©,

𝑒 = 𝑔 on πœ•Ξ©π· , (33)π‘’π‘šβˆ‡π‘’ Β· n = β„Ž on πœ•Ξ©π‘ ,

where 𝑓 is a volume source term, 𝑔 and β„Ž are the prescribed Dirichlet and Neumann boundary data,respectively, and n is the outward unit normal vector.

The weak form of (33) reads as follows: Find 𝑒 ∈ V𝑔 such that

βŸ¨π‘’π‘šβˆ‡π‘’,βˆ‡π‘βŸ© = βŸ¨π‘“ , π‘βŸ© + βŸ¨β„Ž, π‘βŸ©πœ•Ξ©π‘βˆ€π‘ ∈ V0, (34)

where

V𝑔 ={𝑣 ∈ 𝐻 1 (Ξ©) |𝑣 = 𝑔 on πœ•Ξ©π·

},

V0 ={𝑣 ∈ 𝐻 1 (Ξ©) |𝑣 = 0 on πœ•Ξ©π·

}. (35)

Above, we denote the 𝐿2-inner product over Ξ© by ⟨·, ·⟩ and that over πœ•Ξ©π‘ by ⟨·, Β·βŸ©πœ•Ξ©π‘.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 18: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

18 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

Fig. 6. Prior mean (leftmost) and three sample fields drawn from the prior distribution for the Poissonproblem.

As a quantity of interest, the log of normal flux through the bottom boundary πœ•Ξ©π‘ βŠ‚ πœ•Ξ©π· isconsidered. Specifically, we define the quantity of interest G(π‘š) as

G(π‘š) = ln{βˆ’

βˆ«πœ•Ξ©π‘

π‘’π‘šβˆ‡π‘’ Β· n𝑑𝑠}. (36)

In this example we consider a unit square domain in R2 with no source term (𝑓 = 0), no normalflux (β„Ž = 0) on the left and right boundaries, and the Dirichlet condition imposed on the topboundary (𝑔 = 1) and the bottom boundary (𝑔 = 0).

For the spatial discretization, we use quadratic finite elements for the state variable (also for theadjoint variable) and linear finite elements for the parameter variable. For the numerical resultspresented in Sections 5.1.5 and 5.1.7, the computational domain is then discretized using a regularmesh with 2,048 triangular elements. This leads to 4,225 and 1,089 degrees of freedom for the stateand parameter variables, respectively. In the scalability results presented in Section 5.1.6, the meshis then refined with up to four levels of uniform refinement leading to 263,169 and 66,049 degreesof freedom for the state and parameter variables, respectively, on the finest level.

5.1.2 Prior model. As discussed in Section 2, we choose the prior to be a Gaussian distributionN

(π‘špr, Cprior

)with Cprior = Aβˆ’2 where A is a Laplacian-like operator given as

Aπ‘š =

{βˆ’π›Ύβˆ‡ Β· (Ξ˜βˆ‡π‘š) + π›Ώπ‘š in Ξ©,

Ξ˜βˆ‡π‘š Β· n + π›½π‘š on πœ•Ξ©.(37)

Here, 𝛽 βˆβˆšπ›Ύπ›Ώ is the optimal Robin coefficient introduced to alleviate undesirable boundary

effects [22], and an anisotropic tensor Θ is of the form

Θ =

[\1 sin2 (𝛼) (\1 βˆ’ \2) sin(𝛼) cos𝛼

(\1 βˆ’ \2) sin(𝛼) cos𝛼 \2 cos2 (𝛼)

]. (38)

For this example we take 𝛾 = 0.1, 𝛿 = 0.5, 𝛽 =βˆšπ›Ύπ›Ώ/1.42, \1 = 2.0, \2 = 0.5 and 𝛼 = πœ‹/4. Figure 6

shows the prior meanπ‘špr and three samples from the prior distribution.

5.1.3 Observations with noise and the likelihood. We generate state observations at 𝑙 randomlocations uniformly distributed over [0.05, 0.95]2 by solving the forward problem on the finestmesh with the true parameter fieldπ‘štrue (here a sample from the prior is used) and then addinga random Gaussian noise to the resulting state values; see Figure 7. The number of observations𝑙 is set to 300 for the experiments in Sections 5.1.5 and 5.1.6, while π‘ž = 60 for the comparison inSection 5.1.7. The vector of synthetic observations is given by

𝒅 = B𝑒 + 𝜼, (39)

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 19: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 19

Fig. 7. True parameter field (left) and the corresponding state field (right) for the Poisson problem. Thelocations of the observation points are marked as white squares in the right figure.

where B is a linear observation operator, restricting the state solution to the 𝑙 observation points.The additive noise vector 𝜼 has mutually independent components that are normally distributedwith zero mean and standard deviation 𝜎 = 0.005 (Sections 5.1.5 and 5.1.6) or 𝜎 = 0.1 (Section 5.1.7).The likelihood function is then given by

πœ‹like (𝒅 |π‘š) ∝ exp(βˆ’1

2 βˆ₯B 𝑒 (π‘š) βˆ’ 𝒅obsβˆ₯2πšͺβˆ’1noise

), (40)

where πšͺnoise = 𝜎2I.

5.1.4 Laplace approximation of the posterior. We next construct the Laplace approximation of theposterior, a Gaussian distribution Λ†post ∼ N

(π‘šMAP,H(π‘šMAP)βˆ’1) with mean equal to the MAP point

and covariance given by the Hessian of the negative log-posterior evaluated at the MAP point. TheMAP point is obtained by minimizing the negative log-posterior, i.e.,

minπ‘šβˆˆM

J (π‘š) := 12 βˆ₯B 𝑒 (π‘š) βˆ’ 𝒅obsβˆ₯2Ξ“βˆ’1

noise+ 1

2 βˆ₯π‘š βˆ’π‘šprβˆ₯2Cβˆ’1prior

. (41)

We employ the inexact Newton-CG algorithm implemented in hIPPYlib to solve the above opti-mization problem. We refer the reader to Villa et al. [65] for a detailed description of the algorithmand the expressions for the gradient and Hessian actions of the negative log-posterior J (π‘š).As pointed out in Section 2, explicitly computing the Hessian is prohibitive for large-scale

problems, as this entails solving two forward-like PDEs as many times as the number of parameters.To make the operations with the Hessian scalable with respect to the parameter dimension, weinvoke a low-rank approximation of the data misfit part of the Hessian, retaining only π‘Ÿ eigenvectorsthat are the most significantly informed directions from the data [65].Figure 8 shows the eigenspectrum of the prior-preconditioned data misfit Hessian. The double

pass randomized algorithm provided by hIPPYlib with an oversampling factor of 20 is used toaccurately compute the dominant eigenpairs. We see that eigenvalues are smaller than 1 afteraround the 60π‘‘β„Ž eigenvalue, indicating that keeping 60 eigenpairs is sufficient for the low-rank ap-proximation. Figure 8 also shows four eigenvectors, which, as expected, illustrate that eigenvectorscorresponding to smaller eigenvalues display more fluctuations.

In Figure 9, we depict the MAP point and three samples drawn from the Laplace approximationof the posterior.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 20: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

20 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

0 20 40 60 80 100

10βˆ’1

100

101

102

103

104

105

106

index

eigenv

alue

Fig. 8. Logarithmic plot of the π‘Ÿ = 100 dominant eigenvalues of the prior-preconditioned data misfit Hessianand the eigenvectors corresponding to the 1𝑠𝑑, 4π‘‘β„Ž, 16π‘‘β„Ž and 64π‘‘β„Ž largest eigenvalues for the Poisson problem.

Fig. 9. The MAP point (leftmost) and three sample fields drawn from the Laplace approximation of theposterior distribution for the Poisson problem.

5.1.5 Exploring the posterior using MCMC methods. In this section, we implement the advancedMCMC algorithms discussed in the Section 3 to explore the posterior and compare their perfor-mance.

In particular, we consider pCN, MALA,∞-MALA, DR, DILI, and their Hessian-informed counter-parts. For each method, we simulate 20 independent MCMC chains, each with 25,000 samples, andhence draw a total of 500,000 samples from the posterior. A sample from the Laplace approximationof the posterior is chosen as starting point for the chains.

Table 1 shows the convergence diagnostics and computational efficiency of the MCMC samples.MPSRF and ESS are computed by projections of parameter samples along the first 25 dominanteigenvectors of the prior-preconditioned data misfit Hessian at the MAP point. Table 1 reports themininum, maximum, and average ESS over all the 25 projections.

The last column in Table 1 represents the number of forward and/or adjoint PDE solves requiredto draw a single independent sample (average ESS is used). This quantity can be used to measure thesampling efficiency and rank the methods in terms of computational efficiency. Under this metric,DILI-MAP is the most efficient method and requires only 202 PDE solves for effective sample. DR(213 NPS/ES for H-∞-MALA and 215 NPS/ES for H-MALA) and H-pCN (216 NPS/ES) are closeseconds.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 21: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 21

Table 1. Comparison of the performance of several MCMC methods for the Poisson problem: pCN, MALA,∞-MALA, DR, DILI, and their Hessian-informed versions. Acceptance rate (AR), multivariate potential scalereduction factor (MPSRF), and effective sample sample size (ESS) are reported for convergence diagnostics.MPSRF and ESS are computed by projections of parameter samples along the first 25 dominant eigenvectorsof the prior-preconditioned data misfit Hessian at the MAP point. Two values of AR are listed in DR andDILI-MAP, which are for the first and the second proposal moves, respectively. We also provide the number offorward and/or adjoint PDE solves per effective sample (NPS/ES) for sampling efficiency. We use 20 MCMCchains, each with 25,000 iterations (500,000 samples in total). The numbers in the parenthesis in each methodname represent the parameter values used (𝛽 for pCN, 𝜏 for MALA, β„Ž for ∞-MALA, and 𝛽 and 𝜏 for andDILI). The number in parenthesis of the minimum ESS and the maximum ESS indicates the correspondingeigenvector index.

Method AR (%) MPSRF Min. ESS (index) Max. ESS (index) Avg. ESS NPS/ESpCN (5.0E-3) 24 2.629 25 (24) 225 (8) 84 5,952MALA (6.0E-6) 48 2.642 26 (22) 874 (5) 148 10,135∞-MALA (1.0E-5) 57 2.943 25 (23) 1,102 (5) 160 9,375H-pCN (4.0E-1) 27 1.192 64 (1) 3,598 (15) 2,314 216H-MALA (6.0E-2) 60 1.014 545 (1) 8,868 (19) 6,459 232H-∞-MALA (1.0E-1) 71 1.016 582 (1) 8,417 (18) 5,905 254DR (H-pCN (1.0E0), H-MALA (6.0E-2)) (4, 61) 1.013 641 (1) 12,522 (17) 9,222 215DR (H-pCN (1.0E0), H-∞-MALA (2.0E-1)) (4, 48) 1.011 613 (1) 12,812 (17) 9,141 213DILI-PRIOR (0.8, 0.1) (60, 33) 1.064 314 (1) 4,667 (13) 3,216 548DILI-LA (0.8, 0.1) (83, 36) 1.017 562 (1) 10,882 (17) 7,192 245DILI-MAP (0.8, 0.1) (77, 22) 1.006 1,675 (1) 10,271 (20) 8,692 202

0 100 200 300 400 500βˆ’0.2

0

0.2

0.4

0.6

0.8

1

lag

autocorrelation

pCNMALA∞-MALAH-pCNH-MALAH-∞-MALADR (H-pCN, H-MALA)DR (H-pCN,∞-H-MALA)DILI-PRIORDILI-LADILI-MAP

Fig. 10. Autocorrelation function estimate (30) of the quantity of interest G (36) for several MCMC methods.

We next assess the convergence of MCMC samples of the quantity of interest G(π‘š) in (36) to thepredictive posterior distribution of G(π‘š): the autocorrelation function estimates of the quantityof interest G (36) are shown in Figure 10 (here, we use the formula (30) to account for the use ofmultiple chains), the trace plots from three independent MCMC chains are depicted in Figure 11,and histograms of all the MCMC samples (the number of counts is normalized) are shown inFigure 12.Lastly, we compare estimates of moments of the quantity of interest for the different sampling

strategies. For each MCMC chain, the π‘˜th (π‘˜ = 1, 2, 3) moment of the quantity of interest computed

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 22: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

22 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

quantityof

interestG

MCMC step

Fig. 11. Trace plots of the quantity of interest G (36) from three MCMC chains (out of 20 independent chains).Different colors (here blue, green and red) represent traces from each chain.

4 2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2pCN

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 H-pCN

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 DR (H-pCN, H-MALA)

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 DILI-PRIOR

4 2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2MALA

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 H-MALA

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 DR (H-pCN, H- -MALA)

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 DILI-LA

4 2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2-MALA

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 H- -MALA

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 DILI-MAP

norm

alized

coun

ts

quantity of interest G

Fig. 12. Probability density function estimate of the quantity of interest G (36) computed from several MCMCmethods. All the 500,000 samples, 20 chains with 25,000 samples each, are pulled together in the histogram.The number of counts is normalized so that the plot represents a probability density function.

from parameter samples m𝑖 (𝑖 = 1, 2, . . . , 𝑁 ;𝑁 = 25, 000) is computed as

Gπ‘˜ =1𝑁

π‘βˆ‘π‘–=1Gπ‘˜ (m𝑖 ). (42)

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 23: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 23

βˆ’1012

firstmom

entG

1

0.40.60.8

0123

second

mom

entG

2

0.60.8

1

pCN

MALA

∞-M

ALA

H-pCN

H-M

ALA

H-∞

-MALA

DR(H

-pCN

H-M

ALA

)

DR(H

-pCN

H-∞

-MALA

)

DILI-P

RIOR

DILI-L

A

DILI-M

AP

βˆ’20246

third

mom

entG

3

0.61

1.4

Fig. 13. Box plots of first, second, and third moment estimates (Gπ‘˜ , π‘˜ = 1, 2, 3) of the quantity of interest (42)computed by using several MCMC methods. The central mark is the median; lower and upper quartilesrepresent 25th and 75th percentiles, respectively. Whiskers extend to the extreme data points that fall withinthe distance from the lower or upper quartiles to 1.5 times the interquartile range (the distance between theupper and lower quartiles); all the other data points are plotted as outliers. The number of data points foreach method is 20, the number of independent MCMC chains.

The results are reported in Figure 13 as box-and-whisker plots.From the results presented in this section, we draw the following conclusions:

β€’ The Hessian information at the MAP point plays an important role in enhancing the samplingperformance of the MCMC methods. In fact MCMC chains without the Hessian informationdid not converge over the entire length of the chain and were localized around the startingpoint. The convergence was achieved only when the MCMC proposal exploited the Laplaceapproximation of the posterior that incorporates the Hessian information.β€’ DILI-MAP shows the best sampling efficiency in terms of the number of forward and/oradjoint PDE solves per effective sample. Note that the parameter value used in the MCMCmethods (e.g., 𝛽 and/or 𝜏) was not the optimal and a different result may be obtained withdifferent parameter values.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 24: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

24 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

0 20 40 60 80 100

10βˆ’1

100

101

102

103

104

105

106

index

eigenv

alue

mesh 1mesh 2mesh 3mesh 4

Fig. 14. Logarithmic plot of the π‘Ÿ = 100 dominant eigenvalues of the prior-preconditioned data misfit Hessiancomputed using four different meshes. The mesh is uniformely refined from the coarsest (mesh 1) to thefinest (mesh 4).

Table 2. Acceptance rate (AR), multivariate potential scale reduction factor (MPSRF) and effective samplesize (ESS) of the posterior samples generated by using the H-pCN method with different dimensions. Weuse 𝛽 = 0.4 for the H-pCN method and draw in total 500,000 samples (20 MCMC chains, each with 25,000iterations). MPSRF and ESS are computed from the projection of samples along the first 25 dominanteigenvectors of the prior-preconditioned data misfit Hessian at the MAP point. The number in parenthesis ofthe minimum ESS and the maximum ESS indicates the corresponding eigenvector index.

Dimension (state, parameter) AR (%) MPSRF Min. ESS (index) Max. ESS (index) Avg. ESS(4,225, 1,089) 27 1.192 64 (1) 3,598 (15) 2,314(16,641, 4,225) 24 1.333 63 (1) 3,221 (18) 1,830(66,049, 16,641) 23 1.075 209 (1) 3,073 (11) 1,940(263,169, 66,049) 22 1.117 102 (2) 3,276 (15) 1,767

We further study the performance of MCMC methods under different problem settings to providemore insight into the practical use of the hIPPYlib-MUQ framework.

5.1.6 Scalability of Hessian-informed pCN. Here we investigate the effect of mesh resolution onthe sampling performance. A curvature aware MCMC method, the H-pCN is selected with 𝛽 = 0.4for the test. The dimensions of the parameter and the state variables from a coarse mesh (mesh 1)to the finest mesh (mesh 4) are (1,089, 4,225), (4,225, 16,641), (16,641, 66,049), and (66,049, 263,169),respectively.

We follow the same problem setting as before, and use the same synthetic observations (obtainedfrom the true parameter field generated from the finest mesh) for all levels. Figure 14 shows theπ‘Ÿ = 100 dominant eigenvalues of the prior-preconditioned data misfit Hessian. One observes thatthe eigenspectrum is virtually independent of mesh refinement.To assess the convergence of the MCMCM methods, in Table 2 we report the acceptance rate,

MPSRF, and ESS of the posterior samples. The MPSRF and ESS are computed by projections ofparameter samples along the first 25 dominant eigenvectors of the prior-preconditioned data misfitHessian at the MAP point, as discussed in Section 3.2.3. In Figure 15, we present the autocorrelationfunction estimates (30) and show histograms for the quantity of interest G (36). The results show

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 25: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 25

0 500 1,000 1,500 2,000βˆ’0.2

0

0.2

0.4

0.6

0.8

1

lag

autocorrelation

mesh 1mesh 2mesh 3mesh 4

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 mesh 1

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 mesh 2

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 mesh 3

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6 mesh 4

norm

alized

coun

ts

quantity of interest G

Fig. 15. Left: Autocorrelation function estimate (30) of the quantity of interest G (36). Right: Probabilitydensity function estimate of the quantity of interest G (36); all the samples, 20 chains with 25,000 sampleseach, so 500,000 in total, are pulled together in the histogram; the number of counts is normalized so thatthe plot represents a probability density function. We use the H-pCN method (𝛽 = 0.4) to draw samples. Weconsider four different meshes which are increasingly refined from the coarsest (mesh 1) to the finest (mesh4).

Table 3. Acceptance rate (AR), multivariate potential scale reduction factor (MPSRF) and effective samplesize (ESS) of the posterior samples generated by using the pCN and H-pCN methods for the larger noisecase. MPSRF and ESS are computed from the projection of samples along the first 5 dominant eigenvectorsof the prior-preconditioned data misfit Hessian at the MAP point. We use 𝛽 = 0.2 for the pCN method and𝛽 = 0.9 for the H-pCN method, respectively, and draw in total 500,000 samples (20 MCMC chains, eachwith 25,000 iterations). The number in parenthesis of the minimum ESS and the maximum ESS indicates thecorresponding eigenvector index.

Method AR (%) MPSRF Min. ESS Max. ESS Avg. ESSpCN 35 1.004 3,014 (5) 16,100 (2) 8,684H-pCN 61 1.006 890 (1) 37,691 (5) 11,401

that while the ESS decreases as the dimension increases, the convergence of samples is almostindependent with respect to the MPSRF and the autocorrelation function.

5.1.7 MCMC results with larger uncertainty. So far, we considered inverse problems with a largenumber of observations (𝑙 = 300) and small noise (𝜎 = 0.005). In some cases, however, a limitednumber of measurements is available with larger noise, and one may expect that the posterior to beless concentrated. In this section we extend our study to such a problem and report a comparisonof two MCMC methods, the pCN and the H-pCN.For consistency, for this study, we use the same setup as in the first example, but with smaller

observations (𝑙 = 60) and larger noise (𝜎 = 0.1). We summarize in Table 3 and Figure 16 theconvergence diagnostics of the pCN and the H-pCN methods. The results reveal that increasing theuncertainty in the observations leads to an improved performance in both the pCN and the H-pCNmethods. As expected, the Hessian-informed h-pCN still largely outperforms pCN both in terms

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 26: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

26 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

0 100 200 300 400 500βˆ’0.2

0

0.2

0.4

0.6

0.8

1

lag

autocorrelation

pCNH-pCN

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5pCN

4 2 0 2 40.0

0.1

0.2

0.3

0.4

0.5H-pCN

quantityof

interestπ‘ž

MCMC step

norm

alized

coun

ts

quantity of interest G

Fig. 16. Left: Autocorrelation function estimate (30) of the quantity of interest G (36) Right (top): Trace plotsof the quantity of interest G (36) computed by the parameter samples from three independent chains. Right(bottom): Probability density function estimate of the quantity of interest G (36); all 500,000 samples (20chains with 25,000 samples each) are pulled together in the histogram. We use the pCN (𝛽 = 0.2) and theH-pCN (𝛽 = 0.9) MCMC methods. These results are for the larger noise case.

of MPRSF and ESS, however it is worth to notice that, in this case, pCN is still able to adequatelysample the posterior distribution.

5.2 Boundary condition inversion in a three-dimensional 𝑝-Poisson nonlinear PDEIn this second example, we consider a nonlinear PDE in three space dimensions for which we seekto infer an unknown boundary data from pointwise uncertain state observations. Specifically, theforward governing equations are given by

βˆ’βˆ‡ Β·(|βˆ‡π‘’ |π‘βˆ’2

πœ– βˆ‡π‘’)= 𝑓 in Ξ©,

𝑒 = 𝑔 on πœ•Ξ©π· , (43)

|βˆ‡π‘’ |π‘βˆ’2πœ– βˆ‡π‘’ Β· n =π‘š on πœ•Ξ© \ πœ•Ξ©π· ,

with 1 ≀ 𝑝 ≀ ∞. Note that the 𝑝-Laplacian, βˆ‡Β·(|βˆ‡π‘’ |π‘βˆ’2βˆ‡π‘’

), is singular when 𝑝 < 2 and degenerates

when 𝑝 > 2 at points βˆ‡π‘’ = 0 [12, 36], so a regularization term πœ– (here we take πœ– = 1.0 Γ— 10βˆ’8) isintroduced in the above equation as |βˆ‡π‘’ |πœ– =

√|βˆ‡π‘’ |2 + πœ– . The 𝑝-Laplacian is a non-linear counterpart

of the Laplacian operator, and appears in many nonlinear diffusion problems (e.g., non-Newtonianfluids), where a nonlinear diffusion is modeled as a power law type.

In this example, we consider a thin brick domain Ξ© = [0, 1]2 Γ— [0, 0.05] with volume source term(𝑓 = 0) and assume 𝑝 = 3. Homogeneous Dirichlet boundary conditions (𝑔 = 0) are prescribedon the lateral boundaries and no normal flux is applied on the top boundary surface. We aim toestimate the normal fluxπ‘š on the bottom boundary surface from the state observations measuredon the top boundary surface.We discretize Ξ© using a regular tetrahedral grid and use linear finite elements for all the state,

adjoint, and parameter variables. The dimension of each variable after discretization is 233,289.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 27: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 27

Fig. 17. Left: True parameter field (true normal flux on the bottom surface) of the 𝑝-Poisson problem. Middle:Corresponding state field and 𝑙 = 300 observation points (white square marks) on the top surface. Right: TheMAP point.

Table 4. Convergence diagnostics for the 𝑝-Poisson problem: acceptance rate (AR), multivariate potentialscale reduction factor (MPSRF), and effective sample sample size (ESS) of the projection of the parametersamples along the first 25 eigenvectors of the prior-preconditioned data misfit Hessian at the MAP point. Weuse H-pCN method (𝛽 = 0.9) with 20 chains, each with 25,000 iterations (500,000 samples in total).

AR (%) MPSRF Min. ESS Max. ESS Avg. ESS50 1.000 27,075 62,005 49,469

The prior is taken as a Gaussian with zero mean and Cprior = (βˆ’π›ΎΞ” + 𝛿𝐼 )βˆ’2 with Robin boundaryconditions π›ΎΞ”π‘š Β· n + π›½π‘š imposed on πœ•Ξ©. Here we take 𝛾 = 1, 𝛿 = 1 and 𝛽 = 0.7. In particular,the value of 𝛽 was chosen following [22] to mitigate boundary artifacts in the prior marginalvariance. Synthetic state observations are created at 𝑙 = 300 random locations uniformly distributedon the top surface by solving the forward problem with the true parameter fieldπ‘štrue generatedby sampling the prior and then adding a Gaussian noise (here we take 𝜎 = 0.005 for the noisevector). Figure 17 illustrates the true parameter field on the bottom boundary, the locations ofthe observations on the top surface, and the MAP point obtained by solving the optimizationproblem of minimizing the negative log-posterior. The Laplace approximation of the posterioris then constructed based on the low-rank factorization of the data misfit Hessian at the MAPpoint. The spectrum of the prior-preconditioned data misfit Hessian indicates that the number ofdominant eigenvalues (larger than 1) is about 50.

5.2.1 MCMC results for characterizing the posterior. We present MCMC sampling results for theuncertain boundary condition. We only consider in this example the H-pCN method with 𝛽 = 0.9and run 20 independent MCMC chains, each with 25,000 iterations (500,000 samples are generatedin total). For each MCMC run, a sample from the Laplace approximation of the posterior is takenas the starting point.

As before, we consider the quantity of interest defined by

G =

βˆ«πœ•Ξ©π‘™

|βˆ‡π‘’ |π‘βˆ’2βˆ‡π‘’ Β· n 𝑑𝒙, (44)

where πœ•Ξ©π‘™ is the lateral boundary surfaces. Note that the above quantity of interest is evaluatedfrom the state field 𝑒 which is the solution of the nonlinear forward problem (43) given a realizationof the parameter fieldπ‘š (the boundary condition on the bottom surface).Table 4 lists convergence diagnostics of the MCMC simulation. The parameter samples are

projected along the first 25 eigenvectors of the prior-preconditioned data misfit Hessian at the

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 28: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

28 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

0 50 100 150 200 250 300βˆ’0.2

0

0.2

0.4

0.6

0.8

1

lag

autocorrelation

0.13 0.14 0.15 0.16 0.170

20

40

60

80

100

quantityof

interestπ‘ž

MCMC step

norm

alized

coun

ts

quantity of interest π‘ž

Fig. 18. Left: Autocorrelation function estimate (30) of the quantity of interest G (44). Middle: Trace plots ofthe quantity of interest G (44) computed using parameter samples from three independent MCMC chains(colored in blue, green, and red). Right: Probability density function estimate of the quantity of interest G (44);all 500,000 samples are pulled together in the histogram. We use the H-pCN (𝛽 = 0.9).

MAP point, and the MPSRF and the ESS are evaluated based on the projection. We also estimate thequantity of interest G (44) using the parameter samples and illustrate its autocorrelation function,trace plots (three independent MCMC chains) and histograms (all the samples) in Figure 18 whereit is observed that the MCMC chains are mixed well and reach stationarity.

6 CONCLUSIONWe have presented a robust and scalable software framework for the solution of large-scale Bayesianinverse problems governed by PDEs. The software integrates two complementary open-sourcesoftware libraries, hIPPYlib and MUQ, resulting in a unique software framework that addresses theprohibitive nature of Bayesian solution of inverse problems governed by PDEs. The main objectivesof the proposed software framework are to(1) provide to domain scientists a suite of sophisticated and computationally efficient MCMC

methods that exploit Bayesian inverse problem structure; and(2) allow researchers to easily implement new methods and compare against the state of the art.The integration of the two libraries allows advanced MCMC methods to exploit the geometry

and intrinsic low-dimensionality of parameter space, leading to efficient and scalable exploration ofthe posterior distribution. In particular, the Laplace approximation of the posterior is employed togenerate high-quality MCMC proposals. This approximation is based on the inverse of the Hessianof the log-posterior, made tractable via low-rank approximation of the Hessian of the log-likelihood.Numerical experiments on linear and nonlinear PDE-based Bayesian inverse problems illustratethe ability of Laplace-based proposals to accelerate MCMC sampling by factors of ∼ 50Γ—.

Despite the fast and dimension-independent convergence of these advanced structure-exploitingMCMC methods, many Bayesian inverse problems governed by expensive-to-solve PDEs remainout of reach. For example, the results of section 5.1.5 for the Poisson coefficient inverse problemindicate that 𝑂 (106) PDE solves may still be required even with the most efficient MCMC methods.In such cases, hIPPYlib-MUQ can be used as a prototyping environment to study new methodsthat further exploit problem structure, for example through the use of various reduced models(e.g., [20]) or via advanced Hessian approximations that go beyond low rank [2, 3].

Future versions of hIPPYlib-MUQ will feature parallel implementations of MCMC methods. Theresulting multilevel parallelism (within PDE solves, and across MCMC chains) will allow solutionof even more complex PDE-based Bayesian inverse problems with higher-dimensional parameterspaces.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 29: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 29

Software Availability. hIPPYlib-MUQ is distributed under the GNU General Public License ver-sion 3 (GPL3). The hIPPYlib-MUQ project is hosted on Git-Hub (https://github.com/hippylib/hippylib2muq) and use Travis-CI for continuous integration. hIPPYlib-MUQ uses semantic version-ing. The results presented in this work were obtained with hIPPYlib-MUQ version 0.2.0, hIPPYlibversion 3.0.0, and MUQ version 0.3.5. A Docker image [41] containing the pre-installed softwareand examples is available at https://hub.docker.com/r/ktkimyu/hippylib2muq. hIPPYlib-MUQ docu-mentation is hosted on ReadTheDocs (https://hippylib2muq.readthedocs.io). Users are encouragedto join the hIPPYlib and MUQ workspaces on Slack to connect with other users, get help, anddiscuss new features; see https://hippylib.github.io/#slack-channel and https://mituq.bitbucket.iofor more information on how to join.

ACKNOWLEDGMENTSThis work was supported by the U.S. National Science Foundation, Software Infrastructure forSustained Innovation (SI2: SSE & SSI) Program under grants ACI-1550593, ACI-1550547, and ACI-1550487 and the Division of Mathematical Sciences under the CAREER grant 1654311. MP and YMwere also supported in part by Office of Naval Research MURI grant N00014-20-1-2595. OG wasalso supported in part by Department of Energy Advanced Scientific Computing Research grantsDE-SC0021239 and DE-SC0019303. The authors gratefully acknowledge computing time on theMulti-Environment Computer for Exploration and Discovery (MERCED) cluster at UC Merced,which was funded by National Science Foundation Grant No. ACI-1429783.

REFERENCES[1] Volkan Akçelik, George Biros, Omar Ghattas, Judith Hill, David Keyes, and Bart van Bloeman Waanders. 2006. Parallel

PDE-constrained optimization. In Parallel Processing for Scientific Computing, M. Heroux, P. Raghaven, and H. Simon(Eds.). SIAM.

[2] N. Alger, V. Rao, A. Meyers, T. Bui-Thanh, and O. Ghattas. 2019. Scalable matrix-free adaptive product-convolutionapproximation for locally translation-invariant operators. SIAM Journal on Scientific Computing 41, 4 (2019), A2296–A2328. https://arxiv.org/abs/1805.06018

[3] Ilona Ambartsumyan, Wajih Boukaram, Tan Bui-Thanh, Omar Ghattas, David Keyes, Georg Stadler, George Turkiyyah,and Stefano Zampini. 2020. Hierarchical Matrix Approximations of Hessians Arising in Inverse Problems Governed byPDEs. SIAM Journal on Scientific Computing 42, 5 (2020), A3397–A3426.

[4] Yves F. AtchadΓ©. 2006. An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift.Methodology and Computing in Applied Probability 8 (2006), 235–254.

[5] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, AlpDener, Victorand Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois CurfmanMcInnes, Richard Tran Mills, Todd Munson, Karl Rupp, Patrick Sanan, Barry F. Smith, Stefano Zampini, and HongZhang. 2018. PETSc Web page. http://www.mcs.anl.gov/petsc. http://www.mcs.anl.gov/petsc

[6] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Victor Eijkhout,WilliamD.Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, and Hong Zhang.2014. PETSc Web page. http://www.mcs.anl.gov/petsc. http://www.mcs.anl.gov/petsc

[7] Johnathan M Bardsley, Tiangang Cui, Youssef M Marzouk, and Zheng Wang. 2020. Scalable optimization-basedsampling on function space. SIAM Journal on Scientific Computing 42, 2 (2020), A1317–A1347.

[8] E. B. Becker, G. F. Carey, and J. T. Oden. 1981. Finite Elements: An Introduction, Vol I. Prentice Hall, Englewoods Cliffs,New Jersey.

[9] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E Farrell, and Andrew M Stuart. 2017. Geometric MCMC forinfinite-dimensional inverse problems. J. Comput. Phys. 335 (2017), 327–351.

[10] Alfio Borzì and Volker Schulz. 2012. Computational Optimization of Systems Governed by Partial Differential Equations.SIAM.

[11] Stephen P Brooks and Andrew Gelman. 1998. General Methods for Monitoring Convergence of Iterative Simulations.Journal of Computational and Graphical Statistics 7, 4 (dec 1998), 434–455. https://doi.org/10.1080/10618600.1998.10474787

[12] Jed Brown. 2010. Efficient Nonlinear Solvers for Nodal High-Order Finite Elements in 3D. Journal of ScientificComputing 45, 1 (2010), 48–63. https://doi.org/10.1007/s10915-010-9396-8

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 30: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

30 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

[13] Tan Bui-Thanh, Carsten Burstedde, Omar Ghattas, James Martin, Georg Stadler, and Lucas C. Wilcox. 2012. Extreme-scale UQ for Bayesian inverse problems governed by PDEs. In SC12: Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis. Gordon Bell Prize finalist.

[14] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler. 2013. A computational framework for infinite-dimensional Bayesianinverse problems Part I: The linearized case, with application to global seismic inversion. SIAM Journal on ScientificComputing 35, 6 (2013), A2494–A2523.

[15] Ben Calderhead. 2014. A general construction for parallelizing Metropolis- Hastings algorithms. Proceedings of theNational Academy of Sciences 111, 49 (2014), 17408–17413.

[16] George Casella and Edward I. George. 1992. Explaining the Gibbs sampler. The American Statistician 46, 3 (1992),167–174.

[17] Patrick R Conrad and Youssef M Marzouk. 2013. Adaptive Smolyak pseudospectral approximations. SIAM Journal onScientific Computing 35, 6 (2013), A2643–A2670. https://doi.org/10.1137/120890715

[18] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. 2012. MCMC methods for functions: modifying old algorithmsto make them faster. (2012). submitted.

[19] T. Cui, K.J.H. Law, and Y.M. Marzouk. 2016. Dimension-independent likelihood-informed MCMC. J. Comput. Phys. 304(2016), 109–137.

[20] Tiangang Cui, Youssef Marzouk, and Karen Willcox. 2016. Scalable posterior approximations for large-scale Bayesianinverse problems via likelihood-informed parameter and state reduction. J. Comput. Phys. 315 (2016), 363–387.

[21] Tiangang Cui and Olivier Zahm. 2021. Data-free likelihood-informed dimension reduction of Bayesian inverseproblems. Inverse Problems 37, 4 (2021), 045009.

[22] Yair Daon and Georg Stadler. 2018. Mitigating the Influence of Boundary Conditions on Covariance Operators Derivedfrom Elliptic PDEs. Inverse Problems and Imaging 12, 5 (2018), 1083–1102. arXiv:1610.05280

[23] Tim J. Dodwell, Christian Ketelsen, Robert Scheichl, and Aretha L. Teckentrup. 2019. Multilevel Markov chain MonteCarlo. SIAM Rev. 61, 3 (2019), 509–545.

[24] M. Evans and T. Swartz. 2000. Approximating integrals via Monte Carlo and deterministic methods. Vol. 20. OUP Oxford.[25] James M Flegal and Galin L Jones. 2010. Batch means and spectral variance estimators in Markov chain Monte Carlo.

The Annals of Statistics 38, 2 (2010), 1034–1070.[26] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. 2004. Bayesian data analysis.[27] Mark Girolami and Ben Calderhead. 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal

of the Royal Statistical Society: Series B (Statistical Methodology) 73, 2 (2011), 123–214.[28] Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (third ed.). Johns Hopkins University Press,

Baltimore, MD.[29] Heikki Haario, Eero Saksman, and Johanna Tamminen. 2001. An Adaptive Metropolis Algorithm. Bernoulli 7, 2 (sep

2001), 223–242. https://doi.org/10.2307/3318737[30] Nathan Halko, Per Gunnar Martinsson, and Joel A. Tropp. 2011. Finding structure with randomness: Probabilistic

algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 2 (2011), 217–288.[31] Jouni Hartikainen and Simo SΓ€rkkΓ€. 2010. Kalman filtering and smoothing solutions to temporal Gaussian process

regression models. In 2010 IEEE international workshop on machine learning for signal processing. IEEE, 379–384.https://doi.org/10.1109/MLSP.2010.5589113

[32] W. Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1(1970), 97–109.

[33] Tobin Isaac. 2015. Scalable, Adaptive Methods for Forward and Inverse Problems in Continental-Scale Ice Sheet Modeling.Ph.D. Dissertation. The University of Texas at Austin.

[34] Jari Kaipio and Erkki Somersalo. 2005. Statistical and Computational Inverse Problems. Applied Mathematical Sciences,Vol. 160. Springer-Verlag New York. https://doi.org/10.1007/b138659

[35] Finn Lindgren, H𝒓avard Rue, and Johan LindstrΓΆm. 2011. An explicit link between Gaussian fields and GaussianMarkov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society:Series B (Statistical Methodology) 73, 4 (2011), 423–498. https://doi.org/10.1111/j.1467-9868.2011.00777.x

[36] Peter Lindqvist. 2017. Notes on the p-Laplace equation. Number 161. University of JyvΓ€skylΓ€.[37] Anders Logg, Kent-Andre Mardal, and Garth N. Wells (Eds.). 2012. Automated Solution of Differential Equations

by the Finite Element Method. Lecture Notes in Computational Science and Engineering, Vol. 84. Springer. https://doi.org/10.1007/978-3-642-23099-8

[38] Tristan Marshall and Gareth Roberts. 2012. An Adaptive Approach to Langevin MCMC. Statistics and Computing 22, 5(Sept. 2012), 1041–1057. https://doi.org/10.1007/s11222-011-9276-6

[39] James Martin, Lucas C Wilcox, Carsten Burstedde, and Omar Ghattas. 2012. A stochastic Newton MCMC method forlarge-scale statistical inverse problems with application to seismic inversion. SIAM Journal on Scientific Computing 34,3 (2012), A1460–A1487.

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 31: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

hIPPYlib-MUQ: Integrating Data with Complex Predictive Models 31

[40] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016. Sampling via Measure Transport: AnIntroduction. Springer International Publishing, 1–41. https://doi.org/10.1007/978-3-319-11259-6_23-1

[41] Dirk Merkel. 2014. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J.2014, 239, Article 2 (2014). http://dl.acm.org/citation.cfm?id=2600239.2600241

[42] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953.Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics 21, 6 (1953), 1087–1092.https://doi.org/10.1063/1.1699114

[43] Antonietta Mira et al. 2001. On Metropolis-Hastings algorithms with delayed rejection. Metron 59, 3-4 (2001), 231–241.[44] R. M. Neal. 2010. Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press, Chapter MCMC using

Hamiltonian dynamics.[45] Art B Owen. 2013. Monte Carlo theory, methods and examples. (2013).[46] Matthew Parno, Andrew Davis, Patrick Conrad, and YM Marzouk. 2014. MIT Uncertainty Quantification (MUQ)

Library. https://muq.mit.edu[47] Matthew D Parno and Youssef M Marzouk. 2018. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA

Journal on Uncertainty Quantification 6, 2 (2018), 645–682. https://doi.org/10.1137/17M1134640[48] Noemi Petra, JamesMartin, Georg Stadler, andOmarGhattas. 2014. A computational framework for infinite-dimensional

Bayesian inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet inverse problems. SIAMJournal on Scientific Computing 36, 4 (2014), A1525–A1555.

[49] Frank J Pinski, Gideon Simpson, Andrew M Stuart, and Hendrik Weber. 2015. Algorithms for Kullback–Leiblerapproximation of probability measures in infinite dimensions. SIAM Journal on Scientific Computing 37, 6 (2015),A2733–A2757.

[50] S. J. Press. 2003. Subjective and Objective Bayesian Statistics: Principles, Methods and Applications. Wiley, New York.[51] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning. The MIT

Press.[52] Christian P. Robert and George Casella. 2005. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-

Verlag New York, Inc., Secaucus, NJ, USA.[53] Gareth O Roberts, Jeffrey S Rosenthal, et al. 2004. General state space Markov chains and MCMC algorithms. Probability

surveys 1 (2004), 20–71.[54] Gareth O. Roberts and Osnat Stramer. 2003. Langevin Diffussions and Metropolis-Hastings Algorithms. Methodology

and Computing in Applied Probability 4 (2003), 337–357.[55] D. Rudolph and B. Sprungk. 2018. On a Generalization of the Preconditioned Crank-Nicolson Metropolis Algorithm.

Foundations of Computational Mathematics 18 (2018), 309–343. Issue 2.[56] Stigler, S. M. 1986. Laplace’s 1774 Memoir on Inverse Probability. Statist. Sci. 1, 3 (08 1986), 359–363. https:

//doi.org/10.1214/ss/1177013620[57] G. Strang and G. J. Fix. 1988. An Analysis of the Finite Element Method. Wellesley-Cambridge Press, Wellesley, MA.[58] Andrew M. Stuart. 2010. Inverse problems: A Bayesian perspective. Acta Numerica 19 (2010), 451–559. https:

//doi.org/10.1017/S0962492910000061[59] Albert Tarantola. 2005. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, Philadelphia, PA.

xii+342 pages.[60] L. Tierney and J. B. Kadane. 1986. Accurate Approximations for Posterior Moments and Marginal Densities. J. Amer.

Statist. Assoc. 81, 393 (1986), 82–86. https://doi.org/10.1080/01621459.1986.10478240[61] The Trilinos Project Team. 2020 (acccessed May 22, 2020). The Trilinos Project Website. https://trilinos.github.io[62] Fredi TrΓΆltzsch. 2010. Optimal Control of Partial Differential Equations: Theory, Methods and Applications. Graduate

Studies in Mathematics, Vol. 112. American Mathematical Society.[63] Dootika Vats, James M Flegal, and Galin L Jones. 2019. Multivariate output analysis for Markov chain Monte Carlo.

Biometrika 106, 2 (2019), 321–337.[64] Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian BΓΌrkner. 2020. Rank-Normalization,

Folding, and Localization: An Improved 𝑅 for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis16, 2 (2020), 1–26. https://doi.org/10.1214/20-ba1221 arXiv:arXiv:1903.08008v5

[65] Umberto Villa, Noemi Petra, and Omar Ghattas. 2021. hIPPYlib: An Extensible Software Framework for Large-ScaleInverse Problems Governed by PDEs: Part I: Deterministic Inversion and Linearized Bayesian Inference. ACM Trans.Math. Softw. 47, 2, Article 16 (April 2021), 34 pages. https://doi.org/10.1145/3428447

[66] David Williams. 1991. Probability with Martingales. Cambridge University Press.[67] Ulli Wolff, Alpha Collaboration, et al. 2004. Monte Carlo errors with less errors. Computer Physics Communications

156, 2 (2004), 143–153.[68] R. Wong. 2001. Asymptotic Approximations of Integrals. Society for Industrial and Applied Mathematics. https:

//doi.org/10.1137/1.9780898719260 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898719260

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.

Page 32: arXiv:2112.00713v1 [math.NA] 1 Dec 2021

32 Ki-Tae Kim, Umberto Villa, Matthew Parno, Youssef Marzouk, Omar Ghattas, and Noemi Petra

[69] Olivier Zahm, Tiangang Cui, Kody Law, Alessio Spantini, and Youssef Marzouk. 2018. Certified dimension reductionin nonlinear Bayesian inverse problems. Preprint (2018).

ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: December 2021.