uncertainty quantification in ill-posed inverse problems

Uncertainty Quantification in Ill-Posed Inverse Problems:Case Studies in the Physical Sciences

Mikael KuuselaDepartment of Statistics and Data Science

Carnegie Mellon University

CERN-EP/IT Data Science Seminar

April 1, 2020

Mikael Kuusela (CMU) April 1, 2020 1 / 40

Motivation: Heat equation

Consider the heat equation on the real line:{∂u(x ,t)∂t = k ∂

2u(x ,t)∂x2 , (x , t) ∈ R× [0,∞),

u(x , 0) = f (x), x ∈ R

Forward problem: Given the initial condition f (x) = u(x , 0), what is thetemperature distribution g(x) = u(x , t∗) at time t∗?

Inverse problem: Given the temperature distribution g(x) = u(x , t∗) attime t∗, what was the initial distribution f (x) = u(x , 0) at time t = 0?



The forward problem is easy:

u(x , t) =

∫ ∞−∞

Φt(x − y)f (y) dy

where

Φt(x) =1√

4πktexp

(− x2

4kt

)

So we have a well-defined integral operator K that maps the initialdistribution f to the distribution g at time t∗:

g(x) = u(x , t∗) =

∫ ∞−∞

Φt∗(x − y)f (y) dy = (Kf )(x)

The inverse problem is to solve g = Kf for f . This is fundamentallydifficult. Let’s see why.



-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1Initial distribution

Distribution at time t=0.05

Temperature distribution at time t = 0.05



-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09






-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09


Distribution at time t=1

Temperature distribution at time t = 1



-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09






Now, consider the two initial distributions shown below and run themforward in time.

-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09


-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09




Now, consider the two initial distributions shown below and run themforward in time.

-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09



-25 -20 -15 -10 -5 0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09



The distributions at time t = 1 are almost indistinguishable!

If the distribution at time t = 1 is observed with even the tiniest amountof noise, then there is no way to distinguish between these two initialdistributions based on the distribution at time t = 1


Ill-posed inverse problems

Ill-posed inverse problems are problems of the type

g = K (f ),

where the mapping K is such that it can take inputs f1 and f2 that lookvery different into outputs g1 and g2 that are very similar

This means that, without further information, data collected in theg -space can only mildly constrain the solution in the f -space


Ill-posed inverse problems

A classical example is the Fredholm integral operator

g(x) =

∫k(x , y)f (y) dy ,

where k(x , y) is a problem-specific integration kernel

Also more complex operators, including nonlinear ones, arise in importantpractical applications


Ill-posed inverse problems in the physical sciences

Ill-posed inverse problems are ubiquitous in the physical sciences

Some illustrative examples:

Black hole image fromthe Event Horizon

Telescope

Inversion of ice sheetflow parameters

(Isaac et al., 2015)

Positron emissiontomography of human

brain


Case studies

Throughout this talk, I will specifically focus on the following two problems:

Large Hadron Collider

Unfolding of detector smearing indifferential cross-section measurements

Orbiting Carbon Observatory-2

Space-based retrieval of atmosphericcarbon dioxide concentration


The unfolding problem

Any differential cross section measurement is affected by the finiteresolution of the particle detectors

This causes the observed spectrum of events to be “smeared” or“blurred” with respect to the true one

The unfolding problem is to estimate the true spectrum using thesmeared observations

Ill-posed inverse problem with major methodological challenges

−5 0 50

500

1000

1500

Physical observable

(b) Smeared intensity

Intensity

Figure: Smeared spectrum

Folding←−−Unfolding−−→

−5 0 50

500

1000

1500

Physical observable

(a) True intensity

Intensity

Figure: True spectrum


Problem formulation

Let f be the true, particle-level spectrum and g the smeared, detector-levelspectrum

Denote the true space by T and the smeared space by S (both takento be intervals on the real line for simplicity)Mathematically f and g are the intensity functions of the underlyingPoisson point process

The two spectra are related by

g(s) =

∫T

k(s, t)f (t) dt,

where the smearing kernel k represents the response of the detector and isgiven by

k(s, t) = p(Y = s|X = t,X observed)P(X observed|X = t),

where X is a true event and Y the corresponding smeared event

Task: Infer the true spectrum f given smeared observations from g


Discretization

Problem usually discretized using histograms (splines are also sometimes used)Let {Ti}pi=1 and {Si}ni=1 be binnings of the true space T and the smeared space SSmeared histogram y = [y1, . . . , yn]T with mean

µ =

[∫S1

g(s) ds, . . . ,

∫Sn

g(s) ds

]TQuantity of interest:

x =

[∫T1

f (t) dt, . . . ,

∫Tp

f (t) dt

]TThe mean histograms are related by µ = Kx , where the elements of the responsematrix K are given by

Ki,j =

∫Si

∫Tjk(s, t)f (t) dt ds∫Tjf (t) dt

= P(smeared event in bin i | true event in bin j)

The discretized statistical model becomes

y ∼ Poisson(Kx)

and we wish to make inferences about x under this modelMikael Kuusela (CMU) April 1, 2020 11 / 40

OCO-2: general model

x ∈ Rp: state vector, F ∈ Rp → Rn: forward model,ε ∈ Rn: instrument noise, y ∈ Rn: radiance observations


OCO-2: linearized surrogate model (Hobbs et al., 2017)

State vector x :

CO2 profile (layer 1 to layer 20) [20 elements]surface pressure [1 elements]surface albedo [6 elements]aerosols [12 elements]

Forward model F : linearized using the Jacobian K (x) = ∂F (x)∂x

Noise ε: normal approximation of Poisson noise

Observations y : radiances in 3 near-infrared bands [1024 in each band]

O2 A-band (around 0.76 microns)weak CO2 band (around 1.61 microns)strong CO2 band (around 2.06 microns)


Synthesis

Both HEP unfolding and CO2 remote sensing can be approximatedreasonably well using the Gaussian linear model:

y = Kx + ε, ε ∼ N(0,Σ),

where y ∈ Rn are the observations, x ∈ Rp is the unknown quantity andK ∈ Rn×p is an ill-conditioned matrix

A few caveats:

May have n ≥ p or n < p

Even when n ≥ p, may have rank(K ) < p

Usually have physical constraints for x , for example x ≥ 0

The fundamental problem, though, is that K has a huge condition number


Regularization

In both fields, the standard approaches1 for handling the ill-posedness arevery similar

Find a regularized solution by solving

x = minx∈Rp

(y −Kx)TΣ−1(y −Kx) + δ‖L(x − xa)‖2 (∗)

Here, the first term is a data-fit term and the second term penalizesphysically implausible solutions

→ Balances the competing requirements of fitting the data and finding awell-behaved solution (bias-variance trade-off)

Two viewpoints — the same answer:1 Frequentist: Penalized maximum likelihood

x = maxx∈Rp

log L(x)− δP(x)⇒ (∗)

2 Bayesian: Using prior x ∼ N(xa, 1δ (LTL)−1), estimate x using the

mean/maximum of the posterior p(x |y) ∝ p(y |x)p(x)⇒ (∗)1There are many variants of these ideas, but the fundamental issues remain the same.


Regularized uncertainty quantification

For the rest of this talk, let’s assume that we are interested in some functional (orpotentially some collection of functionals) of x

Let θ = hTx be a quantity of interest; for example

OCO-2: θ = XCO2 = weighted average of the CO2 profile

HEP: θ = eTi x = ith unfolded bin or θ = aggregate of several unfolded bins

In both frequentist and Bayesian approaches, a natural estimator of θ is θ = hTx

While the frequentist and Bayesian perspectives lead to the same pointestimators θ, the associated uncertainties are different:

1 Frequentist: 1− α variability intervals

[θ, θ] = [θ − z1−α/2

√var(θ), θ + z1−α/2

√var(θ)]

2 Bayesian: 1− α posterior credible intervals

[θ, θ] = [θ − z1−α/2

√var(θ|y), θ + z1−α/2

√var(θ|y)]

HEP unfolding typically takes approach 1 , while remote sensing uses 2


Regularization and frequentist coverage

Let’s assume that ultimately we would like to quantify the uncertainty of θusing a confidence interval with well-calibrated frequentist coverage

That is, for a given confidence level 1− α, we are looking for a randominterval [θ, θ] such that

P (θ ∈[θ, θ]) ≈ 1− α, for any x

When constructed based on a regularized point estimator x , both thefrequentist variability interval and the Bayesian credible interval havetrouble satisfying this criterion

There are always x ’s such that the regularized estimator x is biased (i.e.,E(x) 6= x) and the estimator θ inherits this bias

Variability intervals ignore the bias and only account for the variance⇒ Undercoverage when large bias, correct coverage when small bias

Credible intervals are always wider than the variability intervals⇒ Undercoverage when large bias, overcoverage when small bias


Regularization and frequentist coverage

When the noise is Gaussian with known covariance, it is possible to writedown the coverage of these intervals in closed form:

1 Variability intervals

P (θ ∈ [θ, θ]) = Φ

bias(θ)√var(θ)

+ z1−α/2

− Φ

bias(θ)√var(θ)

− z1−α/2

2 Credible intervals

P (θ ∈ [θ, θ]) = Φ

bias(θ)√var(θ)

+ z1−α/2

√var(θ|y)

var(θ)

− Φ

bias(θ)√var(θ)

− z1−α/2

√var(θ|y)

var(θ)

Both of these are even functions of bias(θ) and maximized when bias(θ) = 0

The variability intervals have coverage 1− α if and only if bias(θ) = 0;otherwise coverage < 1− αThe credible intervals have coverage > 1− α for bias(θ) = 0; coverage 1− αfor a certain value of |bias(θ)|; and coverage < 1− α for large |bias(θ)|


Unfolding: Simulation setup

-5 0 5

0

200

400

600

800

1000

1200

Inte

nsity

True

Smeared f (t) = λtot

{π1N (t|−2, 1) + π2N (t|2, 1) + π3

1

|T |

}g(s) =

∫TN (s − t|0, 1)f (t) dt

f MC(t) = λtot

{π1N (t|−2, 1.12) + π2N (t|2, 0.92) + π3

1

|T |

}


Undercoverage in unfolding

10 -1 10 0 10 1 10 2 10 3 10 4 10 5

0

0.2

0.4

0.6

0.8

1

C

ove

rag

e a

t rig

ht

pe

ak

(a) SVD variant of Tikhonov regularization

−5 0 50

0.2

0.4

0.6

0.8

1(a) SVD, weighted CV

Bin

wis

e c

ove

rage

Coverage in SVD unfolding: as a function of the regularization strength (left)and for cross-validated regularization strength (right)


Bias and coverage for operational OCO-2 retrievals

(a) Bias distribution (b) Coverage distribution

Bias (left) and coverage (right) for operational OCO-2 retrievals forseveral realizations of the state vector x


Coverage for various unfolded confidence intervals

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1E

mp

iric

al co

ve

rag

e

EB

HB, Pareto(1,10−10

)

HB, Pareto(1/2,10−10

)

HB, Gamma(0.001,0.001)HB, Gamma(1,0.001)Bootstrap basicBootstrap percentile


Unregularized inversion?

At the end of the day, any regularization technique makes unverifiableassumptions about the true solution

If these assumptions are not satisfied, the uncertainties will be wrongIn the absence of oracle information about the true x , there does not seemto be any obvious way around this

So maybe we should reconsider whether explicit regularization is such a goodidea to start with?Instead of finding a regularized estimator of x , what if we simply used theunregularized maximum likelihood / least squares estimator

x = minx∈Rp

(y −Kx)TΣ−1(y −Kx)

When K has full column rank, the solution is x = (KTΣ−1K )−1KTΣ−1yThis is unbiased (E(x) = x) and hence also the corresponding estimatorθ = hTx of the functional θ = hTx is unbiasedTherefore, by the previous discussion, the variability interval

[θ, θ] = [θ − z1−α/2

√var(θ), θ + z1−α/2

√var(θ)]

has correct coverage 1− αMikael Kuusela (CMU) April 1, 2020 23 / 40

Implicit regularization

Of course, when K is ill-conditioned, the unregularized estimator x willhave a huge variance

But this does not mean that θ = hTx needs to have a huge variance!

The mapping x 7→ θ = hTx can act as an implicit regularizer resulting ina well-constrained interval [θ, θ] for the functional θ = hTxThis is especially the case when the functional is a smoothing oraveraging operation, for example:

Inference for wide unfolded bins (demo to follow)XCO2 in OCO-2 (average CO2 over the atmospheric column)

Of course, there are also functionals that are more difficult to constrain(e.g, point evaluators θ = eT

i x , derivatives,...)

In those cases, the intervals [θ, θ] are wide—as they should be, sincethere is simply not enough information in the data y to constrain thesefunctionals


Wide bin unfolding

In the case of unfolding, one functional we should be able to recoverwithout explicit regularization is the integral of f over a wide unfoldedbin:

Hj [f ] =

∫Tj

f (t) dt, width of Tj large

But one cannot simply arbitrarily increase the particle-level bin size in theconventional approaches, since this increases the MC dependence of KTo circumvent this, it is possible to first unfold with fine bins (and noregularization) and then aggregate into wide bins

Let’s see how this works using a similar deconvolution setup as before


Wide bins, standard approach, perturbed MC

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

ove

rag

e

The response matrix Ki ,j =

∫Si

∫Tjk(s,t)f MC(t) dt ds∫Tjf MC(t) dt

depends on f MC

⇒ Undercoverage if f MC 6= f


Wide bins, standard approach, correct MC

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

ove

rag

e

If f MC = f , coverage is correct

⇒ But this situation is unrealistic because f of course is unknown


Fine bins, standard approach, perturbed MC

-6 -4 -2 0 2 4 6

-4000

-2000

0

2000

4000

6000

8000

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

ove

rag

e

With narrow bins, less dependence on f MC so coverage is correct, but theintervals are very wide2

⇒ Let’s aggregate these into wide bins, keeping track of the bin-to-bincorrelations in the error propagation

2More unfolded realizations given in the backup .Mikael Kuusela (CMU) April 1, 2020 28 / 40

Wide bins via fine bins, perturbed MC

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

ove

rag

e

Wide bins via fine bins gives both correct coverage and intervals withreasonable length3

3More unfolded realizations given in the backup .Mikael Kuusela (CMU) April 1, 2020 29 / 40

Relaxing the full rank assumption

This simple approach works as long as the forward model K has fullcolumn rank and there are no constraints that x needs to satisfy

The full rank requirement can be quite restrictive in practice, for example:

Unfolding with more true bins p than smeared bins n ⇒ Kcolumn-rank deficient

The linearized OCO-2 forward model has n� p, but is neverthelesscolumn-rank deficient

When K is column-rank deficient, it has a non-trivial null space ker(K )

Therefore, confidence intervals for θ = hTx would need to be infinitelylong if there are no constraints on x (assuming h not orthogonal to ker(K ))

However, simple constraints such as x ≥ 0 or Ax ≤ b can be enough tomake the intervals finite

And we would in any case like to make use of constraints, if available


Strict bounds: Motivation

So the question becomes:

Assuming model y = Kx + ε, where K need not have full columnrank, how does one obtain a finite-sample 1 − α confidence intervalfor the linear functional θ = hTx subject to the constraint Ax ≤ b?

In the following:

We assume that we have transformed the problem so that ε ∼ N(0, I )

We denote the noise-free data by µ = Kx


Strict bounds (e.g., Stark (1992))

Rn

y space

Rp

x space

yΞ

D = K−1(Ξ)C

x

K

µ

R

hT

θ θθ

θ = hTx , θ = minx∈C∩D

hTx , θ = maxx∈C∩D

hTx

P(µ ∈ Ξ) ≥ 1− α ⇒ P(x ∈ D) ≥ 1− α⇒ P(x ∈ C ∩ D) ≥ 1− α⇒ P(θ ∈ [θ, θ]) ≥ 1− α


Strict bounds

If we construct the confidence set Ξ as

Ξ = {µ ∈ Rn : ‖y − µ‖2 ≤ χ2n,1−α},

then the endpoints of the confidence interval for θ are given by thesolutions of the following two quadratic programs:

minimize hTxsubject to ‖y −Kx‖2 ≤ χ2

n,1−αAx ≤ b

andmaximize hTxsubject to ‖y −Kx‖2 ≤ χ2

n,1−αAx ≤ b

The resulting interval [θ, θ] has by construction coverage at least 1− α


Strict bounds

The main limitation of the previous construction is that there is slack inthe last step:

P(x ∈ C ∩ D) ≥ 1− α ⇒ P(θ ∈ [θ, θ]) ≥ 1− α

Because C ∩ D is a simultaneous confidence set for x , this constructionyields simultaneous confidence intervals for any arbitrarily large collectionof functionals of x

This means that, if evaluated as a one-at-a-time interval, [θ, θ] from thisconstruction is necessarily conservative (i.e., it has overcoverage)


One-at-a-time strict bounds

Denote s2 = minAx≤b ‖y −Kx‖2.

It has been conjectured (Rust and Burrus, 1972) that the followingmodification gives a better one-at-a-time interval:

minimize hTxsubject to ‖y −Kx‖2 ≤ z2

1−α/2 + s2

Ax ≤ b

andmaximize hTxsubject to ‖y −Kx‖2 ≤ z2

1−α/2 + s2

Ax ≤ b

The coverage of these intervals has been “proven” (Rust and O’Leary,1994) and then “disproven” (Tenorio et al., 2007) so, as of now, there islimited theoretical understanding of their properties

We hope to ultimately prove the coverage, but for now we’ll simply goahead and empirically see how these perform in the CO2 retrieval problemwith the constraint x1, . . . , x21 ≥ 0 and column-rank deficient K


Coverage for OCO-2 retrievals


Coverage for OCO-2 retrievals

Table: Comparison of 95% Bayesian credible intervals (operational) and theone-at-a-time strict bounds intervals (proposed) for inference of XCO2 in OCO-2.

x operational operational operational proposed proposed proposedrealization bias coverage length coverage avg. length length s.d.

1 1.4173 0.7899 3.94 0.9515 11.20 0.292 1.3707 0.8090 3.94 0.9511 11.20 0.283 1.2986 0.8363 3.94 0.9510 11.20 0.294 1.2357 0.8579 3.94 0.9515 11.20 0.285 1.1590 0.8816 3.94 0.9513 11.20 0.286 1.0747 0.9042 3.94 0.9512 11.21 0.277 0.9721 0.9272 3.94 0.9515 11.20 0.298 0.8420 0.9500 3.94 0.9513 11.19 0.319 0.6477 0.9730 3.94 0.9508 11.19 0.3210 0.0001 0.9959 3.94 0.9502 11.18 0.35


One-at-a-time strict bounds: Discussion

We have tried more complex constraints of the form Ax ≤ b and so farthe one-at-a-time strict bounds have always had coverage at least 1− αThe intervals have excellent empirical coverage but, as of now, we aremissing a rigorous proof of coverage and/or optimality

The price to pay for good calibration is an increased interval length incomparison to regularization-based intervals

The OCO-2 problem has n� p and rank(K ) = p − 1

Do these intervals still work when p > n or rank(K )� p?

There are a few things we can say about these intervals though:

They are variable length (as opposed to fixed length in traditionalconstructions); important since there are known optimality results for fixedlength intervals (Donoho, 1994)In the case of full column rank K and no constraints on x , they reduce tothe classical fixed-length variability interval


Conclusions

Regularization works well for point estimation, but uncertaintyquantification based on regularized estimators is very difficult

Intervals from both frequentist and Bayesian constructions tend to havepoor frequentist calibration

It seems that there is a need for a major rethinking of the role ofregularization

Any explicit regularization really means some amount of “cheating” in theuncertaintiesWe should probably avoid explicit regularization and instead provideuncertainties for functionals that implicitly regularize the problemA simple example is to first unfold with narrow bins and then aggregateinto wide bins

One-at-a-time strict bounds provide good calibration even in situationswhere classical intervals do not applyAcknowledgments:

HEP unfolding: Joint work with Victor Panaretos, Philip Stark and LyleKim, with valuable input from members of the CMS Statistics CommitteeCO2 retrieval: Joint work with Pratik Patil and Jonathan Hobbs, withvaluable input from scientists at NASA Jet Propulsion Laboratory


STAMPS @ CMU

The Statistical Methods for the Physical Sciences (STAMPS) Focus Group atCMU develops statistical methods for analyzing large and complex datasets arisingacross the physical sciences.

Coordinated by: Ann Lee and Mikael Kuusela

Application areas include:

Common statistical themes: spatio-temporal data, ill-posed inverse problems,uncertainty quantification, high-dimensional data, non-Gaussian data, large-scalesimulations, massive datasets,...

See: http://stat.cmu.edu/stamps/


http://stat.cmu.edu/stamps/

References I

T. Adye. Unfolding algorithms and tests using RooUnfold. In H. B. Prosper andL. Lyons, editors, Proceedings of the PHYSTAT 2011 Workshop on Statistical IssuesRelated to Discovery Claims in Search Experiments and Unfolding, CERN-2011-006,pages 313–318, CERN, Geneva, Switzerland, 17–20 January 2011.

V. Blobel. Unfolding methods in high-energy physics experiments. In CERN YellowReport 85-09, pages 88–127, 1985.

V. Blobel, The run manual: Regularized unfolding for high-energy physics experiments,OPAL Technical Note TN361, 1996.

J. Bourbeau and Z. Hampel-Arias, PyUnfold: A Python package for iterative unfolding,The Journal of Open Source Software, 3(26):741, 2018.

A. Bozson, G. Cowan, and F. Spano, Unfolding with Gaussian processes, 2018.

G. Choudalakis. Fully Bayesian unfolding. arXiv:1201.4612v4 [physics.data-an], 2012.

G. D’Agostini, A multidimensional unfolding method based on Bayes’ theorem, NuclearInstruments and Methods A, 362:487–498, 1995.

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incompletedata via the EM algorithm, Journal of the Royal Statistical Society. Series B(Methodological), 39(1):1–38, 1977.


References II

D. L. Donoho, Statistical estimation and optimal recovery, The Annals of Statistics, 22(1):238–270, 1994.

P. J. Green and B. W. Silverman. Nonparametric Regression and Generalized LinearModels: A Roughness Penalty Approach. Chapman & Hall, 1994.

P. C. Hansen, Analysis of discrete ill-posed problems by means of the L-curve, SIAMReview, 34(4):561–580, 1992.

J. Hobbs, A. Braverman, N. Cressie, R. Granat, and M. Gunson, Simulation-baseduncertainty quantification for estimating atmospheric CO2 from satellite data,SIAM/ASA Journal on Uncertainty Quantification, 5(1):956–985, 2017.

A. Hocker and V. Kartvelishvili, SVD approach to data unfolding, Nuclear Instrumentsand Methods in Physics Research A, 372:469–481, 1996.

T. Isaac, N. Petra, G. Stadler, and O. Ghattas, Scalable and efficient algorithms for thepropagation of uncertainty from data through inference to prediction for large-scaleproblems, with application to flow of the Antarctic ice sheet, Journal ofComputational Physics, 296:348–368, 2015.

M. Kuusela. Uncertainty quantification in unfolding elementary particle spectra at theLarge Hadron Collider. PhD thesis, EPFL, 2016. Available online at:https://infoscience.epfl.ch/record/220015.


https://infoscience.epfl.ch/record/220015

References III

M. Kuusela and V. M. Panaretos, Statistical unfolding of elementary particle spectra:Empirical Bayes estimation and bias-corrected uncertainty quantification, The Annalsof Applied Statistics, 9(3):1671–1705, 2015.

M. Kuusela and P. B. Stark, Shape-constrained uncertainty quantification in unfoldingsteeply falling elementary particle spectra, The Annals of Applied Statistics, 11(3):1671–1710, 2017.

K. Lange and R. Carson, EM reconstruction algorithms for emission and transmissiontomography, Journal of Computer Assisted Tomography, 8(2):306–316, 1984.

L. B. Lucy, An iterative technique for the rectification of observed distributions,Astronomical Journal, 79(6):745–754, 1974.

B. Malaescu. An iterative, dynamically stabilized (IDS) method of data unfolding. InH. B. Prosper and L. Lyons, editors, Proceedings of the PHYSTAT 2011 Workshop onStatistical Issues Related to Discovery Claims in Search Experiments and Unfolding,CERN-2011-006, pages 271–275, CERN, Geneva, Switzerland, 17–20 January 2011.

N. Milke, M. Doert, S. Klepser, D. Mazin, V. Blobel, and W. Rhode, Solving inverseproblems with the unfolding program TRUEE: Examples in astroparticle physics,Nuclear Instruments and Methods in Physics Research A, 697:133–147, 2013.


References IV

W. H. Richardson, Bayesian-based iterative method of image restoration, Journal of theOptical Society of America, 62(1):55–59, 1972.

B. W. Rust and W. R. Burrus. Mathematical Programming and the Numerical Solutionof Linear Equations. American Elsevier, 1972.

B. W. Rust and D. P. O’Leary, Confidence intervals for discrete approximations toill-posed problems, Journal of Computational and Graphical Statistics, 3(1):67–96,1994.

S. Schmitt, TUnfold, an algorithm for correcting migration effects in high energyphysics, Journal of Instrumentation, 7:T10003, 2012.

L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for emission tomography,IEEE Transactions on Medical Imaging, 1(2):113–122, 1982.

P. B. Stark, Inference in infinite-dimensional inverse problems: Discretization andduality, Journal of Geophysical Research, 97(B10):14055–14082, 1992.

A. M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica, 19:451–559,2010.

L. Tenorio, A. Fleck, and K. Moses, Confidence intervals for linear discrete inverseproblems with a non-negativity constraint, Inverse Problems, 23:669–681, 2007.


References V

Y. Vardi, L. A. Shepp, and L. Kaufman, A statistical model for positron emissiontomography, Journal of the American Statistical Association, 80(389):8–20, 1985.

E. Veklerov and J. Llacer, Stopping rule for the MLE algorithm based on statisticalhypothesis testing, IEEE Transactions on Medical Imaging, 6(4):313–319, 1987.

I. Volobouev. On the expectation-maximization unfolding with smoothing.arXiv:1408.6500v2 [physics.data-an], 2015.


Backup


Current unfolding methodsTwo main approaches:

1 Tikhonov regularization (i.e., SVD by Hocker and Kartvelishvili (1996) and TUnfold by

Schmitt (2012)):minλ∈Rp

(y − Kλ)TC−1(y − Kλ) + δP(λ)

with

PSVD(λ) =

∥∥∥∥∥∥∥∥∥Lλ1/λ

MC1

λ2/λMC2

...λp/λ

MCp

∥∥∥∥∥∥∥∥∥

2

or PTUnfold(λ) = ‖L(λ− λMC)‖2,

where L is usually the discretized second derivative (also other choices possible)

2 Expectation-maximization iteration with early stopping (D’Agostini, 1995):

λ(t+1)j =

λ(t)j∑n

i=1 Ki,j

n∑i=1

Ki,jyi∑pk=1 Ki,kλ

(t)k

, with λ(0) = λMC

All these methods typically regularize by biasing towards a MC ansatz λMC

Regularization strength controlled by the choice of δ in Tikhonov or by the number ofiterations in D’Agostini

Uncertainty quantification:[λi , λi

]=[λi − z1−α/2

√var(λi

), λi + z1−α/2

√var(λi

) ],

with var(λi

)estimated using error propagation or resampling


Regularized unfolding

Two popular approaches to regularized unfolding:

1 Tikhonov regularization (Hocker and Kartvelishvili, 1996; Schmitt, 2012)

2 Expectation-maximization iteration with early stopping (D’Agostini, 1995;Richardson, 1972; Lucy, 1974; Shepp and Vardi, 1982; Lange and Carson,1984; Vardi et al., 1985)


Tikhonov regularization

Tikhonov regularization estimates λ by solving:

minλ∈Rp

(y −Kλ)TC−1(y −Kλ) + δP(λ)

The first term as a Gaussian approximation to the Poisson log-likelihoodThe second term penalizes physically implausible solutionsCommon penalty terms:

Norm: P(λ) = ‖λ‖2

Curvature: P(λ) = ‖Lλ‖2, where L is a discretized 2nd derivative operatorSVD unfolding (Hocker and Kartvelishvili, 1996):

P(λ) =

∥∥∥∥∥∥∥∥∥Lλ1/λ

MC1

λ2/λMC2

...λp/λ

MCp

∥∥∥∥∥∥∥∥∥

2

,

where λMC is a MC prediction for λTUnfold4 (Schmitt, 2012): P(λ) = ‖L(λ− λMC)‖2

4TUnfold implements also more general penalty termsMikael Kuusela (CMU) April 1, 2020 49 / 40

D’Agostini iteration

Starting from some initial guess λ(0) > 0, iterate

λ(k+1)j =

λ(k)j∑n

i=1 Ki ,j

n∑i=1

Ki ,jyi∑pl=1 Ki ,lλ

(k)l

Regularization by stopping the iteration before convergence:

λ = λ(K) for some small number of iterations KWill bias the solution towards λ(0)

Regularization strength controlled by the choice of K

In RooUnfold (Adye, 2011), λ(0) = λMC

PyUnfold (Bourbeau and Hampel-Arias, 2018) implements freechoice of λ(0)


D’Agostini iteration

λ(k+1)j =

λ(k)j∑n

i=1 Ki ,j

n∑i=1

Ki ,jyi∑pl=1 Ki ,lλ

(k)l

This iteration has been discovered in various fields, including optics(Richardson, 1972), astronomy (Lucy, 1974) and tomography (Sheppand Vardi, 1982; Lange and Carson, 1984; Vardi et al., 1985)

In particle physics, it was popularized by D’Agostini (1995) whocalled it “Bayesian” unfolding

But: This is in fact an expectation-maximization (EM) iteration(Dempster et al., 1977) for finding the maximum likelihood estimatorof λ in the Poisson regression problem y ∼ Poisson(Kλ)

As k →∞, λ(k) → λMLE (Vardi et al., 1985)

This is a fully frequentist technique for finding the (regularized) MLE

The name “Bayesian” is an unfortunate misnomer


D’Agostini demo, k = 0

−10 −5 0 5 100

100

200

300

400

500µy

Kλ(k)

Figure: Smeared histogram

−10 −5 0 5 100

100

200

300

400

500

λλ(k)

Figure: True histogram



−10 −5 0 5 100

100

200

300

400

500µy

Kλ(k)


−10 −5 0 5 100

100

200

300

400

500

λλ(k)




−10 −5 0 5 100

100

200

300

400

500µy

Kλ(k)


−10 −5 0 5 100

200

400

600

800

λλ(k)




−10 −5 0 5 100

100

200

300

400

500µy

Kλ(k)


−10 −5 0 5 100

500

1000

1500

λλ(k)



Other methods

Bin-by-bin correction factorsAttempts to unfold resolution effects by performing multiplicative efficiency correctionsThis method is simply wrong and must not be used

Fully Bayesian unfolding (FBU) (Choudalakis, 2012)

Unfolding using Bayesian statistics where the prior regularizes the ill-posed problemCertain priors lead to solutions similar to Tikhonov, but with Bayesian credible intervals as theuncertaintiesNote: D’Agostini has nothing to do with proper Bayesian inference

Gaussian processes (Bozson et al., 2018; Stuart, 2010)

Very closely related to Tikhonov regularization / penalized maximum likelihood / FBUInherits many of the same limitations

RUN/TRUEE (Blobel, 1985, 1996; Milke et al., 2013)

Penalized maximum likelihood with B-spline discretization

Shape-constrained unfolding (Kuusela and Stark, 2017)

Correct-coverage simultaneous confidence intervals by imposing constraints on positivity,monotonicity and/or convexity

Expectation-maximization with smoothing (Volobouev, 2015)

Adds a smoothing step to each iteration of D’Agostini and iterates until convergence

Iterative dynamically stabilized unfolding (Malaescu, 2011)

Seems ad-hoc, with many free tuning parameters and unknown (at least to me) statistical propertiesI have not seen this used in CMS, but it seems to be quite common in ATLAS

...


Choice of the regularization strength

A key issue in unfolding is the choice of the regularization strength (δ inTikhonov, # of iterations in D’Agostini)

The solution and especially the uncertainties depend heavily on this choice

This choice should be done using an objective data-driven criterion

In particular, one must not rely on the software defaults for the regularizationstrength (such as 4 iterations of D’Agostini in RooUnfold)

Many data-driven methods have been proposed:1 (Weighted/generalized) cross-validation (e.g., Green and Silverman, 1994)2 L-curve (Hansen, 1992)3 Marginal maximum likelihood (MMLE; Kuusela and Panaretos (2015))4 Goodness-of-fit test in the smeared space (Veklerov and Llacer, 1987)5 Akaike information criterion (Volobouev, 2015)6 Minimization of a global correlation coefficient (Schmitt, 2012)7 Stein’s unbiased risk estimate (SURE; new in TUnfold V17.9)8 ...

Limited experience about the relative merits of these in typical unfoldingproblems

Note: All of these are designed for point estimation!

Not necessarily optimal for uncertainty quantification


Tikhonov regularization, P(λ) = ‖λ‖2, varying δ

−5 0 5−2000

−1000

0

1000

2000

δ = 1e−05

λ

λ± SE[λ]

−5 0 5−100

0

100

200

300

400

500

600

δ = 0.001

λ

λ± SE[λ]

−5 0 5−100

0

100

200

300

400

500

600

δ = 0.01

λ

λ± SE[λ]

−5 0 5−100

0

100

200

300

400

500

600

δ = 1

λ

λ± SE[λ]


Uncertainty quantification in unfolding

Aim: Find random intervals[λi (y), λi (y)

], i = 1, . . . , p, with coverage

probability 1− α:

P(λi ∈

[λi (y), λi (y)

])≈ 1− α

Most implementations quantify the uncertainty using the binwisevariance (estimated using either error propagation or resampling):

[λi , λi

]=

[λi − z1−α/2

√var(λi), λi + z1−α/2

√var(λi)]

But: These intervals may suffer from significant undercoverage sincethey ignore the regularization bias


Undersmoothed unfolding

Standard methods for picking the regularization strength optimize the pointestimation performance

These estimators have too much bias from the perspective of thevariance-based uncertainties

One possible solution is to debias the estimator, i.e., to adjust the bias-variancetrade-off to the direction of less bias and more variance

The simplest form of debiasing is to reduce δ from the cross-validation /L-curve / MMLE value until the intervals have close-to-nominal coverage

The challenge is to come up with a data-driven rule for deciding how much toundersmooth

I have been working with Lyle Kim to implement the data-driven methods fromKuusela (2016) as an extension of TUnfold

The code is available at:

https://github.com/lylejkim/UndersmoothedUnfolding

If you’re already working with TUnfold, then trying this approach requiresadding only one extra line of code to your analysis


https://github.com/lylejkim/UndersmoothedUnfolding

Unfolded histograms, λMC = 0

6− 4− 2− 0 2 4 6

200

400

600

800

1000

1200

1400

Figure: L-curve, τ =√δ=0.01186

6− 4− 2− 0 2 4 6

200

400

600

800

1000

1200

1400

Figure: Undersmoothing, τ =√δ=0.00177


Binwise coverage, λMC = 0

Bin0 5 10 15 20 25 30 35 40

Cov

erag

e (1

000

repe

titio

ns)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Binwise coverage, ScanLcurveBinwise coverage, ScanLcurve

Figure: L-curve

Bin0 5 10 15 20 25 30 35 40

Cov

erag

e (1

000

repe

titio

ns)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Binwise coverage, UndersmoothingBinwise coverage, Undersmoothing

Figure: Undersmoothing


Coverage as a function of τ =√δ

Tau (regularization strength)0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

Pea

k C

over

age

(100

00 r

epet

ition

s)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Undersmoothing

Lcurve

Undersmoothing

Lcurve

TUnfold, coverage at peak bin

Figure: Coverage at the right peak of a bimodal density


Interval lengths, λMC = 0

45

67

log(interval length) comparison

log

(in

terv

al l

en

gth

)

LcurveScan mean = 66.017 median = 66.231

Undersmoothing mean = 238.09

median = 207.391

Undersmoothing(oracle) mean = 197.102 median = 197.36


Histograms, coverage and interval lengths when λMC 6= 0

Bin0 5 10 15 20 25 30 35 40

Cov

erag

e (1

000

repe

titio

ns)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Binwise coverage, ScanLcurveAverage interval length: 29.3723

Binwise coverage, ScanLcurve

Bin0 5 10 15 20 25 30 35 40

Cov

erag

e (1

000

repe

titio

ns)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Binwise coverage, UndersmoothingAverage interval length: 308.004

Binwise coverage, Undersmoothing

6− 4− 2− 0 2 4 60

200

400

600

800

1000

1200

1400

1600

0

200

400

600

800

1000

1200

1400

1600

Unfolded, ScanLcurve

Unfolded

True

Unfolded, ScanLcurve

6− 4− 2− 0 2 4 60

200

400

600

800

1000

1200

1400

1600

0

200

400

600

800

1000

1200

1400

1600

Unfolded, UndersmoothingUnfolded, Undersmoothing


Coverage study from Kuusela (2016)

Method Coverage at t = 0 Mean length

BC (data) 0.932 (0.915, 0.947) 0.079 (0.077, 0.081)BC (oracle) 0.937 (0.920, 0.951) 0.064 (0.064, 0.064)US (data) 0.933 (0.916, 0.948) 0.091 (0.087, 0.095)US (oracle) 0.949 (0.933, 0.962) 0.070 (0.070, 0.070)MMLE 0.478 (0.447, 0.509) 0.030 (0.030, 0.030)MISE 0.359 (0.329, 0.390) 0.028Unregularized 0.952 (0.937, 0.964) 40316

BC = iterative bias-correctionUS = undersmoothingMMLE = choose δ to maximize the marginal likelihood

MISE = choose δ to minimize the mean integrated squared error


Fine bins, standard approach, perturbed MC, 4 realizations

-6 -4 -2 0 2 4 6

-4000

-2000

0

2000

4000

6000

8000In

tensity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 6

-4000

-2000

0

2000

4000

6000

8000

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 6

-4000

-2000

0

2000

4000

6000

8000

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 6

-4000

-2000

0

2000

4000

6000

8000

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge


Wide bins via fine bins, perturbed MC, 4 realizations

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge

-6 -4 -2 0 2 4 60

500

1000

1500

2000

2500

Inte

nsity

Unfolded

True

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bin

wis

e c

overa

ge


uncertainty quantification in ill-posed inverse problems

Documents