semi-parametric importance sampling for rare-event...

Semi-Parametric ImportanceSampling for Rare-event probability

EstimationZ. I. Botev and P. L’Ecuyer

IMACS Seminar 2011

Borovets, Bulgaria

Semi-Parametric Importance Sampling for Rare-event probability Estimation – p.1/24

Outline

Formulation of importance sampling problem.

Background material on currently suggested

adaptive importance sampling methods.

The proposed methodology

Numerical example taken from recent work by

Asmussen, Blanchet, Juneja, and

Rojas-Nandayapa — Tail probabilities of sums

of correlated lognormals.

ConclusionsSemi-Parametric Importance Sampling for Rare-event probability Estimation – p.2/24

Problem formulation

The problem is to estimate high-dimensional integrals ofthe form

ℓ =

∫f(x)H(x)dx = EfH(X)

The function H : Rd → R and X is a d-dimensionalrandom variable with pdf f .

For discrete counting or combinatorial problems wesimply replace the integration by summation.

How do we estimate such integrals via Monte Carlo?


Existing methods

Importance sampling: Let g be an importance samplingdensity such that g(x) = 0 ⇒ H(x) f(x) = 0 for all x.

Generate X1, . . . ,Xmiid∼ g, then an unbiased estimator

of ℓ is

ℓ =1

m

m∑

k=1

Zk with Zk = H(Xk)f(Xk)

g(Xk).

The minimum variance importance sampling density is

π(x) =H(x) f(x)

ℓ, (H(x) > 0) ,

which depends on ℓ and is therefore not useful.


Existing importance sampling ideas

Existing methods for selecting the density g assumethat g is part of a parametric family: {g(·;η)}.

The objective is to select the parameter η so that g is as“close” to the optimal density π as possible.

The closeness between π and g is measured by theφ-divergence distance

∫π(x) φ

(g(x;η)

π(x)

)dx , (1)

where φ : R+ → R is twice continuously differentiable,and φ(1) = 0, φ′′(x) > 0, for all x > 0.


Variance Minimal method

Variance Minimal (VM) method:

Method equivalent to minimizing the φ-divergence withφ(z) = 1/z.

The minimization of the resulting φ-divergence

argminη

∫π2(x)

g(x;η)dx

is highly nonlinear /The φ-divergence has to be estimated — we have a

noisy nonlinear optimization problem /


Cross Entropy method

Cross Entropy method:

Method equivalent to minimizing the φ-divergence withφ(z) = − ln(z).

The minimization of the Kullback-Leibler distance

argminη

−

∫π(x) ln(g(x;η)/π(x))dx

is similar to likelihood maximization and we thusfrequently have analytical solutions to the optimization

problem ,The Kullback-Leibler distance still has to be estimated/


Shortcomings

Both Cross-Entropy and Variance Minimization methods

suffer from the following /:

Their performance is limited by how well a simple andrigid parametric density can approximate the optimal π.Ideally we would like a flexible non-parametric model forthe importance sampling density, but this is oftenimpossible.

The VM method always requires non-linear non-convexoptimization and the Cross Entropy method is as simpleas likelihood maximization can be.


MCMC methods for estimation

The Bayesian community has alternatives to parametricimportance sampling that use MCMC:

Chib’s method

Bridge sampling

Path sampling

Equi-Energy sampling

Gelfand-Dey Method

The most popular and efficient is Chib’s method and itsvariants. However, even Chib’s approach suffers from thefollowing drawbacks.


MCMC problems

The estimators are biased estimators, because thechain almost never starts in stationarity.

Difficulty in computing empirical and asymptotic errorestimates, because MCMC does not generate iidsamples.

Chib’s estimator relies on the output of multiple differentchains.

Is there a way to draw on the strength of both MCMC and

importance sampling?


Markov chain importance sampling

Similar to the cross entropy and variance minimizationmethods, the MCIS method consists of two stages:

Markov Chain (MC) stage, in which we construct anergodic estimator π of the minimum varianceimportance sampling density π.

Importance Sampling (IS) stage, in which we use π as animportance sampling density to estimate ℓ.

There are many different ways of constructing a model-free

estimator of π — e.g., standard kernel density estimation

(poor convergence). Here we only explore one way related

to the Gibbs sampler.Semi-Parametric Importance Sampling for Rare-event probability Estimation – p.11/24

MCIS with Gibbs

Suppose that we have used an MCMC sampler togenerate the population

X1, . . . ,Xnapprox∼ π(x), Xi = (Xi,1, . . . , Xi,d) .

We can construct the nonparametric estimator of πusing the Gibbs transition kernel:

π(y) =1

n

n∑

i=1

κ(y |Xi)

κ(y |X)def=

d∏

j=1

π(yj | y1, . . . , yj−1, Xj+1, . . . , Xd) .


MCIS vs importance sampling

The MCIS estimator is

ℓ =1

m

m∑

k=1

|H(Yk)| f(Yk)

π(Yk),

where Y1, . . . ,Ymiid∼ π. Advantages and disadvantages of

the MCIS approach:

P(limn→∞ π(x) = π(x)) = 1 for all x.

No need to solve a φ-divergence optimization problem

If n is large, then evaluation of π may be costly.

MCIS estimator, unlike Chib’s, is unbiased and iidsampling allows for standard estimation error bands.


Sums of correlated lognormals

Consider the estimation of the rare-event probability

ℓ = P(eX1 + · · ·+ eXd > γ) =

∫f(x) I{S(x) > γ}dx ,

where:

X = (X1, . . . , Xd) and X ∼ N(µ,Σ).

S(x) = ex1 + · · ·+ exd.

f is the density of N(µ,Σ) with associated precisionmatrix Λ = Σ−1.

We use the notation Λ = (Λi,j) and Σ = (Σi,j).


Conditional densities needed for π

We need the conditional densities of the optimal importancesampling pdf π(y) = f(y)I{S(x) > γ}/ℓ:

π(yi |y−i) ∝

{f(yi |y−i) if

∑j 6=i eyj > γ

f(yi |y−i) I{yi > ln(γ −

∑j 6=i eyj)

}if∑

j 6=i eyj < γ,

where by standard properties of the multivariate normaldensity f(yi |y−i) is normal density with mean

µi + Λ−1i,i

∑

j 6=i

Λi,j(µj − yj) and variance Λ−1i,i .


Numerical setup

Compare with importance sampling vanishing relativeerror estimator (ISVE) and the cross entropy vanishingrelative error estimator (CEVE) of Asmussen, Blanchet,Juneja, Rojas-Nandayapa, Efficient simulation of tailprobabilities of sums of correlated lognormals, Ann. Oper. Res.(2009)

Both of these estimators decompose the probability ℓ(γ)into two parts:

ℓ(γ) = P(maxi

Xi > γ) + P(S(X) > γ, maxi

Xi 6 γ).

The first (dominant) term and the second (residual) termare estimated by two different importance samplingestimators that ensure the strong efficiency of the sum.


Numerical setup I

The first term is asymptotically dominant in the sensethat

limγ→∞

P(maxiXi > γ)

P(S(X) > γ, maxiXi 6 γ)= 0 .

We compute ℓ for various values of the commoncorrelation coefficient

=Σi,j√Σi,iΣj,j

.

d = 10, µi = i− 10, σ2i = i (i = 1, . . . , d), γ = 5× 105.

We used the MCIS estimator with m = 5× 105 and aMarkov chain sample n = 80 obtained using splitting.


Numerical results I

relative error % MCIS est. ℓ MCIS CEVE ISVE0 1.795092× 10−5 0.0092 0.0063 0.0069

0.4 1.8077× 10−5 0.093 0.17 0.230.7 1.9014× 10−5 0.04 0.65 2.850.9 2.0735× 10−5 0.068 0.63 2.80

0.93 2.0997× 10−5 0.17 3.5 4.590.95 2.1412× 10−5 0.11 6 8.090.99 2.1882× 10−5 0.29 15 4.35


Numerical setup II

Both the ISVE and CEVE estimators are stronglyefficient, so we expect that these estimators willeventually outperform the MCIS estimator.

This intuition is confirmed in the next table describing 14cases with increasing values of the threshold γ.

We use the same algorithmic and problem parameters,except that = 0.9 in all cases and the thresholdparameter depends on the case number c according tothe formula:

γ = 5× 10c+3, c = 1, . . . , 14 .


Numerical results IIrelative error % efficiency

case c = ISVE estimate MCIS ISVE MCIS ISVE

1 3.9865× 10−4 0.049 1.6 25 1500

2 2.0802× 10−5 0.067 4.3 75 10000

3 6.4385× 10−7 0.077 5.5 64 22000

4 1.2039× 10−8 0.041 3.0 21 5000

5 1.3468e× 10−10 0.034 6.1 12 20000

6 8.9791× 10−13 0.036 7.6 17 30000

7 3.5899× 10−15 0.043 0.014 27 0.11

8 8.5302× 10−18 0.079 0.029 81 0.50

9 1.2082× 10−20 0.025 0.024 7 0.32

10 1.0148× 10−23 0.042 0.00064 28 0.00021

11 5.0428× 10−27 0.042 0.00030 27 4.6× 10

−5

12 1.4898× 10−30 0.024 0.00014 7 1.0× 10

−5

13 2.5961× 10−34 0.023 3.87× 10

−15 8.2 7.7× 10−27

14 2.6754× 10−38 0.012 4.22× 10

−13 2.6 9.1× 10−23


L2 properties

Assume that:

X1, . . . ,Xniid∼ κt−1(x |θ), where κt−1 is the (t− 1) step

transition density.

κ is the transition density of systematic Gibbs sampler.s

κ2(y |x)π(x)π(y)dydx < ∞. Warning: this can be difficult

to verify /All states of the chain can be accessed using a singlestep of the chain (irreducability condition).

Then we have the following result for ℓ = f(Y)H(Y)π(Y) .


L2 theoretical properties

We have the following bound on the Neymann χ2

distance:

E

[(ℓ− ℓ)2

ℓ ℓ

]=

(−1 +

∫κ2

t (y |θ)

π(y)dy

)

︸︷︷︸χ2 component

+1

n

∫Eκ2(y |X)− κ2

t (y |θ)

π(y)dy

︸︷︷︸variance component

6 V (θ)e−rt + O(1/n) ,

where V is a positive Liapunov function and r > 0 is aconstant (the geometric rate of convergence of theGibbs sampler).

Note that the above is NOT the relative error, becausethe denominator has ℓ, instead of ℓ.


Conclusions

We have presented a new method for integration thatcombines MCMC with importance sampling.

We argue that the method is preferable to methodsbased purely on MCMC or importance sampling.

The method dispenses with the traditional parametricmodeling used in the design of importance samplingdensities ,Unlike MCMC based methods, the proposed methodprovides unbiased estimators and estimation of relativeerror and confidence bands is straightforward ,


Some interesting links

Can we use large deviations theory to sample exactlyfrom the minimum variance density, instead of usingMCMC?

Idea related to empirical likelihood andRao-Blackwellization in classical statistics. Essentiallywe construct an estimator based on empirical likelihoodarguments.

thank you for your attention


semi-parametric importance sampling for rare-event...

Documents