on non-negative unbiased estimators

On non-negative unbiased estimators

Pierre E. Jacob@ University of Oxford

& Alexandre H. Thiery@ National University of Singapore

Banff – March 2014

Pierre Jacob Non-negative unbiased estimators 1/ 28

Outline

1 Exact inference

2 Unbiased estimators and sign problem

3 Existence and non-existence results

4 Discussion


Exact inference

For a target probability distribution with unnormalised density π, anumerical method is “exact” if for any test function φ,∫

φ(θ)π(θ)dθ∫π(θ)dθ

can be approximated with arbitrary precision, at the cost of morecomputational effort.

⇒ In this sense MCMC is exact.


Exact inference

With exact methods, no systematic error.

No guarantees that for a fixed computational budget, exactmethods should be preferred over approximate methods.

Still important to know in which settings exact methods areavailable.


Exact inference using Metropolis-Hastings

Metropolis-Hastings algorithm.

1: Set some θ(1).2: for i = 2 to Nθ do3: Propose θ⋆ ∼ q(·|θ(i−1)).4: Compute the ratio:

α = min(

1,π(θ⋆)

π(θ(i−1))q(θ(i−1)|θ⋆)q(θ⋆|θ(i−1))

).

5: Set θ(i) = θ⋆ with probability α, otherwise set θ(i) = θ(i−1).6: end for


Exact inference with unbiased estimators

Pseudo-marginal Metropolis-Hastings algorithm.For each θ, we can sample Z (θ) with E(Z (θ)) = π(θ).

1: Set some θ(1) and sample Z (θ(1)).2: for i = 2 to Nθ do3: Propose θ⋆ ∼ q(·|θ(i−1)) and sample Z (θ⋆).4: Compute the ratio:

α = min(

1,Z (θ⋆)

Z (θ(i−1))q(θ(i−1)|θ⋆)q(θ⋆|θ(i−1))

).

5: Set θ(i) = θ⋆, Z (θ(i)) = Z (θ⋆) with probability α, otherwiseset θ(i) = θ(i−1), Z (θ(i)) = Z (θ(i−1)).

6: end for



Game-changer when one has access to efficient unbiasedestimators of the target density.

Especially in parameter inference for hidden Markov models,with particle MCMC methods based on particle filters toestimate the likelihood.

What if one doesn’t have access to straightforward unbiasedestimators? Are there general schemes to obtain thoseunbiased estimators?



Example: big dataObservations yi

iid∼ fθ for i = 1, . . . , n, and n is very large.Can we do exact inference without computing the full likelihoodevery time we try a new parameter value?

Unbiased estimator of the log-likelihood

ℓ(θ) = (n/m)m∑

i=1log f (yσi | θ)

for m < n and σi corresponding to some subsampling scheme.

It doesn’t directly provide an unbiased estimator of thelikelihood.



Doubly intractable distributionsPosterior density decomposable into

π(θ | y) = 1C (θ)

f (y; θ)p(θ).

One can typically get an unbiased estimator of C (θ) usingimportance sampling.

It doesn’t directly provide an unbiased estimator of 1/C (θ).



Reference priorsStarting from an arbitrary prior π⋆, define

fk(θ) = exp{∫

p(y1, . . . , yk | θ) log π⋆(θ | y1, . . . , yk)dy1 . . . dyk

}and the reference prior is, for any θ0 in the interior of Θ,

f (θ) = limk→∞

fk(θ)fk(θ0)

.

(Berger, Bernardo, Sun 2009.)Unbiased estimators of f (θ)?


Outline

1 Exact inference



4 Discussion


Unbiased estimators

Von Neuman & Ulam (∼ 1950), Kuti (∼ 1980), Rychlik (∼ 1990),McLeish, Rhee & Glynn (∼ 2010). . .

Removing the bias off consistent estimatorsIntroduce

a random variable S with E(S) = λ ∈ R,a sequence (Sn)n≥0 converging to S in L2,N be an integer valued random variable andwn = 1/P(N ≥ n) < ∞ for all n ≥ 0,

then

Y =N∑

n=0wn ×

(Sn − Sn−1

)has expectation E(Y ) = E(S) = λ.


Unbiased estimators

If ∞∑n=1

wn × E(

|S − Sn−1|2)

< ∞, (1)

then the variance of Y is finite.

Denote by tn the expected computing time to obtainSn − Sn−1.Then the computing time of Y , denoted by τ , shouldpreferably satisfy

E(τ) =∞∑

n=0w−1

n × tn < ∞. (2)

Success story in multi-level Monte Carlo.


Unbiased estimators

Even if the consistent estimators Sn are each almost-surelynon-negative, Y is not in general almost-surely non-negative:

Y =N∑

n=0wn ×

(Sn − Sn−1

),

unless we manage to construct ordered consistent estimators, ie:

P(Sn−1 ≤ Sn) = 1.

Direct implementation of the pseudo-marginal approach is difficultin the presence of possibly negative acceptance probabilities.


Dealing with negative values

One can still perform exact inference by noting∫φ(θ)π(θ)dθ∫

π(θ)dθ=∫

φ(θ)σ(π(θ))|π(θ)|dθ∫σ(π(θ))|π(θ)|dθ

which suggests using the absolute values of Z (θ) in the MHacceptance ratio.The integral is recovered using the importance samplingestimator: ∑N

i=1 σ(Z (θ(i)))φ(θ(i))∑Ni=1 σ(Z (θ(i)))

As an importance sampler, deteriorates with the dimension.Called the sign problem in lattice quantum chromodynamics.


Avoiding the sign problem

Can we avoid the sign problem by directly designingnon-negative unbiased estimators?

Given an unbiased estimator of λ > 0, can I generate anon-negative unbiased estimator of λ?

Let f be any function f : R → R+. Given an unbiasedestimator of λ ∈ R, can I generate a non-negative unbiasedestimator of f (λ)?


Outline

1 Exact inference



4 Discussion


f -factory

Let X be a subset of R and f : conv(X ) → R+ a function.

DefinitionAn X -algorithm A is an f -factory if, given as inputs

any i.i.d sequence X = (Xk)k≥1 with expectation λ ∈ R,an auxiliary random variable U ∼ Uniform(0, 1) independentof (Xk)k≥1,

then Y = A(U , X) is a non-negative unbiased estimator of f (λ).


X -algorithm

Let X be a subset of R.

DefinitionAn X -algorithm A is a pair

(T , φ

)where

T = (Tn)n≥0 is a sequence of Tn : (0, 1) × X n → {0, 1},φ = (φn)n≥0 is a sequence of φn : (0, 1) × X n → R+.

A ≡ (T , φ) takes u ∈ (0, 1) and x = (xi)i≥1 ∈ X ∞ as inputs andproduces as output

exit time: τ = τ(u, x) = inf{n ≥ 0 : Tn(u, x1, . . . , xn) = 1}exit value: A(u, x) = φτ (u, x1, . . . , xτ )

Set A(u, x) = ∞ if Tn never gives 1.


General non-existence of f -factories

TheoremFor any non constant function f : R → R+, no f -factory exists.

LemmaGiven i.i.d copies of an unbiased estimator of λ > 0 and a uniformrandom variable U , there is no algorithm producing a non-negativeunbiased estimator of λ.

The lemma is not directly implied by the theorem but the proof isvery similar.


Proof

For the sake of contradiction, introducea non-constant function f : R → R+, and λ1, λ2 ∈ R withf (λ1) > f (λ2),an f -factory (φ, T ).

Consider an i.i.d sequence X = (Xn)n≥1 with expectation λ1.Then

A(U , X) = φτX (U , X1, . . . , XτX )

has expectation f (λ1), and

τX = inf{n : Tn(U , X1, . . . , Xn) = 1}

is almost surely finite.


Proof

An f -factory should work for any input sequence.Introduce Bernoulli variables (Bn)n≥1, with P(Bn = 0) = ε and

Yn = Bn Xn + λ2 − λ1(1 − ε)ε

(1 − Bn)

so that E(Yn) = λ2.Then

A(U , Y ) = φτY (U , Y1, . . . , YτY )

has expectation f (λ2) < f (λ1), and

τY = inf{n : Tn(U , Y1, . . . , Yn) = 1}

is almost surely finite.


Proof

By construction we can tune the probability (1 − ε)n of

Mn = {(Y1, . . . , Yn) = (X1, . . . , Xn)},

by changing ε. On the events

{(Y1, . . . , Yn) = (X1, . . . , Xn)}

the algorithm has to “compensate”, so that

f (λ2) = E[A(U , Y )] < E[A(U , X)] = f (λ1).

But the algorithm cannot ouptut values lower than zero⇒ for ε small enough it leads to a contradiction.


Other cases

By putting more restrictions on X we get different results.

The case where X ⊂ R+ and f is decreasing also leads to anon-existence result.

The case where X ⊂ R+ and f is increasing allows someconstructions, for instance for real analytic functions.


Other cases

No full characterisation of increasing functions allowingf -factories for X ⊂ R+, yet.

The case where X = [a, b] is related to the Bernoulli factory.Necessary and sufficient condition: f continuous and thereexist n, m ∈ N and ε > 0 such that

∀x ∈ [a, b] f (x) ≥ ε min ((x − a)m , (b − x)n)


Case X = [a, b]

Assume

∀x ∈ [a, b] f (x) ≥ ε min ((x − a)m , (b − x)n) .

Introduceg : x 7→ f (x)/{(x − a)m(b − x)n}

bounded away from zero.Hence g can be approximated from below by polynomials.Introduce

P1(x) =∑

(i,j)∈I1

α(1)i,j (x − a)i(b − x)j

with non-negative coefficients, and such that P1(x) ≤ g(x).Then approximate g − P1 from below by P2, g − P1 − P2 by P3,etc.


Case X = [a, b]

We obtain a sum of polynomials∑n

k=0 Pk(x) converging to g(x)when n → ∞.We multiply by (x − a)m(b − x)n to estimate f (x) instead.This leads to a sequence of estimators

Sn =n∑

k=0

∑(i,j)∈Ik

a(k)i,j

{ i∏p=1

(Xp − a)i}{ j∏

q=1(b − Xi+q)j

}

for which P(Sn−1 ≤ Sn) = 1, yielding a non-negative unbiasedestimator of f (λ).


Outline

1 Exact inference



4 Discussion


Back to statistics

No f -factory for X = R and any non-constant f .⇒ without lower bounds on the log-likelihood estimator, nonon-negative unbiased likelihood estimators.

No f -factory for decreasing functions f and X = R+.⇒ without lower and upper bounds on the estimator of C (θ),no non-negative unbiased estimators of 1/C (θ).

For the reference prior, it seems hopeless unless X = [a, b].


Discussion

No answer for the case f “slowly” increasing and X ⊂ R+.

We only considered the transformation of an unbiasedestimator of λ to an unbiased estimator of f (λ).

Should we tolerate negative values and come up withappropriate methodology?

Should we aim for exact inference?

Thanks!


References

On non-negative unbiased estimators, Jacob, Thiery, 2014(arXiv)

Playing Russian Roulette with Intractable Likelihoods,Girolami, Lyne, Strathmann, Simpson, Atchade, 2013 (arXiv)

Computational complexity and fundamental limitations tofermionic quantum Monte Carlo simulations, Troyer, Wiese,2005 (Phys. rev. let. 94)

Unbiased Estimation with Square Root Convergence for SDEModels , Rhee, Glynn, 2013 (arXiv)


on non-negative unbiased estimators

Education