quasi-monte carlo sampling to improve the eﬃciency of ...stevel/565/literature/qmc_mcem.pdf ·...

Quasi-Monte Carlo Sampling to

improve the Efficiency of Monte Carlo EM

Wolfgang JankDepartment of Decision and Information Technologies

University of MarylandCollege Park, MD 20742-1815

[email protected]

November 17, 2003

Abstract

In this paper we investigate an efficient implementation of the Monte Carlo EM al-

gorithm based on Quasi-Monte Carlo sampling. The Monte Carlo EM algorithm is a

stochastic version of the deterministic EM (Expectation-Maximization) algorithm in

which an intractable E-step is replaced by a Monte Carlo approximation. Quasi-Monte

Carlo methods produce deterministic sequences of points that can significantly improve

the accuracy of Monte Carlo approximations over purely random sampling. One draw-

back to deterministic Quasi-Monte Carlo methods is that it is generally difficult to

determine the magnitude of the approximation error. However, in order to implement

the Monte Carlo EM algorithm in an automated way, the ability to measure this error

is fundamental. Recent developments of randomized Quasi-Monte Carlo methods can

overcome this drawback. We investigate the implementation of an automated, data-

driven Monte Carlo EM algorithm based on randomized Quasi-Monte Carlo methods.

We apply this algorithm to a geostatistical model of online purchases and find that it

can significantly decrease the total simulation effort, thus showing great potential for

improving upon the efficiency of the classical Monte Carlo EM algorithm.

Key words and phrases: Monte Carlo error; low-discrepancy sequence; Halton sequence; EM algo-

rithm; geostatistical model.

1

1 Introduction

The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is a popular tool in statis-

tics and many other fields. One limitation to the use of EM is, however, that quite often the

E-step of the algorithm involves an analytically intractable, sometimes high dimensional integral.

Hobert (2000), for example, considers a model for which the E-step involves intractable integrals of

dimension twenty. The Monte Carlo EM (MCEM) algorithm, proposed by Wei & Tanner (1990),

estimates this intractable integral with an empirical average based on simulated data. Typically, the

simulated data is obtained by producing random draws from the distribution commanded by EM.

By the law of large numbers, this integral-estimate can be made arbitrarily accurate by increasing

the size of the simulated data. The MCEM algorithm typically requires a very high accuracy,

especially at the later iterations. Booth & Hobert (1999), for example, report sample sizes of over

66,000 at convergence. This suggests that the overall efficiency of MCEM could be improved by

using simulation methods that achieve a high accuracy in the integral-estimate with smaller sample

sizes.

Recent research has provided evidence that entirely random draws do not necessarily result in

the most efficient use of the simulated data. In particular, one criticism of random draws is that

they often do not explore the sample space well (Morokoff & Caflisch, 1995; Caflisch et al., 1997).

For instance, points drawn at random tend to form clusters which leads to gaps where the sample

space is not explored at all (see Figure 1 for illustration). This criticism has lead to the development

of a variety of deterministic methods that provide for a better spread of the sample points. These

deterministic methods are often classified as Quasi-Monte Carlo (QMC) methods. Theoretical as

well as empirical research has shown that QMC methods can significantly increase the accuracy of

the integral-estimate over random draws.

Figure 1 about here

In this paper we investigate an implementation of the MCEM algorithm based on QMC methods.

Wei & Tanner (1990) point out that for an efficient implementation, the size of the simulated data

should be chosen small at the initial stage but increased successively as the algorithm moves along.

Early versions of the method require a manual, user-determined increase of the sample size, for

instance, by allocating the amount of data to be simulated in each iteration already before the start

2

of the algorithm (e.g. McCulloch, 1997). Implementations of MCEM that determine the necessary

sample size in an automated, data-driven fashion have been developed only recently (see Booth

& Hobert, 1999; Levine & Casella, 2001; Levine & Fan, 2003). Automated implementations of

MCEM base the decision to increase the sample size on the magnitude of the error in the integral-

approximation. In their seminal work, Booth & Hobert (1999) use statistical methods to estimate

this error when the simulated data is generated at random. However, since QMC methods are

deterministic in nature, statistical methods do not apply. Moreover, determining the error of the

QMC integral-estimate analytically can be extremely hard (Caflisch et al., 1997).

Recently, the development of randomized QMC methods has overcome this early drawback.

Randomized Quasi-Monte Carlo (RQMC) methods combine the benefits of deterministic sampling

methods, which achieve a more uniform exploration of the sample space, with the statistical advan-

tages of random draws. A survey of recent advances in RQMC methods can be found in L’Ecuyer &

Lemieux (2002). In this work we implement an automated MCEM algorithm based on RQMC meth-

ods. Specifically, we demonstrate how to obtain a QMC sample from the distribution commanded

by EM and we use the ideas of RQMC sampling to measure the error of the integral-estimate in

every iteration of the algorithm. We implement this Quasi-Monte Carlo EM (QMCEM) algorithm

within the framework of the automated MCEM formulation proposed by Booth & Hobert (1999).

The remainder of this paper is organized as follows. In Section 2 we briefly motivate the ideas

surrounding QMC and RQMC. In Section 3 we explain how RQMC methods can be used to imple-

ment QMCEM in an automated, data-driven fashion. We apply this algorithm to a geostatistical

model of online purchases in Section 4 and conclude with final remarks in Section 5.

2 Quasi-Monte Carlo Sampling

Quasi-Monte Carlo methods can be regarded as a deterministic counterpart to classical Monte

Carlo. Suppose we want to evaluate an (analytically intractable) integral

I =∫

Cd

f(x)dx (1)

3

over the d-dimensional unit cube, Cd := [0, 1]d. Classical Monte Carlo integration randomly selects

points xk ∼ Uniform(Cd), k = 1, . . . ,m, and approximates (1) by the empirical average

I =1m

m∑

k=1

f(xk). (2)

Quasi-Monte Carlo methods, on the other hand, select the points deterministically. Specifically,

QMC methods produce a deterministic sequence of points that provides the best-possible spread

in Cd. These deterministic sequences are often referred to as low-discrepancy sequences (see, for

example, Niederreiter, 1992; Fang & Wang, 1994).

A variety of different low-discrepancy sequences exist. Examples include the Halton sequence

(Halton, 1960), the Sobol sequence (Sobol, 1967), the Faure sequence (Faure, 1982), and the Nieder-

reiter sequence (Niederreiter, 1992), but this list is not exhaustive. In this work we focus our

attention on the Halton sequence since it is conceptually very appealing.

2.1 Halton Sequences

Let b be a prime number. Then any integer k, k ≥ 0, can be written in base-b representation as

k = djbj + dj−1b

j−1 + · · ·+ d1b + d0,

where di ∈ {0, 1, . . . , b− 1} for i = 0, 1, . . . , j. Define the base-b radical inverse function, φb(k), as

φb(k) =d0

b1+

d1

b2+ ... +

dj

bj+1.

Notice that for every integer k ≥ 0, φb(k) ∈ [0, 1].

The kth element of the Halton sequence is obtained via the radical inverse function evaluated

at k. Specifically, if b1, ..., bd are d different prime numbers, then a d-dimensional Halton sequence

of length m is given by {x1, ...,xm}, where the kth element of the sequence is

xk = [φb1(k − 1), ..., φbd(k − 1)]T , k = 1, ..., m. (3)

(See Halton (1960) or Wang & Hickernell (2000) for more details.)

Notice that the Halton sequence does not need to be started at the origin. Indeed, for any d-

vector of non-negative integers, n = (n1, ..., nd)T , say, the Halton sequence with the first elements

skipped,

xk = [φb1(n1 + k − 1), ..., φbd(nd + k − 1)]T , k = 1, ..., m, (4)

4

remains a low-discrepancy sequence (see Pages, 1992; Bouleau & Lepingle, 1994). We will refer to

the sequence defined by (4) as a Halton sequence with starting point n. Figure 1 shows the first

2500 elements of a two-dimensional Haltion sequence with n = (0, 0)T .

2.2 Randomized Quasi-Monte Carlo

Owen (1998b) points out that the main (practical) disadvantage of QMC is that determining the

accuracy of the integral-estimate in (2) is typically very complicated, if not impossible. Moreover,

since QMC methods are based on deterministic sequences, statistical procedures for error estimation

do not apply. This drawback has lead to the development of randomized Quasi-Monte Carlo

(RQMC) methods.

L’Ecuyer & Lemieux (2002) suggest that any RQMC sequence should have the following two

properties: 1) every element of the sequence has a uniform distribution over Cd; 2) the low-

discrepancy property of the sequence is preserved under the randomization. The first property

guarantees that the approximation I in (2) is an unbiased estimate of the integral in (1). Moreover,

one can estimate its variance by generating r independent copies of I (which is typically done by

generating r independent sequences x(j)1 , . . . ,x(j)

m , j = 1, . . . , r). Given a desired total simulation

amount N = rm, smaller values of r (paired with a larger value of m) should result in a better

accuracy of the integral-estimate, since it takes better advantage of the low-discrepancy property

of each sequence. At the extreme, taking r = N and m = 1 simply reproduces classical Monte

Carlo estimation.

2.3 Randomized Halton Sequences

Recall that, regardless of the starting point, the Halton sequence remains a low-discrepancy se-

quence. Wang & Hickernell (2000) use this fact to show that if the Halton sequence is started at

a random point, x1 ∼ Uniform(Cd), then it satisfies the RQMC properties 1) and 2) from Subsec-

tion 2.2. In the following sections, we will use RQMC sampling based on the randomized Halton

sequence.

5

3 Quasi-Monte Carlo EM

The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure

useful to approximate the maximum likelihood estimator (MLE) in incomplete data problems. Let

y be a vector of observed data, let u be a vector of unobserved data or random effects and let θ

denote a vector of parameters. Furthermore, let f(y,u; θ) denote the joint density of the complete

data, (y,u). Let L(θ;y) =∫

f(y,u; θ)du denote the (marginal) likelihood function for this model.

The MLE, θ, maximizes L(·;y).

In each iteration, the EM algorithm performs an expectation and a maximization step. Let

θ(t−1) denote the current parameter value. Then, in the tth iteration of the algorithm, the E-

step computes the conditional expectation of the complete data log-likelihood, conditional on the

observed data and the current parameter value,

Q(θ|θ(t−1)) = E[log f(y,u; θ)|y;θ(t−1)

]. (5)

The tth EM update, θ(t), maximizes (5). That is θ(t) satisfies

Q(θ(t)|θ(t−1)) ≥ Q(θ|θ(t−1)) (6)

for all θ in the parameter space. This is also known as the M-step. The M-step is often implemented

using standard numerical methods like Newton-Raphson (see Lange, 1995). Solutions to overcome

a difficult M-step have been proposed in, for example, Meng & Rubin (1993). Given an initial value

θ(0), the EM algorithm produces a sequence {θ(0),θ(1),θ(2), . . . } that, under regularity conditions

(see Boyles, 1983; Wu, 1983), converges to θ.

In this work we focus on the situation when the E-step does not have a closed form solution.

Wei & Tanner (1990) proposed to approximate an analytically intractable expectation in (5) by

the empirical average

Q(θ|θ(t−1)) ≡ Q(θ|θ(t−1);u1, . . . ,umt) =1

mt

mt∑

k=1

log f(y,uk; θ), (7)

where u1, . . . ,umt are simulated from the conditional distribution f(u|y; θ(t−1)). Then, by the

law of large numbers, Q(θ|θ(t−1)) will be a reasonable approximation to Q(θ|θ(t−1)) if mt is large

enough.

6

We consider a modification of (7) suitable for RQMC sampling. Let u(j)1 , . . . ,u(j)

mt , j = 1, . . . , r,

be r independent RQMC sequences of length mt, each simulated from f(u|y; θ(t−1)). (The details

of how to simulate a RQMC sequence from f(u|y;θ(t−1)) are deferred until Subsection 3.2.) Then,

an unbiased estimate of (5) is given by the pooled estimate

QP (θ|θ(t−1)) =1r

r∑

j=1

Q(j)(θ|θ(t−1)), (8)

where Q(j)(θ|θ(t−1)) = Q(θ|θ(t−1);u(j)1 , . . . ,u(j)

mt) in (7). The tth Quasi-Monte Carlo EM (QMCEM)

update, θ(t)

, maximizes QP (·|θ(t−1)).

3.1 Increasing the length of the RQMC sequences

We have pointed out earlier that the Monte Carlo sample sizes mt should be increased successively as

the algorithm moves along. In fact, Booth et al. (2001) argue that MCEM will never converge if mt

is held fixed across iterations because of a persevering Monte Carlo error (see also Chan & Ledolter,

1995). While earlier versions of the method choose the Monte Carlo sample sizes in a deterministic

fashion before the start of the algorithm (e.g. McCulloch, 1997), the same deterministic allocation of

Monte Carlo resources that works well in one problem may result in a very inefficient (or inaccurate)

algorithm in another problem. Thus, data-dependent (and user-independent) sample size rules are

necessary in order to implement MCEM in an automated way. Booth & Hobert (1999) base the

decision of a sample size increase on the noise in the parameter updates (see also Levine & Casella,

2001; Levine & Fan, 2003).

Let θ(t−1)

denote the current QMCEM parameter value and let θ(t)

denote the maximizer of

QP (·|θ(t−1)) in (8) based on r independent RQMC sequences each of length mt. Thus, θ

(t)satisfies

FP (θ(t)|θ(t−1)

) = 0, (9)

where we define FP (θ|θ′) = ∂QP (θ|θ′)/∂θ. Let θ(t) denote the parameter update of the determin-

istic EM algorithm, that is, θ(t) satisfies

F(θ(t)|θ(t−1)) = 0, (10)

where, in similar fashion to above, we define F(θ|θ′) = ∂Q(θ|θ′)/∂θ. Thus, a first order Taylor

expansion of FP (θ(t)|θ(t−1)

) about θ(t) yields

(θ(t) − θ(t))T SP (θ(t)|θ(t−1)

) ≈ −FP (θ(t)|θ(t−1)), (11)

7

where we define the matrix SP (θ|θ′) = ∂2QP (θ|θ′)/∂θ∂θT . Under RQMC sampling, QP is an

unbiased estimate of Q. Assuming mild regularity conditions, it follows that for the expectation

E[FP (θ(t)|θ(t−1))] = F(θ(t)|θ(t−1)

) = 0. (12)

Therefore, the expected value of θ(t)

is θ(t) and its variance-covariance matrix is given by

Var(θ(t)

) =[SP (θ(t)|θ(t−1)

)]−1

Var(FP (θ(t)|θ(t−1))[SP (θ(t)|θ(t−1)

)]−1

. (13)

Under regular Monte Carlo sampling, it follows that, for a large enough Monte Carlo sample

size, θ(t)

is approximately normal distributed with mean and variance specified above. Under

RQMC sampling, however, the accuracy of the normal approximation may depend on the number

r of independent RQMC sequences. In Section 4 we consider a range of values for r in order to

investigate its effect on QMCEM.

In our implementations we estimate Var(θ(t)

) by substituting θ(t)

for θ(t) in (13) and estimate

Var(FP (θ(t)|θ(t−1)) via

1r2

r∑

j=1

(∂

∂θQ(j)(θ|θ(t−1)

))(

∂

∂θQ(j)(θ|θ(t−1)

))T

∣∣∣∣∣∣�=�

(t)

. (14)

Larger values of r should result in a more accurate estimate for Var(θ(t)

). However, we also

pointed out that smaller values of r should result in a better accuracy of the Monte Carlo estimate

in (8), since it takes better advantage of the low-discrepancy property of each individual sequence

u(j)1 , . . . ,u(j)

mt . We investigate the impact of this trade-off on the overall efficiency of the method in

Section 4.

The QMCEM algorithm proceeds as follows. Following Booth & Hobert’s recommendation, we

measure the noise in the QMCEM update θ(t)

by constructing a (1−α)×100% confidence ellipsoid

about the deterministic EM update θ(t), using the normal approximation for θ(t)

. If this ellipsoid

contains the previous parameter value θ(t−1)

, then we conclude that the system is too noisy and

we increase the length mt of the RQMC sequence. Booth et al. (2001) argue that the sample sizes

should be increased at an exponential rate. Thus, we increase the sample size to mt+1 := (1+κ)mt,

where κ is a small number, typically κ = 0.2, 0.3, 0.4. Since stochastic algorithms, like MCEM, can

satisfy deterministic stopping rules purely by chance, it is recommended to continue the method

until the stopping rule is satisfied for several consecutive iterations (see also Booth & Hobert,

8

1999). Thus, we stop the algorithm when the relative change in two successive parameter updates

is smaller than some small number δ, δ > 0, for 3 consecutive iterations.

3.2 Laplace Importance Sampling to generate RQMC sequences

Recall that the pooled estimate in (8) is based on r independent RQMC sequences u(j)1 , . . . ,u(j)

mt , j =

1, . . . , r, simulated from f(u|y; θ(t−1)). In this section we demonstrate how to generate randomized

Halton sequences using Laplace importance sampling.

Laplace importance sampling has been proven useful to draw approximate samples from f(u|y; θ)

in many instances (see Booth & Hobert, 1999; Kuk, 1999). Laplace importance sampling attempts

to find an importance sampling distribution whose mean and variance match the mode and curva-

ture of f(u|y; θ). More specifically, suppressing the dependence on y, let

l(u;θ) = log f(y,u; θ) (15)

denote the complete data log likelihood and let l′(u; θ) and l′′(u; θ) denote its first and second

derivatives in u, respectively. Suppose that u denotes the maximizer of l satisfying l′(u; θ) =

0. Then the Laplace approximations to the mean and variance of f(u|y; θ) are µ(θ) = u and

Σ(θ) = −{l′′(u; θ)}−1, respectively (e.g. De Bruijn, 1958). Booth & Hobert (1999) as well as Kuk

(1999) propose to use a multivariate normal or multivariate t importance sampling distribution,

shifted and scaled by µ(θ) and Σ(θ), respectively. Let fLap(u|y;θ) denote the resulting Laplace

importance sampling distribution.

Recall that by RQMC property 1), every element of a RQMC sequence has a uniform distri-

bution over Cd. Let xk be the kth element of a randomized Halton sequence. Using a suitable

transformation (e.g. Robert & Casella, 1999), we can generate a d-vector of i.i.d. normal or t vari-

ates. Shifting and scaling this vector by µ(θ) and Σ(θ) results in a draw uk from fLap(u|y;θ).

Thus, using r independent randomized Halton sequences of length mt, x(j)1 , . . . ,x(j)

mt , j = 1, . . . , r,

we obtain r independent sequences u(j)1 , . . . ,u(j)

mt from fLap(u|y; θ).

Booth & Hobert (1999) or Kuk (1999) successfully use Laplace importance sampling for the

fitting of generalized linear mixed models. In the following we use the method to an application of

generalized linear mixed models to data exhibiting spatial correlation.

9

4 Application: A Geostatistical Model of Online Purchases

In this section we consider sales data from an online book publisher and retailer. The publisher

sells online the titles it publishes in print form as well as, more recently, also in PDF form. The

publisher has good reason to believe that a customer’s preference for either print or PDF form

varies significantly due to his or her geographical location. In fact, since the PDF form is directly

downloaded from the publisher’s web site, it requires a reliable and typically fast internet connec-

tion. However, the availability of reliable internet connections varies greatly across different regions.

Moreover, directly downloaded PDF files provide content immediately without having to wait for

shipment as in the case of a printed book. Thus, shipping times can also influence a customer’s

preference. The preference can also be affected by a customer’s access to good quality printers or

his/her technology readiness, all of which often exhibit strong local variability.

Data exhibiting spatial correlation can be modelled using generalized linear mixed models (e.g

Breslow & Clayton, 1993). Diggle et al. (1998) refer to these spatial applications of generalized

linear mixed models as “model based geostatistics.” These spatial mixed models are challenging

from a computational point of view since they often involve approximating rather high dimensional

integrals. In the following we consider a set of data leading to an analytically intractable likelihood-

integral of dimension 16.

Let {zi}di=1, zi = (zi1, zi2), denote the spatial coordinates of the observed responses {yi}d

i=1.

For example, zi1 and zi2 could denote the longitude and latitude of the observation yi. While yi

could represent a variety of response types, we focus here on the binomial case only. For instance,

yi could indicate whether or not a person living at location zi has a certain disease or whether or

not this person has a preference for a certain product. One of the modelling goals is to account

for the possibility that two people living in close geographic proximity are more likely to share the

same disease or the same preference.

Let u = (u1, . . . , ud) be a vector of random effects. Assume that, conditional on ui, the responses

yi arise from the model

yi|ui ∼ Binomial(

ni,exp(β + ui)

1 + exp(β + ui)

), (16)

where β is an unknown regression coefficient. Assume furthermore that u follows a multivariate

normal distribution with mean zero and covariance structure such that the correlation between two

10

random effects decays with the geographical distance between the associated two observations. For

example, assume that

Cov(ui, uj) = σ2 exp{−α‖zi − zj‖}, (17)

where ‖·‖ denotes the Euclidian norm. While different modelling alternatives exist (see, for example,

Diggle et al., 1998), we will use the above model to investigate the efficiency of Quasi-Monte Carlo

MCEM implementations for estimating the parameter vector θ = (β, σ2, α).

We analyze a set of online retail data for the Washington, DC, area. Washington is a very

diverse area with respect to a variety of aspects like socio-economic factors or infrastructure. This

diversity is often expressed in regionally/locally strongly varying customer preferences. The data

set consists of 39 customers who accessed the publisher’s web site and either purchased the title in

print form or in PDF. In addition to a customer’s purchasing choice, the publisher also recorded

the customer’s geographical location. Geographical location can easily be obtained (at least ap-

proximately) through the customer’s ZIP code. ZIP code information can then be transformed into

longitudinal and latitudinal coordinates. After aggregating customers from the same ZIP code with

the same preference, we obtained d = 16 distinct geographical locations. Let ni denote the number

of purchases from location i and let yi denote the number of PDF purchases thereof. Figure 2

displays the data.

Figure 2 about here

Quasi-Monte Carlo has been found to improve upon the efficiency of classical Monte Carlo

methods in a variety of setting. For instance, Bhat (2001) reports efficiency gains via the Halton

sequence in a logit model for integral dimensions ranging from 1 to 5. Lemieux & L’Ecuyer (1998),

on the other hand, consider integral dimensions as large as 120 and find efficiency improvements

for the pricing of Asian options. In our example, the correlation structure of the random effects in

equation (17) causes the likelihood function (and therefore also the E-step of the EM algorithm)

to include an analytically intractable integral of dimension 16. Indeed, the (marginal) likelihood

function for the model in (16) and (17) can be written as

L(θ;y) ∝∫ (

d∏

i=1

f(yi|ui; θ)

)exp{−0.5uTΣ−1u}

|Σ|1/2du, (18)

where u = (u1, . . . , u16)T contains the random effects corresponding to the 16 distinct locations

and Σ is a 16× 16 matrix with elements σij = Cov(ui, uj).

11

The evaluation of high dimensional integrals is computationally burdensome. We conducted

a simulation study to investigate the efficiency of QMC approaches relative to that of classical

Monte Carlo. Table 1 shows the results for three different QMCEM algorithms, using r = 5,

r = 10 and r = 30 RQMC sequences, respectively. This compares to an implementation of MCEM

using classical Monte Carlo techniques. We can see that the Monte Carlo standard errors of the

parameter estimates of θ = (β, σ2, α) are very similar across the estimation methods, indicating

that all 4 methods estimate the parameters with (on average) comparable accuracy. However,

the total simulation efforts required to obtain this accuracy differs greatly. Indeed, while classical

Monte Carlo requires an average number of 800,200 simulated vectors (each of dimension 16!),

it only takes 20,836 for QMC (using r = 5 RQMC sequences). This is a reduction in the total

simulation effort by a factor of almost 40! It is also interesting to note that among the 3 different

QMC approaches, choosing r = 30 RQMC sequences results in a (average) total simulation effort

of 30,997 simulated vectors compared to only 20,836 for r = 5.

Table 1 about here

The reduction in the total simulation effort that is possible with the use of QMC methods is

intriguing. The MCEM algorithm usually spends most of its simulation effort in the final iterations

when the algorithm is in the vicinity of the MLE. This has already been observed by, for example,

Booth & Hobert (1999) or McCulloch (1997). The reason for this is the convergence behavior of

the underlying deterministic EM algorithm. EM usually takes large steps in the early iterations,

but the size of the steps reduce drastically as EM approaches θ. The step size in the tth iteration

of EM can be thought of as the signal that is transmitted to MCEM. However, due to the error

in the Monte Carlo approximation of the E-step in (7), MCEM receives only a noisy version of

that signal. While the signal-to-noise ratio is large in the early iterations of MCEM, it declines

continuously as MCEM approaches θ. This makes larger Monte Carlo sample sizes necessary in

order to increase the accuracy of the approximation in (7) and therefore to reduce the noise. Table

1 shows that QMC methods, due to their superior ability to estimate an intractable integral more

accurately, manage to reduce that noise with smaller sample sizes. The result is a smaller total

simulation effort required by QMC.

Table 1 also shows that among the 3 different QMCEM algorithms, implementations that use

fewer but longer low-discrepancy sequences result in a better total simulation effort than a large

12

number of short sequences. Indeed, the simulation effort for r = 30 RQMC sequences is about 50%

higher than that for r = 5 or r = 10. We pointed out in Section 2 that for a given total simulation

amount r ·m, smaller values of r paired with larger values of m should result in a more accurate

integral-estimate. On the other hand, the trade-off for using small values of r is a less accurate

variance estimate in (14). In order to implement MCEM using randomized Halton sequences, a

balance has to be achieved between a more accurate integral-estimate (i.e. less noise) and a more

accurate variance estimate. In our example, we found this balance for values of r between 5 and

10. We also experimented with values smaller than 5 and frequently encountered problems with

the numerical stability of the estimate of the covariance matrix in (14).

In the final paragraphs of this section we want to take a closer look at noise of the QMCEM

algorithm and compare it to classical MCEM. Figure 3 visualizes the Monte Carlo error for three

different Monte Carlo estimation methods: classical Monte Carlo using random sampling (column

1), randomized Quasi-Monte Carlo with r = 5 RQMC sequences (column 2) and pure Quasi-Monte

Carlo without randomization (column 3).

Figure 3 about here

We can see that for classical Monte Carlo, the average parameter update (thick solid line) is very

volatile and has wide confidence bounds (dotted lines). This suggests that the Monte Carlo error is

huge. This is in strong contrast to QMC. Indeed, for pure QMC sampling the parameter updates

are significantly less volatile with much tighter confidence bounds. Notice that we allocated the

same simulation effort for both simulation methods! It takes classical MCEM much larger sample

sizes to reduce the noise to the same level as under QMC sampling.

We have argued at the beginning of this paper that in order to implement MCEM in an au-

tomated way, the ability to estimate the error in the Monte Carlo approximation is essential.

Randomized QMC methods provide this ability. While randomized Halton sequences have the low-

discrepancy property (and thus estimate the integral with a higher accuracy than classical Monte

Carlo), randomization may not come for free. Indeed, the second column of Figure 3 shows that,

while the error reduction is still substantial compared to a classical Monte Carlo approach, the

system is noisier than under pure QMC sampling.

13

5 Conclusion

In this paper we have demonstrated how recent advances in randomized Quasi-Monte Carlo can

be used to implement the MCEM algorithm in an automated, data-driven way. The empirical

investigations provide encouraging evidence that this Quasi-Monte Carlo EM algorithm can lead

to a significant efficiency gains over implementations using regular Monte Carlo methods.

We focused our investigations in this work on the randomized Halton sequence only. Other

randomized Quasi-Monte Carlo methods exist. See, for example, Owen (1998a) or L’Ecuyer &

Lemieux (2002). It could be a rewarding topic for future research to investigate the benefits of

different Quasi-Monte Carlo methods for the implementation of Monte Carlo EM (and also other

stochastic estimation methods that are frequently encountered in the statistics literature).

Acknowledgements

All the simulations made in this work are based on the programming language Ox of Doornik (2001).

14

References

Bhat, C. (2001). Quasi-random maximum simulated likelihood estimation for the mixed multino-

mial logit model. Transportation Research 35, 677–693.

Booth, J. G. & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods

with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61,

265–285.

Booth, J. G., Hobert, J. P. & Jank, W. (2001). A survey of Monte Carlo algorithms for

maximizing the likelihood of a two-stage hierarchical model. Statistical Modelling 1, 333–349.

Bouleau, N. & Lepingle, D. (1994). Numerical Methods for Stochastic Processes. New York:

Wiley.

Boyles, R. A. (1983). On the convergence of the EM algorithm. Journal of the Royal Statistical

Society B 45, 47–50.

Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed

models. Journal of the American Statistical Association 88, 9–25.

Caflisch, R., Morokoff, W. & Owen, A. (1997). Valuation of mortgage-backed securities

using brownian bridges to reduce effective dimension. Journal of Computational Finance 1,

27–46.

Chan, K. S. & Ledolter, J. (1995). Monte Carlo EM estimation for time series models involving

counts. Journal of the American Statistical Association 90, 242–252.

De Bruijn, N. G. (1958). Asymptotic Methods in Analysis. Amsterdam: North-Holland.

Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete

data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–22.

Diggle, P. J., Tawn, J. A. & Moyeed, R. A. (1998). Model-based geostatistics. Journal of the

Royal Statistical Society A 47, 299–350.

Doornik, J. A. (2001). Ox: Object Oriented Matrix Programming. London: Timberlake.

15

Fang, K.-T. & Wang, Y. (1994). Number Theoretic Methods in Statistics. New York: Chapman

& Hall.

Faure, H. (1982). Discrepance de suites associees a un systeme de numeration (en dimension s).

Acta Arithmetica 41, 337–351.

Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating

multi-dimensional integrals. Numerische Mathematik 2, 84–90.

Hobert, J. P. (2000). Hierarchical models: A current computational perspective. Journal of the

American Statistical Association 95, 1312–1316.

Kuk, A. Y. C. (1999). Laplace importance sampling for generalized linear mixed models. Journal

of Statistical Computation and Simulation 63, 143–158.

Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of the

Royal Statistical Society B 57, 425–437.

L’Ecuyer, P. & Lemieux, C. (2002). Recent advances in randomized Quasi-Monte Carlo Meth-

ods. In Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications,

M. Dror, P. L’Ecuyer & F. Szidarovszki, eds. Kluwer Academic Publishers.

Lemieux, C. & L’Ecuyer, P. (1998). Efficiency improvement by lattice rules for pricing asian

options. In Proceedings of the 1998 Winter Simulation Conference. IEEE Press.

Levine, R. & Fan, J. (2003). An automated (Markov Chain) Monte Carlo EM algorithm. Tech.

rep., San Diego State University.

Levine, R. A. & Casella, G. (2001). Implementations of the Monte Carlo EM algorithm. Journal

of Computational and Graphical Statistics 10, 422–439.

McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models.

Journal of the American Statistical Association 92, 162–170.

Meng, X.-L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm:

A general framework. Biometrika 80, 267–278.

16

Morokoff, W. J. & Caflisch, R. E. (1995). Quasi-Monte Carlo integration. Journal of

Computational Physics 122, 218–230.

Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.

Philadelphia: SIAM.

Owen, A. (1998a). Scrambling Sobol’ and Niederreiter-Xing points. Journal of Complexity 14,

466–489.

Owen, A. B. (1998b). Monte Carlo extension of Quasi-Monte Carlo. In 1998 Winter Simulation

Conference Proceedings. New York: Springer, pp. 571–577.

Pages, G. (1992). Van der Corput sequences, Kakutani transforms and one-dimensional numerical

integration. Journal of Computational and Applied Mathematics 44, 21–39.

Robert, C. P. & Casella, G. (1999). Monte Carlo Statistical Methods. New York: Springer.

Sobol, I. M. (1967). Distribution of points in a cube and approximate evaluation of integrals.

U.S.S.R. Computational Mathematics and Mathematical Physics 7, 784–802.

Wang, X. & Hickernell, F. J. (2000). Randomized Halton sequences. Mathematical and

Computer Modelling 32, 887–899.

Wei, G. C. G. & Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and

the poor man’s data augmentation algorithms. Journal of the American Statistical Association

85, 699–704.

Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics

11, 95–103.

17

Regular Monte Carlo

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Quasi-Monte Carlo

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: 2500 points in the unit square: The upper plot shows the result of regular Monte Carlosampling; that is, 2500 points selected randomly. Random points tend to form clusters, over-sampling the unit square in some places; this leads to gaps in other places, where the sample spaceis not explored at all. The lower plot shows the result of Quasi-Monte Carlo sampling: 2500 pointsof a two dimensional Halton sequence.

18

−77.20 −77.15 −77.10 −77.05 −77.00 −76.95 −76.90

38.8

038

.85

38.9

038

.95

39.0

0

Longitude

Latit

ude

Geographical Distribution of Data

Proportion of PDF Purchases per Location

−77.08 −77.06 −77.04 −77.02 −77.00 −76.98 −76.96 −76.94

0.0

0.2

0.4

0.6

0.8

1.0

38.86

38.88

38.90

38.92

38.94

38.96

38.98

Longitude

Latit

ude

Pro

port

ion

Figure 2: Geographical distribution of PDF purchases for Washington, DC: The upper plot showsthe geographical borders of Washington, DC, as well as the geographical location of the 39 purchasesof PDF or Print. The lower plot displays the geographical scatter of the relative proportion of PDFpurchases.

19

0 20 40 60 80 100

−0.

990

−0.

980

−0.

970

−0.

960

Classical Monte Carlo

Beta

0 20 40 60 80 100

2.74

2.76

2.78

2.80

2.82

Sigma

0 20 40 60 80 100

13.2

013

.25

13.3

013

.35

Alpha

0 20 40 60 80 100

−0.

990

−0.

980

−0.

970

−0.

960

Randomized Quasi−Monte Carlo

Beta

0 20 40 60 80 100

2.74

2.76

2.78

2.80

2.82

Sigma

0 20 40 60 80 100

13.2

013

.25

13.3

013

.35

Alpha

0 20 40 60 80 100

−0.

990

−0.

980

−0.

970

−0.

960

Pure Quasi−Monte Carlo

Beta

0 20 40 60 80 100

2.74

2.76

2.78

2.80

2.82

Sigma

0 20 40 60 80 100

13.2

013

.25

13.3

013

.35

Alpha

Figure 3: Monte Carlo Error and Quasi-Monte Carlo Error: Starting MCEM near the MLE, weperformed 100 iterations using a fixed Monte Carlo sample size of rmt ≡ 1000, ∀t = 1, . . . , 100.We repeated this experiment 50 times for a) MCEM using classical Monte Carlo sampling (column1); b) randomized Quasi-Monte Carlo with r = 5 (column 2); c) pure Quasi-Monte Carlo withoutrandomization, i.e. r = 1 (column 3). For each parameter value we plotted the average of the 50iteration histories (thick, solid lines) as well as pointwise 95% confidence bounds (dotted lines).

20

Table 1: Spatial model: The table investigates the efficiency of Quasi-Monte Carlo implementationsof MCEM for fitting geostatistical models. We investigate three different Quasi-Monte Carlo (QMC)algorithms using r = 5, 10 and 30 independent RQMC sequences, respectively. These RQMCsequences are obtained via randomized Halton sequences using Laplace importance sampling basedon a t distribution with 10 degrees of freedom. We benchmark these three QMC algorithms againstan implementation of MCEM based on regular Monte Carlo (MC) sampling using the same Laplaceimportance sampler. We start each algorithm from (β(0), σ2(0), α(0)) = (0, 1, 1) and increase thelength of the RQMC sequences according to Section 3.1 using α = 0.25 and κ = 0.2. The algorithmis terminated if the relative difference in two successive parameter updates falls below δ = 0.01 for 3consecutive iterations. For each of the four MCEM implementations we performed this experiment50 times recording the final parameter values, βi, σ2

i and αi and the total number of simulatedvectors, N =

∑Tij=1 r ·mj , where Ti denotes the final iteration number (i = 1, . . . , 50). The table

displays the Monte Carlo average (AVG) and the Monte Carlo standard error (SE) for these values.For instance, for the regression parameter β it displays the average β over the 50 replications andthe Monte Carlo standard error sβ/

√50, where sβ denotes the sample standard deviation over the

50 replicates.

β σ2 α N

MC AVG -0.973 2.879 13.095 800,200SE 0.005 0.010 0.028 230,347

QMC AVG -0.978 2.856 13.173 20,836(r=5) SE 0.005 0.011 0.032 908QMC AVG -0.975 2.881 13.094 21,768(r=10) SE 0.005 0.013 0.035 941QMC AVG -0.969 2.858 13.148 30,997(r=30) SE 0.005 0.011 0.031 1,234

21

quasi-monte carlo sampling to improve the eﬃciency of ...stevel/565/literature/qmc_mcem.pdf ·...

Documents