quasi-monte carlo sampling to improve the efficiency of ...stevel/565/literature/qmc_mcem.pdf ·...
TRANSCRIPT
Quasi-Monte Carlo Sampling to
improve the Efficiency of Monte Carlo EM
Wolfgang JankDepartment of Decision and Information Technologies
University of MarylandCollege Park, MD 20742-1815
November 17, 2003
Abstract
In this paper we investigate an efficient implementation of the Monte Carlo EM al-
gorithm based on Quasi-Monte Carlo sampling. The Monte Carlo EM algorithm is a
stochastic version of the deterministic EM (Expectation-Maximization) algorithm in
which an intractable E-step is replaced by a Monte Carlo approximation. Quasi-Monte
Carlo methods produce deterministic sequences of points that can significantly improve
the accuracy of Monte Carlo approximations over purely random sampling. One draw-
back to deterministic Quasi-Monte Carlo methods is that it is generally difficult to
determine the magnitude of the approximation error. However, in order to implement
the Monte Carlo EM algorithm in an automated way, the ability to measure this error
is fundamental. Recent developments of randomized Quasi-Monte Carlo methods can
overcome this drawback. We investigate the implementation of an automated, data-
driven Monte Carlo EM algorithm based on randomized Quasi-Monte Carlo methods.
We apply this algorithm to a geostatistical model of online purchases and find that it
can significantly decrease the total simulation effort, thus showing great potential for
improving upon the efficiency of the classical Monte Carlo EM algorithm.
Key words and phrases: Monte Carlo error; low-discrepancy sequence; Halton sequence; EM algo-
rithm; geostatistical model.
1
1 Introduction
The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is a popular tool in statis-
tics and many other fields. One limitation to the use of EM is, however, that quite often the
E-step of the algorithm involves an analytically intractable, sometimes high dimensional integral.
Hobert (2000), for example, considers a model for which the E-step involves intractable integrals of
dimension twenty. The Monte Carlo EM (MCEM) algorithm, proposed by Wei & Tanner (1990),
estimates this intractable integral with an empirical average based on simulated data. Typically, the
simulated data is obtained by producing random draws from the distribution commanded by EM.
By the law of large numbers, this integral-estimate can be made arbitrarily accurate by increasing
the size of the simulated data. The MCEM algorithm typically requires a very high accuracy,
especially at the later iterations. Booth & Hobert (1999), for example, report sample sizes of over
66,000 at convergence. This suggests that the overall efficiency of MCEM could be improved by
using simulation methods that achieve a high accuracy in the integral-estimate with smaller sample
sizes.
Recent research has provided evidence that entirely random draws do not necessarily result in
the most efficient use of the simulated data. In particular, one criticism of random draws is that
they often do not explore the sample space well (Morokoff & Caflisch, 1995; Caflisch et al., 1997).
For instance, points drawn at random tend to form clusters which leads to gaps where the sample
space is not explored at all (see Figure 1 for illustration). This criticism has lead to the development
of a variety of deterministic methods that provide for a better spread of the sample points. These
deterministic methods are often classified as Quasi-Monte Carlo (QMC) methods. Theoretical as
well as empirical research has shown that QMC methods can significantly increase the accuracy of
the integral-estimate over random draws.
Figure 1 about here
In this paper we investigate an implementation of the MCEM algorithm based on QMC methods.
Wei & Tanner (1990) point out that for an efficient implementation, the size of the simulated data
should be chosen small at the initial stage but increased successively as the algorithm moves along.
Early versions of the method require a manual, user-determined increase of the sample size, for
instance, by allocating the amount of data to be simulated in each iteration already before the start
2
of the algorithm (e.g. McCulloch, 1997). Implementations of MCEM that determine the necessary
sample size in an automated, data-driven fashion have been developed only recently (see Booth
& Hobert, 1999; Levine & Casella, 2001; Levine & Fan, 2003). Automated implementations of
MCEM base the decision to increase the sample size on the magnitude of the error in the integral-
approximation. In their seminal work, Booth & Hobert (1999) use statistical methods to estimate
this error when the simulated data is generated at random. However, since QMC methods are
deterministic in nature, statistical methods do not apply. Moreover, determining the error of the
QMC integral-estimate analytically can be extremely hard (Caflisch et al., 1997).
Recently, the development of randomized QMC methods has overcome this early drawback.
Randomized Quasi-Monte Carlo (RQMC) methods combine the benefits of deterministic sampling
methods, which achieve a more uniform exploration of the sample space, with the statistical advan-
tages of random draws. A survey of recent advances in RQMC methods can be found in L’Ecuyer &
Lemieux (2002). In this work we implement an automated MCEM algorithm based on RQMC meth-
ods. Specifically, we demonstrate how to obtain a QMC sample from the distribution commanded
by EM and we use the ideas of RQMC sampling to measure the error of the integral-estimate in
every iteration of the algorithm. We implement this Quasi-Monte Carlo EM (QMCEM) algorithm
within the framework of the automated MCEM formulation proposed by Booth & Hobert (1999).
The remainder of this paper is organized as follows. In Section 2 we briefly motivate the ideas
surrounding QMC and RQMC. In Section 3 we explain how RQMC methods can be used to imple-
ment QMCEM in an automated, data-driven fashion. We apply this algorithm to a geostatistical
model of online purchases in Section 4 and conclude with final remarks in Section 5.
2 Quasi-Monte Carlo Sampling
Quasi-Monte Carlo methods can be regarded as a deterministic counterpart to classical Monte
Carlo. Suppose we want to evaluate an (analytically intractable) integral
I =∫
Cd
f(x)dx (1)
3
over the d-dimensional unit cube, Cd := [0, 1]d. Classical Monte Carlo integration randomly selects
points xk ∼ Uniform(Cd), k = 1, . . . ,m, and approximates (1) by the empirical average
I =1m
m∑
k=1
f(xk). (2)
Quasi-Monte Carlo methods, on the other hand, select the points deterministically. Specifically,
QMC methods produce a deterministic sequence of points that provides the best-possible spread
in Cd. These deterministic sequences are often referred to as low-discrepancy sequences (see, for
example, Niederreiter, 1992; Fang & Wang, 1994).
A variety of different low-discrepancy sequences exist. Examples include the Halton sequence
(Halton, 1960), the Sobol sequence (Sobol, 1967), the Faure sequence (Faure, 1982), and the Nieder-
reiter sequence (Niederreiter, 1992), but this list is not exhaustive. In this work we focus our
attention on the Halton sequence since it is conceptually very appealing.
2.1 Halton Sequences
Let b be a prime number. Then any integer k, k ≥ 0, can be written in base-b representation as
k = djbj + dj−1b
j−1 + · · ·+ d1b + d0,
where di ∈ {0, 1, . . . , b− 1} for i = 0, 1, . . . , j. Define the base-b radical inverse function, φb(k), as
φb(k) =d0
b1+
d1
b2+ ... +
dj
bj+1.
Notice that for every integer k ≥ 0, φb(k) ∈ [0, 1].
The kth element of the Halton sequence is obtained via the radical inverse function evaluated
at k. Specifically, if b1, ..., bd are d different prime numbers, then a d-dimensional Halton sequence
of length m is given by {x1, ...,xm}, where the kth element of the sequence is
xk = [φb1(k − 1), ..., φbd(k − 1)]T , k = 1, ..., m. (3)
(See Halton (1960) or Wang & Hickernell (2000) for more details.)
Notice that the Halton sequence does not need to be started at the origin. Indeed, for any d-
vector of non-negative integers, n = (n1, ..., nd)T , say, the Halton sequence with the first elements
skipped,
xk = [φb1(n1 + k − 1), ..., φbd(nd + k − 1)]T , k = 1, ..., m, (4)
4
remains a low-discrepancy sequence (see Pages, 1992; Bouleau & Lepingle, 1994). We will refer to
the sequence defined by (4) as a Halton sequence with starting point n. Figure 1 shows the first
2500 elements of a two-dimensional Haltion sequence with n = (0, 0)T .
2.2 Randomized Quasi-Monte Carlo
Owen (1998b) points out that the main (practical) disadvantage of QMC is that determining the
accuracy of the integral-estimate in (2) is typically very complicated, if not impossible. Moreover,
since QMC methods are based on deterministic sequences, statistical procedures for error estimation
do not apply. This drawback has lead to the development of randomized Quasi-Monte Carlo
(RQMC) methods.
L’Ecuyer & Lemieux (2002) suggest that any RQMC sequence should have the following two
properties: 1) every element of the sequence has a uniform distribution over Cd; 2) the low-
discrepancy property of the sequence is preserved under the randomization. The first property
guarantees that the approximation I in (2) is an unbiased estimate of the integral in (1). Moreover,
one can estimate its variance by generating r independent copies of I (which is typically done by
generating r independent sequences x(j)1 , . . . ,x(j)
m , j = 1, . . . , r). Given a desired total simulation
amount N = rm, smaller values of r (paired with a larger value of m) should result in a better
accuracy of the integral-estimate, since it takes better advantage of the low-discrepancy property
of each sequence. At the extreme, taking r = N and m = 1 simply reproduces classical Monte
Carlo estimation.
2.3 Randomized Halton Sequences
Recall that, regardless of the starting point, the Halton sequence remains a low-discrepancy se-
quence. Wang & Hickernell (2000) use this fact to show that if the Halton sequence is started at
a random point, x1 ∼ Uniform(Cd), then it satisfies the RQMC properties 1) and 2) from Subsec-
tion 2.2. In the following sections, we will use RQMC sampling based on the randomized Halton
sequence.
5
3 Quasi-Monte Carlo EM
The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure
useful to approximate the maximum likelihood estimator (MLE) in incomplete data problems. Let
y be a vector of observed data, let u be a vector of unobserved data or random effects and let θ
denote a vector of parameters. Furthermore, let f(y,u; θ) denote the joint density of the complete
data, (y,u). Let L(θ;y) =∫
f(y,u; θ)du denote the (marginal) likelihood function for this model.
The MLE, θ, maximizes L(·;y).
In each iteration, the EM algorithm performs an expectation and a maximization step. Let
θ(t−1) denote the current parameter value. Then, in the tth iteration of the algorithm, the E-
step computes the conditional expectation of the complete data log-likelihood, conditional on the
observed data and the current parameter value,
Q(θ|θ(t−1)) = E[log f(y,u; θ)|y;θ(t−1)
]. (5)
The tth EM update, θ(t), maximizes (5). That is θ(t) satisfies
Q(θ(t)|θ(t−1)) ≥ Q(θ|θ(t−1)) (6)
for all θ in the parameter space. This is also known as the M-step. The M-step is often implemented
using standard numerical methods like Newton-Raphson (see Lange, 1995). Solutions to overcome
a difficult M-step have been proposed in, for example, Meng & Rubin (1993). Given an initial value
θ(0), the EM algorithm produces a sequence {θ(0),θ(1),θ(2), . . . } that, under regularity conditions
(see Boyles, 1983; Wu, 1983), converges to θ.
In this work we focus on the situation when the E-step does not have a closed form solution.
Wei & Tanner (1990) proposed to approximate an analytically intractable expectation in (5) by
the empirical average
Q(θ|θ(t−1)) ≡ Q(θ|θ(t−1);u1, . . . ,umt) =1
mt
mt∑
k=1
log f(y,uk; θ), (7)
where u1, . . . ,umt are simulated from the conditional distribution f(u|y; θ(t−1)). Then, by the
law of large numbers, Q(θ|θ(t−1)) will be a reasonable approximation to Q(θ|θ(t−1)) if mt is large
enough.
6
We consider a modification of (7) suitable for RQMC sampling. Let u(j)1 , . . . ,u(j)
mt , j = 1, . . . , r,
be r independent RQMC sequences of length mt, each simulated from f(u|y; θ(t−1)). (The details
of how to simulate a RQMC sequence from f(u|y;θ(t−1)) are deferred until Subsection 3.2.) Then,
an unbiased estimate of (5) is given by the pooled estimate
QP (θ|θ(t−1)) =1r
r∑
j=1
Q(j)(θ|θ(t−1)), (8)
where Q(j)(θ|θ(t−1)) = Q(θ|θ(t−1);u(j)1 , . . . ,u(j)
mt) in (7). The tth Quasi-Monte Carlo EM (QMCEM)
update, θ(t)
, maximizes QP (·|θ(t−1)).
3.1 Increasing the length of the RQMC sequences
We have pointed out earlier that the Monte Carlo sample sizes mt should be increased successively as
the algorithm moves along. In fact, Booth et al. (2001) argue that MCEM will never converge if mt
is held fixed across iterations because of a persevering Monte Carlo error (see also Chan & Ledolter,
1995). While earlier versions of the method choose the Monte Carlo sample sizes in a deterministic
fashion before the start of the algorithm (e.g. McCulloch, 1997), the same deterministic allocation of
Monte Carlo resources that works well in one problem may result in a very inefficient (or inaccurate)
algorithm in another problem. Thus, data-dependent (and user-independent) sample size rules are
necessary in order to implement MCEM in an automated way. Booth & Hobert (1999) base the
decision of a sample size increase on the noise in the parameter updates (see also Levine & Casella,
2001; Levine & Fan, 2003).
Let θ(t−1)
denote the current QMCEM parameter value and let θ(t)
denote the maximizer of
QP (·|θ(t−1)) in (8) based on r independent RQMC sequences each of length mt. Thus, θ
(t)satisfies
FP (θ(t)|θ(t−1)
) = 0, (9)
where we define FP (θ|θ′) = ∂QP (θ|θ′)/∂θ. Let θ(t) denote the parameter update of the determin-
istic EM algorithm, that is, θ(t) satisfies
F(θ(t)|θ(t−1)) = 0, (10)
where, in similar fashion to above, we define F(θ|θ′) = ∂Q(θ|θ′)/∂θ. Thus, a first order Taylor
expansion of FP (θ(t)|θ(t−1)
) about θ(t) yields
(θ(t) − θ(t))T SP (θ(t)|θ(t−1)
) ≈ −FP (θ(t)|θ(t−1)), (11)
7
where we define the matrix SP (θ|θ′) = ∂2QP (θ|θ′)/∂θ∂θT . Under RQMC sampling, QP is an
unbiased estimate of Q. Assuming mild regularity conditions, it follows that for the expectation
E[FP (θ(t)|θ(t−1))] = F(θ(t)|θ(t−1)
) = 0. (12)
Therefore, the expected value of θ(t)
is θ(t) and its variance-covariance matrix is given by
Var(θ(t)
) =[SP (θ(t)|θ(t−1)
)]−1
Var(FP (θ(t)|θ(t−1))[SP (θ(t)|θ(t−1)
)]−1
. (13)
Under regular Monte Carlo sampling, it follows that, for a large enough Monte Carlo sample
size, θ(t)
is approximately normal distributed with mean and variance specified above. Under
RQMC sampling, however, the accuracy of the normal approximation may depend on the number
r of independent RQMC sequences. In Section 4 we consider a range of values for r in order to
investigate its effect on QMCEM.
In our implementations we estimate Var(θ(t)
) by substituting θ(t)
for θ(t) in (13) and estimate
Var(FP (θ(t)|θ(t−1)) via
1r2
r∑
j=1
(∂
∂θQ(j)(θ|θ(t−1)
))(
∂
∂θQ(j)(θ|θ(t−1)
))T
∣∣∣∣∣∣�=�
(t)
. (14)
Larger values of r should result in a more accurate estimate for Var(θ(t)
). However, we also
pointed out that smaller values of r should result in a better accuracy of the Monte Carlo estimate
in (8), since it takes better advantage of the low-discrepancy property of each individual sequence
u(j)1 , . . . ,u(j)
mt . We investigate the impact of this trade-off on the overall efficiency of the method in
Section 4.
The QMCEM algorithm proceeds as follows. Following Booth & Hobert’s recommendation, we
measure the noise in the QMCEM update θ(t)
by constructing a (1−α)×100% confidence ellipsoid
about the deterministic EM update θ(t), using the normal approximation for θ(t)
. If this ellipsoid
contains the previous parameter value θ(t−1)
, then we conclude that the system is too noisy and
we increase the length mt of the RQMC sequence. Booth et al. (2001) argue that the sample sizes
should be increased at an exponential rate. Thus, we increase the sample size to mt+1 := (1+κ)mt,
where κ is a small number, typically κ = 0.2, 0.3, 0.4. Since stochastic algorithms, like MCEM, can
satisfy deterministic stopping rules purely by chance, it is recommended to continue the method
until the stopping rule is satisfied for several consecutive iterations (see also Booth & Hobert,
8
1999). Thus, we stop the algorithm when the relative change in two successive parameter updates
is smaller than some small number δ, δ > 0, for 3 consecutive iterations.
3.2 Laplace Importance Sampling to generate RQMC sequences
Recall that the pooled estimate in (8) is based on r independent RQMC sequences u(j)1 , . . . ,u(j)
mt , j =
1, . . . , r, simulated from f(u|y; θ(t−1)). In this section we demonstrate how to generate randomized
Halton sequences using Laplace importance sampling.
Laplace importance sampling has been proven useful to draw approximate samples from f(u|y; θ)
in many instances (see Booth & Hobert, 1999; Kuk, 1999). Laplace importance sampling attempts
to find an importance sampling distribution whose mean and variance match the mode and curva-
ture of f(u|y; θ). More specifically, suppressing the dependence on y, let
l(u;θ) = log f(y,u; θ) (15)
denote the complete data log likelihood and let l′(u; θ) and l′′(u; θ) denote its first and second
derivatives in u, respectively. Suppose that u denotes the maximizer of l satisfying l′(u; θ) =
0. Then the Laplace approximations to the mean and variance of f(u|y; θ) are µ(θ) = u and
Σ(θ) = −{l′′(u; θ)}−1, respectively (e.g. De Bruijn, 1958). Booth & Hobert (1999) as well as Kuk
(1999) propose to use a multivariate normal or multivariate t importance sampling distribution,
shifted and scaled by µ(θ) and Σ(θ), respectively. Let fLap(u|y;θ) denote the resulting Laplace
importance sampling distribution.
Recall that by RQMC property 1), every element of a RQMC sequence has a uniform distri-
bution over Cd. Let xk be the kth element of a randomized Halton sequence. Using a suitable
transformation (e.g. Robert & Casella, 1999), we can generate a d-vector of i.i.d. normal or t vari-
ates. Shifting and scaling this vector by µ(θ) and Σ(θ) results in a draw uk from fLap(u|y;θ).
Thus, using r independent randomized Halton sequences of length mt, x(j)1 , . . . ,x(j)
mt , j = 1, . . . , r,
we obtain r independent sequences u(j)1 , . . . ,u(j)
mt from fLap(u|y; θ).
Booth & Hobert (1999) or Kuk (1999) successfully use Laplace importance sampling for the
fitting of generalized linear mixed models. In the following we use the method to an application of
generalized linear mixed models to data exhibiting spatial correlation.
9
4 Application: A Geostatistical Model of Online Purchases
In this section we consider sales data from an online book publisher and retailer. The publisher
sells online the titles it publishes in print form as well as, more recently, also in PDF form. The
publisher has good reason to believe that a customer’s preference for either print or PDF form
varies significantly due to his or her geographical location. In fact, since the PDF form is directly
downloaded from the publisher’s web site, it requires a reliable and typically fast internet connec-
tion. However, the availability of reliable internet connections varies greatly across different regions.
Moreover, directly downloaded PDF files provide content immediately without having to wait for
shipment as in the case of a printed book. Thus, shipping times can also influence a customer’s
preference. The preference can also be affected by a customer’s access to good quality printers or
his/her technology readiness, all of which often exhibit strong local variability.
Data exhibiting spatial correlation can be modelled using generalized linear mixed models (e.g
Breslow & Clayton, 1993). Diggle et al. (1998) refer to these spatial applications of generalized
linear mixed models as “model based geostatistics.” These spatial mixed models are challenging
from a computational point of view since they often involve approximating rather high dimensional
integrals. In the following we consider a set of data leading to an analytically intractable likelihood-
integral of dimension 16.
Let {zi}di=1, zi = (zi1, zi2), denote the spatial coordinates of the observed responses {yi}d
i=1.
For example, zi1 and zi2 could denote the longitude and latitude of the observation yi. While yi
could represent a variety of response types, we focus here on the binomial case only. For instance,
yi could indicate whether or not a person living at location zi has a certain disease or whether or
not this person has a preference for a certain product. One of the modelling goals is to account
for the possibility that two people living in close geographic proximity are more likely to share the
same disease or the same preference.
Let u = (u1, . . . , ud) be a vector of random effects. Assume that, conditional on ui, the responses
yi arise from the model
yi|ui ∼ Binomial(
ni,exp(β + ui)
1 + exp(β + ui)
), (16)
where β is an unknown regression coefficient. Assume furthermore that u follows a multivariate
normal distribution with mean zero and covariance structure such that the correlation between two
10
random effects decays with the geographical distance between the associated two observations. For
example, assume that
Cov(ui, uj) = σ2 exp{−α‖zi − zj‖}, (17)
where ‖·‖ denotes the Euclidian norm. While different modelling alternatives exist (see, for example,
Diggle et al., 1998), we will use the above model to investigate the efficiency of Quasi-Monte Carlo
MCEM implementations for estimating the parameter vector θ = (β, σ2, α).
We analyze a set of online retail data for the Washington, DC, area. Washington is a very
diverse area with respect to a variety of aspects like socio-economic factors or infrastructure. This
diversity is often expressed in regionally/locally strongly varying customer preferences. The data
set consists of 39 customers who accessed the publisher’s web site and either purchased the title in
print form or in PDF. In addition to a customer’s purchasing choice, the publisher also recorded
the customer’s geographical location. Geographical location can easily be obtained (at least ap-
proximately) through the customer’s ZIP code. ZIP code information can then be transformed into
longitudinal and latitudinal coordinates. After aggregating customers from the same ZIP code with
the same preference, we obtained d = 16 distinct geographical locations. Let ni denote the number
of purchases from location i and let yi denote the number of PDF purchases thereof. Figure 2
displays the data.
Figure 2 about here
Quasi-Monte Carlo has been found to improve upon the efficiency of classical Monte Carlo
methods in a variety of setting. For instance, Bhat (2001) reports efficiency gains via the Halton
sequence in a logit model for integral dimensions ranging from 1 to 5. Lemieux & L’Ecuyer (1998),
on the other hand, consider integral dimensions as large as 120 and find efficiency improvements
for the pricing of Asian options. In our example, the correlation structure of the random effects in
equation (17) causes the likelihood function (and therefore also the E-step of the EM algorithm)
to include an analytically intractable integral of dimension 16. Indeed, the (marginal) likelihood
function for the model in (16) and (17) can be written as
L(θ;y) ∝∫ (
d∏
i=1
f(yi|ui; θ)
)exp{−0.5uTΣ−1u}
|Σ|1/2du, (18)
where u = (u1, . . . , u16)T contains the random effects corresponding to the 16 distinct locations
and Σ is a 16× 16 matrix with elements σij = Cov(ui, uj).
11
The evaluation of high dimensional integrals is computationally burdensome. We conducted
a simulation study to investigate the efficiency of QMC approaches relative to that of classical
Monte Carlo. Table 1 shows the results for three different QMCEM algorithms, using r = 5,
r = 10 and r = 30 RQMC sequences, respectively. This compares to an implementation of MCEM
using classical Monte Carlo techniques. We can see that the Monte Carlo standard errors of the
parameter estimates of θ = (β, σ2, α) are very similar across the estimation methods, indicating
that all 4 methods estimate the parameters with (on average) comparable accuracy. However,
the total simulation efforts required to obtain this accuracy differs greatly. Indeed, while classical
Monte Carlo requires an average number of 800,200 simulated vectors (each of dimension 16!),
it only takes 20,836 for QMC (using r = 5 RQMC sequences). This is a reduction in the total
simulation effort by a factor of almost 40! It is also interesting to note that among the 3 different
QMC approaches, choosing r = 30 RQMC sequences results in a (average) total simulation effort
of 30,997 simulated vectors compared to only 20,836 for r = 5.
Table 1 about here
The reduction in the total simulation effort that is possible with the use of QMC methods is
intriguing. The MCEM algorithm usually spends most of its simulation effort in the final iterations
when the algorithm is in the vicinity of the MLE. This has already been observed by, for example,
Booth & Hobert (1999) or McCulloch (1997). The reason for this is the convergence behavior of
the underlying deterministic EM algorithm. EM usually takes large steps in the early iterations,
but the size of the steps reduce drastically as EM approaches θ. The step size in the tth iteration
of EM can be thought of as the signal that is transmitted to MCEM. However, due to the error
in the Monte Carlo approximation of the E-step in (7), MCEM receives only a noisy version of
that signal. While the signal-to-noise ratio is large in the early iterations of MCEM, it declines
continuously as MCEM approaches θ. This makes larger Monte Carlo sample sizes necessary in
order to increase the accuracy of the approximation in (7) and therefore to reduce the noise. Table
1 shows that QMC methods, due to their superior ability to estimate an intractable integral more
accurately, manage to reduce that noise with smaller sample sizes. The result is a smaller total
simulation effort required by QMC.
Table 1 also shows that among the 3 different QMCEM algorithms, implementations that use
fewer but longer low-discrepancy sequences result in a better total simulation effort than a large
12
number of short sequences. Indeed, the simulation effort for r = 30 RQMC sequences is about 50%
higher than that for r = 5 or r = 10. We pointed out in Section 2 that for a given total simulation
amount r ·m, smaller values of r paired with larger values of m should result in a more accurate
integral-estimate. On the other hand, the trade-off for using small values of r is a less accurate
variance estimate in (14). In order to implement MCEM using randomized Halton sequences, a
balance has to be achieved between a more accurate integral-estimate (i.e. less noise) and a more
accurate variance estimate. In our example, we found this balance for values of r between 5 and
10. We also experimented with values smaller than 5 and frequently encountered problems with
the numerical stability of the estimate of the covariance matrix in (14).
In the final paragraphs of this section we want to take a closer look at noise of the QMCEM
algorithm and compare it to classical MCEM. Figure 3 visualizes the Monte Carlo error for three
different Monte Carlo estimation methods: classical Monte Carlo using random sampling (column
1), randomized Quasi-Monte Carlo with r = 5 RQMC sequences (column 2) and pure Quasi-Monte
Carlo without randomization (column 3).
Figure 3 about here
We can see that for classical Monte Carlo, the average parameter update (thick solid line) is very
volatile and has wide confidence bounds (dotted lines). This suggests that the Monte Carlo error is
huge. This is in strong contrast to QMC. Indeed, for pure QMC sampling the parameter updates
are significantly less volatile with much tighter confidence bounds. Notice that we allocated the
same simulation effort for both simulation methods! It takes classical MCEM much larger sample
sizes to reduce the noise to the same level as under QMC sampling.
We have argued at the beginning of this paper that in order to implement MCEM in an au-
tomated way, the ability to estimate the error in the Monte Carlo approximation is essential.
Randomized QMC methods provide this ability. While randomized Halton sequences have the low-
discrepancy property (and thus estimate the integral with a higher accuracy than classical Monte
Carlo), randomization may not come for free. Indeed, the second column of Figure 3 shows that,
while the error reduction is still substantial compared to a classical Monte Carlo approach, the
system is noisier than under pure QMC sampling.
13
5 Conclusion
In this paper we have demonstrated how recent advances in randomized Quasi-Monte Carlo can
be used to implement the MCEM algorithm in an automated, data-driven way. The empirical
investigations provide encouraging evidence that this Quasi-Monte Carlo EM algorithm can lead
to a significant efficiency gains over implementations using regular Monte Carlo methods.
We focused our investigations in this work on the randomized Halton sequence only. Other
randomized Quasi-Monte Carlo methods exist. See, for example, Owen (1998a) or L’Ecuyer &
Lemieux (2002). It could be a rewarding topic for future research to investigate the benefits of
different Quasi-Monte Carlo methods for the implementation of Monte Carlo EM (and also other
stochastic estimation methods that are frequently encountered in the statistics literature).
Acknowledgements
All the simulations made in this work are based on the programming language Ox of Doornik (2001).
14
References
Bhat, C. (2001). Quasi-random maximum simulated likelihood estimation for the mixed multino-
mial logit model. Transportation Research 35, 677–693.
Booth, J. G. & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods
with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61,
265–285.
Booth, J. G., Hobert, J. P. & Jank, W. (2001). A survey of Monte Carlo algorithms for
maximizing the likelihood of a two-stage hierarchical model. Statistical Modelling 1, 333–349.
Bouleau, N. & Lepingle, D. (1994). Numerical Methods for Stochastic Processes. New York:
Wiley.
Boyles, R. A. (1983). On the convergence of the EM algorithm. Journal of the Royal Statistical
Society B 45, 47–50.
Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed
models. Journal of the American Statistical Association 88, 9–25.
Caflisch, R., Morokoff, W. & Owen, A. (1997). Valuation of mortgage-backed securities
using brownian bridges to reduce effective dimension. Journal of Computational Finance 1,
27–46.
Chan, K. S. & Ledolter, J. (1995). Monte Carlo EM estimation for time series models involving
counts. Journal of the American Statistical Association 90, 242–252.
De Bruijn, N. G. (1958). Asymptotic Methods in Analysis. Amsterdam: North-Holland.
Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–22.
Diggle, P. J., Tawn, J. A. & Moyeed, R. A. (1998). Model-based geostatistics. Journal of the
Royal Statistical Society A 47, 299–350.
Doornik, J. A. (2001). Ox: Object Oriented Matrix Programming. London: Timberlake.
15
Fang, K.-T. & Wang, Y. (1994). Number Theoretic Methods in Statistics. New York: Chapman
& Hall.
Faure, H. (1982). Discrepance de suites associees a un systeme de numeration (en dimension s).
Acta Arithmetica 41, 337–351.
Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating
multi-dimensional integrals. Numerische Mathematik 2, 84–90.
Hobert, J. P. (2000). Hierarchical models: A current computational perspective. Journal of the
American Statistical Association 95, 1312–1316.
Kuk, A. Y. C. (1999). Laplace importance sampling for generalized linear mixed models. Journal
of Statistical Computation and Simulation 63, 143–158.
Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of the
Royal Statistical Society B 57, 425–437.
L’Ecuyer, P. & Lemieux, C. (2002). Recent advances in randomized Quasi-Monte Carlo Meth-
ods. In Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications,
M. Dror, P. L’Ecuyer & F. Szidarovszki, eds. Kluwer Academic Publishers.
Lemieux, C. & L’Ecuyer, P. (1998). Efficiency improvement by lattice rules for pricing asian
options. In Proceedings of the 1998 Winter Simulation Conference. IEEE Press.
Levine, R. & Fan, J. (2003). An automated (Markov Chain) Monte Carlo EM algorithm. Tech.
rep., San Diego State University.
Levine, R. A. & Casella, G. (2001). Implementations of the Monte Carlo EM algorithm. Journal
of Computational and Graphical Statistics 10, 422–439.
McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models.
Journal of the American Statistical Association 92, 162–170.
Meng, X.-L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm:
A general framework. Biometrika 80, 267–278.
16
Morokoff, W. J. & Caflisch, R. E. (1995). Quasi-Monte Carlo integration. Journal of
Computational Physics 122, 218–230.
Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.
Philadelphia: SIAM.
Owen, A. (1998a). Scrambling Sobol’ and Niederreiter-Xing points. Journal of Complexity 14,
466–489.
Owen, A. B. (1998b). Monte Carlo extension of Quasi-Monte Carlo. In 1998 Winter Simulation
Conference Proceedings. New York: Springer, pp. 571–577.
Pages, G. (1992). Van der Corput sequences, Kakutani transforms and one-dimensional numerical
integration. Journal of Computational and Applied Mathematics 44, 21–39.
Robert, C. P. & Casella, G. (1999). Monte Carlo Statistical Methods. New York: Springer.
Sobol, I. M. (1967). Distribution of points in a cube and approximate evaluation of integrals.
U.S.S.R. Computational Mathematics and Mathematical Physics 7, 784–802.
Wang, X. & Hickernell, F. J. (2000). Randomized Halton sequences. Mathematical and
Computer Modelling 32, 887–899.
Wei, G. C. G. & Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and
the poor man’s data augmentation algorithms. Journal of the American Statistical Association
85, 699–704.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics
11, 95–103.
17
Regular Monte Carlo
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Quasi-Monte Carlo
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1: 2500 points in the unit square: The upper plot shows the result of regular Monte Carlosampling; that is, 2500 points selected randomly. Random points tend to form clusters, over-sampling the unit square in some places; this leads to gaps in other places, where the sample spaceis not explored at all. The lower plot shows the result of Quasi-Monte Carlo sampling: 2500 pointsof a two dimensional Halton sequence.
18
−77.20 −77.15 −77.10 −77.05 −77.00 −76.95 −76.90
38.8
038
.85
38.9
038
.95
39.0
0
Longitude
Latit
ude
Geographical Distribution of Data
Proportion of PDF Purchases per Location
−77.08 −77.06 −77.04 −77.02 −77.00 −76.98 −76.96 −76.94
0.0
0.2
0.4
0.6
0.8
1.0
38.86
38.88
38.90
38.92
38.94
38.96
38.98
Longitude
Latit
ude
Pro
port
ion
Figure 2: Geographical distribution of PDF purchases for Washington, DC: The upper plot showsthe geographical borders of Washington, DC, as well as the geographical location of the 39 purchasesof PDF or Print. The lower plot displays the geographical scatter of the relative proportion of PDFpurchases.
19
0 20 40 60 80 100
−0.
990
−0.
980
−0.
970
−0.
960
Classical Monte Carlo
Beta
0 20 40 60 80 100
2.74
2.76
2.78
2.80
2.82
Sigma
0 20 40 60 80 100
13.2
013
.25
13.3
013
.35
Alpha
0 20 40 60 80 100
−0.
990
−0.
980
−0.
970
−0.
960
Randomized Quasi−Monte Carlo
Beta
0 20 40 60 80 100
2.74
2.76
2.78
2.80
2.82
Sigma
0 20 40 60 80 100
13.2
013
.25
13.3
013
.35
Alpha
0 20 40 60 80 100
−0.
990
−0.
980
−0.
970
−0.
960
Pure Quasi−Monte Carlo
Beta
0 20 40 60 80 100
2.74
2.76
2.78
2.80
2.82
Sigma
0 20 40 60 80 100
13.2
013
.25
13.3
013
.35
Alpha
Figure 3: Monte Carlo Error and Quasi-Monte Carlo Error: Starting MCEM near the MLE, weperformed 100 iterations using a fixed Monte Carlo sample size of rmt ≡ 1000, ∀t = 1, . . . , 100.We repeated this experiment 50 times for a) MCEM using classical Monte Carlo sampling (column1); b) randomized Quasi-Monte Carlo with r = 5 (column 2); c) pure Quasi-Monte Carlo withoutrandomization, i.e. r = 1 (column 3). For each parameter value we plotted the average of the 50iteration histories (thick, solid lines) as well as pointwise 95% confidence bounds (dotted lines).
20
Table 1: Spatial model: The table investigates the efficiency of Quasi-Monte Carlo implementationsof MCEM for fitting geostatistical models. We investigate three different Quasi-Monte Carlo (QMC)algorithms using r = 5, 10 and 30 independent RQMC sequences, respectively. These RQMCsequences are obtained via randomized Halton sequences using Laplace importance sampling basedon a t distribution with 10 degrees of freedom. We benchmark these three QMC algorithms againstan implementation of MCEM based on regular Monte Carlo (MC) sampling using the same Laplaceimportance sampler. We start each algorithm from (β(0), σ2(0), α(0)) = (0, 1, 1) and increase thelength of the RQMC sequences according to Section 3.1 using α = 0.25 and κ = 0.2. The algorithmis terminated if the relative difference in two successive parameter updates falls below δ = 0.01 for 3consecutive iterations. For each of the four MCEM implementations we performed this experiment50 times recording the final parameter values, βi, σ2
i and αi and the total number of simulatedvectors, N =
∑Tij=1 r ·mj , where Ti denotes the final iteration number (i = 1, . . . , 50). The table
displays the Monte Carlo average (AVG) and the Monte Carlo standard error (SE) for these values.For instance, for the regression parameter β it displays the average β over the 50 replications andthe Monte Carlo standard error sβ/
√50, where sβ denotes the sample standard deviation over the
50 replicates.
β σ2 α N
MC AVG -0.973 2.879 13.095 800,200SE 0.005 0.010 0.028 230,347
QMC AVG -0.978 2.856 13.173 20,836(r=5) SE 0.005 0.011 0.032 908QMC AVG -0.975 2.881 13.094 21,768(r=10) SE 0.005 0.013 0.035 941QMC AVG -0.969 2.858 13.148 30,997(r=30) SE 0.005 0.011 0.031 1,234
21