bayesian-optimal design via interacting particle systems

14
Bayesian-Optimal Design via Interacting Particle Systems Author(s): Billy Amzal, Frédéric Y. Bois, Eric Parent and Christian P. Robert Source: Journal of the American Statistical Association, Vol. 101, No. 474 (Jun., 2006), pp. 773-785 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/27590734 . Accessed: 14/06/2014 12:34 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PM All use subject to JSTOR Terms and Conditions

Upload: eric-parent-and-christian-p-robert

Post on 20-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian-Optimal Design via Interacting Particle Systems

Bayesian-Optimal Design via Interacting Particle SystemsAuthor(s): Billy Amzal, Frédéric Y. Bois, Eric Parent and Christian P. RobertSource: Journal of the American Statistical Association, Vol. 101, No. 474 (Jun., 2006), pp.773-785Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27590734 .

Accessed: 14/06/2014 12:34

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 2: Bayesian-Optimal Design via Interacting Particle Systems

Bayesian-Optimal Design via Interacting Particle Systems

Billy Amzal, Fr?d?ric Y. Bois, Eric Parent, and Christian P. Robert

We propose a new stochastic algorithm for Bayesian-optimal design in nonlinear and high-dimensional contexts. Following Peter M?ller, we solve an optimization problem by exploring the expected utility surface through Markov chain Monte Carlo simulations. The optimal

design is the mode of this surface considered a probability distribution. Our algorithm relies on a "particle" method to efficiently explore high-dimensional multimodal surfaces, with simulated annealing to concentrate the samples near the modes. We first test the method on an

optimal allocation problem for which the explicit solution is available, to compare its efficiency with a simpler algorithm. We then apply our method to a challenging medical case study in which an optimal protocol treatment needs to be determined. For this case, we propose a formalization of the problem in the framework of Bayesian decision theory, taking into account physicians' knowledge and motivations.

We also briefly review further improvements and alternatives.

KEY WORDS: Bayesian decision theory; Experimental design; Markov chain Monte Carlo; Particle methods; Simulated annealing; Stochastic optimization.

1. INTRODUCTION

This article gives a full Bayesian simulation-based treat

ment for the problem of optimal experimental design, includ

ing workable implementation algorithms, adapted to complex statistical models and decision spaces.

Statistical analysis and inference need to be closely associ

ated with the process of data collection. Practitioners have been

warned about the limits of using standard techniques, such as

regression with observational data (Box, Hunter, and Hunter

1978, chap. 14.7), latent variables, and so on. The statistician's

answer is always the same: Experiments must be carefully de

signed.

The art of statistical design has been widely developed in the classical frequentist framework, both in theory (Scheff? 1961) and in practice (Atkinson and Donev 1992). Although many articles are still published in this area, the classical approach fails to be realistic in many routine applications, because of two

of its requirements. First, optimization criteria such as mini

mal variance do not necessarily coincide with the goals of the

decision makers. Second, whereas traditional experimental de

sign theory is well suited for linear or linearized models, it is not uncommon nowadays to have complex nonlinear models

with high-dimensional parameters (Goldstein 1995), hierarchi cal structures (M?ller and Rosner 1997) or a dependence struc ture (Sampson and Guttorp 1992; M?ller, Sanso, and De lorio

2003) highly sensitive to the design (e.g., epidemic models and

spatial statistics). In such cases only heuristic or approximate

methods are available, by linearizing the model (Mentr?, Mal

let, and Baccar 1997) or by constraining the exploration of de

sign space (Bois 1999). The Bayesian point of view for optimal design relies on the

maximum expected utility (MEU) principle (see, e.g., de Groot

1970). The preferences of the decision maker are assumed to

be encoded by a utility function u(d, 0,y) describing the merit

Billy Amzal is a Doctoral Candidate, ENGREF, GRESE Laboratory, 75015 Paris, France (E-mail: [email protected]). Fr?d?ric Y Bois is Senior Modeler-Statistician and Toxicologist, INERIS, 60550, Verneuil-en

Halatte, France. Eric Parent is Professor of Statistics, GRESE Laboratory, 75015 Paris, France. Christian P. Robert is Professor of Statistics, CERE

MADE, Paris-Dauphine University, 75017 Paris, France. Partial funding for this work was provided by the French Ministry of Research and Technology (project DIADEME, decision 02 C 0141). The authors thank Jacques Bernier for helpful comments and Andrew Gelman for a careful reading of this work.

of choosing the design d and getting the result y when the un knowns (parameters and latent variables) of the model take the value 6. Of course, the form of the criterion should be made

specific to the application. It is also assumed that a probability density function piO, y|d) is available that encodes direct prob abilistic judgment about credible values of the unobserved y and the unknown quantities 0 for any possible design d. It in cludes knowledge coming from learning samples, previous ex

periments, personal expertise, and other relevant information.

The Bayesian optimal design d* maximizes the expectation of u with respect to the random quantities 0 and y,

d* = argmax[L/(d)] deV

= argmax/ / uid,0,y)piO,y\d)dydO. deV J Jy.e

In this article it is assumed that such a d* exists. This

Bayesian solution is easier said than done, however, because

multiple integrations and a maximization over a large de

cision space are required to get the optimum. Various sto

chastic algorithms have been proposed to approximate the maximization and integration problem (see Wakefield 1994; Clyde and Chaloner 1996; Carlin, Kadane, and Gelfand 1998; M?ller 1999). This article extends these approaches with a new simulation-based method for general models, based on recent

particle algorithms. This method is implemented on a real world case of medical decision support. For a caffeine treat

ment for premature neonates, we optimize the daily doses and

the sampling day, according to the specified goals of physicians. In Section 2 we recall the theoretical material for simulation

based optimal design through MCMC. We formulate the maxi mization problem into a problem of simulation from the utility surface as seen as a probability density over the design space. Simulated annealing can be added to accentuate the peaks of

that surface to make the search easier. In Section 3 we in

troduce the approach of optimization by particle systems. In Section 4 we present a general optimization algorithm that ex

ploits interaction between particles for more efficient simula

tions close to the modes and also present an alternative version,

the resampling-Markov algorithm. We analyze the convergence

? 2006 American Statistical Association Journal of the American Statistical Association

June 2006, Vol. 101, No. 474, Theory and Methods DOI 10.1198/016214505000001159

773

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 3: Bayesian-Optimal Design via Interacting Particle Systems

774 Journal of the American Statistical Association, June 2006

issues in Section 5, and in Section 6 solve a toy example of an optimal decision problem (for which the solution is known) to check the performance of the algorithm. We then present in

Section 7 a real-world case study and derive the optimal design for caffeine treatment of neonates. In Section 8 we discuss the

efficiency of the method and propose improvements or alterna

tives, and in Section 9 offer a general conclusion and highlights.

2. MARKOV CHAIN MONTE CARLO METHODS FOR BAYESIAN-OPTIMAL DESIGN

2.1 Optimization Seen as an Inference Problem

Two main problems complicate analytic methods for maxi

mizing expected utilities. First, the decision space can be fairly intricate in practical applications. For example, the decision can

include the following components:

The sample size of future experiments to be conducted.

(For some recent discussions of this issue, see Smeeton

and Adcock 1997; Parent, Chaouche, and Girard 1995; Parent, Lang, and Girard 1996.) The times at which we should observe a random phenom enon (with a fixed or unknown total number of measure

ments, equally spaced or not on the time axis)

The number of individuals to be tested in an experiment, with individuals in turn being surveyed with a fixed mea surement budget to be allocated

A network from a spatial grid, with measurement stations

to be chosen.

Second, even if the decision space V is small or easily parame

terized, the utility function may not be simple to integrate, so

the expected utility U cannot be obtained explicitly. In this work, U is assumed to be positive and integrable (al

though not necessarily analytically integrable). We follow the idea of M?ller (1999) to consider U as a probability density over the design space and implement a simulation-based max

imization algorithm. We introduce an augmented probability distribution h over the triplet (d, 0,y)eVx?xy,

h(d, 6,y) oc w(d, 0,y)p(0,y\d) = u(d, 6,y)p(y\0, d)p(0).

We can compute h, up to a normalizing constant, because

we know the prior distribution p(0) over 0, the model likeli hood p(y\0,d), and the utility u(d,0,y). As desired, the mar

ginal distribution /.(d) is proportional to U(d). In this context, we can use Markov chain Monte Carlo (MCMC) techniques to simulate (d, 0,y) under h and search for the marginal mode of fe h(d,6,y)d0dy. If M couples (#/,}'/) simulated from

p(0,y\d) are available, then we may evaluate U according to

its Monte Carlo approximation,

1 M u(d) = -J2u^??y^'

Unfortunately, the shape of U(d) around its mode is often very flat, and errors in the Monte Carlo approximation may hide the

top of the surface for reasonable computational costs. The next

section shows how a "simulated annealing" scheme improves

this search of mode.

2.2 Mode Determination by Simulated Annealing

For high-dimensional problems, getting a sample of designs from the probability density proportional to U may not be suf ficient to determine its mode. In our mode search problem, a generic idea (used in M?ller 1999; M?ller et al. 2003; Brooks and Morgan 1995) is to instead simulate a sample from UJ, where J is a large integer. This will obviously sharpen the

top of the utility surface and concentrate simulations closer to

the mode. It relies on the same idea as in simulated anneal

ing (Van Laarhoven and Aarts 1987), where 1/7 can be inter

preted as a "temperature." Given 7, sampling from hj implies that the joint augmented distribution to simulate is now over

px0Jx yJ, j

hjid,6x,... ,6j ,yx,... ,yj) ̂ Y\uid,ej,yj)pi6j,yj\d), 7=1

which conserves the necessary property that Je_ ;<?hj

is pro

portional to U id) because the couples i0j,yj) are drawn inde

pendently from piO, y\d). One could use the same value of J for all iterations, but this is not efficient in high-dimensional cases. One could also use a "cooling" schedule that makes Jit) in crease up to +00 with iterations t, mimicking the temperature

decreasing to 0 in simulated annealing. However, unlike with

the standard simulated annealing algorithm, U does not need to be evaluated here, and the dimension of the support space changes with iterations. Despite this, M?ller (1999) referred to work of Geman and Geman (1984) to propose a logarithmic cooling schedule. As J increases, the dimension of the support space of joint target density hj also increases, making the mode search of the marginal more difficult. This is a reason why Jit) should not grow too fast.

2.3 M?ller's Algorithm and Its Limitations

Based on the previous two sections, the optimization al

gorithm proposed by M?ller (1999) is a Metropolis-Hastings algorithm with target density hj associated with a simulated

annealing. The proposal distribution over the design space at iteration t is assumed to be qi-\d^~^) (random-walk type). The

proposal distribution on iOj,yj) at iteration t is assumed to be

pi$,y\d^) (independent type). With this natural choice, the ac

ceptance probability is easier to compute, because we do not

need to evaluate piO?, yy|d(r)).

Algorithm: 1. Start with a design d(0) at t = 0. Set J = 7(0). Simulate

iOj, yj) from piO, y|d(0)) for j = 1,..., J. Compute w(0) =

n^(d^-,y,). 2. Generate a "candidate" d from qi- |d(0)), then iOj,yj) from

piO,y\d) for j =

1,..., J. Compute u = n/=i M(d, $/',J/')

3. Compute the acceptance probability,

a===rmn(l,(^(d(0)|d))/(M(0^(d|d(0)))). 4. Set

(d(l),u{l)) = id,u) with probability a

and

(d(1), u{[)) =

(d(0), u{0)) with probability 1 - a.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 4: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 775

5. Set J = J(?+ 1). 6. Repeat steps 2-5 until approximate convergence.

Depending on the model structure, we can also propose

a similar algorithm based on the Gibbs sampler. M?ller's

algorithm generates a sample from hj. As t -> +oo, hj uni

formly converges to a Dirac delta function centered at the

optimal design d*. Therefore, dw converges almost surely to

ward d*. Unfortunately, this convergence can be extremely

slow in practice, especially in high-dimensional cases, for

which this algorithm becomes inefficient. As usual with Metro

polis algorithms, the exploration of large design space can be

slow, and the Markov chain can be trapped around local modes.

These drawbacks are worsened by the annealing effect. To

improve this method, a better exploration of design space is needed.

3. A NEW PARTICLE APPROACH

3.1 Interacting MCMC: An Intuitive Approach

This section approaches a new simulation-based algorithm

for optimal design in the framework of complex or high dimensional models. The method is based on the recent de

velopments of particle methods, with application to particle filters (Doucet, de Freitas, and Gordon 2001) or to popula tion Monte Carlo (Capp?, Guillin, Marin, and Robert 2004) simulations. To simulate a sample from hj, we no longer sim

ulate one (or even more) Markov chains (d^\6^\y^) as

in M?ller's algorithm, but we generate instead N "parallel" chains (d[ , 6i , y\ ) for / = 1,..., N. Using the vocabulary from sequential MCMC theory, each triplet (d^ ,0?r) ,y\f)) is called a particle, and the set of N Markov chains is called an

interacting-particles system. The goal is not to produce a sam

ple to approximate the target distribution, but rather to sim ulate particles close to the modes. The idea is to generate at

each iteration t an approximated weighted sample from hj by "importance sampling" (Geweke 1989), then use a selection

procedure to duplicate particles closer to the modes of the tar

get distribution while eliminating the others. A standard selec tion procedure is a "sampling importance resampling" scheme

(Rubin 1988; Smith and Gelfand 1992). This selection proce dure has been widely studied and applied in the literature on

particle methods for sequential MCMC (Doucet et al. 2001). Other selection methods are possible (Liu and Chen 1998; Car

penter, Clifford, and Fearnhead 1999), but we do not discuss

these here. As is usual for this type of algorithm, an indepen dent Markov step may also be added for each particle, to avoid

degeneracy problems. At each step (importance sampling, re

sampling, and Markov steps) and for any iteration t, this would

generate a sample (?/ )/=i,...,_v fr?m hj such that as N -? +oo,

we have the approximation

1 N ?

for any measurable and bounded function 0. The aim in this ap

proach is to get a rich sample from hj. The obvious drawback is that this iterative scheme for fixed J and N would cumulate

noise, so that the approximation would worsen with iterations.

This point has been underlined in developments of nonsequen tial population MCMC algorithms (Capp? et al. 2004; Chopin 2002). However, the interest of such iterative algorithms is fully recovered when the target distribution changes with iterations,

as in sequential MCMC or particle filter algorithms. In our case this holds when we add a simulated annealing effect.

3.2 Simulated Annealing for Particles

For the same reasons as with M?ller's approach, the target

distribution hj changes with iteration t, so that J -> +oo as

t -> +oo. Although the goals differ, this approach can be com

pared with particle filtering for sequential MCMC. In the latter, the target distribution is typically the posterior distribution of a state variable. At each iteration, this distribution changes to be

updated with new data. Sampling importance resampling allows a "quick" update of the sample, and an additional Markov step

brings a better exploration. In our case, the philosophy is just the same; sampling importance resampling updates the sample to suit to the new 7, and the Markov kernel that follows enriches the sample. Similarities can also be found with genetic algo rithms (Reeves and Wright 1999; Del Moral, Kallel, and Rowe

2001), which use the same idea of interacting chains with simu lated annealing but in a less general setup. An important issue is the choice of the cooling schedule function 7(0- In this context, a logarithmic rule may be far from optimal. Intuitively, temper ature should change faster because of interactions (resampling), and we see that linear growth is often a reasonable choice.

4. OPTIMIZATION ALGORITHMS

4.1 A General Algorithm

The previous two sections yielded a new algorithm for op

timal design that combines M?ller's idea with a particle ap proach. For convenience, we choose a Metropolis-Hastings

step as the Markov step. We need to choose two proposal distri

butions over the design space, q\$ for the importance sampling step and qun for the Metropolis-Hastings step. For a better ex

ploration in multimodal cases, these functions can be widely different; for example, q\s could have a smaller variance to al

low a small-scale exploration at the importance sampling step

and a larger-scale exploration at the Metropolis-Hastings step.

Indeed, according to ideas of importance sampling, the optimal choice for qis should be "close" to the target distribution. This remark leads us to a heuristic choice for the variance of qis.

Because q\$ is a random walk step from a sample drawn ap

proximately from JJJ^~^/ j UJ^~l\ its variance should be of the order of the variance of UJ^ / J UJ^. Moreover, for 7 large enough, we can locally approximate the marginal target distrib

ution UJ^ / f UJ^ by a Gaussian approximation with mean d* and variance

We should then consider letting this variance for q\s decrease to 0 with rate 1/7. In contrast, the exploration scale for #mh could stay large.

Summing up, we can propose the following general opti mization algorithm:

General algorithm: 1. Initialization. Start with a sample (d) )/=i,...,/y at t

= 0. Set 7 = 7(0).

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 5: Bayesian-Optimal Design via Interacting Particle Systems

776 Journal of the American Statistical Association, June 2006

2. Importance sampling step. For each / = 1,..., N, simu

late d) from gis(d|dz ) and J independent experiments

(0-?] ,yjy-}) from p(0, y|d;(1)) for;

= 1, ..., /. Compute

and the normalized weights

(1) ~U) / /jd) l_a(0)\

3. Selection step. Resample (d\ ,..., dN ) from (d\ \ ...,

dN ) according to a multinomial distribution with

weights w? and utilities ?i? .

4. Metropolis-Hastings step. For each / = 1,..., N, simu

late d| from ̂MH(d|dz- ) and J independent experiments

(G^P ,y\j]) from/?((9,y|d;(1)) for j = \,...,J. Compute

T?\ = Y\j=x u(d? , 6?j , y?: ) and the acceptance rates

Set dz(1) =

d?l) with probability a{ and d?l) -

d|1} with

probability 1 ? ot?.

We once again insist on the difference between the impor tance sampling step and the Markov step. The importance sam

pling proposal can be very generic, not necessarily Markovian,

whereas the Markov transition must be such that its stationary distribution is the right target distribution at each iteration. In the foregoing detailed algorithm, we considered a Metropolis

Hastings step for practical convenience, but any other algorithm could be used instead, provided that the stationary condition held. Consequently, all samples from q\s are accepted, whereas

quu only generates candidates to be accepted or not. Of course,

step 4 could be dropped out, considering only a generic class of

importance sampling proposal. But although theoretically cor

rect, the resulting algorithm would be helpless in terms of both

practicability and convergence proofs. Indeed, as we demon

strate in Section 5, the importance sampling proposal is the key step to control to demonstrate convergence properties. In con

trast, one could consider dropping out step 1 in the loops; this

particular case is presented in the next section.

4.2 A Particular Case: The Resampling-Markov Algorithm

We now consider a simpler algorithm by removing the im

portance sampling step. In this case, we obtain at the beginning of each loop t a sample drawn approximately from hj(t-\) that will be resampled and enriched by a Markov step to become an

approximated sample from hj(t). We assume that J(t) > J(t?\) to give a sense to the resampling, which means that the cooling schedule is at least linear [J(t) = ? + 1]. The resampling weights have also a simpler form, because we can take

J(t) (0 TT i*(t-l) /_('-!) (t-\)\

w)]cx [[ u(?? \e?j ;,y^- j), j=j(t-\)+\

where (0\p ,y\- )j=j(t-\)+\,...,j{t) are additional draws of

independent experiments for particle i, conditionally on d;

.

The Markov step is the same as the former algorithm, although its efficiency to rejuvenate the sample is more needed here,

because degeneracy problems could occur more often than in

the general case. In the following algorithm, we use again a

Metropolis-Hastings transition for the Markov step:

Resampling-Markov algorithm:

1. Initialization. Start at f = 0 with a sample idf*\o?\ yn )i=x.n drawn by importance sampling as in step 1 in the general algorithm. Set 7 = 7(0) = 1.

2. Reweighting. For each / = 1, ..., N, draw additional inde

pendent experiments i6?- ,y\ )/=y(0)+i,...,./(i), and com

pute the normalized weights

AD o) ? n ??aM z)(0) jo)\

J=A0)+1

iJ

3. Resampling step. Resample (dj ,...,dN) from (dj

,

...,dN) according to a multinomial distribution with

weights w\ and utilities u? .

4. Metropolis-Hastings step. For each i = 1,..., A, simu

late dz- from^MH(d|dz- ) and 7 independent experiments

iOf ,y(ijl)) from /?((9,y|dz(1)) for j

= 1,...,7. Compute

u\ =

ri/=i ui^\ ' on; ? xij ) and trie acceptance rates

a/ = min(l,(?^W

Set d\l) =df?l) with probability or,- and d^ =

dju with

probability 1 ? a/.

This version of our algorithm obviously would save compu

tation time and also should be theoretically more stable in con

vergence, as we show in the next section. It can be viewed as a

genetic algorithm looping selection and Markov steps; however, it may be less efficient for exploring complex utility surfaces.

5. CONVERGENCE ISSUES

The algorithm that we propose here is of the "interact

ing particles systems" type. Theoretical convergence stud

ies for sequential MCMC applications have been given by Del Moral (1998), Del Moral and Doucet (2004), and Crisan and Doucet (2000). Investigating precise convergence rates is

quite difficult in this context, because we are dealing with N

dependent nonhomogeneous Markov chains. We compare the

target distribution hj(t) with the random empirical distribution

(1/A) _2i=\ <V(') More precisely, we want to compare marginal w

measures n^ with (1/A) X!/=i ^d(0' wnere ̂ *s the target

measure derived from the distribution proportional to UJ^ (d) over design space V and where

(d| )/=i,...,# are the N designs

sampled by the algorithm at iteration t.

5.1 Convergence of the General Algorithm

We consider the general algorithm and prove a mean squared

convergence result. The importance sampling step obviously

plays a major role. We introduce the notation quid) = #is(d|

d?_1)), p/fC?/i ,37/1.?/j, y?/) =pi9ix,y?\df) ? ?/K0?/,

yu\d? ), and the following assumption (A):

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 6: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 777

1. The utility u is positive and bounded. 2. For all i and t, q?t > 0 on the support of hj. 3. The expression (\/N) J2i=\ V3Iqit?Pit(wi ) *s bounded in

dependently of N. 4. There exists 8 > 0 such that

(tV \2+<3 yv

i/E-r^K(i)) E^^[K(")2+S] /=i / i=\ ?

0. /V^+oc

5. d h> var[^MH(-|d)] is bounded.

Theorem 1 (Step-by-step convergence). Under assump tion (A), for any iteration t, there exists a constant at such that

for any measurable and bounded </> over V, we have

E

2

ri(t)((t>) Tt ^ flrll0l& ~

AT '

where Tt-1 stands for the conditioning on the previous sample

(dz )i=i._v

This result ensures that for any given t, our current sam

ple (d? )/=i,...,/v is> in some sense, near a sample of the target distribution rj^\ We show this result with classical statisti cal arguments. The randomness in the expectation comes from

three sources: the importance sampling, the resampling, and the

Markov step. For each source, we first state a lemma to bound

the corresponding error.

Lemma 1 (Bound of importance sampling error). Let

(ki )i=\,...,N be the importance sample drawn from (qa ?

Pit)i=\,...,N and with nonnormalized weights (wj )/=i,...,yv- Un

der assumption (A),

Eld" f(d) E El w? -/

2 f(?)hj(?)d? Tt

< a't\\f I2 loo

N

Proof In what follows, all expectations and variances are

considered conditionally on Tt-1. The proof relies on a gener

alization of the central limit theorem for importance sampling when the importance distribution qlt depends on previous sam

pled point /. To generalize Geweke's demonstration (Geweke 1989) to our case, we need to replace standard central limit

theorem by the Lindeberg theorem for particles \\ with

weights wf and importance distribution q\t (&p\t. Indeed, con

ditionally on Tt-\, I, are independent random variables. For

that purpose, we need the wy to satisfy the Lindeberg con dition. The last point of assumption (A) is nothing but the

Lyapunov condition, which implies the Lindeberg condition.

The Lindeberg theorem applies for X^/=i w! anc^ giyes the

convergence in law,

Using point 3 of assumption (A), this implies that

N

e; N .Jt)

-X ?Vi l+0'i where

Oij?) denotes a random variable independent from \?s that converges almost surely toward 0 as N goes to infinity and where i/> denotes /' yj/i%)hji%)d%. Then

??

_U_____zj____^(l + 0(l N \ \N (1)

NE

Consequently,

eW> f

w N

/=i

Now we end the proof using again point 3 of assumption (A), with a constant dt independent from \?/ but that depends on the

previous sample (d? )/=i,...,w.

We now evaluate the effect of resampling step on the sample.

Lemma 2 (Bound of the resampling error). Let (f ,

w) )i=\,...,N be a weighted sample such that for all bounded measurable functions i//,

Et.wfVd/") E Et W (t)

1 vvi -/

yKH)h?S)dH Tt

a'tU |2 loo

N

Then, if (n i,..., n^) denotes a draw from multinomial distrib

ution MiN, (w(/V Eti wf)),.. , (wjJV Eti W/?)). we have \ 2

V /= 1

Proof. We write

/

n&h?&dl; Ft ?MWlo

N

/=! A? El ?=l

r?(0 W ?,(0, :('k */W;) e;li<vo

TV Y-N (0 Ei=xwt

E:

Then Minkowski's inequality brings 2 nl/2

W=l

, g=ijrVf) ?\ z=l Wz

E 'ic^-* N

<E ^ - -'-rtCK \-^v , (n/r/?(0

x 2

E Vz=l

^rj) Eti<V(^) N

^N (t)

U N ,?(')

-AWi

F<

1/2

+ E rn<vo l'/2

n -Awi

f \Tt

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 7: Bayesian-Optimal Design via Interacting Particle Systems

778 Journal of the American Statistical Association, June 2006

If Tt represents the conditioning on the sample (f? );= 1,...,_V,

then we easily get, from the properties of the multinomial dis

tribution, that

E

. 2_w=i wz ^ ^,^i < tf

w

N

where a is a constant independent from other variables. There

fore, integrating over Tt and using the hypothesis of the lemma, we have

E

< N / i sh(0\

\i=\ N

Ft

1/2

<Ja + at :w\c

N

which completes the proof of this lemma.

The last source of randomness comes from the Markov step.

Lemma 3 (Bound of Markov step error). Let rj be a probabil ity measure and let (d/)/=i,...,/v be a sample over V such that:

E -XX?>)->?(</>) N

1=1

<c ll0H2oo N

Vd>.

Then, if g is a Markov kernel with r? as stationary distribution, and such that d i-> var[g(-|d)] is bounded, we have

E ?5d/(0)-r7(0)

i=\ N

where (d;);= !,...,# is a sample drawn independently from

<8)/?(-|d/).

Proof We can write

' , N

E X!5df(0)-^(0) i=\

:E

+ ]

^E5df(?*>-^?0) /=1

,W2

_v \

E var?(-id,-) (^ (*)))

Then, using the Markov stationarity and the variance bound

hypothesis,

E EM0)-r?(0)

/'=!

^ C|l601l5o | c ''\\A\\2 loo

N

c'WWlc

N

- N

Proof of the Step-by-Step Convergence Theorem. At a given iteration t and under assumption (A), we put together Lemmas

1 and 2 to get

E i N r ^

i=\ ' J /

2n Vll,/rl|2

AT (2)

Then, using Jensen's inequality, we get for any bounded and

continuous function ? over V,

E ?5?(o(0)~^(O(0)

i=l N

(3)

The last point of (A) implies that the Markov step Q over V en

gendered by ?/mh verifies the uniformly bounded variance con

dition. Therefore, Lemma 3 can be used to end the proof. Of course, a sufficient condition is the existence of a uniform

lower bound for q^, but this constraint may be too strong in

practice. Some weaker sufficient conditions on the lower bound

for q?t can also be asserted. In applications, we might need to

truncate the proposal distributions slightly on the border of V. The obvious limitation of our convergence theorem is that

the bound depends on t, which means that there is no theo

retical guarantee that this upper bound does not explode when t -> +00. Obtaining more powerful results within this formal

ism, and with the same generality on the importance sampling

proposal, appears to be very difficult. We emphasize that in

practice, the number of iterations should rarely be exceed 50.

Nonetheless, the fact that there is no importance sampling step in the resampling-Markov algorithm invites us to investigate further convergence results for it.

5.2 Convergence of the Resampling-Markov Algorithm

We now analyze the convergence of the lighter algorithm presented in the previous section. As a particular case of our

general algorithm, it would be easy, using the same approach, to prove a similar result as in (1) but with an unconditional

expectation. But, as we already noted, this algorithm can be

considered an evolutionary algorithm embedded in a Feynman Kac formalism (see Del Moral et al. 2001; Del Moral 2004 for deeper and more recent developments). This drives us to

stronger convergence theoretical results.

We introduce the following notation and conventions:

We arbitrarily fix t, the iteration at which we analyze con

vergence.

For all s < t, we canonically inject V x SJ^ x yJ^ into E = V x 0y(r) x yJ^ (we are now dealing with a fixed dimensional space).

We denote 77jy the empirical measure over V, r???/ =

x ^N x

We define on ?r+1 (space of the trajectories of our algo rithm)

and

^o.'] = ^o)0...0?7W

[0,r] (0) ̂ ^ (r)

We can then state the following theorem.

Theorem 2 (Trajectories convergence theorem). Assuming that u is bounded by strictly positive constants um\n and um?X,

there exists a constant ct such that for any 0 measurable and

bounded on ?r+1, we have

E[?fl(?-5M(?)4]1/4<^, (4)

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 8: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 779

with

ct ? dt + c"

2_^

n j_ I//^'W0-1K2 V1 i umax )

y (?)-7(5-1) min

This result is a direct application of Del Moral's work, the

proof of which has been given by Bartoli and Del Moral (2001 ). It requires a strictly positive lower bound for u, which can be obtained by a truncation. The explicit form of ct gives us an

optimal form for the cooling schedule, which should be linear. With this optimal choice, the constant ct can be written as ct for

some constant c (depending only on wmax and wmin), and leads

to the following corollary on the uniform convergence of the

visiting frequency of any optimum's neighborhood.

Corollary 1 (Uniform convergence result). Let B be a neigh borhood of the targeted optimum. Let/^(B) be the frequency of visiting B for the RM algorithm, and let /(r) (B) be the fre

quency of visiting B for a theoretical trajectory from ̂?'rJ. If the cooling schedule is linear, then

c(t)rn\ ^)/^\4l1/4 < c

E[(^f)(?)-/(,)(?))] - ,N

Proof As noted previously, if the cooling schedule is linear, then it comes from the linear bound in Theorem 2,

Ettf'1?)^10'"?))4]1^^' (5)

for any measurable and bounded function qb on Et+l. Choosing the particular function

0 ?

l(BxEt)U(ExBxE<-l)U---U(EtxB)

leads us to Corollary 1.

Recalling that our goal is to determine the optimum, this

corollary provides information about the effectiveness of the

resampling-Markov algorithm. It also proves that the linear

cooling schedule is an efficient choice.

6. KING ARTHUR'S OPTIMAL-SHARING PROBLEM

6.1 The Decision Problem and Its Solution

The purpose of this example is to test the algorithm in a case where explicit calculations are possible. Once upon a time,

King Arthur put T jars onto the round table to be shared be tween his n faithful knights. Every day, one jar must be brought back, filled by the daily income from the county of the knight in charge of it. Each jar is used only once, for one day and one

knight to be chosen. King Arthur wants to share the jars so that all knights contribute as equally as possible to the daily welfare of the kingdom. Lancelot suggests that every knight take the same number of jars, but wise Merlin points out that the wealth of the knights differed significantly and states the problem as

follows: Let T? be the number of jars given to the ith knight. We have

/=1

0 < Tj < T.

Let YiJi be the revenue brought by knight / on day t? where

t[ ?

1,..., Ti, and denote by y the set (Y^Jti = 1,..., Tp, i =

1,..., n). The Y?J? are supposed to be iid normal random vari

ables fluctuating around mean p? with variance of.

The prior

density of 9 = i?\, ..., ?n,o~\,..., o~n) is available, because

King Arthur has some prior knowledge about the welfare of his vassals. In this (anticipated) Bayesian framework, the king wants to find the allocation d = (7i,..., Tn) that minimizes in

average the discrepancy between the daily knights' contribu

tions, measured by

n / Ti Ti+l \2

cid^,y) = J2(?_ZY^-Y- E y'+U+,

i=xVi t, = X li+l ri+l = X I

with the convention that, on the round table, index (n + 1) de notes knight number 1. The optimal sharing, d*, is then

d* = argminE>.,0[c(d,0,y)].

An explicit solution to this problem exists and is unique, as

found by Merlin after a wizard visit to Lagrange; see the Ap pendix for details. We want now to apply our algorithm to this

example, for which the number of all possible designs d is CTn, which can be very large.

6.2 Practical Implementation of the Algorithm

Several choices are required for the practical implementation of the algorithm. First, we must choose the number TV of par

ticles and the discretization of the time scale. Then, we must

choose proposal distributions for the importance sampling and

Metropolis-Hastings steps. We also must define the function 7 that drives the simulated annealing. Finally, because the algo

rithm deals only with positive criteria to be maximized, we need a transformation for the minimization.

We evaluate the algorithm using multiple independent sim ulations. We choose for the simulations to share T = 100 jars between n = 10 knights, using only N = 50 particles. Better results would certainly be obtained with larger N, but here we want fast algorithms to be run many times for our comparison

study. In this example, the decision space is the ^-dimensional

simplex,

v= (r,,...,rlo);f;r,

= ioo .

Therefore, Dirichlet distributions are convenient choices for the proposal distributions, for both importance sampling and

Metropolis-Hastings steps, centered on the previous state but

with a decreasing smaller variance for the importance sam

pling proposal (as described in Sec. 3). This allows a reasonable random-walk scheme on the decision space.

The cooling schedule 7(0 also must be chosen. Following the convergence study of the resampling-Markov algorithm, we

choose a linear cooling schedule.

Finally, we must define a positive and bounded local utility, u(d, 0,y), as in (2). The most convenient case is when c(d, 0, y) is uniformly bounded for all 0, y, d, so that we can simply set

w(d, 0,y) = K ?

cid, 6, y). If it is not, then we can choose a rea

sonable bound K and set uid,0,y) = maxiK ?

cid, 0, y), .01) as we did in this example, or else truncate the prior distribution on 6 so that K ?

cid, 6, y) is positive. In either case we would be dealing with an approximation to the utility surface.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 9: Bayesian-Optimal Design via Interacting Particle Systems

780 Journal of the American Statistical Association, June 2006

6.3 Results

The simulations were performed with Matlab (version 6) on

personal computers. Here the chosen number N of particles was small compared with the size of a representative sample of such high-dimensional distributions. We compared our al

gorithm with N independent Metropolis-Hastings algorithms (rather than with many independent sampling importance re

sampling algorithms). The same Dirichlet distribution proposal is taken for the independent Metropolis-Hastings algorithms and for the Metropolis-Hastings step of our algorithm. We also

keep the same simulated annealing scheme for both algorithms. The purpose is to show the benefit of the selection process throughout iterations.

Figure 1 shows a typical simulation of h(d,0,y). At each

iteration, we have computed C(d) for every particle, because

the explicit calculations make this computation possible. Fig ure 2 shows the minimum, mean, and maximum of the N val

ues C(dz) found at each iteration. The sample is converging toward a Dirac delta function at the optimum, as the three

curves get closer to one another as iterations go by. Consid

ering how few iterations we use, our algorithm appears to be

efficient in this example, despite our approximation on u and

border effects due to Dirichlet proposals that enforce simula tions close to extremal points. We compare it with N inde

pendent Metropolis-Hastings algorithms. Figure 2 shows the minimum of C(d?) over the N particles sampled at each iter ation. The selection step obviously makes the algorithm more

efficient. In this case, clearly the Markov chains simulated by our algorithm are trapped around local modes less often than

Metropolis-Hastings Markov chains.

To make sure that our results are not per chance, we should

consider the performance of our algorithm in average. Figures 3 and 4 presents the same curves than previously, but averaged over 20 runs. We can make the same observations as for the

single run. Our procedure brings decisions d to the minimum faster than Metropolis-Hastings algorithms.

7. OPTIMAL DESIGN OF CAFFEINE TREATMENT FOR PREMATURE NEONATES

7.1 The Clinical Problem

We now present a real-world application for the University

Hospital of Amiens (France). During the first months of life,

51-1-,-,-,-,-1 0 10 20 30 40 50 60 Number of iteration

Figure 1. On a Typical Run, the Maximal, Mean, and Minimal Costs

for the System of Particles Shrink to the Theoretical Minimal Cost.

13_U?with interactions

^ii-_lA/\ a without interactions

07 M|_n_imal_ cost_"

0.6 '-'-'-'-'-> 0 10 20 30 40 50 60 Number of iteration

Figure 2. Minimal Cost Attained by the Interacting Particles Improves on the Solution Reached With N Independent M?ller Algorithms.

preterm infants encounter respiratory problems, such as apnea, and daily injections of caffeine can be an efficient treatment. But caffeine is a heart stimulant, and overdose can cause tachy cardia. Therefore, pharmacologists are concerned with finding a balanced caffeine dosing schedule so as to minimize the risks of apnea and tachycardia. According to pediatricians, the con

centration of caffeine in preterm inflants blood should be be tween 12 and 15 mg/1. The problem is now to find the optimal daily concentrations c; such that the caffeine blood rate remains

on average for the population in this "therapeutic" window, as

much as possible as shown in Figure 5. For a given dosing schedule, the time series of concentra

tions can vary widely among individuals, because they depend on physiological parameters. These parameters cannot be mea

sured, but pharmacologists have some previous knowledge of

their distributions. The actual concentration of caffeine in blood can also be measured at any time, but to minimize blood sam

pling, this should not be done too often. As is often the case in practice, we assume that a single measure is made for each

treatment, at a time to be determined, and that the duration of

treatment is 4 weeks. For each subject, the treatment protocol

comprises the measurement day r and the 28 daily concentra

tions Ci. But because the protocol depends on what we measure

at r, the optimal design to be searched comprises only r and

the t first concentrations. A pharmacokinetic model is needed

to describe how caffeine concentration varies over time given a

Minimal cost

D0 10 20 30 40 50 60 Number of iteration

T3^ o

Figure 3. On Average, Over 20 Runs, We Observe the Same Con

vergence Toward the Optimum as for One Run.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 10: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 781

- with interactions

without interactions;

Minimal cost

Number of iteration

Figure 4. On Average, Over 20 Runs, We Observe the Same Im

provement of Convergence When Chains Interact as for One Run.

design. We also need to quantify a specific utility function to be

optimized.

7.2 Pharmacokinetic Model and Utility Function

According to published pharmacologie studies (Thomson, Kerr, and Wright 1996; Lee, Charles, Steer, Flenady, and Shearman 1997; Falcao et al. 1997), caffeine distribution in the

body is well described with a one-compartment model (Fig. 6). The infant's body is modeled by one homogeneous compart ment of volume Vit) at time t that eliminates caffeine with clearance CL(0 (volume cleaned by unit of time) according to (2),

dqit) CLit) dt, (6)

qit) Vit) where, for a given individual, c(f) is the concentration (in mg/kg) of caffeine in blood at time t, Mit) is the body weight, and qit) = cit)Mit) is the quantity (or mass) of caffeine in the blood.

In most studies of caffeine treatment of neonates, V and CL are not time-dependent, to simplify statistical inference and cal culations. This approximation is crude in this case, because pa rameters or covariates such as body weight, body volume, and metabolic clearance (driven by enzyme system maturation) for

young infants vary significantly over time during the treatment. We now want to express the time dependence of M, V,

and CL. From pediatricians' knowledge, it appears that the

o -Q RISK OF TACHYCARDIA

time (hours)

Figure 5. A Typical Trajectory for Caffeine Concentration in Blood for a Baby.

Figure 6. A One-Compartment Model Describes Caffeine Elimination in Blood.

growth of premature neonates can be divided into two distinct

periods. Body weight and volume decrease during the first pe riod, then increase during the second period. We propose lin ear variation depending on covariates Mo (body weight at birth) and Ao (postconceptual age at birth) as follows:

M(t)=Mo + a\t ift<8,

M(t) = M(8) + a2(t -8) ift>8;

? V0=M0 ?,

ot\

V(t) = V0 + ?t ift<8,

V(8) V(t) = V(8) + -2-d-a2(t-8) ift>8;

M(8) CL(t) = y(t + Ao).

This is a nonlinear model with a five-dimensional parameter (ot\, c?2, ?, y, 8), which we denote by 0. In a Bayesian setup, previous informative experiments allow us to derive a prior distribution over 0 that depicts the interindividual variability of physiological characteristics. These prior distributions are

given in Table 1. From the foregoing equations, we can explicitly compute

the quantity of interest, c(d, 6,Ao,Mo,t), for given parame ter 0, covariates Ao and Mo, and injection protocol d. In the

study that follows, we consider neonates with Mo = 1.2 kg and

Ao = 30 weeks, so that we can simply note c(d, 0, t), the con centration in blood of caffeine at time t for a subject with para meter 0 treated with protocol d.

We now need to quantify the utility of having c(d, 0, t) for as long as possible in the therapeutic interval [12,15]. In the notation of Sections 1 and 2, we must define utility functions

w(d, 0, y) and U(d). For a given parameter 0, the utility of pro tocol d can be written as f^(c(d, 0, t)) dt, where vl> codes the wish of physicians. This *I> function can be reasonably cho sen as a broken line such that 0 < *I> < 1, *I>(c) = 1 when c e [12,15], and vl>(c) = 0 when c e [0,6] U [30, -hoof. This

shape for V permits rapid numerical integration. Figure 7 illus trates this choice of utility.

Table 1. Prior Distributions Derived From Earlier

Experimental Data

Parameters Prior distribution

?1 (mg/h) Af(-1.07,.20) a2(mg/h) A/"(1.15,.13) ? (ml/h) Af(-.79, .08) y (/?l/h) JV(.70,.15) ?(days) r(8.41,.69)

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 11: Bayesian-Optimal Design via Interacting Particle Systems

782 Journal of the American Statistical Association, June 2006

RISK OF TACHYCARDIA

time (hours)

Figure 7. A Trapezoidal Daily Utility Function Matches the Pediatri

cians' Requirements.

If no blood sampling were allowed, then we would have to

maximize ?/(d) = ^eifo ^(c(d, 0, t)) dt), where the expecta tion is calculated with respect to prior density piO). But here

we get data y at measurement day x, with an additive lognor

mal error 6 ?? logA/"(0, .05),

y =

c(d,0,r) + e.

Therefore, we can write the injection protocol d as d =

(di, d2(y)), where the first component dj =

(ci, ..., cr-\, x)

is the same for all individuals and the second component d2(y) = icTiy),..., C28(y)) depends on the subject. For a

given y, d2?y) is calculated as with dj but with y as the ini tial concentration of caffeine in blood instead of 0. The utility to maximize is then

Uid) = E9(f *(c(d ,0,t))dt

+ E? <r

ty(cidx,d2iy),0,t))dt

where the second expectation is calculated over 0 with respect to the posterior density pi0\d\, y) and then over y with respect to piy\d\). Here, U is the sum of two expectations, so we need

to introduce an augmented distribution to define u and h (which has no major affect on algorithm implementation). Indeed, us

ing Bayes' formula piy\d\, 0')pi0') = piO'\d\,y)piy\d\) the

density function h, we must simulate

,t))dt PiO)

piy\d,,0f)pi0f).

h(dx,d2iy),0,0f,y)

ex / ̂ (c(di,0,? L70

- /?28

+ / y(cidl,d2iy),0,,t))dt

The point of having (0, 0') instead of only 0 is that h is such that

h(dud2iy),0,0'ty)

[ *( Jo >icid,0,t))dt

/ 28

J y(cidx,d2iy),0f,t))dt piy,6r\dx)pi0).

In this application the uniqueness of the optimal design not

guaranteed, and there should be many locally optimal designs. More specifically, optimal sampling day x is not defined if the

support of the prior distribution on 0 is too small. For exam

ple, if we take the extreme case where we know the exact 0, the

same for all subjects, then we would be able to find (cx,..., C2s)

such that the utility is maximal and equal to (29). In this situ

ation, we would not need any blood sampling, and looking for an optimal sampling time would not make any sense. But in our

case we are in a true Bayesian decision context, where the prior

p(0) is definitely informative but still vague enough to represent the population variability.

7.3 Results and Comments

We have implemented our algorithm with 300 particles over

80 iterations. Exploration of V follows a random-walk scheme.

The proposal distribution for r is taken as a beta distribution with mean equal to the previous value. Proposal distributions

for the concentrations are truncated normal distributions with a

mean of the previous value. The variances were set as evoked

in Section 4: decreasing with J for the importance sampling step and fixed for the Metropolis-Hastings step. To determine the optimal design in practice, we can use a clustering method

to evaluate the multivariate mode from the sample. Here we

have simply represented the marginal histograms of each com

ponent of simulated designs, because they were highly uncor related. Figure 8 represents the histogram for the sampling time r. We see that the optimal sampling day is the tenth day of the treatment. Similarly, we can draw histograms for the in

jection doses until r, as represented on Figure 9. From these

histograms, we can get the optimal di by taking the average for each day. Under this optimal protocol, which minimizes the defined risk, Figure 10 shows the concentration of caffeine in blood for one 100 simulated subjects throughout the treatment. This figure indicates the influence of the shape of the chosen

utility function. Our results show that improvements are possible. In current

practice, physicians are overdosing and are sampling too early

(in the first week). However, our results are strongly dependent on the length of treatment, which is actually not the same for all subjects. In addition, medical decisions are very subject

dependent. Consequently, this study will be deepened by ap plying our method to an improved model, including a new data

collection, a population model, and a better model of the deci

sion process and physicians' behavior. The final results could

provide a practical tool for decision support.

20 25 30 time fdavs)

Figure 8. Histogram of Simulated Measurement Days z.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 12: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 783

5000 4000 3000 2000 1000

0 (

5000 4000 3000 2000 1000

0 (

5000 r

4000 I 3000 1 20001 1000 I

5000 4000 3000 2000 1000

0 0

- -1 50?U | 4000 I 3000 I 2000 I

_I ,o:Li

5000 4000 3000 2000 1000

0 (

5000 4000 3000 2000 1000

[

tool

>oo|l

00lllj 0 -'-j DUUU i- -1 DUUU i

4000 4000

30001 3000J 2000 IL 2000

L_I moM-^_I ,oo:l

5000 4000 3000 2000 1000

0

,CX-1. Figure 9. Histograms of Simulated Injection Doses cu .

8. DISCUSSION

8.1 Implementation

8.1.1 Setting Up a Bayesian Model. Traditional reluctance to use a Bayesian approach can, of course, be invoked. The

specification of prior distribution, likelihood, and utility func tion can involve difficult choices that are not always explained in the Bayesian literature. Moreover, in ordinary Bayesian deci

sion theory, only the product of the posterior density and utility functions matters, because these functions play equivalent roles

in the expected utility criterion. In this article this convenient dual symmetry is broken, and thus each function must be as

sessed separately and carefully. In other examples in the fields of predictive biology and hierarchical toxicological models, it has been more rewarding to work out the information put in the

prior than the exactness of the utility function. An appropriate case-dependent sensitivity study should complete the method.

8.1.2 Proposal Distributions. Mathematical guarantees are

not sufficient to turn the procedure described in this article into a ready-made hands-off computer routine. Many numeri

cal choices are needed to perform the method, and this practi

cal aspect can be seriously problematic in some cases. The most

sensitive tasks for convergence concerns are the choice of tran

sitions for the Markov step and, of course, the choice of im

portance proposal. The problem of choosing efficient Markov

time (hours)

Figure 10. Response of 100 Simulated Subjects With Optimal Injec tion Protocol.

transition is the same here as in standard MCMC methods. Con

sidering the assumptions [see (A)] required for convergence and classical results on importance sampling (Robert and Casella

2004), one might believe that uniform bounds (on i and t) for E (t)[l/qit(d)] should significantly improve convergence rates.

It would be interesting to consider adaptive importance propos

als with t following the annealing effect. An interesting way of

doing this would be to use a population Monte Carlo scheme

(see Capp? et al. 2004).

8.2 Computation

8.2.1 Computation Cost. An obvious drawback of iterative

particle methods is the computational cost, which is enforced by simulated annealing and is not affected much by the selection

process. Because our goal was to optimize designs before any

experiment with no sequential data, it appears that computation

time is not so important a determinant as long as it remains rea

sonable. Nonetheless, we briefly present two ways of accelerat

ing the algorithm in the next section. Related questions involve

setting the sample size N and the final temperature. The algo rithm might be stopped when the empirical distribution is close to a Dirac delta function.

8.2.2 Alternatives for Faster Algorithms. Two improve ments are suggested here, both based on minimizing importance

sampling cost. The first idea is to use importance sampling and

resampling only when it is "necessary" instead of for all itera

tions. For instance, one can suspect that the best iterations are

those for which the target distribution actually changes (when J grows to J + 1 in our examples). In this case, because we

can afford more Markov iterations, particles are more likely to

be interpreted in terms of Markov chains rather than in terms

of importance-sampled points. Consequently, a smaller sample

size N should be required. The second idea is based on the op

posite point of view. Considering a greater sample size, one can

assume that it is sufficient to approximate the target distribution at each iteration. In this case, only one iteration is needed for a

given J, so we can set J(t) = t. Furthermore, q\? no longer plays

a major role, and we can suppress it, because the sample from

Markov transitions is supposed to be sufficiently accurate to be

interpreted as a sample from target distribution. In this particu

lar setup, one can modify the algorithm proposed in Section 3.3

by removing step 2. The target distribution at iteration ? ? 1 is considered the importance proposal at iteration t, so that impor

tance weights at t are much easier to compute; they are just pro

portional to u(dj , 0? ,y? ). This variant is computationally much lighter, because we no longer simulate from gis, and we

need only N draws of (0, y) instead of JN draws for resampling. This algorithm is close to similar algorithms proposed previ ously, like the annealed importance sampler of Neal (2001). Another advantage of this simpler algorithm is that it is easy to study theoretically and to exhibit convergence rates, for ex

ample, in the framework of measure valued processes depicted

by Del Moral (1998). Its major drawback lies in complicated cases, which were basically our center of interest. The sample

size N might need to be huge in some cases to handle such

problems. A generic and theoretical overview of such particle algorithms has been given by Del Moral and Doucet (2004).

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 13: Bayesian-Optimal Design via Interacting Particle Systems

784 Journal of the American Statistical Association, June 2006

8.3 Further Improvements and Perspectives

Aside from minimizing computation cost and optimizing proposal distributions, we can think of further algorithmic im

provements. Similar approaches as in the literature for sequen

tial MCMC could be proposed. For instance, one could try to improve the selection process, as mentioned earlier, or to

adapt N to the variance of the current sample. Another approach

could be to randomize N. Finally, one could optimize the cool

ing schedule. It should also be possible to adapt our algorithm to sequential problems, so that the design can be optimized as new data arrive.

9. CONCLUSIONS

The new algorithm proposed in this article launches N parti cles and iterates the following three steps:

(r) importance ~(r) resampling y r) Markov ,(f+l)

sampling step

The convergence arguments developed herein rely on both

importance sampling statistical results and Markov chain the

ory. They show that, under general regularity conditions, the

number N of particles controls the degree of accuracy between the empirical distribution of the particles and the importance sampling target distribution hj.

Consequently, the algorithm inherits two key features of its

conceptual components:

1. In contrast to some MCMC techniques, one does not have

to wait for some asymptotic behavior to hit the target dis tribution. At each iteration, the particles give a sample

from hj, and subsequent statistics [such as the marginal mode of hj, arg max UJid)] can be readily evaluated. If one is not satisfied with the precision of this impor tance sampling estimate, then one need only increase N,

the number of particles.

2. The algorithm adjusts itself onto the region of interest that

keeps shrinking toward the limiting distribution. As iter ations proceed, the particles are spread more and more

densely around the modes of the target distribution.

Because no specific condition on the model is required,

this method can be applied in general decision- or design optimization problems. But in simpler cases, such as linear

or low-dimensional models, more straightforward approaches

should be more efficient. The toy example shows how this al

gorithm can perform better than M?ller's algorithm. The phar

macological case study attests that the approach is effective and implementable for complex design structures and nonlinear

models. As noted earlier, further improvements could possibly lead to some ready-to-use software for practitioners.

APPENDIX: EXPLICIT OPTIMIZATION WITH THE LAGRANGE METHOD FOR

KING ARTHUR'S PROBLEM

With the same notations, we want to minimize the criterion C(d),

Ti . r/+1 \2

C(d) = E?iV

knowing model likelihood p(y\0) and prior density p(6). Because we

are dealing with a Gaussian model, we can integrate further,

C(d) = E? Y, var> -7=1 \

Y: T, Ydj

+ var^ii

f/+i

TJ+\ =

J2 Yj+^tJ+l ) + bij

- tij+i)2

Then we have

n-l

C(d) = J]e?((m7-m7+i)2)

and

(Sf) Ee(3?) 7-1 r?

>=2 J

C(d) = _? + ?^, 7=1 ^

where _^ and Ay are known and positive constants that depend only on

the prior on 0. We are now dealing with a standard optimization prob lem where we look for the decision d* =

(_T*,..., _T*) that minimizes

the quantity X^/Li p under a linear constrain X_y=i ^}

= ^ We can

then perform a simple Lagrange factors method, and we finally find

that

I An 1 sr^n

T ̂J=

[Received November 2003. Revised May 2005.]

REFERENCES Atkinson, A., and Donev, A. (1992), Optimum Experimental Designs,

New York: Oxford.

Bartoli, N., and Del Moral, P. (2001), Simulation et Algorithmes Stochastiques, Toulouse: Cepadues.

Bois, F. (1999), "Optimal Design for a Study of Butadiene Toxicokinetics in

Humans," Toxicological Sciences, 49, 213-224.

Box, G., Hunter, W., and Hunter, J. (1978), Statistics for Experimenters, New York: Wiley Interscience.

Brooks, S., and Morgan, B. (1995), "Optimization Using Simulated Anneal

ing," The Statistician, 44, 241-257.

Capp?, O., Guillin, M., Marin, J., and Robert, C. (2004), "Population Monte

Carlo," Journal of Computer Graphics and Statistics, 13, 907-929.

Carlin, B., Kadane, J., and Gelfand, A. (1998), "Approaches for Optimal Se

quential Decision Analysis in Clinical Trials," Biometrics, 54, 964-975.

Carpenter, J., Clifford, P., and Fearnhead, P. (1999), "An Improved Particle Fil ter for Nonlinear Problems," IEEE Proceedings?Radar, Sonar and Naviga tion, 146, 2-7.

Chopin, N. (2002), "A Sequential Particle Filter Method for Static Models," Biometrika, 89, 539-552.

Clyde, M., and Chaloner, K. (1996), "The Equivalence of Constrained and

Weighted Designs in Multiple Objective Design Problems," Journal of Amer ican Statistical Association, 91, 1236-1244.

Crisan, D., and Doucet, A. (2000), "Convergence of Sequential Monte Carlo

Methods," technical report, University of Cambridge, Signal Processing Group, Dept. of Engineering.

de Groot, M. H. (1970), Optimal Statistical Decisions, New York: McGraw Hill.

Del Moral, P. (1998), "Measure-Valued Processes and Interacting Particle Sys tems, Application to Nonlinear Filtering Problems," The Annals of Applied Probability, 2, 438^95.

- (2004), Feynman-Kac Formulae: Genealogical and Interacting Parti

cle Systems With Applications, New York: Springer-Verlag. Del Moral, P., and Doucet, A. (2004), "Particle Motions in Absorbing Medium

With Hard and Soft Obstacles," Stochastic Analysis and Applications, 22, 1175-1207.

Del Moral, P., Kallel, L., and Rowe, J. (2001), Natural Computing Series: Theoretical Aspects of Evolutionary Computing, Berlin: Springer-Verlag, pp. 10-67.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions

Page 14: Bayesian-Optimal Design via Interacting Particle Systems

Amzal et al.: A Particle Method for Bayesian-Optimal Design 785

Doucet, A., de Freitas, N., and Gordon, N. (2001), Sequential Monte Carlo Methods in Practice, New York: Springer-Verlag.

Falcao, A., Fernandez de Gatta, M., Delgado Iribarnegaray, M., Garcia, M,

Dominguez-Gil, A., and Lanao, J. (1997), "Population Pharmacokinetics of Caffeine in Premature Neonates," European Journal of Clinical Pharmacol

ogy, 52,211-217. Geman, S., and Geman, D. (1984), "Stochastic Relaxation, Gibbs Distribu

tion and the Bayesian Restoration of Images," IEEE Transactions on Pattern

Analysis and Machine Intelligence, 6, 721-741.

Geweke, J. (1989), "Bayesian Inference in Econometric Models Using Monte Carlo Integration," Econometrica, 24, 1317-1399.

Goldstein, H. (1995), Multilevel Statistical Models, New York: Wiley/London: Edward Arnold.

Lee, C, Charles, B., Steer, P., Flenady, V., and Shearman, A. (1997), "Popu lation Pharmacokinetics of Intravenous Caffeine in Neonates With Apnea of

Prematurity," Clinical Pharmacology and Therapeutics, 61, 628-640.

Liu, J., and Chen, R. (1998), "Sequential Monte Carlo Methods for Dynamic Systems," Journal of American Statistical Association, 93, 1032-1043.

Mentr?, F., Mallet, A., and Baccar, D. (1997), "Optimal Design in Random Effects Regression Models," Biometrika, 84, 429-442.

M?ller, P. (1999), "Simulation-Based Optimal Design," Bayesian Statistics, 6, 459-474.

M?ller, P., and Rosner, G. (1997), "A Bayesian Population Model With Hier archical Mixture Priors Applied to Blood Count Data," Journal of American Statistical Association, 92, 1279-1292.

M?ller, P., Sanso, B. G, and De lorio, M. (2003), "Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation," technical report, University of California-Santa Cruz.

Neal, R. M. (2001), "Annealed Importance Sampling," Statistics and Comput ing, 11, 125-139.

Parent, E., Chaouche, A., and Girard, P. (1995), "Sur l'Apport des Statistiques Bayesiennes au Contr?le de la Qualit? par Attribut, Partie 1: Contr?le Sim

ple," Revue Statistique Appliqu?e, XLIII, 5-18.

Parent, E., Lang, G., and Girard, P. (1996), "Sur l'Apport des Statistiques Bayesiennes au Contr?le de la Qualit? par Attribut, Partie 2: Contr?le S?quen tiel Tronqu?," Revue Statistique Appliqu?e, XLIV, 37-54.

Reeves, C, and Wright, C. (1999), "Genetic Algorithms and the Design of Experiments," in Evolutionary Algorithms, New York: Springer-Verlag, pp.207-226.

Robert, C, and Casella, G. (2004), Monte Carlo Statistical Methods, New York:

Springer-Verlag. Rubin, D. (1988), Bayesian Statistics, Vol. 3, New York: Oxford University

Press, pp. 395-402.

Sampson, P., and Guttorp, P. (1992), "Nonparametric Estimation of Nonstation

ary Spatial Covariance Structure," Journal of the American Statistical Asso

ciation, %1, 108-119.

Scheff?, H. (1961), The Analysis of Variance, New York: Wiley. Smeeton, N., and Adcock, C. (1997), "Sample Size Determination," The Statis

tician, 46, 129-283.

Smith, A., and Gelfand, A. (1992), "Bayesian Statistics Without Tears: A Sampling-Resampling Perspective," The American Statistician, 46, 84-88.

Thomson, A., Kerr, S., and Wright, S. (1996), "Population Pharmacokinetics of Caffeine in Neonates and Young Infants," Therapeutic Drug Monitoring, 18, 245-253.

Van Laarhoven, P., and Aarts, E. (1987), Simulated Annealing: Theory and Ap plications, Dordrecht: Reider.

Wakefield, J. (1994), "An Expected Loss Approach to the Design of Dosage Regimens via Sampling-Based Methods," The Statistician, 43, 13-29.

This content downloaded from 195.78.109.119 on Sat, 14 Jun 2014 12:34:24 PMAll use subject to JSTOR Terms and Conditions