in copyright - non-commercial use permitted rights ...47372/...1the work of this author has been...

26
Research Collection Report Intrinsic fault tolerance of multi level Monte Carlo methods Author(s): Pauli, Stefan; Arbenz, Peter; Schwab, Christoph Publication Date: 2013 Permanent Link: https://doi.org/10.3929/ethz-a-010387066 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Upload: trandan

Post on 03-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Research Collection

Report

Intrinsic fault tolerance of multi level Monte Carlo methods

Author(s): Pauli, Stefan; Arbenz, Peter; Schwab, Christoph

Publication Date: 2013

Permanent Link: https://doi.org/10.3929/ethz-a-010387066

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Eidgenossische

Technische Hochschule

Zurich

Ecole polytechnique federale de Zurich

Politecnico federale di ZurigoSwiss Federal Institute of Technology Zurich

Intrinsic fault tolerance of multi level

Monte Carlo methods

St. Pauli, P. Arbenz and Ch. Schwab

Research Report No. 2012-24

August 2012

Revised: November 2013

Seminar fur Angewandte Mathematik

Eidgenossische Technische Hochschule

CH-8092 Zurich

Switzerland

Intrinsic Fault Tolerance of Multi Level Monte Carlo Methods

Stefan Pauli, a,b,1,∗, Peter Arbenza, Christoph Schwabb,2

aETH Zurich, Computer Science Department, Universitatsstrasse 6, 8092 Zurich, SwitzerlandbETH Zurich, Seminar for Applied Mathematics, Ramistrasse 101, 8092 Zurich, Switzerland

Abstract

Monte Carlo (MC) and Multilevel Monte Carlo (MLMC) methods applied to solvers for PartialDifferential Equations with random input data are proved to exhibit intrinsic failure resilience.Sufficient conditions are provided for non-recoverable loss of a random fraction of MC samplesnot to fatally damage the asymptotic accuracy vs. work of an MC simulation. Specifically, theconvergence behavior of MLMC methods on massively parallel hardware with runtime faults isanalyzed mathematically and investigated computationally. Our mathematical model assumesnode failures which occur uncorrelated of MC sampling and with general sample failure statisticson the different levels and which also assume absence of checkpointing, i.e., we assume irrecoverablesample failures with complete loss of data. Modifications of the MLMC with enhanced resilienceare proposed. The theoretical results are obtained under general statistical models of CPU failureat runtime. Particular attention is paid to node failures with so-called Weibull failure models onmassively parallel stochastic Finite Volume computational fluid dynamics simulations are discussed.

Keywords: Multilevel Monte Carlo, fault tolerance, failure resilience, exascale parallel computing

1. Introduction

Monte Carlo (MC) methods estimate statistical moments of random variables (such as means orso-called “ensemble averages”) by sample averages [6]. The goal can, for instance, be to determinethe expected solution of a partial differential equation (PDE) with random initial or boundaryconditions that follow some statistical law [12–14]. Then each sample is the solution of the PDEfor a random input (such as, in the context of hyperbolic systems of conservation laws, a particularinitial/boundary condition). The statistical independence of the input data makes it possible toexecute the simulations corresponding to each sample in parallel. The slow convergence of MonteCarlo methods (M−1/2 for M draws of input data) entails large numbers of samples. This, inturn, implies good parallel scalability of MC methods to large numbers of processors. Mostly,the simulations take a similar amount of time such that a distribution among large numbers ofprocessors with a balanced load is achieved quite easily. In a parallel setting the only seriousproblem is to guarantee the statistical independence of the random input draws (e.g. [19]).

∗Corresponding authorEmail addresses: [email protected] (Stefan Pauli), [email protected] (Peter Arbenz),

[email protected] (Christoph Schwab)1The work of this author has been funded by the ETH interdisciplinary research grant CH1-03 10-1.2CS acknowledges partial support by the European Research Council under grant ERC AdG 247277-STAHDPDE.

Preprint submitted to Journal of Parallel and Distributed Computing November 4, 2013

Multilevel Monte Carlo (MLMC) methods were recently proposed in [7, 12] in order to improvethe convergence versus work. They can be used for efficient numerical simulations of stochasticordinary or partial differential equations. Unlike MC methods where samples are only computedon one discretization, MLMC methodes use a hierarchy of discretization levels hence computationsare done on many different discretizations. Based on the expected solution computed on thecoarsest discretization level, the expected difference from this level to the next finer one is added,until the required finest discretization level is reached. In MLMC methods the expected differencefrom two consecutive discretization levels is computed using the MC method. Hence, in thispaper the difference of a realization computed on two consecutive discretization levels is referredto as a MLMC sample of a certain “level”. A “level” does, therefore, in this paper, not denote adiscretization level but a level related to a MLMC sample. It is by now known (see, eg., [2, 7, 12–14]) that under rather general assumptions, MLMC methods converge faster than MC, in termsof overall computational work, i.e., cumulated execution time. The cumulated execution time andmemory consumption of the computation of samples depends on the levels and differ considerably:the computation of an MLMC sample of a ‘finer’ level may require much more compute resources(execution time, memory space, number of cores) than sample of a ‘coarser’ level. The load of anMLMC simulation is therefore not as easy to balance [25] as in MC, as there are only few sampleson the fine levels. Nevertheless, in the context of partial differential equations with random inputs,the MLMC to allow the approximation of ensemble averages of the solution with accuracy versuscomplexity analogous to that necessary for one numerical solution of the deterministic problem onthe finest mesh [2, 12].

The present study is based on the following assumptions: a) in large scale simulations onemerging, massively parallel computing platforms processor failures at runtime are inevitable [4, 5],and occur, in fact, with increasing frequency as the number of processors increases, respectively thequality of processors decreases; b) processor failures at runtime can, in general, not be predicted,but occur randomly and should therefore be modelled as stochastic processes; c) processor failuresat runtime are not checkpointed and not recoverable; d) the algorithm of interest has redundancyby design in order to “survive” a certain number of (non-checkpointed) failure events with randomarrivals.

Assumptions c) and d) exclude a large number of currently used standard algorithms, for whichany loss of data entails abortion of execution. We mention only standard Gaussian Elimination withloss of one pivot element. Other algorithms may, however, tolerate partial loss of data at runtime.We think of iterative solvers of large, linear systems which may converge even if one or severaliterates are “lost” due to hardware failures and so satisfy assumption d); interestingly, assumptionb) implies that the convergence results of deterministic algorithms for deterministic problems onrandom hardware will necessarily be probabilistic in nature. Here, we consider the case when thealgorithm under consideration is stochastic by design, such as Monte Carlo (MC) methods. Aswe argue in the present paper, MC methods, being probabilistic in nature, are intrinsically faulttolerant: as we prove in the present paper, the loss of (a “subcritical” fraction of) information byfailed samples does not render the whole simulation useless as is the case, e.g., in many matrixcomputations, such as Gaussian Eliminiation. MC samples which were lost due to node failuresat runtime can be repeated since new, independent samples can be generated to replace the failedones.

We provide a mathematical argument predicting that the convergence behavior of MC is notaffected substantially if the failed samples are simply disregarded, provided there are “not too

2

many” (made precise in the mathematical analysis, and corroborated in the numerical experiments)of these failures.

Specifically, in the present paper we analyze the performance of both MC and MLMC PDEsolvers in the presence of hardware failures at runtime. In particular, we investigate the con-vergence behavior of these methods if processors fail according to a stochastic failure model; thepresently developed mathematical analysis accomodates rather general failure models. Naturally,to arrive at a theory which is amenable to rigorous mathematical treatment, a number of simpli-fying assumptions had to be made. In particular, in our analysis we do not distinguish amongdifferent reasons of processor failure. So, we do not distinguish between node, program, network,or any other type of failure. We assume that the complete MC sample is lost if one of the (maybemultiple) processors fails that are used for its simulation. We disregard all samples affected by afailure and compute the results with the ‘surviving’ ones.

While in MC all samples are from the (single) finest level (or grid), MLMC gets its statisticsalso from samples corresponding to coarser grids. (The resolution of the finest level is determinedby the required discretization error.) By using information on multiple levels MLMC needs muchfewer samples on the finest level than ordinary MC to attain the same quality of answer. MLMCturns out to be much more efficient than MC. To get the optimal MLMC convergence rate (wrt.work), it is crucial to choose properly the numbers M` of samples on level `.

In the presence of failures without checkpointing MC samples on all levels might be irrevocablylost. The larger (in the sense that they are defined on finer meshes) samples of the finer levels aremore vulnerable than the (larger number of) smaller samples on the coarser levels. With very highfailure rates it might not be feasible that sufficiently many samples survive on the finer levels. Ingeneral, the error components of faulty levels increase and the overall convergence rate is reduced.With a sufficiently high failure rate all samples on a particular level may get lost. In this case, theattainable error is bounded from below by the discretization error of that level.

Our main mathematical results are as follows: we prove convergence of MC and MLMC forthe first moment (sample average) provided that sufficiently many samples survive on average.We compute the effect of failures according to existing failure models. Numerical experiments ofMLMC for hyperbolic PDE’s coupled with the Weibull failure model validate our theory. We furtherinvestigate the failure resilience of two and three dimensional time-dependent grid applications,like finite elements, finite differences, or finite volumes. These results are obtained by MLMCsimulations treating the sample sizes M` as random variables.

We also discuss FT-issues regarding MPI. In the present standard MPI-3.0 [15], failure of asingle MPI process is fatal for the entire MLMC simulation. For our method to work, MPI wouldhave to be extended by a mechanism to survive losses of MPI processes at runtime, and continuewith the remaining ones. Based on this paper the proposed fault tolerant MLMC was alreadyimplemented [18] using the User Level Failure Mitigation (ULFM) [3], a fault tolerant version ofMPI. We compare the error bound with the measured results from this implementation.

The paper is organized as follows: In Sections 2 and 3 we give error bounds for MC and MLMC,respectively, in the presence of a statistical loss of samples. In Section 4 we discuss the Weibullfailure model. In Section 5 we conduct a number of numerical experiments to investigate howconvergence is affected by failure. We consider two and three dimensional problems with differentconvergence rates of the PDE solver.

3

2. MC with a random number of samples

We first introduce a fault tolerant MC (FT-MC) method. Starting from there the fault toleranttechnique used in MLMC is derived.

We are interested in the expected value E[X] of a random variable (RV) X in the probabilityspace (Ω,A ,P), with sample space Ω, σ-algebra A and probability measure P [17]. If the 2ndmoments of X exist the Monte Carlo method can be used to estimate E[X]. Given a fixed numberM of independent, identically distributed samples Xi, i = 1, 2, . . . ,M , the MC approximation ofE[X] is defined by

EM [X] :=1

M

M∑i=1

Xi. (1)

The mean square error in the estimator (1) is known to be (see, e.g., [12, 24])

‖E[X]− EM [X]‖L2(Ω;L1(Rd)) ≤M−1/2‖X‖L2(Ω;L1(Rd)) := M−1/2√E[‖X‖2

L1(Rd)]. (2)

The “embarassingly parallel” evaluation of (1) across a possibly large number of nodes is a commonapproach to reduce wall clock time in MC simulations. In the present work, we highlight andcapitalize on a second feature of the MC estimator (1): software or hardware failures at runtimemay cause nodes to fail or even crash, such that some samples get lost. We will give sufficientconditions on the node failure statistics to prove that

1. it is reasonable to continue the MC simulation without checkpointing,

2. that no recovery of samples from failed nodes is required,

3. that the statistical quality of the MC simulation is unaffected provided “node failures at run-time do not occur too frequently”, and

4. that there is a certain critical node failure intensity above which the MC simulation doesdeterioriate, almost surely.

Let us now be more specific about the mathematical model from which these assertions arededuced.

In the presence of system failures at runtime, the sample size M in (1) is not a fixed numberanymore, but becomes itself a random variable M . We denote the probability space for the failuresby (Ω′,A ′,P′) and the respective expectation by E′[·]. We assume throughout the paper that thenode failures at runtime occur with statistics that are independent of the realizations of X. Thisimplies that the runtime to compute a solution of a realization of X is independent of e.g., therealization. In particular, we furthermore assume in the sequel that the surviving samples arestatistically independent. The N -out-of-M strategy suggested by Li and Mascagni [10] has beendesigned in such a way that the random numbers generated by any N out of M random numberstreams are independent random numbers. The pseudo-random number generators in [11, 21]incorporate this strategy.

For fixed M the approximation EM [X] is a RV over Ω. For M considered as RV over Ω′ thisapproximation becomes

EM [X] :=1

M

M∑i=1

Xi,

4

Therefore, it is a RV over the product probability space (Ω×Ω′,A ⊗A ′,P⊗ P′). Since (Ω,A ,P)and since (Ω′,A ′,P′) are finite measure spaces, there exists (see, e.g., [24] and the references there)a unique product probability measure P = P⊗ P′ on A ⊗A ′ such that

P(A⊗A′) = P(A)P′(A′), ∀A ∈ A , A′ ∈ A ′ .

Theorem 2.1. The MC error estimate with a random number of samples is given by

‖E[X]− EM [X]‖2L2(Ω;L1(Rd);P)

≤ E′[M−1/2

]‖X‖2L2(Ω;L1(Rd)) . (3)

Proof. We substitute the RV M for M in equation (2) and integrate over (Ω′,A ′,P′) to obtain∫Ω′‖E[X]− EM [X]‖2L2(Ω;L1(Rd))P

′(dω′) ≤∫

Ω′M−1/2‖X‖2L2(Ω;L1(Rd))P

′(dω′)

= E′[M−1/2

]‖X‖2L2(Ω;L1(Rd)) .

Theorem I.3.17. in [23] then shows that

L2(Ω′;L2(Ω;L1(Rd))) ∼= L2(Ω× Ω′;L1(Rd);P⊗ P′),

which leads to the claimed error estimate (3).

Once a particular statistical distribution for the node failures has been adopted (and calibratedto the hardware platform of interest), the term E′[M−1/2] can be computed and hence usingTheorem 2.1, the a priori error bound for the fault tolerant MC (FT-MC) method as well. Theintuition behind Theorem 2.1 is that all failures are accounted for in the error bound of the FT-MC. The number of failures occurring in a FT-MC simulation varies according to a known failuredistribution, nevertheless the a priori error bound in Theorem 2.1 remains valid, regardless of thenumber of failures. Knowing the failure distribution allows to compute an error bound before thenumber of failures and the number of surviving samples is known.

The condition M(ω′) = 0 means that no samples remain. If the probability of this event ispositive, then obviously E′[M−1/2] tends to infinity and the FT-MC error bound becomes mean-ingless. We therefore assume that P′(M = 0) = 0 in the ensuing analysis. In practice, thismeans that the MC estimation has to be restarted from scratch in the event that all samples arelost at runtime.

3. Abstract Multilevel Monte Carlo with sample losses

3.1. Review of the Multilevel Monte Carlo method

In this section we consider random variables X that are solutions of problems with randominputs that can be solved only approximately. Prominent examples are PDEs with random initialor boundary conditions as they arise in uncertainty quantification (UQ) in engineering applications.The solutions of these PDEs typically are obtained numerically by a discretization method like thefinite element, finite difference, or finite volume method. We assume that we have available ahierarchy of discretizations that is indicated by a measure, e.g., the grid spacing.

In contrast to ordinary Monte Carlo methods, Multilevel Monte Carlo (MLMC) methods canexploit this hierarchy of discretizations. If implemented properly, MLMC provides estimates E(X)of higher accuracy than MC for the same amount of work measured in floating point operations.

5

A hierarchy of discretizations is given for instance under the assumption that the computationaldomain is rectangular and is discretized by a hierarchy of regular grids with grid spacings (or“meshwidths”)

h0 > · · · > hL, hi = 2hi+1. (4)

Such hierarchies are available to most simulations in engineering. Similar to the derivation ofGiles [7], the difference between X(·, ω) of the original problem and its discretization Xh(·, ω) onthe mesh of width h, i.e., the discretization error, is assumed to converge with order α

‖E[X −Xh` ]‖B = O(hα` ), α > 0, (5)

where ‖ · ‖B denotes a norm in the Banach space B. We also assume a convergence order β for

‖Xh` −Xh`−1‖L2(Ω;B) = O(hβ`−1), β, ` > 0, (6)

and assume that‖Xh0‖L2(Ω;B) = O(1). (7)

The first equality is a consequence of the bounded variation of X and of the consistency of thediscretization scheme, given the almost sure regularity of the sample paths; the latter equality canalways be achieved by a scaling of the data. Furthermore it is assumed that the work needed tocompute Xh` satisfies

W` = O(h−γ` ) , γ > 0 . (8)

As explained e.g. in [7, 12–14], the MLMC method is based on the following telescopic sum,

E[XhL ] = E[Xh0 ] +L∑`=1

E[Xh` −Xh`−1] , (9)

that allows to estimate E[X] levelwise,

E[XhL ] = EM0 [Xh0 ] +

L∑`=1

EM`[Xh` −Xh`−1

] . (10)

On level ` we use an ordinary MC method with M` samples (Xih`− Xi

h`−1), i = 1, . . . ,M`, to

approximate E[Xh` − Xh`−1]. Note, that in this paper a sample on level ` comprises a difference

between the same realization Xi computed on two consecutive discretization levels h` and h`−1. Inorder to form this difference the realizations Xi

h`and Xi

h`−1have to be computed with the same ω.

On the coarsest level we estimate E[Xh0 ] with M0 samples Xih0

, i = 1, . . . ,M0,. The MLMC error‖E[X] − E[XhL ]‖L2(Ω;B) is the norm of the difference of the true expectation E[X] in (9) and theMLMC estimate E[XhL ] in (10),

‖E[X]− E[XhL ]‖L2(Ω;B) ≤ ‖E[X]− E[XhL ]‖B + ‖E[XhL ]− E[XhL ]‖L2(Ω;B)

≤ ‖E[X]− E[XhL ]‖B + ‖E[Xh0 ]− EM0 [Xh0 ]‖L2(Ω;B)

+L∑`=1

‖E[Xh` −Xh`−1]− EM`

[Xh` −Xh`−1]‖L2(Ω;B)

(2)

≤ ‖E[X]− E[XhL ]‖B +M−1/20 ‖Xh0‖L2(Ω;B)

+L∑`=1

M−1/2` ‖Xh` −Xh`−1

‖L2(Ω;B).

(11)

6

With (6) and (7) this bound becomes

‖E[X]− E[XhL ]‖L2(Ω;B) = O(hαL) +M−1/20 O(1) +

L∑`=1

M−1/2` O(hβ`−1). (12)

This result is valid for any numbers L and M`. However, to benefit from MLMC, a clever choice iscrucial. One may balance the expected error terms on the right side of (12) [12] or minimize thecomputational work with respect to the expected error, see [7]. In either case, when expressed interms of work, the convergence rates of MLMC are always as good as the ones from standard MCmethods. Giles [7] shows that for α ≥ 1/2 and any ε < e−1 there is an L and values M`, 0 ≤ ` ≤ L,such that the MLMC (discretization and sampling) error is bounded by

‖E[X]− E[XhL ]‖L2(Ω;B) < ε.

with the total work for computing E[XhL ] being given by

work =

O(ε−2), β > 1,O(ε−2(log ε)2), β = 1,

O(ε−2−(1−β)/α), 0 < β < 1.

3.2. Sample losses in MLMC

In the previous section 2 we considered the MC method with a random number of samples,given the realistic assumption that at least one MC-sample survives. Within each level of theMLMC method a standard MC simulation is executed. This allows to reuse most parts of the FT-MC approach in the FT-MLMC method. In the FT-MLMC method, in contrast to the FT-MCmethod, it is feasible to compute an MLMC estimate even in the case that all samples of one level` ∈ [0, L] are lost. A level ` of mesh-refinement on which all MC samples are lost will be referredto as a lost level. In MLMC M` = 0 on some level ` should not lead to an infinite error bound.Therefore, the MLMC error bound has to be modified such that it can handle lost levels.

3.3. MLMC error bound with lost levels

We derive an MLMC error bound under assumption that M` ≥ 0, and hence M` = 0 mightappear. We shall refer to this case as an “entirely lost level”. In the discussion of this section, thenumber of samples M` are not random.

The MLMC estimate (10) is modified such that lost levels (M` = 0) are not taken into account,

E[XhL ] = EM0 [Xh0 ] +L∑`=1

EM`[Xh` −Xh`−1

], EM`[·] = 0 if M` = 0. (13)

The error bound‖E[X]− E[XhL ]‖L2(Ω;B)

is analyzed.In the following derivation we use a bound for a MC simulation with M = 0 when nothing is

computed (EM [X] = 0, M = 0):

‖E[X]− E0[X]‖L2(Ω;B) = ‖E[X]‖L2(Ω;B) =√‖E[X]‖2B ≤

√E[‖X‖2B] = ‖X‖L2(Ω;B) (14)

7

Following the derivation (11) provides a error bound for the MLMC method where levels mightbe lost (M` = 0)

‖E[X]− E[XhL ]‖L2(Ω;B) ≤ ‖E[X]− E[XhL ]‖B + ‖E[XhL ]− E[XhL ]‖L2(Ω;B)

≤ ‖E[X]− E[XhL ]‖B + ‖E[Xh0 ]− EM0 [Xh0 ]‖L2(Ω;B)

+L∑`=1

‖E[Xh` −Xh`−1]− EM`

[Xh` −Xh`−1]‖L2(Ω;B)

(2),(14)

≤ ‖E[X]− E[XhL ]‖B + min(1,M−1/20 )‖Xh0‖L2(Ω;B)

+L∑`=1

min(1,M−1/2` )‖Xh` −Xh`−1

‖L2(Ω;B) .

(15)

This inequality admits to bound the MLMC error where some levels might be lost.

Theorem 3.1. The Multilevel Monte Carlo error with lost levels (i.e. when M` ≥ 0) is boundedby

‖E[X]− E[XhL ]‖L2(Ω;B) ≤ ‖E[X]− E[XhL ]‖B + min(1,M−1/20 )‖Xh0‖L2(Ω;B)

+L∑`=1

min(1,M−1/2` )‖Xh` −Xh`−1

‖L2(Ω;B),

or, with assumptions (5), (6) and (7),

‖E[X]− E[XhL ]‖L2(Ω;B) ≤ O(hαL) + min(1,M−1/20 )O(1) +

L∑`=1

min(1,M−1/2` )O(hβ`−1) .

Losing a level (M` = 0) increases the MLMC error bound in Theorem 3.1 substantially. Thisapplies particularly for the low levels where costs of losing them are highest, since losing level `leads to an error O(hβ`−1). Remark that higher levels than the lost one will not improve the orderof the error anymore.

3.4. Fault Tolerant MLMC error bound with random failures

As in the fault tolerant MC method we work under the “all or nothing paradigm”, i.e., a sampleis either correctly computed or irrecoverably lost when nodes fail at runtime. We do not distinguishbetween node, program, network, or any other cause of failure at runtime. We discard all samplesaffected by a failure, and compute the result with the remaining ones. Then M` the number ofsamples per level is not a fixed number anymore, but a random number M`.

In accordance with Section 2 the probability space for the random solution X is denoted as(Ω,A ,P). The probability space to model runtime failures is denoted as (Ω′,A ′,P′). Evidently,the number of MC samples per level M` depends on the runtime failures. Hence M` is a measurablemapping

M` : (Ω′,A ′,P′)→ (N0, 2N0).

8

We derive an error bound for an MLMC sample average estimate which is computed basedon a random numbers of samples which models the MLMC in the presence of failures at runtime.Specifically, the following fault tolerant MLMC (FT-MLMC) estimator is used:

E [XhL ] = EM0[Xh0 ] +

L∑`=1

EM`[Xh` −Xh`−1

] , EM`[·] = 0 if M` = 0.

Theorem 3.2. The FT-MLMC error under influence of failure is bounded by

‖E[X]− E[XhL ]‖L1(Ω′;L2(Ω;B)) ≤ ‖E[X]− E[XhL ]‖B + E′[min(1, M−1/20 )]‖Xh0‖L2(Ω;B)

+L∑`=1

E′[min(1, M−1/2` )]‖Xh` −Xh`−1

‖L2(Ω;B),

or given the assumptions (5), (6) and (7)

‖E[X]− E[XhL ]‖L1(Ω′;L2(Ω;B)) ≤ O(hαL) + E′[min(1, M−1/20 )]O(1) +

L∑`=1

E′[min(1, M−1/2` )]O(hβ`−1) .

Proof. By Theorem 3.1 we have

‖E[X]− E[XhL ]‖L2(Ω;B) ≤ ‖E[X]− E[XhL ]‖B + min(1, M−1/20 )‖Xh0‖L2(Ω;B)

+L∑`=1

min(1, M−1/2` )‖Xh` −Xh`−1

‖L2(Ω;B),

Both sides are integrated over the probability space (Ω′,A ′,P′) leading to∫Ω′‖E[X]− E[XhL ]‖L2(Ω;B)P′(dω′)

≤∫

Ω′‖E[X]− E[XhL ]‖B + min(1, M

−1/20 )‖Xh0‖L2(Ω;B)

+

L∑`=1

min(1, M−1/2` )‖Xh` −Xh`−1

‖L2(Ω;B)P′(dω′) .

With the linearity of the mathematical expectation the error estimate becomes∫Ω′‖E[X]− E[XhL ]‖L2(Ω;B)P′(dω′)

≤∫

Ω′‖E[X]− E[XhL ]‖BP′(dω′) +

∫Ω′

min(1, M−1/20 )P′(dω′)‖Xh0‖L2(Ω;B)

+L∑`=1

∫Ω′

min(1, M−1/2` )P′(dω′)‖Xh` −Xh`−1

‖L2(Ω;B) .

9

We derived a general fault tolerant MLMC method. Regarding the implementation some as-pects should be considered. A (single level) MC method is used to estimate E[Xh` − Xh`−1

] by

EM`[Xh` −Xh`−1

] =∑M`

i=1(Xih`−Xi

h`−1), where Xi

h`and Xi

h`−1approximate the same realization

Xi with two different discretization parameters h` and h`−1. This implies a strong statistical cor-relation between Xi

h`and Xi

h`−1(in general the same random numbers are used in Xi

h`and Xi

h`−1).

Failures influence the random number of samples per level M`. It is emphasized that a sample ona level always consists of the difference between the two computed solutions (Xi

h`− Xi

h`−1). In

other words whenever a Xih`−1

is lost due to a failure, it is required that the corresponding Xih`

is disregarded as well, and vice versa. We propose therefore to do all computations for a sampleXih`−Xi

h`−1on a single unit. A failure in this unit leads automatically to a loss of both Xi

h`and

Xih`−1

. Only entirely computed samples Xih`−Xi

h`−1are accumulated or communicated to other

units. This procedure has the slight disadvantage that two different discretizations have to becomputed on the same unit.

4. Statistical model of node failure at runtime

System failures can interfere with the computation of the MLMC estimate. They can be due toerrors in software, hardware or network but also due to environmental effects. These failures aremanifestations of various types of errors which can be roughly classified as permanent, transient,or silent [5]. Permanent errors do not disappear, whereas transient errors are short term errors.Silent errors are undetected (permanent or transient) errors, and hence they may lead to undetectederroneous results. Other classifications into hard and soft errors are conceivable and for instancedescribed in [8].

In the previous section we have extended the MLMC theory to cover random numbers ofsamples. This allows to disregard all samples affected by detected errors (permanent or transient,soft or hard). Our fault tolerant MLMC method simply ignores these lost samples and tries tomake the best out of the surviving ones. Silent (undetected) errors however are not covered bythis theory.

To specify the error bound of the FT-MLMC method the statistics of failures leading to samplelosses has to be known. At present only rather rough failure models are available and the devel-opment of more detailed, and realistic models is needed, in our opinion. For new HPC systemsfirst empirical statistical failure studies [16, 22] are available. Schroeder and Gibson [22] derived asomewhat realistic failure model that has been adopted in the present paper. The authors studiedthe failures which occurred over 9 years in more than 20 different systems at Los Alamos NationalLaboratory. Their raw data [27] contains an entry for any failure that required the attention ofa system administrator. The authors conclude that the time interval between two failures of anindividual node, as well as for the entire system is well fitted by the Weibull distribution with theWeibull shape parameter k between 0.7 and 0.8. The probability density function of the Weibullrandom variable x is given by (e.g. [17])

f(x;λ, k) =

(xλ

)k−1e−(x/λ)k , if x ≥ 0,

0, if x < 0,

where λ > 0 is the scale parameter and k > 0 is the shape parameter. Schroeder and Gibsonalso found the failure rates to be roughly proportional to the number of processors in the system.

10

Therefore, we use the Weibull distribution to model the time between two consecutive failures onone node. This model assumes that node failures are statistically independent. This is one of therather rough approximations made. The data of Schroeder and Gibson [22] also gives evidenceof a correlation between the failure rate and the type and intensity of the workload running on amachine, as the failure rate is considerably lower during the night and the weekend. However, wehad to idealize, since more accurate node failure data and models are hard to come by, at least atpresence. Nevertheless we point out that our theory accomodates also other, more sophisticatedparametric failure models, as long as they satisfy the general assumptions in the present paper,and we hope that failure models with better quantitative accuracy will become available in thenear future.

4.1. Time to next failure in a renewal process

When a new simulation on a computer is started, generally we do not know the last time offailure of this computer. Thus, when applying the failure model from Section 4, we are not onlyinterested in the time between two failures but also in the distribution of the time from the startof the computation until the first failure on a node. This time is modeled with an ordinary renewalprocess (RP). An ordinary RP is a sequence of i.i.d. nonnegative RVs Y1, Y2, . . .. In the modelapplied these RVs are the time intervals between two failures, hence they are Weibull distributed.After a failure the failed component is assumed to be renewed within a negligible time. Therefore,the time to the next failure is again Weibull distributed. This ordinary RP is started at t = 0. Weare interested in the so called forward recurrence time till the next failure. Ft is the time from tup to and including the moment of the next failure. The MLMC computation is started at timet. Rinne [20] states that, as long as t is finite, Ft can only be evaluated numerically. In the caset→∞ however [20, eq. (4.41a)] provides the analytic distribution function of F∞.

A procedure to draw realizations of the forward recurrence time F∞ is given in [26]:

1. Generate the random variable S from S = Γ(1 + 1k , λ

k), where λ and k are the Weibull scaleand shape parameter, respectively.

2. Compute the random variable V as V = S1/k.

3. Generate the random variable F∞ from F∞ = Uniform(0, V ).

4.2. Estimate the number of samples per level with MC

One instance of the number of samples M` on level ` is drawn by generating a realization of

the first failure F∞ on a node where samples of level ` are computed. The term E′[min(1, M−1/2` )]

is then estimated by MC, i.e., by averaging over different realizations of M`.

4.3. Calibration of parameters

The two Weibull scale and shape parameters λ and k are used in the above drawing of theforward recurrence times F∞. They depend on the machinery the FT-MLMC method runs on.In [20] several methods are presented to estimate both Weibull parameters.

In our analysis we use the parameters given in [22], k ≈ 0.7 and λ ≈ 7.6 · 105 for a singlenode. Note that k ≈ 0.7 is explicitly given in [22] for an individual node. λ is estimated from [22,Fig. 6(b)]. The time between failures, such that the cumulative distribution function is equalto 0.5, is given approximatively by x ≈ 4.5 · 105s. Therefore, λ can be approximated by 0.5 ≈1− e−(4.5·105/λ)k , yielding λ ≈ 7.6 · 105.

11

These parameters lead to a mean time between failure (MTBF) of approximately 11 days for atypical node. We use nodes with 128 cores, as in [22] a node has 80–128 cores. Whenever a nodeis hit by a failure, results of all involved 128 cores are assumed to be irrevocably lost.

5. Numerical experiments

All MLMC methods satisfying assumptions (5)–(8) are suited for the FT-MLMC method. Weassess the quality of our FT-MLMC error bounds by means of the grid-based finite volume codeALSVID-UQ [1] for solving hyperbolic systems of conservation laws. We set α = β = s in eqs. (5)–(6). With this choice, the work to compute a sample Xi

h`−Xi

h`−1on level ` is

W` = 2d+1W`−1 = 2(d+1)`W0, ` ≥ 0, (16)

where the exponent originates from d space and 1 time dimensions. Note that samples on lowlevels require only a small fraction of the execution time of samples on high levels. The same holdsfor memory space. Memory usage of a grid-based PDE solver is given by

mem` = 2dmem`−1 = 2d`mem0, ` ≥ 0. (17)

As suggested in [12] for s < (d+ 1)/2, the number of samples M` on level `, is set to be

M` = 2−2sM`−1 = 22(L−`)sML, ` ≥ 0. (18)

On the finest level, we chose a moderate number of samples, e.g., ML = 8.We parallelize our simulation level-wise and across levels. A core deals with samples of only

one level, i.e, it computes solutions on two resolutions used for a sample Xih`−Xi

h`−1on level `.

In our implementation we choose an intermediate level b on which individual samples areexecuted on a single core. This level is somewhat arbitrary. It is determined typically based onmemory requirements or execution times of a sample. The subdomains are chosen large enoughsuch that the communication over subdomain-interfaces plays a miner role. On levels ` > b, asingle sample is executed in parallel on multiple cores. The number of cores is determined by thememory requirement of the sample, as given in equation (17). On levels ` ≤ b samples are executedsequentially. So, the number N` of cores invested for a sample on level ` is

N` = d2d(`−b)e, ` ≥ 0. (19)

We collect samples in tasks of equal execution times. Tasks are the units of work that are submittedto the compute nodes. A sample of the finest level forms a task. Tasks of coarser levels consist ofmultiple samples. On very coarse levels, all respective samples are included in a single task thatmay have a shorter execution time than (most) other tasks. Tasks of the same level are alwayscomputed on independent nodes. Any single node may compute tasks from different levels.

In Fig. 1 a 3D example is given with 4 levels, b = 1, ML = 4, and s = 1/2. The MLMCestimate is computed in parallel on 293 cores. The 4 samples of level L are each computed in atask of 64 cores. Hence, 256 cores are involved in the computation of this level. On level b = 1 thenumber of cores in a task is one. This number increases by 8 towards the finer levels, accordingto (19). On level 0, one core deals with a single task that comprises all 32 samples of this level.The execution time of this task is approximately half of that of the others, causing a small loadimbalance.

12

start end

levelL=34x64cores

4 x4 samples

t

wall clock time

level24x8cores

4 x8 samples1sample

2−d 22 s

1sample1sample

1sample

level1(4x)1core

4 x16samples1s.

2−d

1s. 1s. 1s.

22 s

level 0(1x )1core

1x

32samples

1 22 s

idle time32 samples

2−1

2−d−1

2−1

Figure 1: Computing an MLMC estimate with parameters L = 3, ML = 4, d = 3, s = 1/2 and only one core pertask on levels 0 and 1.

As mentioned in Section 3.4 it is crucial that samples (Xih`−Xi

h`−1) are lost as a whole, i.e.,

it must never happen that one part (either Xih`

or Xih`−1

) survives and is used in the end result

while the other part is lost. This is achieved by computing a whole sample (Xih`− Xi

h`−1) on

cores belonging to the same node. In case of node failure the whole sample is lost. Large sampleshowever use many cores such that multiple nodes have to be used to compute them. In case of nodefailure, the processes running on unaffected cores have to realize that they should stop computingtheir part of the sample. How this is solved is implementation dependent. In our FT-MLMCimplementation [18] we used User Level Failure Mitigation (ULFM) [3], a fault tolerant extensionof MPI. With ULFM a MPI process trying to communicate with a failed one is notified about thefailure. This process can then prohibit any further communication on the communicator used forthe computation of the affected sample. This insures that the information of failures is receivedby all involved MPI processes. Further details are available in [18].

5.1. Lowering risk of level loss

A high probability of losing all samples on one level increases the FT-MLMC error bound inTheorem 3.2 substantially. This applies particularly to the coarse levels where the impact of a lossis highest. Two different ways of computing the samples are investigated. Both lower the risk oflosing all samples on coarse levels:

• “late save/4 tasks”: The samples of a level are split into at least four tasks each of which iscomputed on independent nodes. The samples are stored only after the completion of thewhole task. So, according to our assumptions, a failure at runtime of a task leads to the totalloss of all its samples. In the example shown in Fig. 1 only level 0 is affected as all otherlevels already run on at least four cores, see Fig. 2. This strategy reduces the risk of losingall samples on level 0 by reducing the execution time of a task as well as by using multipletasks. The execution time of the now 4 tasks is reduced by 4 compared to the original onewhich increases load imbalance.

13

level 04x1cores

4 x

32samplesidle time8 s.

Figure 2: The “late save/4 tasks” strategy requests at least 4 tasks per level. In the example of Fig. 1 this leads toa large number of very small tasks on level 0.

• “immediate save”: Immediately after the completion of a sample, it is safely stored, andhence will not get lost anymore.

The two strategies have a different behavior in case of failure. In the “immediate save” strategy,a larger part of the data on a node survives the failure of the node, leading to higher failureresilience. This advantage is offset by higher communication overhead. In the “late save” strategythe statistical quality of the remaining samples is equivalent to the N -out-of-M strategy suggestedby Li and Mascagni [10]. How to get statistical independence of the remaining samples in the“immediate save” strategy is to our knowledge an open question. Then only the first entries (upto the failure) of a random number stream are used. Also the resilience of massive parallel RNG(such as WELL [9]) in the presence of partial loss of streams remains to be addressed.

5.2. Euler equation of gas dynamics (New a section on its own)

We show results for the finite volume method (FT-MLMC-FVM) that solves the Euler equationsof gas dynamics. Following [13, 25], the d-dimensional Euler equations of gas dynamics are givenby

ρt + div(ρu) = 0,

(ρu)t + div(ρu⊗ u + pID) = 0,

Et + div((E + p)u) = 0,

in D = (0, 1)d, with outflow boundary conditions, where ρ is the density and u the velocity vector.The pressure p and the total energy E are related by the equation of state of an ideal gas, ie.

E :=p

γ − 1+

1

2ρ|u|2,

with γ the ratio of specific heats. The RV of interest is X(x, t, ω) = ρ(x, t, ω), u(x, t, ω), p(x, t, ω),computed at t = .6 s. In [13] an MLMC Finite Volume Method (MLMC-FVM) approach is proposedto estimate E[X(·, t)]. A MLMC-FVM error bound is shown for scalar solutions in [12] and assumedto hold for nonlinear hyperbolic systems of conservation laws in [13, 14] by

‖E[X(·, t)]− EL[X(·, t)]‖L2(Ω;L1(Rd)) ≤ C1hsL + C2M

∗− 12

0 + C3

L∑`=1

M∗− 1

2` hs`−1

, (20)

which corresponds to the error bound (12). We set the constants as 10 · C1 = C2 = 10 · C3 = .5.The convergence rate

‖E[X(·, t)]− EL[X(·, t)]‖L2(Ω;L1(Rd)) .W−s/(d+1) · log(W ), s < (d+ 1)/2,

is empirically shown in [13]. Here, W =∑W`M` is the total work. Sample errors on level ` are

bounded by [12–14]‖X`(·, t)−X`−1(·, t)‖L2(Ω;L1(Rd)) ≤ C4h

s`−1. (21)

This bound is related to assumption (6). The bounds (20) and (21) imply the validity of Theo-rems 3.1 and 3.2 for the FT-MLMC-FVM method.

14

5.3. Assessment of the FT-MLMC error bound

The error bound is compared with the measured error of a FT-MLMC implementation. In [18]we report on the implementation of a MPI-parallelized FT-MLMC method in ALSVID-UQ [1].In this implementation we used the User Level Failure Mitigation (ULFM) [3], a fault tolerantextension of MPI. A failure generator is used to simulate MPI process failures. For this purposea timed asynchronous interrupt is set, which kills the MPI process using the exit system call.This assures that failures can happen at any time, during the computation, the communication, orwhile ULFM is recovering from previous failures. Our implementation in [18] demonstrates thatthe FT-MLMC algorithm proposed in this paper can be successfully implemented.

In [18] we computed the FT-MLMC results on Brutus, a large compute cluster at ETH Zurich,where we used one node with four 12-core AMD Opteron 6174 CPUs and 64 GB of RAM. Dueto incompatibility of the ULFM developer implementation with the batch system of Brutus it iscurrently not possible to run the code on multiple nodes. We used a modified version of the failuremodel in Section 4 where failures do not affect a whole node but only a process. Otherwise thefailure model remains the same.

The exact parameters used for the simulation are described in [18] and are summarized inTable 1. All measurements are averages over multiple FT-MLMC runs. In the current paper weused 100 runs for 6 s < MTBF < 100 s, otherwise 30 runs where used. The error is computedusing a MLMC reference solution E[Xref] with L = 8, ML = 8 and hL = 2−11, the absolute

FT-MLMC error is measured as√E30; 100[‖E[XhL ]− E[Xref]‖2L1 ]. In contrast to the “immediate

save” strategy described in Section 5.1 we implemented an “intermediate save” strategy, wherethe intermediate results of a process are sent to other processes computing samples of the samelevel, up to four times during the computation. Additionally in both the “intermediate save” andthe “late save” strategy the samples of a level are split into at least two tasks, each of which iscomputed on independent cores.

ML 2d; s 2; 1/2# cells on level 0 (∝ 1/h0) 24

intermediate save 4 times, using 2 taskslate save using 2 tasksW5 (time for one sample) 16 × 94 s=1504 ssimulations with levels L = 5one core per task b = 3Weibull parameters λ; k variable; 0.5

Table 1: The parameters used in Fig. 3.

In Fig. 3 we compare the measurements from a FT-MLMC implementation with the In Fig. 3 wecompare the measurements from a FT-MLMC implementation with the derived FT-MLMC errorbound. The same FT-MLMC problem was simulated with different mean time between failures(MTBF), by varying the Weibull scale parameter λ of the failure model. In order to save computingtime rather large failure rates (small MTBFs) where used. We determine the entry failure rate,the failure rate from which on fault tolerance is useful. This point is reached when at least 10% ofall FT-MLMC runs encounter a failure, and hence 10% of all non fault tolerant implementations(using standard MPI-3.0) would terminate without a result. This is measured by the “at least a

15

100 101 102 103 104 105

mean time between failure [s]

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty

Failure probabilities

at least a failure

(a) Failure probabilities of the 2D FT-MLMC bound.

100 101 102 103 104 105

mean time between failure [s]

10-1

100 Bound for the L1 (L1 )-error

intermediate_savelate_save

(b) 2D FT-MLMC error bound.

100 101 102 103 104 105

mean time between failure [s]

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty

Failure probabilitiesat least a failureprocess failure

(c) Measured failure probabilities of the 2D FT-MLMC implementation [18].

100 101 102 103 104 105

mean time between failure [s]10-2

10-1

100re

lativ

e er

ror

Relative L1 (L1 )-errorintermediate_savelate_savefault_free

(d) Measured relative errors of the 2D FT-MLMC im-plementation [18].

Figure 3: Behavior of the 2D FT-MLMC-FVM with different parameters under Weibull distributed failures.

failure” probability, shown in Fig. 3(a) for the theory derived in this paper and in Fig. 3(c) forthe measured FT-MLMC implementation [18]. In both subfigures the entry failure rate is aroundMTBF=103 s for the given problem. Fig. 3(c) additionally presents the measured “process failure”probability, which shows that in average 20% of all started processes fail at MTBF≈ 6 s. However,at the entry failure rate (MTBF=103 s) the process failure probability is still small.

In Fig. 3(b) and 3(d) the FT-MLMC error bound and the measured relative FT-MLMC errorrespectively are shown for the “intermediate save” and the “late save” strategies. Additionally wepresent the measured reference error from the fault free ALSVID-UQ [1] in Fig. 3(d). We measurethe critical failure rate, the failure rate from which on the FT-MLMC method does not performwell anymore. In the FT-MLMC implementation, shown in Fig. 3(d), the critical failure rate isat MTBF ≈ 40 s. In the FT-MLMC error bound, shown in Fig. 3(b), the critical failure rateis similar. There the “intermediate save” strategy provides a small benefit over the “late save”strategy, whereas in the measurement this benefit is negligible. At MTBF ≈ 40 s the “intermediatesave” strategy performs even slightly better compared to the “late save” strategy. This artifactappears since we are only averaging over 100 FT-MLMC runs.

16

It is observed in both the FT-MLMC implementation as well as in the FT-MLMC error boundthat the error only increases slightly between the entry failure rate and the critical failure rate. Inour opinion, this is a fundamental insight, as one could also expect a “gradual” loss of performance,rather than a range of failure intensities where only a minor degradation is to be observed. Onlyafter the critical failure rate is reached, the error strongly increases and the numerical quality ofthe simulation results is lost.

The rate of failure explosion after the critical failure rate does not match well when comparingthe error bound and the measured relative error. One reason is that the constants C1, C2, andC3 in the error bound are unknown or can only be estimated very conservatively. In realitymany additional constants are implicitly contained in the three constants C1, C2, and C3. As in

E′[min(1, M−1/20 )] with certain implicitly problem dependent constants C ′ and C ′′. It would be

more precise to instead write, with 1A denoting the indicator function of the set A

E′[C ′1M`=0 + C ′′M−1/20 1M`>0)] .

This can be seen in the derivation of the bound with lost levels, in eq. (15), where the part M−1/20

comes from the two inequalities (2) and (14) which imply two unknown constants C ′ ≥ 1 andC ′′ ≥ 1.

This constant C ′ describes the error in case a level is lost while C ′′ describes the error whensamples are computed. Levels are mostly lost for small MTBFs. Hence choosing a large constantC ′ would mainly influence the error bound for small MTBFs, hence leading to a steeper rate offailure explosion after the critical failure rate for the error bound. Several other implied parameterscould be used to tune the error bound towards the measured error. However, without tuning theparameters we get a error bound which shows good qualitative agreement with the measured error.

Also keep in mind, that we are comparing two different things, a possibly pessimistic asymptoticerror bound with an actual discretiation, resp. sampling error.

5.4. Analysis of FT-MLMC error bounds

Four error bounds with different parameters are analyzed, see Table 2. The error bounds fromthe 2 dimensional problem are shown in Fig. 4 and the three dimensional once in Fig. 5. Twoplots are shown for each case. In the left the problem solved is scaled, but the failure rate (Weibullparameter λ) remains the same and vice versa in the right one. The run is exactly the same inboth plots, and its results are in either plot marked with a yellow rhombus.

Fig. 4(a,b) Fig. 4(c,d) Fig. 5(a,b) Fig. 5(c,d)

ML 8 8 8 8d; s 2; 1/2 2; 1 3; 1/2 3; 1# cells on level 0 (∝ 1/h0) 32× 32 16× 16 8× 8× 8 4× 4× 4W0 (time for one sample) 0.05 s 0.04 s 0.01 s 0.007 ssimulations with levels L = 5, . . . , 13 L = 5, . . . , 13 L = 4, . . . , 11 L = 4, . . . , 11one core per task b = 4 b = 3 b = 4 b = 3Weibull parameters λ; k 7.6 · 105; 0.7 7.6 · 105; 0.7 7.6 · 105; 0.7 7.6 · 105; 0.7

Table 2: The parameters used in Fig. 4 and 5 (a)–(c).

We determine the entry failure rate and the critical failure rate in all the measurements ofFigs. 4(b,d) and 5(b,d). In all these cases the entry failure rate is at around λ = 5 · 107 and the

17

105 106 107 108 109 1010 1011

cumulated runtime [s]

10-3

10-2

10-1

Erro

r bou

nd

FT-MLMC convergence analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty(a) 2D FT-MLMC with parameters from Table 2.

103 104 105 106 107 108 109 1010

MTBF per Node [s]

10-3

10-2

10-1

Erro

r bou

nd

FT-MLMC analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(b) 2D FT-MLMC with parameters from Table 2.

105 106 107 108 109 1010 1011

cumulated runtime [s]

10-5

10-4

10-3

10-2

Erro

r bou

nd

FT-MLMC convergence analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(c) 2D FT-MLMC with parameters from Table 2.

103 104 105 106 107 108 109 1010

MTBF per Node [s]

10-5

10-4

10-3Er

ror b

ound

FT-MLMC analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(d) 2D FT-MLMC with parameters from Table 2.

immediate savelate save / 4 cores

referenceat least a failure

lose finest sample

Figure 4: Behavior of the 2D FT-MLMC with different parameters under Weibull distributed failures.

critical failure rate at around λ = 105. Again we observe that between the entry failure rate andthe critical failure rate the error bound only increases slightly. Hence, this desirable property isfound also in large runs, and in multiple dimensional runs. Only after the critical failure rate theerror explodes.

Similar to the entry failure rate, the failure rate from which on fault tolerance is useful, there isa entry problem size, the problem size from which on fault tolerance is useful, when operating on agiven computer. Again we define this point as reached when 10% of all FT-MLMC runs encountera failure. The entry problem size is at a total execution time of around 5 · 106 s for Figs. 4(a)and 5(a,c). In Fig. 4(c) the entry problem size is around 5 ·105 s. Similar to the critical failure rate,we use the term of critical problem size in Figs. 4(a,c) and 5(a,c). This problem size is reached inour measurements at around 109 s. Also in these figures we observe that the convergence behaviorof FT-MLMC is only negligibly affected by failures between the entry problem size and the criticalproblem size. After the critical problem size however, the FT-MLMC method experiences many

18

104 105 106 107 108 109 1010 1011 1012

cumulated runtime [s]

10-2

10-1

Erro

r bou

nd

FT-MLMC convergence analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty(a) 3D FT-MLMC with parameters from Table 2.

103 104 105 106 107 108 109 1010

MTBF per Node [s]

10-2

10-1

100

Erro

r bou

nd

FT-MLMC analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(b) 3D FT-MLMC with parameters from Table 2.

104 105 106 107 108 109 1010 1011 1012

cumulated runtime [s]

10-4

10-3

10-2

Erro

r bou

nd

FT-MLMC convergence analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(c) 3D FT-MLMC with parameters from Table 2.

103 104 105 106 107 108 109 1010

MTBF per Node [s]

10-3

10-2

10-1Er

ror b

ound

FT-MLMC analysis

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abili

ty

(d) 3D FT-MLMC with parameters from Table 2.

immediate savelate save / 4 cores

referenceat least a failure

lose finest sample

Figure 5: Behavior of the 3D FT-MLMC with different parameters under Weibull distributed failures.

sample losses, such that the performance drops significantly. This is apparent by the increasingdisparity between the two FT-MLMC graphs “late save/4 tasks” and “immediate save” with thefault-free MLMC “reference” graph. Hence, when using the failure distribution of Section 4 theFT-MLMC method allows to solve problems which use more than two orders of magnitude moreruntime compared to the largest problem solved with a standard MLMC implementation.

The “late save/4 tasks” error does not only stop decreasing but grows again after the criticalproblem size, Fig. 4(a,c) and 5(a). At first this may surprise. As the number of levels increasesthe runtime of all tasks increases. This allows to compute more and more samples of a fixed levelin the same task. Hence the failure probability of this task increases. In the “late save/4 tasks”mode failures lead to the loss of all samples of a task. Therefore the probability of level losses notonly increase for the high levels, but even for medium and low levels.

In Fig. 4(a,c) and 5(a,c) it is further observed that the performance of the FT-MLMC methoddrops significantly as soon as the probability of losing all samples on the finest level approaches

19

values close to one. This can be expected, since then much of the computational work is used bythe attempt to compute samples, of which in the end only very few to none survive. This leads toa high error to work ratio, or, in other words, to a poor FT-MLMC performance.

5.5. Practical applicability

To employ the theoretical error bounds derived in this paper, it is essential to have an accurate,parametric statistical node failure model of the compute platform used. Once such a model hasbeen verified and calibrated to the particular hardware of interest, the computational work couldbe redistributed among nodes and cores in order to minimize the overall impact of node failureson the accuracy of the computation. Finally, based on a statistical node failure model of thetype considered here, a FT-MLMC error bound can be computed, which allows to infer precisestatements on the contributions to the overall error stemming from failures, and on whether theproblem of interest can be solved using FT-MLMC or plain MLMC on the given system.

The weak point of the presently proposed approach is that accurate and detailed, validatedstatistical failure models and the data to calibrate them

are currently not available. Hopefully, the situation will improve over time, as the awarenessof faults grows. For the time being we opted to use failure models that allow to determine roughestimates of the entry failure rate, critical failure rate, entry problem size, and critical problemsize.

We expect that in practice the entry problem size and the entry failure rate are not computedwith a known failure distribution, but rather we expect that users will be bothered when failuresterminate the application prematurely in at least 10% of the runs, which coincides with the men-tioned quantities. Only then fault tolerance appears to become an issue for the user (this figuremight, however, be problem and application dependent). It is however not practical for the userto decide when the performance of FT-MLMC deteriorates, i.e., the critical problem size and thecritical failure rate cannot be determined without some knowledge of the failure distribution. Inour measurements we observed that there is a range between the entry failure rate and the crit-ical failure rate, and a range between the entry problem size and the critical problem size wherethe performance of FT-MLMC appears to be unaffected by faults. In our problems, obviatingcheckpointing in this observed fault tolerance extends the range of applicability by two ordersof magnitude: on a given computer the FT-MLMC method is applicable to problems that takearound 100 times longer compared to the “plain” fault-free MLMC method, or, equivalently, agiven problem can be solved on a computer with around 100 times smaller MTBF thanks to faulttolerance.

We are aware that this estimated range may be specific to our assumptions, and may changewhen other failure distributions and problems are considered. We are however confident that for anumber of applications, the potential range of improvement on Monte Carlo methods is large.

Our failure model does not take in to account that random node failures may be correlatedamong different nodes in practice. This has adverse effects on the error bound. For large samples,running on multiple nodes, correlated node failures could be beneficial in the following sense: assoon as one node of a sample failed, the computation of this sample is stopped anyway, hence it doesnot matter if additional nodes fail simultaneously. For most problems which occur in engineeringpractice, and in particular for the Finite Volume simulations considered in the present paper, MCSamples on low levels easily fit on a single node. In the “late save/4 tasks” strategy at least 4cores on different nodes are used to compute samples of this level. Node failure correlation mightincrease the risk of losing all 4 involved nodes simultaneously, which has a clearly negative effect

20

on the error bound. But we do not see any parameter which would ruin the nice property that theerror only increases slightly between the entry failure rate and the critical failure rate, and betweenthe entry problem size and the critical problem size and that there is a considerable large rangewhere FT-MLMC is applicable.

The measurement of the number of failures in one single FT-MLMC run on a computer withunknown failure distribution does not provide any statistical information: it is not possible toestimate the range of applicability of the FT-MLMC method with a single run. The same holdsfor attempts to estimate the constants in the a priori error estimate from data obtained with onesingle run.

6. Conclusions and future work

We introduced and analyzed a checkpoint-free and fault-tolerant Multi Level Monte Carlostrategy, termed FT-MLMC. The approach is based on disregarding all samples affected by a nodefailure at run-time. It is assumed that MPI is extended such that processes unaffected by a failurecan continue working and communicating. The MLMC estimate is computed with the remainingsamples. By incorporating general statistical models of sample losses into the mathematical failuremodel and into the error bound of the FT-MLMC method a new MLMC error bound conditionalon prescribed node-failure statistics is derived.

The principal conclusion of the present analysis is that up to a certain rate of node failures persample the FT-MLMC error bound and, hence, the performance of the FT-MLMC, is provably ofthe same type as that of the standard, fault-free MLMC. We therefore conclude that in this rangeof failure rates, in MC and MLMC PDE simulations there is no need for additional fault tolerancestrategies such as, e.g., checkpoint/restart or re-computation of lost samples.

Due to the increasing likelihood of node-failures at run-time in emerging massive parallel hard-ware, obviation of checkpointing may afford substantial increases in parallel efficiency and scalabil-ity. In our analysis, we paid particular attention to the case when failures are Weibull distributed.We considered, exemplarily, the case that the samples were computed with the Finite VolumeMethod (FVM). We emphasize, however, that neither the Weibull distribution nor the FVM areessential for the principal conclusions about the performance of the FT-MLMC method.

We compared the derived FT-MLMC error bound with measured errors of a fault tolerantimplementation [18]. We showed by a number of examples that the FT-MLMC method comparedto the standard MLMC method can solve, on a given hardware platform, considerably larger andmore time-consuming problems as compared to the standard fault-free MLMC method.

The proposed approach of simply discarding samples affected by a node failure at run-timehas the additional benefit that failures do not extend the run-time as does for instance check-point/restart, or the re-computation of lost samples. This leads to a bounded run-time, even iffailures occur.

We conclude with the need for further research to develop better and more realistic stochasticand/or deterministic failure models for massively parallel (in particular, exascale) computing plat-forms. It would be very useful to know the failure distribution of the computer one is using. Inthe best case a failure model would be provided by hardware vendors, ideally with failure inten-sity parameters. Another line of research is the development of effective self-calibrating statisticalfailure models which, when run on a given computing platform, would “probe” a given hardwarefor failures and accumulate estimates of failure parameters at runtime.

21

The presently proposed models and their analysis completely disregard silent-errors. In afurther work we may cover some silent-errors by detecting them by statistical means [10].

We further emphasize that the presently proposed failure models are uncorrelated across thehardware platform: in the presently proposed models, failures occur independently of relativeposition of the node within the compute platform. There is, however, evidence that the ‘ageing’of hardware as well as the time of day and the day of the week can affect fault arrival intensityparameters, and it is plausible that spatial correlation (e.g. due to overheating) can play a role aswell.

Further work also includes investigations of the resilience of parallel random number generators.This includes the preservation of statistical quality with partial loss of streams. Once realistic and,ideally, self-calibrating fault models are in place, adaptive processor load distribution at runtime tominimize impacts of faults on numerical simulation quality would be a major milestone.

Acknowledgment

The authors thank Jonas Sukys for the support with ALSVID-UQ, and the anonymous review-ers for their valuable comments and suggestions that improved the quality of the paper.

References

[1] ALSVID-UQ. avalable from http://mlmc.origo.ethz.ch/, Aug. 2012.[2] A. Barth, Ch. Schwab, and N. Zollinger. Multi-level Monte Carlo finite element method for elliptic pdes with

stochastic coefficients. Numer. Math., 119(1):123–161, 2011.[3] W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. Dongarra. An evaluation of user-level failure

mitigation support in MPI. In J. L. Traff, S. Benkner, and J. Dongarra, editors, Recent Advances in the MessagePassing Interface, pages 193–203. Springer, 2012. (Lecture Notes in Computer Science, 7490).

[4] F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research oppor-tunities. Int. J. High Perform. Comput. Appl., 23(3):212–226, 2009.

[5] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward exascale resilience. Int. J. HighPerform. Comput. Appl., 23(4):374–388, 2009.

[6] G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer, New York, 5th edition, 2003.[7] M. B. Giles. Multilevel Monte Carlo path simulation. Operations Research, 56(3):607–617, 2008.[8] M. Hoemman and M. Heroux. Fault-tolerant iterative methods via selective reliability. Technical Report

SAND2011-3915 C, Sandia National Laboratories, Albuquerque NM, 2011.[9] P. L’Ecuyer and F. Panneton. Fast random number generators based on linear recurrences modulo 2: overview

and comparison. In M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joine, editors, Proceedings of the2005 Winter Simulation Conference, pages 110–119, 2005.

[10] Y. Li and M. Mascagni. Analysis of large-scale grid-based Monte Carlo applications. Int. J. High Perform.Comput. Appl., 17:369–382, 2003.

[11] M. Mascagni, D. Ceperley, and A. Srinivasan. SPRNG: A scalable library for pseudorandom number generation.ACM Trans. Math. Softw., 26:436–461, 2000.

[12] S. Mishra and Ch. Schwab. Sparse tensor multi-level Monte Carlo finite volume methods for hyperbolic conser-vation laws with random initial data. Math. Comp., 81(280):1979–2018, 2012.

[13] S. Mishra, Ch. Schwab, and J. Sukys. Multi-level Monte Carlo finite volume methods for nonlinear systems ofconservation laws in multi-dimensions. J. Comput. Phys., 231(8):3365–3388, 2012.

[14] S. Mishra, Ch. Schwab, and J. Sukys. Multi-level monte carlo finite volume methods for shallow water equationswith uncertain topography in multi-dimensions. SIAM J. Sci. Comput., 34(6), pp. B761–B784, 2012.

[15] MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0, October 2013. available at: http:

//www.mpi-forum.org (Oct. 2013).[16] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed

computing environments. In J. Cunha and P. Medeiros, editors, Euro-Par 2005. Parallel Processing, volume3648 of Lecture Notes in Computer Science, pages 432–441. Springer, Berlin, 2005. doi:10.1007/11549468 50.

22

[17] A. Papoulis and S.U. Pillai. Probability, random variables, and stochastic processes. McGraw-Hill electrical andelectronic engineering series. McGraw-Hill, 2002.

[18] S. Pauli, M. Kohler, and P. Arbenz. A fault tolerant implementation of Multi-Level Monte Carlo methods.Proceedings of the international conference on parallel computing (ParCo 2013), Munich, Germany, 2013.

[19] W. P. Petersen and P. Arbenz. Introduction to Parallel Computing. Oxford University Press, Oxford, 2004.[20] H. Rinne. The Weibull Distribution: A Handbook. CRC Press, 2009.[21] J. K. Salmon, M. A. Moraes, R. O. Dror, and D. E. Shaw. Parallel random numbers: as easy as 1, 2, 3.

In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage andAnalysis, SC’11, pages 16:1–16:12, New York, NY, USA, 2011. ACM.

[22] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. InProceedings of the International Conference on Dependable Systems and Networks, pages 249–258, Washington,DC, USA, 2006. IEEE Computer Society.

[23] Ch. Schwab. Numerical analysis of stochastic partial differential equations. Lecture notes, 2010.[24] Ch. Schwab and C. J. Gittelson. Sparse tensor discretizations of high-dimensional parametric and stochastic

PDEs. Acta Numer., 20:291–467, 2011.[25] J. Sukys, S. Mishra, and Ch. Schwab. Static load balancing for multi-level Monte Carlo finite volume solvers.

In R. Wyrzykowski et al., editors, Parallel Processing and Applied Mathematics (PPAM 2011), pages 245–254,Berlin, 2012. Springer. (Lecture Notes in Computer Science, 7203).

[26] Y. Zhao. Parametric inference from window censored renewal process data. PhD thesis, The Ohio State Uni-versity, Columbus, Ohio, 2006.

[27] The raw data and more information is available at the two URLs http://www.pdl.cmu.edu/FailureData/ andhttp://www.lanl.gov/projects/computerscience/data/.

23

Research Reports

No. Authors/Title

12-24 St. Pauli, P. Arbenz and Ch. Schwab

Intrinsic fault tolerance of multi level Monte Carlo methods

12-23 V.H. Hoang, Ch. Schwab and A.M. Stuart

Sparse MCMC gpc Finite Element Methods for Bayesian Inverse

Problems

12-22 A. Chkifa, A. Cohen and Ch. Schwab

High-dimensional adaptive sparse polynomial interpolation and applica-

tions to parametric PDEs

12-21 V. Nistor and Ch. Schwab

High order Galerkin approximations for parametric second order elliptic

partial differential equations

12-20 X. Claeys, R. Hiptmair, and C. Jerez-Hanckes

Multi-trace boundary integral equations

12-19 Sukys, Ch. Schwab and S. Mishra

Multi-level Monte Carlo finite difference and finite volume methods for

stochastic linear hyperbolic systems

12-18 Ch. Schwab

QMC Galerkin discretization of parametric operator equations

12-17 N.H. Risebro, Ch. Schwab and F. Weber

Multilevel Monte-Carlo front tracking for random scalar conservation

laws

12-16 R. Andreev and Ch. Tobler

Multilevel preconditioning and low rank tensor iteration for space-time

simultaneous discretizations of parabolic PDEs

12-15 V. Gradinaru and G.A. Hagedorn

A timesplitting for the semiclassical Schrodinger equation

12-14 Ph. Grohs and G. Kutyniok

Parabolic molecules

12-13 V. Kazeev, O. Reichmann and Ch. Schwab

Low-rank tensor structure of linear diffusion operators in the TT and

QTT formats

12-12 F. Muller, P. Jenny and D.W. Meyer

Multilevel Monte Carlo for two phase flow and transport in random het-

erogeneous porous media