reliability challenges in large systems

Future Generation Computer Systems 22 (2006) 293–302

Reliability challenges in large systems

Daniel A. Reeda,∗, Charng-da Lub, Celso L. Mendesb

a Renaissance Computing Institute, University of North Carolina, Chapel Hill, 27599 NC, USAb Department of Computer Science, University of Illinois, Urbana, 61801 IL, USA

Available online 1 January 2005

Abstract

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands ofprocessors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflopsystem likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes muchless credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, wedescribe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in thee adaptives©

K

1

piepclpa

c

. Asensly toptionuch

F)isks,igh,

stemon-ilurenlynde-ilure

0d

nvironment and that allow an application to run successfully despite such failures. We also show how intelligent andoftware can lead to failure resilience and efficient system usage.2004 Elsevier B.V. All rights reserved.

eywords: System reliability; Fault-tolerance; Adaptive software

. Introduction

Clusters built from commodity PCs dominate high-erformance computing today, with systems contain-

ng thousands of processors now being deployed. As anxample, the National Center for Supercomputing Ap-lications (NCSA) has deployed a 17.9 teraflop Linuxluster containing 2938 Intel Xeon processors. Evenarger clustered systems are being designed and de-loyed – the 40 teraflop Cray/SNL Red Storm clusternd the 180 teraflop IBM/LLNL Blue Gene/L system

∗ Corresponding author.E-mail address: dan [email protected] (D.A. Reed);

[email protected] (C. Lu); [email protected] (C.L. Mendes).

will contain 10,368 and 65,536 nodes, respectivelynode counts for multi-teraflop systems grow to tof thousands, with proposed petaflop systems likecontain even larger numbers of nodes, the assumof fully reliable hardware and software becomes mless credible.

Although the mean time before failure (MTBfor the individual components (i.e., processors, dmemories, power supplies, fans and networks) is hthe large overall component count means the syitself may fail frequently. For example, a system ctaining 10,000 nodes, each with a mean time to faof 106 h, has a system mean time to failure of o100 h, under the generous assumption of failure ipendence. Hence, the mean time before system fa

167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved.oi:10.1016/j.future.2004.11.015

294 D.A. Reed et al. / Future Generation Computer Systems 22 (2006) 293–302

for today’s 10–20 teraflop commodity systems can beonly 10–40 h, due to a combination of hardware com-ponent and software failures. Indeed, operation of thelarge IBM/LLNL ASCI White system has revealed thatits MTBF is only slightly more than 40 h, despite con-tinuing improvements.

For scientific applications, MPI is the most pop-ular parallel programming model. However, the MPIstandard does not specify mechanisms or interfaces forfault-tolerance – normally, all of an MPI application’stasks are killed when any of the underlying nodes failsor becomes inaccessible. Given the standard domaindecompositions and data distributions used in message-based parallel programs, there are few alternatives tothis approach without underlying support for recovery.Intuitively, because the program’s data set is partitionedacross the nodes, any node loss implies irrecoverabledata loss.

Several research efforts have attempted to addressaspects of this domain decomposition limitation. Histo-rically, large-scale scientific applications have usedapplication-mediated checkpoint and restart techni-ques to deal with failures[6]. However, these schemescan be problematic in an environment where the inter-val between checkpoints is comparable to the MTBF.

In this paper, we describe possible approaches forsys-w anom-less

eck-em-ismsticaltioncreteribetion

ws.toult

t--on

y

2. Reliability of large systems

The mean time before failure (MTBF) of commod-ity hardware components continues to rise. PCs, builtusing commodity processors, DRAM chips, large ca-pacity disks, power supplies and fans have an oper-ational lifetime measured in years (i.e., substantiallylonger than their typical use). For example, the meantime before failure for today’s commodity disks ex-ceeds 1 million hours. However, when components areassembled, the aggregate system MTBF is given by(∑N

k=1 1/Rk)−1, whereRk is the MTBF of componentk. Intuitively, the least reliable component determinesthe overall system reliability.

This simple formula also assumes, optimistically,that component failures are independent. In reality, fail-ure modes are strongly coupled. For example, the fail-ure of a PC fan is likely to lead to other failures orsystem shutdown due to overheating.

Although component reliabilities continue to in-crease, so do the sizes of systems being built usingthese components. Even though the individual com-ponent reliabilities may be high, the large number ofcomponents can make system reliability low. As an ex-ample,Fig. 1shows the expected system reliability forsystems of varying sizes when constructed using com-

re,rob-,

rob-rrge.ys-ould

cou-nsic

allycks,ledlow.ismsthey

nent

the effective usage of multi-teraflop and petafloptems. In particular, we present techniques that alloMPI application to continue execution despite cponent failures. One of these techniques is diskcheckpointing, which enables more frequent chpoints by redundantly saving checkpoint data in mory. Another technique uses inexpensive mechanto collect global status information, based on statissampling techniques. We also present a fault-injecmethodology that enables systematic test and conevaluation of our proposed ideas. Finally, we descplans to create an infrastructure for real-time collecof failure predictors.

The remainder of this paper is organized as folloAfter presenting, in Section2, experimental dataquantify the reliability problem, we describe our fainjection techniques in Section3. We discuss faultolerance for MPI applications in Section4 and system monitoring issues in Section5. We show how tuse such techniques for adaptive control, in Sectio6,discuss related work, in Section7, and conclude bdescribing our future work in Section8.

ponents with three different reliabilities. In the figuthe individual components have one-hour failure pabilities of 104, 105, and 106 (i.e., MTBFs of 10,000100,000 and 1,000,000 h).

Even if one uses components whose one-hour pability of a failure is less than 106, the mean time fosystem failure is only a few hours if the system is laBy this assessment, the IBM/LLNL Blue Gene/L stem, which is expected to have 65,368 nodes, whave an average “up time” of less than 24 h.1 Moreover,these estimates are optimistic; they do not includepled failure modes, network switches, or other extrifactors.

Because today’s large-scale applications typicexecute in batch scheduled, 4–8 h scheduled blothe probability of successfully completing a scheduexecution before a component fails can be quiteProposed petascale systems will require mechanfor fault-tolerance, otherwise there is some danger

1 IBM has focused on careful engineering to reduce compoand system failures below these levels in Blue Gene/L.

D.A. Reed et al. / Future Generation Computer Systems 22 (2006) 293–302 295

Fig. 1. System MTBF scaling for three component reliability levels.

may be unusable – their frequent failures would preventany major application from executing to completion ona substantial fraction of the system resources.

As noted earlier, the data inFig. 1 are drawn froma simplistic theoretical model. However, experimentaldata from existing systems confirm that hardware andsoftware failures regularly disrupt operation of largesystems. As an example,Fig. 2depicts the percentageof accessible nodes on the NCSAs Origin Array, com-posed of 12 shared-memory SGI Origin 2000 systems.This data reflects outages that occurred during a 2-yearperiod between April 2000 and March 2002.

For this system, the NCSA staff defined downtimeas either scheduled (e.g. regular maintenance) or un-scheduled (e.g. software or hardware halts). The overallavailability is the ratio of total downtime to total time,whereas the scheduled availability is the ratio of totalunscheduled downtime to scheduled uptime (i.e. totaltime minus scheduled downtime). For the NCSA sys-tem, hardware failures accounted for only 13% of theoverall failures, whereas software failures representednearly 59% of the total.

As this data shows, failures are already a notice-able cause of service disruption in today’s systems.Given the larger number of components expected inpetascale systems, such failures could make those sys-t dified.H iticalc ances

Fig. 2. NCSA Origin Array availability (system now retired).

3. Fault injection and assessment

Because the relative frequency of hardware and soft-ware failures for individual components is so low, itis only on larger systems that failure frequencies be-come large enough for statistically valid measurement.However, even in such cases, the errors neither system-atic nor reproducible. This makes obtaining statisticallyvalid data arduous and expensive. Instead, we need aninfrastructure that can be used to generate syntheticfaults and assess their effects.

Fault injection is the most widely used approachto simulate faults; it can be either hardware-based or

ems unusable unless current approaches are moence, fault-tolerance mechanisms become a cromponent for the success or large high-performystems.


Table 1Results from fault injection experiments (Cactus code)

Injection Crash Hang Wrong output Correct output Total no. of injections

Text memory 49 12 6 933 1000Data memory 12 31 0 959 1002Heap memory 4 36 10 950 1000Stack memory 82 43 0 877 1002Regular registers 139 180 0 189 508Floating point registers 0 10 10 480 500MPI Algather 15 0 5 70 90MPI Gather 17 16 41 0 74MPI Gatherv 0 0 23 13 46MPI Isend 0 0 0 90 90

software-based[11], and each has associated advan-tages and disadvantages. Hardware fault injection tech-niques range from subjecting chips to heavy ion radia-tion to simulate the effects of alpha particles to insertinga socket between the target chip and the circuit board tosimulate stuck-at, open, or more complex logic faults.Although effective, the cost of these techniques is highrelative to the components being tested.

In contrast, software-implemented fault injection(SWIFI) does not require expensive equipment and cantarget specific software components (e.g., operatingsystems, software libraries or applications). Therefore,we have developed a parameterizable SWIFI infras-tructure that can inject two types of component faultsinto large-scale clusters: computation and communi-cation. For each fault type, we have created models offault behavior and produced instances of each. Thus,we can inject faults into parallel applications duringexecution and observe their effect.

Our fault injection infrastructure emulates compu-tation errors inside a processing node via random bitflips in the address space of the application and in theregister file. These bit errors can corrupt the data inmemory or introduce transient errors in the processorcore. In our experiments, we injected faults in threememory regions: the text segment, the stack and heap

ys-theughoth,n the

are,link

transport, through switches, NICs, communication li-braries and application code. Many of the hardwareerrors can be simulated by appropriate perturbations ofapplication or library code. Hence, our infrastructureinjects faults using the standard MPI library. By inter-cepting the MPI calls via the MPI profiling interface,we can arbitrarily manipulate all MPI function calls.The infrastructure simulates four types of communi-cation faults: redundant packets, packet loss, payloadperturbation and software errors.

As an example of this fault injection methodology,we injected faults to the execution of a computationalastrophysics code based on the Cactus package[9],which simulates the tri-dimensional scalar field pro-duced by two orbiting stellar sources. We injected bothcomputation and communication faults, and we clas-sified their effects on observed program behavior. Alltests were conducted on an x86 Linux cluster. Execu-tion outcomes were classified as follows: an applica-tion crash, an application hang (i.e., an apparent infiniteloop), a complete execution producing incorrect data,or complete execution producing the correct output.

Table 1presents a portion of the observed results,corresponding to the effects of random bit flips inthe computation and communication components ofthe Cactus code. Although the majority of the execu-

r ofav-nu-the

ica-ior.d in

and the register file.To maximize portability and usability across s

tems, we intentionally restricted fault injection toapplication rather than the system software. Althomemory and processor faults can be manifest in bperturbing system address spaces is not feasible olarge-scale, production systems.

Communication errors, both hardware and softwcan also occur at many levels, ranging from the

tions completed correctly, a non-negligible numbethe injected faults resulted in some incorrect behior. In particular, errors in the registers producedmerous crashes and hangs. Similarly bit errors inMPI Gather collective communication routinealwayscaused some type of application error; this communtion primitive carries data critical to program behavA more detailed experimental analysis can be foun[15].


4. Fault-tolerance in MPI codes

Historically, large-scale scientific applications haveused application-mediated checkpoint and restart tech-niques to deal with failures. However, as we have seen,the MTBF for a large system can be comparable tothat needed to restart the application. Consequently,the next failure can occur before the application hasrecovered from the last failure.

In general, application fault-tolerance techniquesare of two types. In the first, the application is mod-ified, and fault-tolerance is explicitly incorporated intothe application’s algorithms. Although flexible, thisapplication-specific approach must be tailored to eachapplication code. Alternatively, fault-tolerance can in-troduced in a semi-transparent way (e.g., via an under-lying library).

For scientific applications, MPI is the most pop-ular parallel programming model. However, the MPIstandard does not specify either mechanisms or inter-faces for fault-tolerance. Normally, all of an MPI ap-plication’s tasks are killed when any of the underlyingnodes fails or becomes inaccessible. Given the standarddomain decompositions and data distributions used inmessage-based parallel programs, there are few alter-natives to this approach without underlying support forr

sedt ancew tob sys-t po-r toi fail-u hreem ica-t intsa eck-p

rol,a erya atch-s ointsa fullyr therc –8 hb very0 faces

as disk-based checkpoints, but occur more frequently.For example, if the disk-based checkpoint interval is1 h, diskless checkpointing might occur every 15 min.

For diskless checkpointing, an application’s nodesare partitioned into groups. Each group contains boththe nodes executing the application code and somenumber of spares. Intuitively, enough extra nodes areallocated in each group to offset the probability of oneor more node failures within the group. When an appli-cation checkpoint is reached, checkpoint data is writtento the local memory in each node, and redundancy data(parity or Reed-Solomon codes[18] is computed andstored in the memories of the spare nodes assigned toeach group.

If a node fails, the modified MPI layer directs theapplication to rollback and restart the failing processon a spare node, using data regenerated from check-point and redundancy bits. This scheme avoids boththe long waits for job resubmission due to failures andthe loss of data between disk-based checkpoint inter-vals. It also capitalizes on high-speed interconnects forrapid data movement, as a complement to lower-speedI/O systems.

To assess the efficacy of diskless checkpointing,we simulated this scheme for a hypothetical high-performance system containing 10,000 computationnodes with 500 spare nodes, under different config-urations of spares per group. We assumed an inter-checkpoint interval of 30 min and a checkpoint durationbetween 0.5 and 2 min, proportional to the size of eachgroup.

Fig. 3. Probability of successful execution with diskless checkpoint-ing.

ecovery.For the foreseeable future, we believe library-ba

echniques, which combine message fault-tolerith application-mediated checkpointing, are likelye the most useful for large-scale applications and

ems. Hence, we are extending LA-MPI by incorating diskless checkpointing[19] as a complementts support for transient and catastrophic networkres. In this scheme, fault-tolerance is provided by techanisms: (a) LA-MPIs fault-tolerance commun

ion, (b) normal, application-specific disk checkpond (c) library-managed, diskless intermediate choints.

Disk checkpoints are under application contnd they write application-specific data for recovnd restart, based on application behavior and bcheduled computing allocations. Diskless checkpre intended to ensure that applications successeach their disk checkpoints (i.e., that node and oomponent failures do not prevent use of a typical 4atch-scheduled allocation with disk checkpoints e.5–1 h). Diskless checkpoints use the same inter


Fig. 3shows the probability of successful execution(i.e., no catastrophic failure) as a function of applicationduration, expressed as checkpoint periods. Obviously,the longer the program runs, the more likely it willterminate due to component failures. Comparing thesimulations with one and two spares per group, showsthat assigning two spares per group can allow the ap-plication to successfully continue execution four timeslonger (at 90% successful probability). These prelim-inary results indicate that diskless checkpointing cansignificantly improve the application’s tolerance to fail-ures.

5. System monitoring for reliability

Many system failures are frequently caused by hard-ware component failures or software errors. Others aretriggered by external stimuli, either operator errors orenvironmental conditions (e.g.[7]) predicts that thefailure rate in a given system doubles with every 10◦Crise in temperature). Rather than waiting for failures tooccur and then attempting to recover from those fail-ures, it is far better to monitor the environment andrespond to indicators of impending failure.

One attractive alternative is creation of an inte-grated set of fault indicator monitoring and measure-ment tools. AsFig. 4shows, this fault monitoring andmeasurement toolkit would leverage extant, device-specific fault monitoring software to create a flexible,integrated and scalable suite of capture and characteri-zation tools. As part of a larger research effort, we plan

Fig. 4. Proposed fault indicator monitoring software.

to integrate three sets of failure indicators: disk warn-ings based on SMART protocols[3], switch and net-work interface (NIC) status and errors, and node moth-erboard health, including temperature and fan status.Each set of failure indicator is described below.

5.1. Disk monitoring

Disk vendors have incorporated on-board logic thatmonitors disk status and health. This system, calledSelf-Monitoring Analysis and Reporting Technology(SMART), supports both ATA and SCSI disks. SMARTtypically monitors disk head flying height, the numberof remapped sectors, soft retries, spin up time, temper-ature and transfer rates. Changes in any or all of theseinternal metrics can signal impending failure. Severalopen source implementations of SMART for Linuxclusters provide access to disk monitoring and statusdata.

5.2. NIC and network status

Unusually high network packet losses can be asymptom of network congestion, network switch er-rors or interface card problems. Networks for multi-teraflop and petaflop systems contain many kilometers

Cs)om-For

vari-ons,rs.

ap-nced

c-i-

viceturn,era-akesd,lditor-

of cabling, thousands of network interface cards (NIand large numbers of network switches. Several cmon cluster interconnects support fault indicators.example, the drivers for Myrinet[21] NICs maintaincounters, accessible via user-level software, for aety of network events, including packet transmissiconnection startups and shutdowns and CRC erro

5.3. Motherboard status

For Linux systems, there are two popularproaches to assessing system health: the AdvaConfiguration and Power Interface (ACPI)[1] and thelm sensors toolkit[2]. ACPI is an open industry speification, co-developed by Hewlett-Packard, Intel, Mcrosoft, Phoenix and Toshiba, for power and deconfiguration management and measurement. Inthe lm sensors toolkit targets hardware and tempture sensors on PC motherboards. The toolkit tits name from one of the first chips of this kinthe LM78 from National Semiconductor, which coumonitor seven voltages, had three fan speed mon


ing inputs and contained an internal temperature sensor.Most other sensor chips have comparable functionality.ACPI and tools like lmsensors serve complementaryroles, providing data on both power consumption andmotherboard health.

AsFig. 4indicates, we envision capture of failure in-dicators from those three sources through sensors fromour Autopilot toolkit[20]. Besides promoting portabil-ity across different systems, those sensors export datain a flexible format for analysis and fault detection.

Because collecting and analyzing data from everysingle node in a large system can become prohibitivelyexpensive, we plan to adopt low-cost data collectionmechanisms. One such mechanism where we obtainedpromising results uses statistical sampling techniques[16]. Instead of checking every system component in-dividually, we select a statistically valid subset of com-ponents, inspect this subset in detail, and derive es-timates for the whole system based on the propertiesfound in the subset. The quality of these estimates de-pends on the components in the chosen subset. Sta-tistical sampling provides a formal basis for quan-tifying the resulting accuracy of the estimation, andguides the selection of a subset that meets accuracyspecifications.

In some cases, reducing the number of nodes for datac datap rge.F duc-tc turet etric.T ion ot fea-t lesse

6

it ish outf ap-p osec cu-t duet blea

Performance contracts provide one mechanism foradaptation. Intuitively, a performance contract specifiesthat, given a set of resources with certain characteristicsand for particular problem parameters, an applicationwill achieve a specified performance during its execu-tion [23]. To validate a contract, one must monitor boththe allocated resources and the application behavior toverify that the contract specifications are met. Hence,the monitoring infrastructure must be capable of mon-itoring a large number of system components withoutunduly perturbing behavior or performance.

The notion of a performance contract is based onits analogue in civil law. Each party to a contract hascertain obligations, which are described in the contract.Case law, together with the contract, also specify penal-ties and possible remediations if either or both partiesfail to honor the contract terms. Witnesses and evidenceprovide mechanisms to assess contact validity. Finally,the law recognizes that small deviations from the con-tract terms are unlikely to trigger major penalties (i.e.,the principle of proportional response).

Performance contracts are similar. They specify thatan application will behave in a specified way (i.e., con-sume resources in a certain manner) given the availabil-ity of the requested resources. Hence, a performancecontract can fail to be satisfied because either the ap-

theim-

n offiessorlica-for

be-all

met-be

ractdo

pli-ancens).

both(i.e.,tlyrted

ions,

ollection may not be sufficient, as the amount ofroduced in each node could still be excessively laor these situations, we plan to implement data re

ion via application signature modeling[14]. For eachollected metric, we dynamically construct a signahat approximates the observed values for that mhese signatures represent a lossy data compress

he original metrics that still captures the salientures of the underlying metric but are considerablyxpensive than regular tracing techniques.

. Intelligent adaptive control

Given the sources of faults described earlier,ighly unlikely that a large system will operate with

aults for an extended period. Thus, a long runninglication will encounter a system environment whapabilities vary throughout the application’s exeion. To respond to changing execution conditionso failures, applications and systems must be nimnd adaptive.

f

plication did not behave in the expected way orresources requested were not available. Equallyportantly, performance contracts embody the notioflexible validation. For example, if a contract specithat an application will deliver 3 gigaflops/procesfor 2 h and measurement shows that the apption actually achieved 2.97 gigaflops/processor118 min, one would normally consider suchhavior as satisfying the contract. Intuitively, smperturbations about expected norms, either inric values or in their expected duration, shouldacceptable.

Combining instrumentation and metrics, a contis said to be violated if any of the contract attributesnot hold during application execution (i.e., the apcation behaves in unexpected ways or the performof one or more resources fails to match expectatioAny contract validation mechanism must managemeasurement uncertainty and temporal variabilitydetermining if the contract is satisfied a sufficienlarge fraction of the time to be acceptable). Repocontract violations can trigger several possible act


Fig. 5. Contract monitor infrastructure based on Autopilot.

including identification of the cause (either applicationor resource) and possible remediation (e.g., applicationtermination, application or library reconfiguration, orrescheduling on other resources).

Autopilot is a toolkit for real-time application andresource monitoring and control built atop the Globusinfrastructure. Using Autopilot, we have developed acontract monitoring infrastructure[12] that includesdistributed sensors for performance data acquisition,actuators for implementing performance optimizationdecisions, and a decision making mechanism for eval-uating sensor inputs and controlling actuator outputs.The main modules of this contract monitor are pre-sented inFig. 5.

In concert with development of the contract moni-tor, we created tools for visualizing the results of con-tract validation. As an example,Fig. 6(a) displays thecontract output for an execution of the Cactus on a

Fig. 6. Visualization of (a) node contrac

set of distributed resources. Each bar in the figure cor-responds to the contract evaluation in one node. Thisnode contract results from the combination of one con-tract for each measured metric. Using other controlsin the GUI, the user can request visualization of val-ues for specific metrics, as displayed inFig. 6(b). Ob-served values correspond to points in the metric space,and rectangles represent the range of acceptable metricvalues determined in the contract.

In our first implementation of performance con-tracts, we captured metrics corresponding to compu-tation and communication activity in the application.Computation performance was characterized by col-lecting values from hardware performance counters,through the PAPI interface[5]. For communication per-formance, we used the MPI profiling interface to cap-ture message counts and data volumes. We are currentlyextending the set of possible metrics by including op-

t evaluation and (b) raw metric values.


erating system events, which can be captured by theMagnet tool[8].

7. Related work

Several research efforts are exploring more fault-toleran MPI implementations. These include LosAlamost MPI (LA-MPI) [10], which includes end-to-end network fault tolerance; compile-time and runtimeapproaches[4] that redirect MPI calls through a fault-tolerant library based on compile-time analysis of ap-plication source code; and message logging and replaytechniques[13]. Each of these approaches potentiallysolves portions of the large-scale fault-tolerance prob-lem, but not all. Compile time techniques require sourcecode access, and message recording techniques oftenrequire storing very large volumes of message data.

Real-time performance monitoring and dynamicadaptive control have been used primarily with net-works and distributed systems[22,17]. However, adap-tation to failing resources must deal with poor perfor-mance, but also with conditions that can lead to appli-cation failure.

8

owt temsl , thea be-c be-f ish sys-t thiss ault-t eedd

ed toc tems.T real-t (b)f up-p iont tives rfor-m

Looking forward, our preliminary tests with perfor-mance contracts can be extended both in spatial as intemporal scopes. On the spatial side, one could con-sider distributed processing of contract validation; dif-ferent instances of a contract monitor would validatecontracts for a subset of the nodes participating in agiven execution. This distribution is essential to ensurescalability of the contract validation process.

From a temporal view, there are also several issuesto be explored. The contract validation example that wepresented was evaluated at a single moment. That val-idation does not consider previous application or sys-tem status. A more intelligent validation scheme shouldincorporate previous validations, and make decisionsbased on both current and past state.

Acknowledgements

This work was supported in part by Contract No.74837-001-0349 from the Regents of University of Cal-ifornia (Los Alamos National Laboratory) to WilliamMarsh Rice University, by the Department of En-ergy under grants DEFC02-01ER41205 and DEFC02-01ER25488, by the National Science Foundation under

CI

tion,

e.

).lec-in:

uper-

cal-nce

uper-

rval2660

ost-s of

. Conclusion

As node counts for multi-teraflop systems gro tens of thousands, with proposed petaflop sysikely to contain hundreds of thousands of nodesssumption of fully reliable hardware and softwareomes much less credible. Although the mean timeore failure (MTBF) for the individual componentsigh, the large overall component count means the

em itself may fail frequently. To be successful incenario, applications and systems must employ folerance capabilities that allow execution to procespite the presence of failures.

We have presented mechanisms that can be usope with failures in terascale and petascale syshese mechanisms rest on four approaches: (a)

ime monitoring for failure detection and recovery,ault injection analysis of software resilience, (c) sort of diskless checkpointing for improved applicat

olerance to failures and (d) development of adapoftware subsystems based on the concept of peance contracts.

grant EIA-99-75020, and by the NSF Alliance PACooperative Agreement.

References

[1] Advanced Configuration and Power Interface SpecificaV2.0a.http://www.acpi.info.

[2] Hardware Monitoring via lm Sensors.http://secure.netroedgcom/∼lm78.

[3] B. Allen, Monitoring hard disks with SMART, Linux J. (2004[4] G. Bronevetsky, D. Marques, K. Pingali, P. Stodghill, Col

tive operations in an application-level fault tolerant MPI,Proceedings of the ICS’03, International Conference on Scomputing, San Francisco, 2003, pp. 234–243.

[5] S. Browne, J. Dongarra, N. Garner, K. London, P. Mucci, A sable cross-platform infrastructure for application performatuning using hardware counters, in: Proceedings of the SComputing 2000 (SC’00), Dallas, TX, 2000.

[6] J. Daly, A model for predicting the optimum checkpoint intefor restart dumps, Lecture Notes in Computer Science(2003) 3–12.

[7] W. Feng, M. Warren, E. Weigle, The bladed beowulf: a ceffective alternative to traditional beowulfs, in: Proceedingthe CLUSTER’2002, Chicago, 2002, pp. 245–254.

http://www.acpi.info

http://secure.netroedge.com/~lm78

http://secure.netroedge.com/~lm78


[8] M. Gardner, W. Feng, M. Broxton, A. Engelhart, G. Hurwitz,MAGNET: a tool for debugging, analysis and reflection in com-puting systems, in: Proceedings of the CCGrid’2003, ThirdIEEE/ACM International Symposium on Cluster Computingand the Grid, 2003, pp. 310–317.

[9] T. Goodale, G. Allen, G. Lanfermann, J. Masso, T. Radke, E.Seidel, J. Shalf, The cactus framework and toolkit: design andapplications, in: Proceedings of the VECPAR’2002, LectureNotes in Computer Science, vol. 2565, Springer, Berlin, 2003,pp. 197–227.

[10] R. Graham, et al., A network-failure-tolerant message-passingsystem for terascale clusters, in: Proceedings of the ICS’02,International Conference on Supercomputing, New York, 2002,pp. 77–83.

[11] M. Hsueh, T.K. Tsai, R.K. Iyer, Fault injection techniques andtools, IEEE Comput. 30 (1997) 75–82.

[12] K. Kennedy, et al., Toward a Framework for Preparing and Exe-cuting Adaptive Grid Programs, NGS Workshop, InternationalParallel and Distributed Processing Symposium, Fort Laud-erdale, 2002.

[13] T. LeBlanc, J.M. Mellor-Crummey, Debugging parallel pro-grams with instant replay, IEEE Trans. Comput. 36 (1987) 471–482.

[14] C. Lu, D.A. Reed, Compact application signatures for paralleland distributed scientific codes, in: Proceedings of the Super-Computing 2002 (SC’02), Baltimore, 2002.

[15] C. Lu, D.A. Reed, Assessing fault sensitivity in MPI applica-tions, in: Proceedings of the SuperComputing 2004 (SC’04),Pittsburgh, 2004.

[16] C.L. Mendes, D.A. Reed, Monitoring large systems via statis-tical sampling, Int. J. High Perform. Comput. Appl. 18 (2004)

man-peed

ult-. 27

EE

ap-the

uted

.L.local

ring,rids,00)

ncevior,uter

Dan Reed is the Chancellor’s Eminent Pro-fessor at the University of North Carolina atChapel Hill, as well as the Director of theRenaissance Computing Institute (RENCI),a venture supported by the three univer-sities – the University of North Carolinaat Chapel Hill, Duke University and NorthCarolina State University – that is explor-ing the interactions of computing technol-ogy with the sciences, arts and humanities.Reed also serves as Vice-Chancellor for In-

formation Technology and Chief Information Officer for the Univer-sity of North Carolina at Chapel Hill.

Dr. Reed is a member of President George W. Bush’s InformationTechnology Advisory Committee, charged with providing advice oninformation technology issues and challenges to the President, andhe chairs the subcommittee on computational science. He is a boardmember for the Computing Research Association, which representsthe interests of the major academic departments and industrial re-search laboratories. He was previously Director of the National Cen-ter for Supercomputing Applications (NCSA) at the University ofIllinois at Urbana-Champaign, where he also led National Computa-tional Science Alliance, a consortium of roughly 50 academic insti-tutions and national laboratories that is developing next-generationsoftware infrastructure of scientific computing. He was also one ofthe principal investigators and chief architect for the NSF TeraGrid.He received his Ph.D. in computer science in 1983 from Purdue Uni-versity.

Charng-da Lu is a Ph.D. candidate at theer-einral-n-

atni-n.orsti-

ob-eerothte

erisents

EEEery.

267–277.[17] K. Nahrstedt, H. Chu, S. Narayan, QoS-aware resource

agement for distributed multimedia applications, J. High-SNetworking 8 (1998) 227–255, IOS Press.

[18] J.S. Plank, A tutorial on Reed–Solomon coding for fatolerance in RAID-like systems, Software – Pract. Exp(1997) 995–1012.

[19] J.S. Plank, K. Li, M.A. Puening, Diskless checkpointing, IETrans. Parallel Distribut. Syst. 9 (1998) 972–986.

[20] R.L. Ribler, J.S. Vetter, H. Simitci, D.A. Reed, Autopilot: adtive control of distributed applications, in: Proceedings ofSeventh IEEE Symposium on High-Performance DistribComputing, Chicago, 1998, pp. 172–179.

[21] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, CSeitz, J.N. Seizovic, W. Su, Myrinet: a gigabit per secondarea network, IEEE Micro 15 (1995) 29–36.

[22] J.S. Vetter, D.A. Reed, Real-time performance monitoadaptive control and interactive steering of computational gInt. J. Supercomput. Appl. High Perform. Comput. 14 (20357–366.

[23] F. Vraalsen, R.A. Aydt, C.L. Mendes, D.A. Reed, Performacontracts: predicting and monitoring grid application behain: Proceedings of the Grid’2001, Lecture Notes in CompScience, vol. 2242, Springer, Berlin, 2002, pp. 154–165.

Department of Computer Science, Univsity of Illinois at Urbana-Champaign. Hobtained his M.S. degree also at Illinois,2002. His main research interests are palel computing, fault-tolerance and fault sesitivity analysis.

Celso L. Mendes is a Research Scientistthe Department of Computer Science, Uversity of Illinois at Urbana-ChampaigPrior to his current position, he worked fseveral years as an Engineer at the Intute for Space Research in Brazil. Hetained the degrees of Electronics Enginand Master in Electronics Engineering, bat the Aeronautics Technological Institu(ITA), in Brazil, and a Ph.D. in ComputScience at the University of Illinois. H

main research interests are parallel computing, Grid environmand performance monitoring. Dr. Mendes is a member of the IComputer Society and of the Association for Computing Machin

reliability challenges in large systems

Documents