achieving higher dependability through host and nic ... · achieving higher dependability through...

83
ACHIEVING HIGHER DEPENDABILITY THROUGH HOST AND NIC PROCESSOR COLLABORATION A Dissertation Presented by YIZHENG ZHOU Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY September 2008 Electrical and Computer Engineering

Upload: others

Post on 27-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

ACHIEVING HIGHER DEPENDABILITY THROUGHHOST AND NIC PROCESSOR COLLABORATION

A Dissertation Presented

by

YIZHENG ZHOU

Submitted to the Graduate School of theUniversity of Massachusetts Amherst in partial fulfillment

of the requirements for the degree of

DOCTOR OF PHILOSOPHY

September 2008

Electrical and Computer Engineering

3336931

3336931 2009

c© Copyright by Yizheng Zhou 2008

All Rights Reserved

ACHIEVING HIGHER DEPENDABILITY THROUGHHOST AND NIC PROCESSOR COLLABORATION

A Dissertation Presented

by

YIZHENG ZHOU

Approved as to style and content by:

Israel Koren, Chair

C. Mani Krishna, Member

Tilman Wolf, Member

Charles C. Weems, Member

Christopher V. Hollot, Department HeadElectrical and Computer Engineering

ABSTRACT

ACHIEVING HIGHER DEPENDABILITY THROUGHHOST AND NIC PROCESSOR COLLABORATION

SEPTEMBER 2008

YIZHENG ZHOU

B.Sc., TSINGHUA UNIVERSITY

M.Sc., NORTH CAROLINA STATE UNIVERSITY

Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST

Directed by: Professor Israel Koren

Traditionally, distributed systems requiring high dependability were designed us-

ing custom hardware with massive amounts of redundancy. Not only the nodes, but

the network, was replicated in most of these systems. Recently, the need for cost

reduction and access to the latest commercial technologies has prompted the use of

commercial off-the-shelf (COTS) hardware and software products in the design of

such systems. On the other hand, reliance on COTS technology brings about new

challenges in system reliability. This dissertation attempts to address these challenges

by developing fault tolerance techniques for modern high-speed networking-based sys-

tems.

Being driven by the demand for greater network performance, emerging network

technologies have complex network interfaces with a Network Interface Card (NIC)

iv

processor and large local memory. However, increasing complexity results in a larger

set of failure points and a potential increase in the network failure rate. This is in

addition to the system failures that can be caused by faults that strike the host system.

In this dissertation, we propose to achieve higher dependability of distributed systems

through host and NIC processor collaboration. The host processor will detect and

recover a failed network interface, and in addition, the symbiotic relationship allows

the NIC processor to aid in the recovery of a failed host system or application. More

specifically, we present an effective low-overhead adaptive and concurrent self-testing

technique to protect programmable high-speed network interfaces, and a low-overhead

message logging protocols to achieve fast recovery from host application crashes.

v

TABLE OF CONTENTS

Page

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

CHAPTER

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 COTS Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Single Event Upset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Contribution of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. ADAPTIVE AND CONCURRENT SELF-TESTING . . . . . . . . . . . . . . 9

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Myrinet: An Example Programmable Network Interface . . . . . . . . . . . . . . 12

2.2.1 Myrinet NIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Myrinet Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Failure Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Failure Detection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.1 Failure Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.2 Performance Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

3. PROGRAMMABLE-NIC-ASSISTED MESSAGE LOGGING . . . . . 30

3.1 System Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Message Logging Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Programmable-NIC-Assisted Message Logging . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 The Coordination During Failure-Free Execution . . . . . . . . . . . . . . 393.3.2 Failure Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 IXP1200-based Board and Programmable NIC . . . . . . . . . . . . . . . . 473.4.2 The MPICH-V framework and the Berkeley Lab

Checkpoint/Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.3 Implementation Issues of NMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5.1 Raw Communication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5.2 The NAS Parallel Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4. SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii

LIST OF TABLES

Table Page

2.1 Results of Fault Injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

viii

LIST OF FIGURES

Figure Page

1.1 The Fault Tolerance Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Example Myrinet Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Simplified block diagram of the Myrinet NIC. . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Simplified View of the Myrinet Control Program (MCP). . . . . . . . . . . . . . 14

2.4 Examples of Fault Effects on Myrinet’s GM. . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Logical Modules and Routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Data Flow of Self-Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Comparison of the Original GM and FDGM. . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Performance Impact for Different Self-Testing Intervals. . . . . . . . . . . . . . . . 28

3.1 System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 The Coordination During Failure-Free Execution. . . . . . . . . . . . . . . . . . . . . 40

3.3 An application process fails with no checkpointing running. . . . . . . . . . . . 44

3.4 An application process fails during checkpointing. . . . . . . . . . . . . . . . . . . . . 45

3.5 Simplified block diagram of the IXP1200-based RadiSys ENP2505board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 Simplified block diagram of the NIC software. . . . . . . . . . . . . . . . . . . . . . . . 48

3.7 General architecture of MPICH and the MPICH-V framework. . . . . . . . . 49

3.8 Packet encapsulations in the MPICH-V framework . . . . . . . . . . . . . . . . . . . 51

ix

3.9 Latency of rollback recovery protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.10 Latency difference between message logging protocols and thecoordinated checkpointing (VCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.11 Bandwidth of rollback recovery protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.12 Bandwidth difference between message logging protocols and thecoordinated checkpointing (VCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.13 Performance comparison of protocols for NPB Class W . . . . . . . . . . . . . . . 59

3.14 Performance comparison of protocols for NPB Class A . . . . . . . . . . . . . . . . 59

x

CHAPTER 1

INTRODUCTION

Historically, designers of distributed systems developed custom hardware and/or

software with massive amounts of redundancy to improve system dependability. Due

to development cost and time constraints, traditional solutions are now giving way to

designs based on Commercial Off-The-Shelf (COTS) hardware and software products.

The increased reliance on COTS technology has created a growing need for lightweight

fault tolerance. This dissertation attempts to address this problem for modern COTS-

based distributed systems.

1.1 COTS Products

COTS products software and hardware components that already exist and are

available to the general public. The use of COTS is often an alternative to in-

house development or one-time development. Over the past decade, the use of COTS

products as elements of distributed systems has become increasingly commonplace,

due to the expected lower cost and faster system construction. Organizations that

adopt a COTS-based systems approach also expect to stay in step with the advances

in commercial technologies that occur in the competitive marketplace.

The pros in using COTS products are [29]:

• Functionality is ready and available. COTS products come as ready-made and

ready-to-use applications. There is no need to “reinvent the wheel.”

1

• Functionality is tested and working. COTS products have undergone a consid-

erable amount of testing through a dedicated team. Also, they have been used

by a large community of users and experts.

• Functionality is rich. Since system requirements are derived in the narrow

confines of a specific problem domain, COTS products may offer greater func-

tionality not previously considered.

• Support is available. Maintenance is provided by the vendor or available in the

market place, and the cost is usually a fraction of doing it in-house. Also, using

COTS allows one to avail of upgrades of the product and keep current with

advances in technology.

There are cons, as well in using COTS products [29]:

• No control over the requirements for the COTS. Requirements are subject to

market forces, therefore, the COTS components may not address all of the

product requirements. Additional effort may be needed to meet reliability,

security, and safety requirements to protect against COTS vulnerability.

• No control over the quality of the COTS. The quality of COTS products, doc-

umentation, and support are in the hands of vendors.

• Learning curve. Although COTS products offer ready-made functionality, de-

velopers have to become familiar with it before being able to use it.

The advantages of COTS products, in fact, outweigh the disadvantages. But as

mentioned above, no COTS products have been designed to meet one’s unique set

of requirements, and as a result there will be a gap between the requirements and

those met by the COTS products. Developers must understand this problem well

before the implementation and ensure that the COTS products can be customized

and modified to fix the gap.

2

1.2 Single Event Upset

The use of COTS products in mission-critical applications is a growing trend,

mainly due to development cost and time constraints. However, COTS products are

not usually designed to meet the stringent requirements of mission-critical applica-

tions. Furthermore, as a consequence of the technological progress in microelectronics,

COTS-based systems become increasingly sensitive to different effects of the environ-

ment. Particularly, COTS hardware components used in space are susceptible to

transient faults due to Single Event Upsets (SEU).

SEUs have emerged as a key challenge in the design of COTS-based systems

for critical applications even at ground level. SEUs arise from energetic particles,

such as neutrons from cosmic rays and alpha particles from packaging material. As

energetic particles pass through a semiconductor device, they lose energy by ionizing

the medium and generate electron-hole pairs. These charges accumulate in transistor

source and diffusion nodes. A sufficient amount of charges may introduce a fault into

circuit operations, such as transient pulses in logic or support circuitry, or bit flips in

memory cells or registers [55]. Because this type of events are non-destructive, it is

termed soft or transient. Typically, a reset of the device or a rewriting of the memory

cell results in normal device behavior thereafter.

As transistor counts continue to increase exponentially, SEUs will be an increasing

burden for COTS-based system designers. The raw error rate per SRAM or latch bit

is projected to be roughly constant or slightly decrease for the next several technology

generations [21, 24]. This means that a COTS-based system’s error rate will grow in

direct proportion to the complexity of COTS components in each succeeding gener-

ation, unless we strengthen the COTS products used in mission-critical applications

with specific fault tolerance techniques.

3

1.3 Research Goals

This dissertation attempts to address the challenges raised by the use of COTS

components and the occurrence SEUs by developing fault tolerance techniques for

modern high-speed networking-based distributed systems. Although numerous fault

tolerance techniques have been developed to improve the reliability of distributed sys-

tems, as far as we know, none of them take advantage of modern network interfaces

that include a NIC processor and large local memory. In this dissertation, our objec-

tive is to take advantage of a collaboration between the host processor and the NIC

processor to develop fault tolerant techniques that will allow the system to quickly

recover from failures without a significant penalty in performance.

Networking hardware has made big strides over the past decade in both perfor-

mance and cost efficiency. Many modern network interfaces have in common a dedi-

cated processor that relieves the host processor of networking chores such as packet

creation, packet scheduling and ensuring in-order delivery of messages. Most of these

networking technologies are I/O attached (i.e., the internal connection is from the

I/O bus rather than from memory) and message-based (i.e., communications take

place through explicit messages rather than through shared storage). Examples of

such technology include Myrinet, Infiniband, Gigabit Ethernet and IBM PowerNP.

While a NIC processor is typically meant to aid in networking functions, it can ad-

vantageously aid in fault tolerance as well.

Our fault tolerance techniques exploit the co-existence of the host processor and

the NIC processor in a symbiotic relationship to achieve a better fault tolerant sys-

tem. The host processor will detect and recover a failed network interface and the

NIC processor will aid the failure detection and recovery of a failed host system or

application. Though a dual-processor mechanism for fault detection has been pro-

posed and used earlier, it involved complete duplication of functionality. In our case,

the degree of autonomy between the network NIC and the host processor allows the

4

H o s t H o s t

N I C N I C

I n t e r c o n n e c t i o n N e t w o r k

A p p l i c a t i o nP r o c e s s

H o s t F TE l e m e n t

N e t w o r k F TE l e m e n t

Figure 1.1. The Fault Tolerance Architecture.

two to carry out their very own functions, while still being able to provide the fault

tolerance functionality.

We propose a layered fault tolerance architecture that is composed of two fault

tolerance elements, one in the host system and the other in the network interface, as

shown in Figure 1.1. Each element keeps track of the health of the other and takes a

corrective action upon a failure. Such an approach takes advantage of the autonomy

between the host system and the I/O attached network interface, thus a fault that

typically affects the host system, if taken care of quickly, will not affect the network

interface and vice-versa. In this dissertation, we investigate how to improve system

reliability with minimal performance overhead through the proposed fault tolerance

architecture.

5

1.4 Previous Work

Lakamraju has successfully implemented one half of the symbiotic relationship,

that is, the host processor to rescue the network interface [27]. The fault tolerance

techniques have been demonstrated in the context of Myrinet, but are generic in

nature, and are applicable to many other modern networking devices that have a

NIC processor and local memory.

The failure detection scheme is based on a fairly simple watchdog timer, imple-

mented in software using the low-granularity interval timers present in most interfaces.

During normal operation, the Network Control Program (NCP) resets the timer pe-

riodically. If the network interface stops responding due to a fault, the timer expires

and an interrupt is raised. This interrupt is then processed by the host processor

as the first indication that something might be wrong with the network interface.

This scheme has been implemented on Myrinet with minimal changes to the NCP.

The worst case fault detection time was measured to be around 5 milliseconds. The

implementation also showed that such a scheme can be used to detect network in-

terface hangs with virtually no overhead, so the performance of the network is not

compromised.

The interrupt caused by the expiration of the timer is handled by a fault recov-

ery daemon. Since messages can be lost or a host can incorrectly accept duplicate

messages, simply resetting the interface card, reloading/restarting the NCP and re-

sending the unacknowledged messages cannot ensure correct recovery. To address

this problem, Lakamraju proposed to checkpoint the state of each transaction includ-

ing sequence numbers and restore this state after the NCP is reloaded. To minimize

checkpointing overhead, the application keeps just the right amount of state infor-

mation to completely recover from a failure. The fault recovery scheme has been

validated using software-based fault injection. The entire failure detection and re-

covery time was under 2 seconds. Moreover, the bandwidth was almost unaffected,

6

while the round-trip latency, during normal operation, has increased by only 1.5μs.

Depending on the application, this small overhead could be well worth paying for,

considering the high availability that can be obtained.

The proposed failure detection technique is very effective against faults that cause

the network interface to stop responding, but cannot detect other types of failures,

such as those that cause data corruption or bandwidth reduction. Because of the

complexity of the network interface and the NCP, it is challenging to efficiently detect

these non-interface-hang failures. This is the first objective we wish to achieve in this

research.

1.5 Contribution of the Dissertation

This dissertation makes contributions to two fault tolerant issues in distributed

systems. The proposed techniques are marked by their effectiveness and small over-

head.

The first contribution is a transparent low-overhead software-based Adaptive and

Concurrent Self-Testing (ACST) technique to detect non-interface-hang failures, such

as data corruption and bandwidth reduction. The proposed scheme achieves failure

detection by periodically directing the control flow to go through only active software

modules in order to detect errors that affect instructions in the local memory of the

network interface. The experimental results have showed that over 95% of the bit-flip

errors that may affect applications can be detected by the proposed ACST scheme

in conjunction with a software watchdog timer, imposing no appreciable performance

degradation with respect to latency and bandwidth.

The second contribution is a Programmable-NIC-assisted Message Logging (PNML)

approach to combine the efficiency and simplicity aspects of existing log-based rollback-

recovery protocols. We take advantage of the autonomy between the host system and

the I/O attached programmable network interface, and have the NIC processor log

7

nondeterministic events to the NIC’s local memory, and periodically flush them to

stable storage in parallel with the host computing. We expect such an approach to re-

duce failure-free performance overhead, and achieve fast recovery and output commit.

The resulting protocol provides attractive features like low failure-free performance

overhead, fast recovery and fast interaction with I/O devices.

1.6 Organization of the Dissertation

The rest of this document is organized as follows. In Chapter 2, we introduce an

example programmable network interface - Myrinet, describe the types of failures we

observed, then propose our failure detection scheme, and finally show the experimen-

tal results. In Chapter 3, we introduce existing rollback-recovery protocols and our

system model, then propose the PNML protocol, discuss its implementation issues,

and compare its performance with two existing protocols. Finally, in Chapter 4, we

summarize the dissertation and discuss future work.

8

CHAPTER 2

ADAPTIVE AND CONCURRENT SELF-TESTING

Nowadays, interfaces with a network processor and large local memory are widely

used [5, 8, 42, 48, 49, 52, 56]. The complexity of network interfaces has increased

tremendously over the past few years. A typical dual-speed Ethernet controller uses

around 10K gates whereas a more complex high-speed network processor such as the

Intel IXP1200 [20] uses over 5 million transistors. As transistor counts increase, single

bit upsets from transient faults, which arise from energetic particles such as neutrons

from cosmic rays and alpha particles from packaging material, have become a major

reliability concern [30, 45], especially in harsh environments [25, 57] such as deep

space. The typical fault rate in deep space for two Myrinet Network Interface Cards

(NICs) is 0.35 faults/hour [25]. When a solar flare is in progress, the fault rate in

interplanetary space can be as great as 6.87 faults/hour for two Myrinet NICs [25].

These also affect systems on earth, especially far away from the equator [59]. Because

this type of fault does not cause a permanent failure of the device, it is termed soft.

Typically, a reset of the device or a rewriting of the memory cell results in normal

device behavior thereafter. Soft-error-induced network interface failures can be quite

detrimental to the reliability of a distributed system. The failure data analysis re-

ported in [43] indicates that network-related problems contributed to approximately

40% of the system failures observed in distributed environments. As we will see in

the following sections, soft errors can cause the network interface to completely stop

responding, function improperly, or greatly reduce network performance. Quickly

detecting and recovering from such failures is therefore crucial for a system requir-

9

ing high reliability. We need to provide fault tolerance for not only the hardware in

the network interface, but also its local memory where the network control program

(NCP) resides.

In this dissertation, we present an efficient software-based fault tolerance technique

for network failures. Software-based fault tolerance approaches allow the implementa-

tion of dependable systems without incurring the high costs resulting from designing

custom hardware or using massive hardware redundancy. However, these approaches

impose some overhead in terms of reduced performance and increased code size: it is

important to ensure that this overhead have a minimal performance impact.

Our failure detection is based on a software-implemented watchdog timer to de-

tect network processor hangs, and a software-implemented concurrent self-testing

technique to detect other failures. The proposed self-testing scheme detects failures

by periodically directing the control flow to go through program paths in specific por-

tions of the NCP in order to detect errors that affect instructions or data in the local

memory as well as other parts of the network interface. The key to our technique

is that the NCP is partitioned into various logical modules and only active logical

modules are tested, where an active logical module is the collection of all basic blocks

that participate in providing a service to a running application. When compared with

testing the whole NCP, testing only active logical modules can limit significantly the

impact on application performance while still achieving good failure detection cover-

age. When a failure is detected by the watchdog timer or the self-testing, the host

system is interrupted and a fault tolerance daemon woken up to start a recovery

process [27].

In this dissertation, we show how the proposed failure detection and recovery

techniques can be made completely transparent to the user. We demonstrate these

techniques in the context of Myrinet, but as we will see, the approaches are generic

in nature and are applicable to many modern networking technologies.

10

The remainder of this chapter is organized as follows. Section 2.1 discusses related

work. A brief overview of Myrinet is given in Section 2.2. We keep the description

sufficiently general so as to highlight the more generic applicability of our work.

Section 2.3 then detail our failure detection technique. In Section 2.4, we discuss the

results and performance impact of our failure detection scheme.

2.1 Related Work

Chillarege [14] proposes the idea of a software probe to help detect failed software

components in a running software system by requesting service, or a certain level of

service, from a set of functions, modules and/or subsystems and checking the response

to the request. This paper however, presents no experimental results to evaluate its

efficiency and performance impact. Moreover, since the author considers general

systems or large operating systems, there is no discussion devoted to minimizing the

performance impact and improving the failure coverage as we did in this dissertation.

Several approaches have been proposed in the past to achieve fault tolerance

by modifying only the software. These approaches include Self-Checking Program-

ming [34], Algorithm Based Fault Tolerance (AFBT) [18], Assertion [1], Control Flow

Checking [44], Procedure Duplication [33], Software Implemented Error Detection and

Correction (EDAC) code [41], Error Detection by Duplicated Instructions (EDDI)

[32], and Error Detection by Code Transformations (EDCT) [31]. Self-Checking Pro-

gramming uses program redundancy to check its own behavior during execution. It

results from either the application of an acceptance test or from the application of

a comparator to the results of two duplicated runs. Since the message passed to a

network interface is completely nondeterministic, an acceptance test is likely to ex-

hibit low sensitivity. ABFT is a very effective approach, but can only be applied to

a limited set of problems. Assertions perform consistency checks on software objects

and reflect invariant properties for an object or set of objects, but effectiveness of

11

Host

S

Host

Host

Host S

Myrinet HostInterface card

MyrinetSwitch Host

Host

Figure 2.1. Example Myrinet Network.

assertions strongly depends on how well the invariant properties of an application

are defined. Control Flow Checking cannot detect some types of errors, such as data

corruption, while Procedure Duplication only protects the most critical procedures.

Software Implemented EDAC code provides protection for code segments by periodi-

cally encoding and decoding instructions. Such an approach, however, would involve

a substantial overhead for a NIC processor because the code size of an NCP might

be several hundreds of thousands of bytes. Although it can detect all the single

bit faults, it is overkill because many faults are harmless. Moreover, it cannot de-

tect hardware unit errors. EDDI and EDCT have a high error coverage, but have

substantial execution and memory overheads.

2.2 Myrinet: An Example Programmable Network Interface

Myrinet [8] is a high bandwidth (2Gb/s) and low latency (∼6.5μs) local area

network technology. A Myrinet network consists of point-to-point, full-duplex links

that connect Myrinet switches to Myrinet host interfaces and other switches.

12

2.2.1 Myrinet NIC

Fig. 2.2 shows the organization and location of the Myrinet NIC in a typical

architecture. The card provides a flexible and high performance interface between

a generic bus, such as PCI and S-Bus, and the high-speed Myrinet link. It has an

instruction-interpreting RISC processor, a DMA interface to/from the host, a link

interface to/from the network and a fast local memory (SRAM) which is used for

storing the Myrinet’s NCP and for packet buffering. The Myrinet’s NCP is respon-

sible for buffering and transferring messages between the host and the network and

providing all network services.

SystemBridge

HostProcessor

SystemMemory

RISCDMAInterface

Fast Local Memory

LinkInterface

Address 64-bit data

Myrinet Host Interface Card

MyrinetLANLink

IO Bus

Figure 2.2. Simplified block diagram of the Myrinet NIC.

2.2.2 Myrinet Software

Basic Myrinet-related software is freely available from Myricom [54]. The software,

called GM, includes a driver for the host OS, the Myrinet’s NCP (GM NCP), a

network mapping program, a user library and Application Program Interfaces (APIs).

GM achieves its high performance through a technique known as “operating-system

13

SDMA

RDMA

L_timer

Sending Queue

Receiving Queue

1

2

3

6

7

8

Send buffer 1

Send buffer 2

Receive buffer 2

Receive buffer 1RECV

SEND 4

5

MCP

Sequence of sending a packet

Sequence of receiving a packet

Figure 2.3. Simplified View of the Myrinet Control Program (MCP).

bypass” (OS-bypass) [42]. After initial operating-system calls to allocate and register

memory for communication, the application programs can send and receive messages

without system calls. Instead, the GM API functions communicate through common

memory with the MCP which executes continuously on the processor in the Myrinet

NIC. It is the vulnerability to faults in the GM NCP that is the focus of this work,

so we now provide a brief description of it.

The GM NCP [54] can be viewed broadly as consisting of four interfaces: Send

DMA (SDMA), SEND, Receive (RECV) and Receive DMA (RDMA), as depicted in

Fig. 2.3. The sequence of steps during sending and receiving is illustrated in Fig.

2.3. When an application wants to send a message, it posts a send token in the

sending queue (step 1) through GM API functions. The SDMA interface polls the

sending queue, and processes each send token (step 2) that it finds. It then divides

the message into chunks (if required), fetches them via the DMA interface, and puts

the data in an available send buffer (step 3). When data is ready in a send buffer, the

SEND interface sends it out, prepending the correct route at the head of the packet

14

(step 4). Performance is improved by using two send buffers: while one is being filled

through SDMA, the packet interface can send out the contents of the other buffer.

Similarly, two receive buffers are present. One of the receive buffers is made

available for receiving an incoming message by the RECV interface (step 5), while

the other could be used by RDMA to transfer the contents of a previously received

message to the host memory (step 6). The RDMA then posts a receive token into the

receiving queue of the host application (step 7). A receiving application on the host

asynchronously polls its receiving queue and carries out the required action upon the

receipt of a message (step 8).

The GM NCP is implemented as a tight event-driven loop. It consists of around

30 routines. A routine is called when a given set of events occur and a specified

set of conditions are satisfied. For example, when a send buffer is ready with data

and the packet interface is free, a routine called send chunk is called. It is also worth

mentioning here that a timer routine (L timer) is called periodically, when an interval

timer present on the interface card expires.

Flow control in GM is managed through a token system. Both sends and receives

are regulated by implicit tokens, which represent space allocated to the user process in

various internal GM queues. A send token consists of information about the location,

size and priority of the send buffer and the intended destination for the message. A

receive token contains information about the receive buffer such as its size and the

priority of the message that it can accept. A process starts out with a fixed number

of send and receive tokens. It relinquishes a send token each time it calls GM to send

a message, and a receive token with a call to GM to receive a message. A send token

is implicitly passed back to the process when a callback function is executed upon

the completion of the sending, and a receive token is passed back when a message is

received from the receive queue.

15

0100002000030000400005000060000700008000090000

1 10 100 1000 10000 100000 1e+06

Late

ncy

(use

c)

Message Length (bytes)

Fault-free GMGM with a fault

(a) Unusually long latencies caused by a fault

05

10152025303540

1 10 100 1000 10000 100000 1e+06

Ban

dwid

th (M

Byt

es/s

)

Message Length (bytes)

Fault-free GMGM with a fault

(b) Bandwidth reduction caused by a fault

Figure 2.4. Examples of Fault Effects on Myrinet’s GM.

2.3 Failure Detection

In the context of the Myrinet card, soft errors in the form of random bit flips can

affect any of the following units: the processor, the interfaces and more importantly,

the local SRAM, containing the instructions and data of the GM NCP. Bit flips may

result in any of the following events:

• Network interface hangs – The entire network interface stops responding.

• Send/Receive failures – Some or all packets cannot be sent out, or cannot be

received.

• DMA failures – Some or all messages cannot be transferred to or/and from host

memory.

• Corrupted control information – A packet header or a token is corrupted.

• Corrupted messages.

• Unusually long latencies.

The above list is not comprehensive. For example, a bit flip occurring in the

region of the SRAM corresponding to the resending path will cause a message to

not be resent when a corresponding acknowledgment was not received. Experiments

also reveal that faults can propagate from the network interface and cause the host

computer to crash. Such failures are outside the scope of this dissertation.

16

Fig. 2.4 shows how a bit-flip fault may affect message latency and network band-

width. The error was caused by a bit-flip that was injected into a sending path of

the GM NCP. More specifically, one of the two sending paths associated with the two

message buffers was impacted, causing the effective bandwidth to be greatly reduced.

To achieve reliable in-order delivery of messages, the GM NCP generates more mes-

sage resends, and this greatly increases the effective latency of messages. Since no

error is reported by the GM NCP, all host applications will continue as if nothing

happened. This can significantly hurt the performance of applications, and in some

situations deadlines may be missed.

Some of the effects of soft-error-induced bit flips are subtle. For example, although

cyclic-redundancy-checks (CRC) are computed for the entire packet, including the

header, there are still some faults that may cause data corruption. When an appli-

cation wants to send a message, it builds a send token containing the pointer to the

message and copies it to the sending queue. If the pointer is affected by a bit flip

before the GM NCP transfers the message from the host, an incorrect message will be

sent out. Such errors are difficult to detect and are invisible to normal applications.

Even though the above discussion was related to Myrinet, we believe that such

effects are generic and apply to other high-speed network interfaces having similar

features, i.e., a network processor, a large local memory and an NCP running on the

interface card.

2.3.1 Failure Detection Strategy

Our approach to detecting interface hangs is based on a simple watchdog [27], but

one which is implemented in software and uses the low-granularity interval timers

present in most interfaces.

Since the code size of the NCP is quite large, it is challenging to efficiently test this

software to detect non-interface-hang failures. We exploit the fact that applications

17

generally use only a small portion of the NCP. For instance, the GM NCP is designed

to provide various services to applications, including reliable ordered message deliv-

ery (Normal Delivery), directed reliable ordered message delivery which allows direct

remote memory access (Directed Delivery), unreliable message delivery (Datagram

Delivery), setting an alarm, etc. Only a few of the services are concurrently re-

quested by an application. For example, Directed Delivery is used for tightly-coupled

systems, while Normal Delivery has a somewhat larger communication overhead and

is used for general systems; it is rare for an application to use both of them. Typ-

ically an application only requests one transport service out of the seven types of

transport services provided by the GM NCP. Consequently, only about 10% to 20%

of the GM NCP instructions are “active” when serving a specific application. Other

programmable NICs, such as the IBM PowerNP [5], have similar characteristics.

Based on this observation, we propose to test the functionalities of only that part

of the NCP which corresponds to the services currently requested by the application:

this can considerably reduce failure detection overhead. Moreover, because a fault

affecting an instruction which is not involved in serving requests from an application

would not change the final outcome of the execution, our scheme avoids signaling

these harmless faults. This reduces significantly the performance impact, compared

to other techniques such as those that periodically encode and decode the entire code

segment [41].

To implement this failure detection scheme we must identify the “active” parts of

the NCP for a specific application. To assist the identification process, we partition

the NCP into various logical modules based on the type of services they provide.

A logical module is the collection of all basic blocks that participate in providing

a service. A basic block, or even an entire routine, can be shared among multiple

logical modules. Fig. 2.5 shows a sample NCP which consists of three routines.

The dotted arrow represents a possible program path of a logical module and an

18

...

...

A

B ...

Routine 1

...

...

D

E ...

F

Routine 2

G

Routine 3

C

Network Control Program

Figure 2.5. Logical Modules and Routines.

octagon represents a basic block. All the shaded blocks on the program path belong

to the logical module. In our implementation, we examined the source code of the

GM NCP and followed all possible control flows to identify the basic blocks of each

logical module. This time-consuming analysis has been done manually, but could be

automated by using a code profiling tool similar to GNU gprof.

For each of the logical modules, we must choose and trigger several requests/events

to direct the control flow to go through all its basic blocks at least once in each self-

testing cycle so that the functionality of the network interface is tested and errors are

detected. For example, in Myrinet interfaces, large and small messages would direct

the control flow to go through different branches of routines because large messages

would be fragmented into small pieces at the sender side and assembled at the receiver

side, while small messages would be sent and received without the fragmenting and

assembling process. We use loopback messages of various sizes to test the sending

and receiving paths of the NCP concurrently. During this procedure, the hardware

of the network interface involved in serving an application is also tested for errors.

19

The technique can, in addition, be used to test other services provided by network

interfaces such as setting an alarm, by directing the control flow to go through basic

blocks providing these services. Such tests are interleaved with the application’s use

of the network interface.

To reduce the overhead of self-testing, we implement an Adaptive and Concurrent

Self-Testing (ACST) scheme. We insert a piece of code at the beginning of the NCP

to identify the requested types of services and start self-testing for the corresponding

logical modules. The periodic self-testing of a logical module should start before it

serves the first request from the application(s) to detect possible failures; this causes

a small delay for the first request. For a low-latency NIC such as Myrinet, this delay

would be negligible. Furthermore, we can reduce this delay by letting the application

packets follow on the heels of the self-testing packets. If a logical module is idle for

a given time period, the NCP would stop self-testing it. A better solution can be

achieved by letting the NCP create lists for each application to track the type of

services it has requested, so that when an application completes and releases network

resources, which can be detected by the NCP, the NCP could check the lists and stop

the self-testing for the logical modules that provide services only to this completed

application.

LANai

SDMA SEND

RDMA RECV

L_timer

Host Switch

DMARegion

DMARegion

Figure 2.6. Data Flow of Self-Testing.

20

2.3.2 Implementation

The software-implemented watch-dog [27] timer makes use of a spare interval timer

to detect interface hangs. One of them, say IT1, is first initialized to a value just

slightly greater than 800μs, which is the maximum time between the L timer routine

invocations during normal operation. The L timer routine is modified to reset IT1

whenever it is called. The interrupt mask register provided by the Myrinet NIC is

modified to raise an interrupt when IT1 expires. Thus, during normal operation,

L timer resets IT1 just in time to avoid an interrupt from being raised. When the

NIC crashes/hangs, the L timer routine is not executed, causing IT1 to expire and

an interrupt to be raised, signaling to the host that something may be wrong with

the network interface. Such a scheme allows the host to detect NIC failures with

virtually no overhead.

This detection technique works as long as a network interface hang does not

affect the timer or the interrupt logic. This is supported by our experiments: over

an extensive period of testing, we did not encounter a single case of a fault that

has affected the timer or the interrupt logic. In fact, this simple failure detection

mechanism was able to detect all the interface hangs in our experiments. While it is

not impossible that a fault might affect these circuits, our experience has shown this

to be extremely unlikely.

In what follows, we demonstrate and evaluate our self-testing scheme for one

of the most frequently used logical modules in the GM NCP, the Normal Delivery

module. Other modules have a similar structure with no essential difference, and the

self-testing of an individual logical module is independent of the self-testing of other

modules.

To check a logical module providing a communication service, several loopback

messages of a specific bit pattern are sent through the DMA and link interfaces and

back so that both the sending and receiving paths are checked. Received messages

21

are compared with the original messages, and the latency is measured and compared

with normal latencies. If all of the loopback messages are received without errors and

without experiencing unusually long latencies, we conclude that the network interface

works properly.

We have implemented such a scheme in the GM NCP. We emulate normal sending

and receiving behavior in the Normal Delivery module. This is done by posting

send and receive tokens into the sending and receiving queues, respectively, from

within the network interface, rather than from the host. The posting of the tokens

causes the execution control to go through basic blocks in the corresponding logical

module, so that errors in the control flow path are detected. Similarly, some events

such as message loss or software-implemented translation look-aside buffer misses,

which might concurrently happen during the sending/receiving process of the Normal

Delivery module, are also triggered within the NIC to test the corresponding basic

blocks. We can emulate different sets of various requests/events to go through most

of the basic blocks. To reduce the overhead, we made an attempt to trigger as few

requests/events as possible.

Fig. 2.6 shows the data flow of the self-testing procedure. When the GM driver

is loaded, two extra DMA regions are allocated for self-testing purposes. The shaded

DMA region is initialized with predefined data. We added some code at the end of

the timer routine (L timer) to trigger requests/events for each self-testing cycle. The

SDMA interface polls the sending queues, and when some tokens for self-testing are

found, the interface starts to fetch the message from the initialized DMA region, and

passes chunks of data to the SEND interface. For our self-testing, messages are sent

out by the SEND interface to the RECV interface at the same node. Then, messages

are transferred to the other DMA region. Finally, after a predetermined interval when

L timer is called, messages are transferred back to the network interface. During this

procedure, we can check the number of received messages, messages’ contents, and

22

latencies. Such a design insures that both directions of the DMA interface and link

interface are tested as well as the network processor and NCP. Note that such a

scheme does not interact with the host processor and hence has minimal overhead.

Because the size of the self-testing code is negligible when compared with the size of

the GM NCP, the performance impact is minor.

Self-testing can also be implemented using an application running in the host with

no modification to the GM NCP. Such an implementation would impose an overhead

to the host system that we avoid with our approach. Also, a pure application-level

self-testing would be unable to test some basic blocks that would otherwise be tested

with our self-testing implemented in the GM NCP, such as the resending path, because

of its inability to trigger such a resending event.

Clearly, it is only when the injected faults manifest themselves as errors that this

approach can detect them. Faults which are “silent” and simply lurk in the data

structures would require a traditional redundancy approach, which is outside the

scope of our work.

Since all the modifications are within the GM NCP, the API used by an application

is unchanged so that no modification to the application source code is required.

2.4 Experimental Results

Our experimental setup consisted of two Pentium III machines each with 256MB

of memory, a 33MHz PCI bus and running Redhat Linux 7.2. The Myrinet NICs

were LANai9-based PCI64B cards and the Myrinet switch was type M3M-SW8.

2.4.1 Failure Coverage

We used as our workload a program provided by GM to send and receive messages

of random lengths between processes in two machines. To evaluate the coverage

of the self-testing of the modified GM, we developed a host program which sends

23

loopback messages of various lengths to test latency and check for data corruption. We

call it application-level self-testing to distinguish it from our NCP-level self-testing.

This program follows the same approach as the NCP-level self-testing, that is, it

attempts to check as many basic blocks as possible for the Normal Delivery module.

The application-level self-testing program sends and receives messages by issuing GM

library calls, in much the same way as normal applications do. We assume that, if

such a test application is run in the presence of faults, it will experience the same

number of faults that would affect normal applications. Based on this premise, we use

the application-level self-testing as baseline and calculate the failure coverage ratio

to evaluate our NCP-level self-testing. The failure coverage ratio is defined as the

number of failures detected by the NCP-level self-testing divided by the number of

failures detected by the application-level self-testing. When calculating the failure

coverage ratio, we did not count the failures that are not covered by the proposed

technique, such as host crashes. To make the baseline application comparable to

the NCP-level self-testing, we concurrently trigger exception events within the GM

NCP to direct the control flow to cover basic blocks handling exceptions, so that the

baseline application can detect all the failures that can be detected by the NCP-level

self-testing.

The underlying fault model used in the experiments was primarily motivated by

SEUs which were simulated by flipping bits in the SRAM. Such faults disappear on

reset or when a new value is written to the SRAM cell. Since the probability of

multiple SEUs is low, we focus on single SEUs. To emulate a fault that may cause

the hardware to stop responding, we injected stuck-at-0 and stuck-at-1 faults into

the special registers in the NIC. The time instances at which faults were injected

were randomly selected. After each fault injection run, the GM NCP was reloaded

to eliminate any interference between two experiments.

24

To evaluate the effectiveness of our NCP-level loopback without testing exhaus-

tively each bit in the SRAM and registers, we performed the following three experi-

ments:

• Exhaustive fault injection into a single routine (the frequently executed send chunk).

• Injecting faults into the special registers.

• Random fault injection into the entire code segment.

The data structures which can make up a significant fraction of the GM NCP

state were not subjected to fault injection because the proposed technique does not

provide adequate coverage for them. This kind of faults would need a traditional

redundancy approach.

In all the experiments mentioned in this section, only the Normal Delivery logical

module was active and checked. The workload program and the application-level self-

testing program requested service only from this module. If a fault was injected in

the Normal Delivery module, it would be activated by the workload program; if not,

the fault would be harmless and have no impact on the application. The injection of

each fault was repeated 10 times and the results averaged.

Send chunk Registers Entire Code Seg.

Failu

res

%Fa

ults

%Fa

ilure

s

Failu

res

%Fa

ults

%Fa

ilure

s

Failu

res

%Fa

ults

%Fa

ilure

sHost Computer Crash 7 0.7 1.7 46 24.0 35.4 8 0.56 9.09NCP Hung (By WT) 128 12.1 30.5 10 5.2 7.7 24 1.68 27.27Send/Recv Failures 151 14.3 36.0 0 0.0 0.0 21 1.47 23.86DMA Failures 21 2.0 5.0 26 13.5 20.0 12 0.84 13.64Corrupted Ctrl Info. 0 0.0 0.0 3 1.6 2.3 1 0.07 1.14Corrupted Message 5 0.5 1.2 45 23.4 34.6 8 0.56 9.09Unusually Latency 107 10.1 25.5 0 0.0 0.0 14 0.98 15.91No Impact 637 60.3 – 62 32.3 – 1342 93.85 –Total 1056 100.0 100.0 192 100.0 100.0 1430 100.00 100.00

Table 2.1. Results of Fault Injection.

25

The routine send chunk is responsible for initializing the packet interface and

setting some special registers to send messages out on the Myrinet link. The entire

routine is part of the Normal Delivery module.

There are 33 instructions in this routine, totaling 1056 bits. Faults were sequen-

tially injected at every bit location in this routine. Columns 2 to 4 of Table 2.1 show

a summary of the results reported by NCP-level self- testing for these experiments.

Column 2 shows the number of detected failures, column 3 shows the failures as a

fraction of the total faults injected, and column 4 the failures as a fraction of the total

failures observed. About 40% of the bit-flip faults caused various types of failures.

Out of these, 30.5% were network interface hangs, which were detected by our watch-

dog timer, 1.7% of these failures caused a host crash, and the remaining 67.8% were

detected by our NCP-level self-testing. The failure coverage ratio of the NCP-level

self-testing for this routine is 99.3%.

For our next set of experiments, we injected faults into the special registers as-

sociated with DMA. Columns 5 to 7 of Table 2.1 show a summary of the results.

The GM NCP sets these registers to fetch messages from the host memory to the

SRAM via the DMA interface. There are a total of 192 bits in the SDMA registers,

containing information about source address, destination address, DMA length and

some flags. We sequentially injected faults at every bit location. From the results, it

is clear that the memory-mapped region corresponding to the DMA special registers

is very sensitive to faults. In these experiments, faults propagated to the DMA hard-

ware or even the host computer and caused fatal failures. Since the total number of

register bits is only several hundred, orders of magnitude smaller than the number

of instruction bits, the probability that a fault hits a register bit and causes a host

crash is very low. Even though 35.4% of the failures from injecting faults in registers

resulted in a host crash, they account for a very small fraction of the total number of

failures. The failure coverage ratio of this set of experiments is 99.2%.

26

The third set of results (columns 8 to 10 of Table 2.1) shows how the NCP-

level self-testing performs when faults are randomly injected into the entire code

segment of the GM NCP. We injected 1430 faults at random bit locations, but only

88 caused failures. 27.3% of these failures were network interface hangs detected

by our watchdog timer, 9.1% caused a host crash, and the remaining 63.6% of the

failures were detected by our NCP-level self- testing. The failure coverage ratio is

about 95.6%. From the table we see that a substantial fraction of the faults do not

cause any failures and thus have no impact on the application. This is because the

active logical module, i.e., Normal Delivery, is only one part of the GM NCP. This

reinforces the fact that self-testing for the entire NCP is mostly unnecessary. By

focusing on the active logical module(s), our self-testing scheme can considerably

reduce the overhead.

Due to uncertainties in the state of the interface when injecting a fault, repeated

injections of the same fault are not guaranteed to have the same effect. However, the

majority of failures displayed a high degree of repeatability. Such repeatability has

also been reported elsewhere [40].

2.4.2 Performance Impact

We measure the network performance using two metrics. One is latency, which

is usually calculated as the time to transmit small messages from source to desti-

nation, the other is bandwidth, which is the sustained data rate available for large

messages. Measurements were performed as bi-directional exchanges of messages of

different length between processes in the two machines. For each message length of

the workload, messages were sent repeatedly for at least 10 seconds and the results

averaged.

We experimented with the failure detection scheme and evaluated its performance

impact, in this section we will refer to this modified GM software as Failure Detection

27

0

2000

4000

6000

8000

10000

12000

14000

1 10 100 1000 10000 100000 1e+06

Late

ncy

(use

c)

Message Length (bytes)

GMFDGM

(a) Bandwidth.

05

10152025303540

1 10 100 1000 10000 100000 1e+06

Ban

dwid

th (M

Byt

es/s

)

Message Length (bytes)

GMFDGM

(b) Latency.

Figure 2.7. Comparison of the Original GM and FDGM.

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Late

ncy

Diff

eren

ce (u

secs

)

Interval of Self-Testing (seconds)

(a) Bandwidth Difference vs. Interval Length.

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5Ban

dwid

th D

iffer

ence

(MB

ytes

/s)

Interval of Self-Testing (seconds)

(b) Latency Difference vs. Interval Length.

Figure 2.8. Performance Impact for Different Self-Testing Intervals.

GM (FDGM). For each message length of the workload, messages were sent repeatedly

for at least 10 seconds and the results were averaged.

Fig. 2.7(a) compares the bandwidth obtained with GM and FDGM for different

message lengths. The reason for the jagged pattern in the middle of the curve is

that GM partitions large messages into packets of at most 4KB at the sender and

reassembles them at the receiver. Fig. 2.7(b) compares the point-to-point half-round-

trip latency for messages of different lengths. For this experiment, the NCP-level

self-testing interval was set to 5 seconds. The figures show that FDGM imposes no

appreciable performance degradation with respect to latency and bandwidth.

28

We also studied the overhead of the NCP-level self-testing when the test interval

is reduced from 5 to 0.5 seconds. Experiments were performed for a message length

of 2KB. The latency of the original GM software is 69.39μs, and its bandwidth is

14.71MB/s. Fig. 2.8 shows the bandwidth and latency differences between GM and

FDGM. There is no significant performance degradation with respect to latency and

bandwidth. For the interval of 0.5 seconds, the bandwidth is reduced by 3.4%, and

the latency is increased by 1.6%, when compared with the original GM.

Such results agree with expectations. The total size of our self-testing messages is

about 24KB which is negligible relative to the high bandwidth of the NIC. Users can

determine accordingly the NCP-level self-testing interval, taking into consideration

performance and failure detection latency.

29

CHAPTER 3

PROGRAMMABLE-NIC-ASSISTED MESSAGELOGGING

COTS-based clusters are widely used as large-scale high performance computing

infrastructures due to their impressive price/performance ratio. Currently, clusters

with thousands of nodes are not rare, and in the future, these infrastructures will

become even larger. As a consequence, the trend of increasing software and hardware

complexity has made the issue of system reliability prominent.

Long duration parallel applications on a large-scale cluster may be stopped at

any time during their execution due to unpredictable failures. The consequence of

failures may cause significant loss of computation time, massive waste of energy,

and/or volatility of time constraints for critical applications. Even though the loss

of computation time might be acceptable once, the same application may encounter

another failure during its reexecution. Such uncertainty is unacceptable for most

users.

This risk has reactivated the research on fault tolerant environments for parallel

applications. There are two main approaches to fully automatic and transparent fault

tolerance: checkpoint-based rollback recovery and log-based rollback recovery. The

first relies only on checkpoints, and the latter combines checkpointing with logging of

nondeterministic events, encoded in tuples called determinants [3]. For example, for

a message-receipt event, a determinant includes the identity of the sender process, a

unique identifier assigned to the message by the sender, the identity of the receiver

process, and the order in which the message is delivered. Log-based rollback recovery

30

in general allows a parallel application to recover beyond the most recent set of

checkpoints up to the maximum recoverable state, and to frequently receive input

data or show its outcome from/to I/O devices, which cannot roll back. The existing

three classes of nondeterministic event logging or message logging protocols differ in

how determinants are logged to stable storage. Pessimistic protocols require a process

to block waiting for the determinant of each nondeterministic event to be stored on

stable storage before sending a message. Pessimistic protocols significantly increase

the overhead of failure-free execution, but simplify and speedup recovery. Optimistic

protocols only require that determinants reach stable storage eventually, and thus

reduce failure-free overhead. However, if any of the determinants are lost in case a

process crashes, it is complicated to reconstruct a consistent state for the system.

Causal protocols require every process to piggyback its volatile log of determinants

on every outgoing message. This feature allows causal protocols to combine some of

the positive aspects of pessimistic and optimistic protocols at the expense of a more

complex recovery protocol.

In this dissertation, we take advantage of the autonomy between the host system

and the I/O attached programmable network interface, and have the NIC processor

log messages and determinants to the NIC’s local memory, and periodically flush them

to stable storage in parallel with the host computing. The proposed PNML protocol

combines the efficiency and simplicity aspects of existing message logging protocols,

and thus provides attractive features like low failure-free performance overhead, fast

recovery and fast interaction with I/O devices. In the following sections, we introduce

the system model, and then detail our NIC-assisted logging protocol.

31

3.1 System Model and Assumptions

In a message-passing system, a fixed number of processes cooperate with each

other by sending and receiving messages to execute a distributed program and interact

with I/O devices.

A process execution can be modeled as a sequence of state intervals, each starting

and ending with a nondeterministic event [38]. During each state interval, execution

is deterministic. A process will always generate the same output if it starts from the

same state and is subjected to the same sequence of nondeterministic events at the

same points within the execution. Log-based rollback-recovery relies on the piecewise

deterministic assumption [38], which postulates that the log-based rollback-recovery

protocols can identify all the nondeterministic events executed by each process, and

log all information necessary to replay the events in case of a failure. If this assumption

holds, the log-based rollback-recovery protocols enable a process to recover from a

failure by replaying its execution as it occurred before the failure.

We assume a fail-stop model [37], more precisely, a process may fail by crashing,

in which case it stops execution and loses all its volatile state, but the rest of the

system is robust, that is, the failure will not affect the host operating system or the

NIC. State information saved on stable storage devices during failure-free execution

can survive process failures, and this information can be used for recovery.

To achieve a correct recovery in case of a failure, a log-based rollback-recovery

protocol must ensure that the observable behavior of a distributed system is equiva-

lent to some failure-free execution, and the internal state of the system is consistent

with the observable behavior of the system before the failure [16, 38]. To meet the

correctness criterion, a log-based rollback-recovery protocol must log state informa-

tion about every internal interaction between processes and every external interaction

with I/O devices.

32

A distributed message-passing system often interacts with I/O devices to receive

input data or show outcome. But in case of a failure, output to I/O devices cannot

be revoked. This is commonly called the output commit problem [38]. Thus, before

sending an output message, the system must ensure that the state generating the

message can be recovered in spite of any failure. The output commit must wait until

all the state information has been identified and saved to stable storage, delaying

the output. If the system frequently sends output messages, the overhead caused

by the output commit problem can severely degrade the performance of the system,

especially for a system incorporating checkpoint-based rollback-recovery protocols.

Rollback recovery uses stable storage to save checkpoints, logs for nondeterministic

events and other state information for recovery. Stable storage must ensure that

the state information for recovery persist through failures and the corresponding

recoveries. People often use hard disk to implement stable storage. In this work, we

assume the volatile memory on a programmable NIC is not subject to failures, and

thus use it together with the host hard disk as the stable storage. We will later relax

this assumption in a discussion. Similarly, it may not be possible for I/O devices to

regenerate input data, so rollback recovery protocols should save the input data on

stable storage before allowing an application process to access it.

During the execution of a parallel application, checkpoints and logs consume the

capacity of stable storage. As the computation progresses and more state information

for recovery is collected and saved, rollback-recovery protocols must identify and

remove useless recovery information, which is called garbage collection. A common

approach is to identify the most recent set of consistent state, and delete all recovery

information relating to nondeterministic events before the identified set of state. Here

a consistent state is one in which if the state of a process reflects receiving of a message,

the state of the corresponding sender reflects sending of that message [13]. For some

rollback-recovery protocols, garbage collection is an important issue, because of the

33

nontrivial overhead of running a special algorithm to identify and discard useless

information. Roll-back protocols differ in the complexity and performance impact of

their garbage collection algorithm.

Log-based rollback-recovery protocols use checkpointing and logging of nondeter-

ministic events to enable a process to recover from a failure. Nondeterministic events

include receiving messages, receiving input from I/O devices, system calls, and asyn-

chronous signals. Because message logging introduces a major source of overhead,

this work focuses on receiving messages. Like most protocols in the literature, we

assume that the reception events are the only possible non-deterministic events in

an execution. Under this assumption, the proposed scheme cannot recover a failed

process that is subjected to other forms of nondeterministic events. The range of

nondeterministic events covered is an implementation issue. For more information

about this issue, please refer to [16].

3.2 Message Logging Protocols

As mentioned earlier in this chapter, there are three classes of message logging

protocols: pessimistic, optimistic and causal message logging [2]. Pessimistic message

logging protocols ensure that the determinant of each nondeterministic event is safely

logged on stable storage before the event is allowed to affect the computation or be-

fore the receiver is allowed to communicate with any other process [2, 23]. Pessimistic

message logging protocols allow fast and simple recovery, output commit and garbage

collection. In a message-passing system incorporating a pessimistic message logging

protocol, a failed process can restart from the most recent checkpoint and no other

functioning processes need to be rolled back. [36] reports that communication time

accounts for about 5% to 20% of the overall computation time, and the so called

basic computation time accounts for the rest. If all state information for recovery can

be readily retrieved from local stable storage, send and receive operations incur no

34

blocking during recovery, and thus the recovery time is very close to the lower bound,

that is, the basic computation time. Whereas, checkpoint-based rollback recovery

protocols do not perform as well, because all send and receive operations do not differ

during recovery and failure-free execution. Moreover, in such a system, all processes

can send an output message or remove useless recovery information without running a

complex algorithm. In contrast to pessimistic message logging protocols, checkpoint-

based rollback-recovery protocols require a global checkpoint before committing any

output to the external system. Quantitatively, it took several seconds to checkpoint

a single process in our experiments, but the time depends on the amount of state

information to be checkpointed and the implementation. The checkpoint-based pro-

tocols introduce a high output latency, and if the system frequently interacts with the

outside world, this is also a costly solution. However, all the above advantages of pes-

simistic message logging protocols come at the expense of a significant performance

penalty incurred by synchronous logging.

Optimistic message logging protocols ensure that the determinant of each non-

deterministic events is saved in volatile memory before the event is allowed to affect

the computation. Determinants kept in volatile memory are periodically flushed to

stable storage. As a result of the asynchronous logging of determinants to stable stor-

age, optimistic message logging protocols do not require processes to block waiting

for nondeterministic events, such as receiving messages, and thus incur small over-

head during failure-free execution. However, the price to be paid for this advantage

includes complicated and slow recovery, output commit and garbage collection. In

a message-passing system incorporating a optimistic message logging protocols, if a

process fails, all the determinants kept in volatile memory will get lost, and all the

corresponding state intervals cannot be recovered. Moreover, if the failed process

sent a message during any of the lost state intervals, the receiver of the message must

roll back until its states do not depend on any message whose determinant was lost.

35

Otherwise, after replaying the execution of the failed process, the system will not be

in a consistent state. As a consequence, optimistic message logging protocols must

track inter-process dependencies during failure-free execution. In case of a failure,

the optimistic message logging protocols use the dependency information to calculate

a global consistent state and recover the pre-failure execution. To ensure that no

output messages may be revoked upon a failure, output commit in optimistic mes-

sage logging protocols require a global coordination, which introduces a high output

latency, and for applications frequently interacting with the outside world, a high

failure-free execution overhead.

Causal message logging protocols have all processes of a message-passing system

piggyback the determinants in their volatile memory on the outgoing messages sent

to other processes. For every incoming message, a process saves the piggybacked

determinants in its volatile memory before delivering the message to the application.

Therefore, causal message logging protocols ensure that in case of a failure, the de-

terminant of each nondeterministic event is either on stable storage or available in

the volatile memory of a surviving process. Like optimistic message logging proto-

cols, causal message logging protocols avoid synchronous logging of determinants to

stable storage. They also allow each process to commit output without a global co-

ordination. However, the recovery of a message-passing system incorporating causal

message logging protocols is more complex. The process being recovered must obtain

its determinants and the content of messages delivered before the failure from the

surviving processes, and then replay the collected events. Furthermore, the causal-

ity tracking of causal message logging protocols is also complex. For example, the

Manetho system propagates causal information in an antecedence graph [15]. Ev-

ery process in a message-passing system keeps in its volatile memory an antecedence

graph, providing a complete history of the nondeterministic events that have causal

effects on the state of the process. In practice, it is a costly solution for each outgoing

36

message to carry the entire antecedence graph. Some optimizations have been pro-

posed and implemented to reduce the amount of information carried on application

messages [15].

Despite the considerable amount of research work in the area of log-based rollback-

recovery protocols in distributed systems, there are only a few commercial systems

that have actually adopted them [16]. The two known commercial implementations [6,

19] use pessimistic message logging protocols, and we are unaware of any commercial

systems incorporating optimistic or causal message logging protocols. This is possibly

because of the difficulties in implementing recovery. Furthermore, the two commercial

implementations using pessimistic message logging protocols are used for applications

where the performance overhead incurred by synchronous logging of determinants to

stable storage can be tolerated.

Sender-based pessimistic message logging protocols have been proposed to lower

the overhead of synchronous logging to stable storage. Unlike receiver-based pes-

simistic message logging protocols, which synchronously log the determinant and the

content of each message on stable storage, sender-based pessimistic message logging

protocols only synchronously log determinants, and keep the contents of messages in

volatile memory at the corresponding senders. Consequently, sender-based schemes

reduce logging overhead during failure-free executions. But the recovery in sender-

based protocols is 2% to 20% slower than in receiver-based protocols in case of one

failure. If the number of failures increases, however, sender-based protocols becomes

significantly slower [35].

3.3 Programmable-NIC-Assisted Message Logging

In this dissertation, we propose the PNML protocol, which has the advantage of

low failure-free execution overhead while retaining the advantages of receiver-based

pessimistic message logging protocols. In the PNML protocol, recovery is achieved

37

H o s t H o s t

N I C N I C

I n t e r c o n n e c t i o n N e t w o r k

A p p l i c a t i o nP r o c e s s

H M P P

N M L P

Figure 3.1. System Architecture.

through the cooperation of the Host Message-Passing Process (HMPP) and NIC

Message-Logging Process (NMLP), as shown in Figure 3.1.

An HMPP connects directly with its peers on other nodes and the local application

process. It is a communication agent, which serves send and receive requests from

the local application process, and exchanges messages with other HMPPs. Besides

passing messages back and forth, an HMPP is the fault tolerance element on the host

side. An HMPP periodically sends checkpoint request to its local application process

and its local NMLP, coordinates the checkpointing and message logging operations

with the NMLP, reads the checkpoint image from the application process, and saves

the image together with necessary recovery information as a checkpoint file on stable

storage, generally a local hard disk.

To reduce the failure-free execution overhead, the PNML protocol offloads receiver-

based message logging from the host to a programmable NIC. An NMLP on a pro-

grammable NIC monitors network traffic, and upon a request from the associated

HMPP on the host side, it logs all messages arriving over the communication chan-

nels connecting the HMPP on behalf of the application process in the local volatile

memory on the NIC. Based on the fail-stop assumption, the local volatile memory of

38

a programmable NIC is not subject to failures, and thus it can be used together with

the host hard disk as stable storage. An NMLP synchronously logs messages to the

local memory of a programmable NIC and periodically flushes determinants and the

content of saved messages to the local hard disk.

Because the major source of overhead in receiver-based pessimistic message logging

is the synchronous saving of determinants and the content of received messages to a

slow stable storage, the PNML protocol can notably reduce the failure-free execution

overhead by synchronously saving messages in fast volatile memory. Moreover, an

NMLP logs messages in parallel with the computation of the associated application

process, which may reduce competing for the resources of the host system and further

contribute to higher performance.

Like the receiver-based pessimistic message logging protocols, the proposed PNML

protocol doesn’t require a global coordination for recovery, output commit, and

garbage collection, which is highly desirable in practical systems. But in the PNML

protocol, a HMPP and a NMLP process the checkpointing and message logging op-

erations separately, and therefore the proposed protocol requires local coordinations

between a HMPP and the associated NMLP during checkpointing and recovery, as

will be discussed in the following subsections.

3.3.1 The Coordination During Failure-Free Execution

During failure-free execution, an HMPP must coordinate message logging and

checkpointing operations with the associated NMLP. Figure 3.2 shows that in the

proposed PNML protocol, the HMPP and the NMLP on a single node cooperate

to execute a distributed application program during failure-free execution, where

horizontal lines extending toward the right-hand side represent the execution of each

process, and arrows between processes represent messages. In all of our figures,

39

m1 m2 m3

m1 m2 m3

m4

m1

m4

m2

c3

c5

c4

m3

m3m2

Phase 1

Checkpointing

In-transmit Messages :

m1m2m1 m4

m2

Message Log 1

m5

m5c2

c1

c6

c7m4m5

ApplicationProcess

HMPP

NMLP

Message Log 2

Phase 2

Figure 3.2. The Coordination During Failure-Free Execution.

ci denotes the ith control message and mj denotes the jth regular communication

message.

Before the application process starts to execute, the NMLP must know which

messages to log, i.e., which communication channel should be monitored for message

logging. After being launched on one of the participating nodes, the HMPP should

first create communication channels with its peers, and then register these channels

with the local NMLP by sending message c1, as shown in Figure 3.2. After the

NMLP gets ready for message logging, the HMPP sends message c2 to the associated

application process to initiate the execution. At the end of the execution, upon

receiving the exit message c6 from the application, the HMPP should send message

c7 to inform the NMLP to stop message logging for the communication channels

monitored on behalf of the terminating application process.

40

The major coordination issue between an HMPP and the associated NMLP is

the synchronization of the checkpointing and the message logging operation. Upon

a failure, a log-based rollback recovery protocol restarts the execution of the failed

application process from the last checkpoint, replays the deliveries of all the messages

delivered after the checkpoint, and thus replays the pre-failure execution up to the

maximum recoverable state. Therefore, we must know or determine which message

is the first one to replay, and which is the last. In the existing log-based rollback

recovery protocols, because the system performs checkpointing and message logging

in a centralized fashion on the host side, the synchronization is unnecessary. How-

ever, in the proposed PNML protocol, because an HMPP and an NMLP perform

checkpointing and message logging operation separately, they must cooperate during

the checkpointing phase.

As shown in Figure 3.2, upon the initiation of a checkpointing operation, the

HMPP sends message c3 and c4 to the local NMLP and application process, respec-

tively. As soon as message c4 (the checkpoint request) is received, the application

process immediately starts to checkpoint all the state information in its process space.

But at the time instance of sending the message c4, there may be some messages some-

where in the communication channel or in the buffer of the HMPP, which have been

logged by the NMLP, are being in transmit, and will be delivered after the initiation

of the ongoing checkpoint. These in-transmit messages either have already been saved

in the current open log files, or will be flushed to these files. Upon any future failure

after the completion of the ongoing checkpoint, these messages have to be replayed

after restarting the execution of the application process. It is however, difficult to re-

trieve them from the log file preceding the most recent one. It is not efficient to trace

the positions of these in-transmit messages in the log file when taking into consider-

ation the garbage collection. Another option is to postpone the sending of message

41

c4 till the deliveries of all the in-transmit messages. This however, it may introduce

an arbitrary long delay to the checkpointing operation.

The proposed PNML protocol saves those in-transmit messages together with

the corresponding checkpoint image as a checkpoint file, and thus the system can

efficiently retrieve them in case of a failure. To identify communication messages

received over a specific communication channel, the proposed PNML protocol asso-

ciates a unique integer to every message sent by the same process as the message ID.

Upon receiving the message c3, the NMLP flushes the determinants and the content

of the messages saved in the volatile memory of the NIC to the corresponding message

log files, closes these log files, opens new message log files associated with the new

checkpoint image, collects the IDs of the last messages in each of the closed message

log files, packs the message IDs into a message (message c5 in Figure 3.2) and sends

it to the HMPP.

With the information in the control message from the NMLP (message c5 ), the

HMPP can differentiate the in-transmit messages (message m2 and m3 ) to be saved

in the checkpoint file from those (message m4 ) saved in the new message logs (message

log 2) during the checkpointing process. For the convenience of our discussion, we

divide the checkpointing process into two successive phases. Phase 1 starts when the

HMPP sends message c3 to the NMLP and lasts until the HMPP receives message

c5 from the NMLP. Phase 2 starts immediately after phase 1 and lasts until the

completion of the checkpointing. The two phases partition the messages into three

groups. The first group of messages are received by the HMPP before phase 1,

but they are not delivered when the HMPP sends the checkpoint request to the

application process. The proposed PNML protocol duplicates all of these messages

and saves them with the checkpoint image as in-transmit messages.

The second group of messages are received by the HMPP during phase 1. Here

we make no assumption about the communication delay of message c3 and message

42

c5, nor the duration of phase 1, as such both messages saved in message log 1 and

those saved in message log 2 may be received by the HMPP during phase 1. For

example, message c5 may be received by HMPP after the receipt of message m3

and m4. All messages received in phase 1 may be delivered before the receipt of

message c5, and once delivered, they will be removed from the process memory space

of the HMPP. One might want to address this issue by duplicating all the messages

received during phase 1, and differentiate them after the receipt of message c5. But for

simplicity, the proposed PNML protocol bypasses the issue by blocking the receiving

of all communication channels during phase 1, and thus the second group of messages

become part of the third group, in which messages are received by the HMPP during

phase 2. The performance impact incurred by the blocking operation during phase 1

would be very small, because message c5 is usually only a few tens or hundreds of

bytes of data, and the communication delay between a host and its NIC is usually

very short. In our experiments, the duration of phase 1 is less than one millisecond,

which is negligible in practice.

Upon the receipt of message c5, the HMPP starts to check and monitor all the com-

munication channels for in-transmit messages. Because some communication chan-

nels may be idle around the initiation of the checkpointing, the HMPP first checks

the recorded message ID of the last received message of every communication chan-

nel against the corresponding one in message c5, and if they are equal, marks the

communication channel as flushed. Then the HMPP monitors each of those commu-

nication channels not yet flushed for in-transmit messages till the ID of a received

message equal to the corresponding one in message c5. Only when all the communi-

cation channels are marked as flushed, can the HMPP save the in-transmit messages

together with the checkpoint image of the application process as a checkpoint file on

stable storage.

43

m2 m3

m1 m2 m3

m4

m4

c3

c5

c4

m3

m3m2

Checkpointing

In-transmit Messages :

m2m1 m4

m2

Message Log 1

m5

m4m5

ApplicationProcess

HMPP

NMLP

Message Log 2

CKPT1m2 m3 m4 m5

c8

c9

Figure 3.3. An application process fails with no checkpointing running.

3.3.2 Failure Recovery

When an HMPP detects the failure of the local application process, which can be

achieved by detecting the disconnection of the communication channel between the

two processes, it will start failure recovery. The proposed PNML protocol needs to

cover two cases in the failure recovery, one is that an application process fails with no

checkpointing operation running, the other is that an application process fails during

a checkpointing operation.

Figure 3.3 shows an example, in which an application process fails when no check-

pointing operation is going on. Shortly after the detection of the failure, the HMPP

sends a message c8 to the local NMLP to request cooperation. Upon the receipt of c8,

the NMLP flushes all the determinants and the content of the received messages saved

in the volatile memory to the stable storage, and informs the HMPP of the completion

of the flushing by sending message c9. Whereafter, the HMPP restarts the failed ap-

44

m2 m3

m1 m2 m3

m4

c3

c5

c4

m3

m3m2

Checkpointing

In-transmit Messages :

m2m1 m4

m2

Message Log 1

ApplicationProcess

HMPP

NMLP

Message Log 2

CKPT1m2 m3 m4m1

Checkpointing

c14CKPT1

m5

m4m5

m5

c11

c12

Figure 3.4. An application process fails during checkpointing.

plication process, retrieves the saved messages from stable storage, and replays these

messages in the order described by their determinants. Messages to replay include

the in-transmit messages saved in the most recent checkpoint file (message m2 and

m3 ), and the messages saved in the current message log (message log 2). During the

recovery process, the HMPP drops all the outgoing messages from the application

process. Since the HMPP retrieves messages directly from stable storage, no send,

receive or message logging operation get involved in the recovery, and thus the time to

replay the pre-failure execution is shorter than that of the original execution, similar

to some traditional receiver-based pessimistic message logging protocols, as shown in

[36]. This is a desirable feature for systems requiring high availability.

Figure 3.4 shows another example, in which an application initiates a checkpoint-

ing operation, but fails before its completion. In this situation, the HMPP simply

45

SDRAM256MB

SRAM8MB

PCI-to-PCIBridge

IXP1200Processor

HostProcessor

HostMemory

NetworkInterfaceDevices

IX Bus

PCI Bus

PCI Bus

Chipset

Figure 3.5. Simplified block diagram of the IXP1200-based RadiSys ENP2505 board.

discards the incomplete checkpoint file, and restart the application process from the

last available checkpoint, or relaunches it if no checkpoint has been made since the

start of the execution. Again the HMPP coordinates with the NMLP by sending

and receiving control messages (message c11 and c12 ), retrieves and replays saved

messages. What differentiates the recovery of this example from the preceding one

is the reinitiation of the checkpointing operation. Coordination between the HMPP

and the NMLP is unnecessary in this case.

3.4 Implementation Issues

We have implemented the proposed PNML protocol on IXP1200-based RadiSys

ENP2505 PCI boards and the MPICH-V framework. In this section, we describe

the underlying hardware and software for the implementation of the proposed PNML

protocol, and address implementation issues in some detail.

46

3.4.1 IXP1200-based Board and Programmable NIC

The IXP1200 network processor [51] forms the core of the RadiSys ENP2505

board. It includes a 232MHz StrongARM core, and six 4-threaded micro-engines

for data movement. Figure 3.5 shows a simplified block diagram of the IXP1200-

based RadiSys ENP2505 board [58] residing on a host system. Besides the IXP1200

chip, the board also includes 256MB local SDRAM, 8MB local SRAM, a PCI-to-PCI

bridge to connect to the host system, and a 100T MAC with four ports to connect to

the network. In our design the StrongARM core runs an embedded Linux operating

system.

The IXP1200-based platform helps developers focus on the most value-added por-

tion of their work. The Intel Internet Exchange Architecture (Intel IXA) includes

a programming framework called Active Computing Element (ACE), which is a C

program encapsulated by the IXA Software Development Kit [50]. The ACE in-

frastructure provides many common elements that developers may need for typical

networking applications, and thus helps eliminate the need to implement and optimize

these common mechanisms.

We have implemented a network interface to evaluate the proposed PNML pro-

tocol. Figure 3.6 shows a simplified block diagram of the NIC software. To focus on

the particular purpose of our research, our implementation makes use of the existing

Ingress ACE, Layer 3 Forwarder ACE (L3Fwdr), Egress ACE and Stack ACE pro-

vided by the IXA Software Development Kit. We made modifications to the L3Fwdr

ACE, the Stack ACE, and the device drivers on the NIC side. As shown in the figure,

upstream packets go through the Ingress ACE, the L3Fwdr ACE, the Stack ACE

and the drivers all the way to the application process on the host, whereas general

downstream packets are passed from the drivers directly to the Egress ACE, and only

downstream control messages are passed to the L3Fwdr ACE. Because the L3Fwdr

ACE runs in the user space of the embedded Linux operating system, where code

47

Ingress ACE

Stack ACE

Egress ACEL3Fwdr ACE

Drivers

Drivers

daemon

TCP/IP Stack

App Process

Host

NIC

Figure 3.6. Simplified block diagram of the NIC software.

development is easier due to extended library and debugging support, we have added

new functionality to the L3Fwdr and made use of it as the NMLP. The L3Fwdr ACE

receives upstream raw packets forwarded from the Ingress ACE, processes them, and

then passes them to the Stack ACE in the kernel space. It also receives some down-

stream packets containing control messages forwarded from the Stack ACE. Because

all communication traffic goes though the embedded Linux, and the L3Fwdr ACE

executes in the user space, our implementation introduces a considerable communi-

cation overhead, but it still allows the verification and preliminary evaluation of the

proposed PNML protocol.

3.4.2 The MPICH-V framework and the Berkeley Lab Checkpoint/Restart

We have implemented the proposed PNML protocol within the MPICH-V frame-

work [11, 12, 28], because it allows implementing new transparent rollback recovery

48

Protocol Layer

Abstract Device Interface

Ch_P4 Generic DeviceP

4 d a

e m o n

Generic Communication Layer

Vcl PNML TPML

Vdaemon

R u n

t i m e

A p p

l i c a t

i o n

C h a

n n e l

I n t e

r f a c e

Figure 3.7. General architecture of MPICH and the MPICH-V framework.

protocols without significant programming efforts. The MPICH-V framework is based

on the MPICH library, a widely used implementation of the MPI standard [53].

The use of MPI middleware based on the MPI standard as message-passing en-

vironment is becoming popular. The MPI standard, which defines the APIs for a

message-passing programming model, is designed to achieve portable communication

for high performance parallel applications. A number of implementations of the MPI

standard have been realized, and they are ubiquitously used in academia and industry.

Among the existing implementations, MPICH [17] is unique in its design goal of

combining portability with high performance. Its targets were to include all systems

capable of supporting the message-passing model, while giving up as little efficiency

as possible for the portability. The MPICH is freely available, and constitutes a

complete implementation of the MPI standard specification. The suffix “CH” in

MPICH stands for “Chameleon,” the symbol of adaptability to one’s environment,

and thus of portability. MPICH includes multiple implementations of the Channel

49

Interface. A channel implements the basic communication routines for a specific

hardware or for a new protocol.

The MPICH-V framework consists of a set of runtime components and a channel,

Ch V. Figure 3.7 shows the layered architecture of the MPICH-V framework. The

Ch V channel is implemented as a generic device layer on top of a specific communi-

cation daemon (Vdaemon). The generic device, independent of any rollback recovery

protocol, implements a set of six primitives used by the Protocol Layer. The commu-

nication daemon connects its peers on all the other computing nodes and the local

MPI application process, and provides all communication routines between different

components involved in the MPICH-V framework. Rollback recovery protocols are

implemented as hooks in relevant communication routines. A set of hooks is called

a V-protocol. For example, VCL is one of the V-protocols, which implements the

Chandy-Lamport Algorithm [13]. The communication daemons in a system incorpo-

rating the VCL protocol periodically perform coordinated checkpoint of distributed

applications. Users can develop new rollback recovery protocols in the Ch V channel.

In the MPICH-V framework, the checkpoint of an MPI application can be per-

formed using the Condor Standalone Checkpoint Library (CSCL) [26] or the Berkeley

Lab Checkpoint/Restart (BLCR) [22]. The CSCL is a user-level solution, whereas

BLCR is a system-level approach. A user-level implementation by its very nature

cannot fully support the restoration of any resources, for example, the process id of

a job. This rules out a wide range of applications. In contrast, a kernel-level solution

reduces potential inconsistency, and is able to support more applications.

In our implementation, the checkpointing of MPI application processes uses BLCR.

It is an open source checkpointer, implemented as a loadable kernel module for Linux

2.4.x and 2.6.x kernels on the x86 and x86-64 architectures, and a small library. In

the MPICH-V framework, when an MPI application process receives a checkpoint re-

quest from the associated local communication daemon, the application process forks

50

EthernetHeader

EthernetTrailer

IP Header

TCP Header

ChannelHeader

ADIHeader

Figure 3.8. Packet encapsulations in the MPICH-V framework

a child process. The child process sends its process image generated by BLCR to

the communication daemon, while the forking process continues its execution. In the

meantime, the communication daemon concurrently communicates with its peers on

other nodes, and receives the checkpoint image of the application process. Thus, the

checkpointing overlaps with the computation of an MPI application.

We have implemented the proposed PNML protocol within the MPICH-V frame-

work as a set of hooks in relevant communication routines of the communication

daemons, and made use of the daemons as HMPPs. In addition to communicating

with its peers and the local application process, a communication daemon exchanges

control messages with the local modified L3Fwdr ACE on the programmable NIC,

which is modeled as NMLP in our fault tolerant architecture, to accomplish the co-

ordination between checkpointing and message logging.

51

3.4.3 Implementation Issues of NMLP

In the proposed PNML protocol, an NMLP offloads message logging from a host

system to a programmable NIC on behalf of an MPI application process running on

the host system. It performs two tasks, one is to save the determinant and the content

of every received message, and the other is to coordinate checkpointing and message

logging with an HMPP running on the host side.

To accomplish the two tasks, an NMLP should save the determinants and the

received messages in a format that can be readily recognized by an HMPP to as-

sist message replaying during failure recovery, and should trace the IDs of received

messages to assist the identification of in-transmit messages during checkpointing.

Therefore, an NMLP cannot simply save received raw packets, but need to process

the received messages and get access to the information in the header of Ch V channel

messages.

An NMLP should first unpack all received packets, and assemble packet payloads

into messages. Figure 3.8 shows the packet encapsulations in the MPICH-V frame-

work. On the IXP1200-based ENP2505 boards, the buffer passed to an L3Fwdr ACE

contains a entire Ethernet frame. As we can see in the figure, the payload size is

limited by the Maximum Transmission Unit (MTU) of Ethernet, which is typically

1,500 bytes. But an MPI message may be a few hundred kilobytes, or even megabytes.

Large MPI messages are usually fragmented by the TCP layer at the sender side. To

access information in the header of a Ch V channel message, an NMLP must assemble

packet payloads into messages. Only after the assembling, can an NMLP save the

determinants and the received Ch V channel messages, and trace the ID of the last

message of every communication channel, in this case every TCP connection.

An NMLP should also be able to handle communication errors, such as packet

duplication, loss, reordering and corruption. The TCP/IP stack in OS kernels usually

handles these errors. But because an L3Fwdr ACE on an ENP2505 board receives

52

raw Ethernet frames, our implementation of the NMLP based on the L3Fwdr ACE

must correctly handle all communication errors.

Besides error handling and packet assembling, an NMLP should address another

issue associated with the coordination between checkpointing and message logging,

that is, to correctly handle partially received messages during the coordination phase.

As mentioned above, the size of a Ch V channel message may be much larger than

MTU. On the receipt of a checkpointing coordination message from the local HMPP,

an NMLP may only receives a small portion of a message. One solution is to wait

until the NMLP receives the entire message. But this may introduce an arbitrary long

delay to the saving of the in-transmit messages on the host side, especially when the

partially received message is very long. Thus, in our implementation, upon the receipt

of a checkpointing coordination message, an NMLP packs into a control message all

the IDs of the last completely received message of all connections and sends it back

to the local HMPP. If the received portion of a partially received message has already

been flushed to a message log file, then the NMLP retrieves the saved portion from the

message log file, and saves it to the new message log file opened after the checkpointing

coordination.

In our implementation, before launching a MPI application process, a commu-

nication daemon creates TCP connections with its peers, and registers the remote

IP address and the local TCP port number of every connection with the L3Fwdr-

ACE-based NMLP on the local programmable NIC. After the registration, for each

received packet, the NMLP checks its destination IP address, source IP address and

destination TCP port number against the local IP address and the pairs of registered

remote IP address and local TCP port number. Then if it is a matching packet,

the NMLP saves all information of the packet necessary for a future recovery. Our

implementation allocates two message logging buffers for each connection. When one

is flushing its content to the local hard disk, the other is used to save new arriving

53

packet payloads, and thus the periodical flushing never blocks message logging. The

NMLP copies the TCP payload of a packet to one of the two message logging buffers,

forwards the original packet buffer to the Stack ACE, and then starts to handle pos-

sible communication errors, assemble received packets into messages, and save all

information necessary for a recovery.

One problem in implementing message logging is the size of the two message log-

ging buffers. How the size of these buffers affects system performance is an unknown

factor. In our experiments, we changed the buffer size from a few hundred kilobytes to

one megabytes, but there was no notable performance impact. This might be because

the impact is negligible when compared with the performance impacts of some more

important factors in our implementation. Another problem is the scalability of our

implementation of the proposed PNML protocol. Because, in this implementation,

an NMLP allocates two large message logging buffers for every connection, it cannot

be scalable to large distributed applications.

Our implementation of the proposed PNML protocol introduces three sources of

overhead. First, each message must in general be copied to the local memory of a

programmable NIC. Second, the volatile log in the local memory is regularly flushed

to a local hard disk to free space. Third, the handling of communication errors is

duplicated on programmable NICs. The first source of overhead occurs in the criti-

cal path of the inter-process communication, and may affect directly communication

throughput and latency, even though it is better than traditional pessimistic message

logging protocols. The second source of overhead may double the required commu-

nication bandwidth between a host system and its programmable NIC when running

applications that require high bandwidth.

54

3.5 Performance Evaluation

Our experimental setup consisted of five Pentium III workstations each with a

processor running at 730MHz, 512 MB of main memory, 1GB of swap on 7200RPM

IDE hard disk,and a 33MHz PCI bus. All the workstations run Linux 2.4.17 (Redhat

Linux 7.2). We used one the five workstations as a monitor node, and the other four

as computing nodes. In our experiments, we started the execution of an MPI program

by initiating a management process called Dispatcher [12] on the monitor node. The

Dispatcher’s task is primarily to launch communication daemons on all participating

computing nodes, each communication daemon instantiating an MPI process on its

node. We installed an IXP1200-based RadiSys ENP2505 board on each of the four

computing nodes, and loaded the L3Fwdr-ACE-based NMLP on each of the four

boards. All the workstations are connected via ethernet cables and a switch.

In the experiments, we used MPICH-V version 1.2.7p1. Most test programs were

compiled using GNU GCC, version 2.96, G77, version 2.96. But because one test pro-

gram, NPB FT, could not be compiled with G77, we compiled it using a commercial

Fortran 77 compiler, PGI PGF77, version 5.2-4.

Our performance evaluation considers coordinated checkpoint based on the Chandy-

Lamport algorithm (VCL [12]), PNML and Traditional Pessimistic Message Logging

(TPML). We implemented both PNML and TPML in the MPICH-V framework. All

tests were run in dedicated mode.

For a fair comparison between different protocols, we disabled checkpointing in

all of our experiments. This is because all the three protocols use similar checkpoint

mechanism, the checkpointing overhead is primarily determined by the checkpoint

interval and the execution time of an MPI application. To justify the experimental

setup, we also did experiments to evaluate the overhead introduced by the coordi-

nation between checkpointing and message logging in the proposed PNML protocol.

We emulated a 4-process system and a 64-process system on a workstation with an

55

ENP2505 board, and the average blocking time incurred by the coordination increased

from 737.0 to 827.8 μs, nearly no impact on the overall performance in practice.

3.5.1 Raw Communication Performance

In the first set of experiments, we used the NetPIPE [39] utility to measure the

raw communication performance of the three protocols. We performed a ping-pong

test to evaluate the effective latency and bandwidth of message transmissions between

processes on two nodes.

Figures 3.9 and 3.10 present the comparison of the point-to-point half-round-trip

latency obtained with the three protocols for different message lengths. Figures 3.11

and 3.12 compare bandwidth. As we can see from these figures, the performance of

the programmable NICs used in our experiments is very slow, due to the reason we

have discussed in Section 3.4.1. The figures show that when the message length is less

than or equal to 256 KB, PNML remains close to VCL, and is notably better than

TPML. But for messages larger than 256 KB, the performance of PNML significantly

drops, even worse than TPML. This slowdown is because the message logging opera-

tions of PNML compete for bandwidth with regular communication traffic. The slow

programmable NICs in some sense magnify the performance impact of PNML, as

faster NICs with much higher bandwidth may start to slow down when the message

length is much larger than 1 MB. However, because typically the average message

size of an MPI application is only a few tens KB, and very large messages are rare,

our implementation is still adequate to preliminarily evaluate the proposed PNML

protocol.

3.5.2 The NAS Parallel Benchmark

In the second set of experiments, we tested the performance of the three proto-

cols on well established and optimized MPI programs, the NAS Parallel Benchmarks

(NPB) [9], version 2.3. NPB is widely used to evaluate the performance of parallel

56

5

10

15

20

1 4 16 64 256 1K 4K 16K 64K 256K

Late

ncy

(104

µs)

Message Length (byte)

Coordinated CheckpointProgrammable-NIC-assisted ML

Traditional Pessimistic ML

Figure 3.9. Latency of rollback recovery protocols

5

10

15

20

25

1 4 16 64 256 1K 4K 16K 64K 256K

Late

ncy

(103

µs)

Message Length (byte)

Programmable-NIC-assisted MLTraditional Pessimistic ML

Figure 3.10. Latency difference between message logging protocols and the coordi-nated checkpointing (VCL)

57

0

2

4

6

8

10

12

14

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Ban

dwid

th (M

B/s

)

Message Length (byte)

Coordinated CheckpointProgrammable-NIC-assisted ML

Traditional Pessimistic ML

Figure 3.11. Bandwidth of rollback recovery protocols

0

1

2

3

4

5

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Ban

dwid

th D

iffer

ence

(MB

/s)

Message Length (byte)

Programmable-NIC-assisted MLTraditional Pessimistic ML

Figure 3.12. Bandwidth difference between message logging protocols and the co-ordinated checkpointing (VCL)

58

0

200

400

600

800

CG EP LU MG BT IS SP FT

Coordinated CheckpointProgrammable-NIC-assisted ML

Traditional Pessimistic ML

Figure 3.13. Performance comparison of protocols for NPB Class W

0

500

1000

1500

2000

2500

CG EP LU MG BT IS SP FT

Coordinated CheckpointProgrammable-NIC-assisted ML

Traditional Pessimistic ML

Figure 3.14. Performance comparison of protocols for NPB Class A

59

systems, because it mirrors real-world parallel scientific applications better than most

other parallel benchmarks.

For each benchmark program, NPB specifies five classes of increasing workloads,

called S, W, A, B, and C. Class S is suitable for testing, Class W for single-processor

desktop workstations, Class A for systems with up to 32 processors, while Classes B

and C are suited for multiprocessor systems.

Figures 3.13 and 3.14 present the performance comparison of the three protocols

for the NPB 2.3, for Class W and Class A size problems, respectively. The NAS

EP program measures the upper limits achievable for floating point performance,

without significant interprocessor communication, so that the execution time of the

three protocols for EP is almost the same. For all the other benchmark programs, the

performance of PNML is worse than VCL, but better than TPML. For Class W size

problem, the overhead introduced by PNML is between 3.6% and 14.9% compared

to VCL, and for Class A size problem, the overhead is between 3.4% and 16.3%.

As we can see, the performance of TPML is close to that of VCL. This is again

because of the slow programmable NICs used in the experiments. Since we disabled

checkpointing, the overhead imposed by VCL is negligible. However, in a system with

relatively faster interconnection, the failure free overhead imposed by TPML is over

70% [36]. The long communication delay hides the message logging overhead.

The largest performance difference came forth when we tested the execution time

of the three protocols on the NPB LU benchmark. The LU benchmark includes a large

number of small communications. Because of the largest message logging overhead

among all the three protocols, TPML causes a significant performance degradation

compared to the other two. A related research has shown that the communication

daemon of a sender-based pessimistic message logging protocol, also in charge of

message logging, competes for CPU resources with the MPI process in this case [10].

60

The performance advantage of PNML over TPML is not quite evident for the BT

and SP benchmarks. Over and above the slow NICs, this is because the long-duration

BT and SP benchmarks exchange many large messages, and stress communication

bandwidth. If we compare Figure 3.13 with Figure 3.14, we can find that the perfor-

mance advantage of PNML over TPML turns to be relatively small. Similarly, this

is because the message sizes scale up when the problem sizes of these benchmarks

increase from Class W to Class A.

61

CHAPTER 4

SUMMARY AND FUTURE WORK

In this dissertation, we proposed a layered fault tolerance architecture that is

composed of two fault tolerant elements on each of the computing nodes in a dis-

tributed/parallel system, one resides on the host system and the other on the I/O

attached programmable NIC. Each fault tolerant element keeps track of the health

of the other and start recovery upon a failure. The proposed fault tolerance archi-

tecture takes advantage of the collaboration between the host processor and the NIC

processor to develop fault tolerant techniques that will allow the system to quickly re-

cover from failures without a significant penalty in performance. In this dissertation,

we investigate how to improve system reliability with minimal performance overhead

through the proposed fault tolerance architecture.

In the first half of our work, we proposed and developed a software-based low-

overhead failure detection scheme for programmable network interfaces. The failure

detection is achieved by a watchdog timer that detects network interface hangs, and a

built-in ACST scheme that detects non-interface-hang failures. The proposed ACST

scheme directs the control flow to go through all the basic blocks in active logical

modules. During this procedure, the functionalities of the network interface, essen-

tially the hardware and the active logical modules of the software, are tested. Our

experimental results are very promising. In the local memory of the Myrinet inter-

face card, over 95% of the bit-flip errors that may affect applications can be detected

by our self-testing scheme in conjunction with a software watchdog timer. The pro-

posed ACST scheme can be implemented transparently to applications. To the best

62

of our knowledge, this is the first effort that applies self-testing to a programmable

network interface to detect bit flips in memory cells. The basic idea underlying the

presented failure detection scheme is quite generic and can be applied to other mod-

ern high-speed programmable networking devices, that contain a microprocessor and

local memory, such as IBM PowerNP [5], Infiniband [52], Gigabit Ethernet [42, 49],

QsNet [56] and ATM [48]. Such a failure detection scheme can take advantage of

the high bandwidth available in these systems, thereby achieving its failure detection

goals with very little overhead.

In the second half of our work, we proposed the PNML protocol. We assume that

the local memory of a programmable NIC is a safe place to save recovery information.

During failure-free execution, instead of synchronously saving recovery information to

slow stable storage in the critical path of inter-process communication, like traditional

receiver-based pessimistic message logging protocols, an NMLP on a programmable

NIC saves the determinants and the content of messages in the local memory on be-

half of an MPI application process running on the host, offloading message logging

operations from host to programmable NIC. Thus, the PNML protocol can notably

reduce failure-free execution overhead. During checkpointing phase, the NMLP coop-

erates with its HMPP on the host side to coordinate the checkpointing and message

logging. The coordination is simple and quite efficient, introducing a blocking delay

on the order of the magnitude of one millisecond. The proposed PNML combines the

efficiency and simplicity aspects of existing message logging protocols, and thus pro-

vides attractive features like low failure-free performance overhead, fast recovery and

fast interaction with I/O devices. Furthermore, in the PNML protocol, no operations

require global coordination, a highly desirable feature in practice, especially for large

scale systems. We have implemented the proposed PNML protocol on IXP1200-based

RadiSys ENP2505 boards. The PNML protocol outperforms the baseline receiver-

63

based pessimistic message logging protocol, especially in cases where processes of an

MPI application exchange a high number of small messages.

In the dissertation, we assume that only application processes may fail. We can

relax this assumption, and assume that a communication daemon or a host oper-

ating system may also fail. In case that a operating system fails, we can have the

programmable NIC raise an interrupt after the detection of the failure, and in the

kernel space of the failed host operating system, have the interrupt handler call the

restart routine to reboot the failed operating system. If the interrupt logic is cor-

rupted, we can have a resetting signal sent from a pin on the programmable NIC

to the reset button of the NIC’s residing workstation and reboot it. The recovery

of a failed communication daemon is simpler, because a daemon is stateless in our

implementation. A Dispatcher can simply relaunch a failed communication daemon

and retrieve all recovery information from either a hard disk or the volatile memory

of a programmable NIC.

To further relax the failure assumption, tolerate more types of failures, and attack

the major sources of overhead of the proposed PNML protocol, we need to develop an

improved PNML. In the new model, we can assume that either a host system or a pro-

grammable NIC may partially or completely fail, but they never fail simultaneously.

This is a reasonable assumption, because a host system and a NIC system usually run

in their own memory space, physically separate. Furthermore, we can consolidate the

assumption by introducing fault isolation techniques to prevent faults to propagate

from the host side to the NIC side, or vice versa. We can then offload the TCP/IP

stack from a host system to a programmable NIC. On one hand, the offloading im-

proves the degree of I/O and CPU parallelism and thus the performance; on the other

hand, it eliminates one of the two duplicated communication error handling opera-

tions as in a system incorporating PNML, and completely removes the third source

of overhead introduced by PNML. To tolerate more types of failures and attack the

64

second source of overhead, we can have both the host system and the NIC save a

copy of a received message and the corresponding determinant. If either of the two

fails, we can still find all necessary recovery information from the surviving one, and

thus be able to tolerate NIC failures in addition to host failures. During failure-free

execution, the host side HMPP asynchronously saves recovery information to stable

storage off the critical path, and after each saving, informs the NMLP to release

the corresponding buffer. This solution may require frequent exchange of very short

control messages between an HMPP and its NMLP, but eliminates the second source

of overhead because of the periodical flushing of large communication messages, and

thus is highly promising. Finally, we can have the improved PNML protocol share

the message log with the transmission buffers, and thus eliminate the first source of

overhead, the copying in the critical path of the inter-process communication. The

resulting improved PNML may approach the performance of coordinated checkpoint-

ing, retain all the advantages of receiver-based pessimistic message logging protocols,

and become a highly desirable log-based rollback-recovery protocol in practice.

65

BIBLIOGRAPHY

[1] D. Andrews, “Using Executable Assertions for Testing and Fault Tolerance,”

Proceedings of the Ninth International Symposium on Fault-Tolerant Computing,

pp. 102-105, Jun. 1979.

[2] L. Alvisi, K. Marzullo, “Message logging : Pessimistic, optimistic, and causal,”

Proceedings of the Fifteenth International Conference on Distributed Computing

Systems (ICDCS 1995), pp. 229-236, May-Jun. 1995.

[3] L. Alvisi, K. Marzullo, “Tradeoffs in implementing causal message logging pro-

tocols,” Proceedings of the ACM SIGACT-SIGOPS Symposium on Principles of

Distributed Computing (PODC), pp. 58-67, 1996.

[4] L. Alvisi, K. Marzullo, “Message Logging: Pessimistic, Optimistic, Causal and

Optimal,” IEEE Transactions on Software Engineering, vol. 24, No. 2, pp. 149-

159, 1998.

[5] J. R. Allen, Jr., B. M. Bass, C. Basso, R. H. Boivie, et al., “IBM PowerNP

Network Processor: Hardware, Software, and Applications,” IBM Journal of

Research and Development, vol. 47, No. 2/3, pp. 177-194, Mar./May 2003.

[6] A. Borg, W. Blau, W. Graetsch, F. Hermann, W. Oberle, “Fault tolerance under

UNIX,” ACM Transactions on Computer Systems, vol. 7, No. 1, pp. 1-24, 1989.

[7] R. M. Butler, E. L. Lusk, “Monitors, Messages, and Clusters: The p4 Parallel

Programming System,” Parallel Computing, vol. 20, No. 4, pp. 547-564, Apr.

1994.

66

[8] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N.

Seizovic, W.-K. Su, “Myrinet – A Gigabit-per-Second Local-Area Network,”

IEEE Micro, vol. 15, No. 1, pp. 29-36, Feb. 1995.

[9] D. Bailey, T. Harris, W. Saphir, R. Wijngaart, A. Woo, M. Yarrow, “The NAS

Parallel Benchmarks 2.0,” Numerical Aerodynamic Simulation Facility, NASA

Ames Research Center, Report NAS-95-020, 1995.

[10] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, F. Magniette,

“MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on the Pessimistic

Sender Based Message Logging,” Proceedings of The IEEE/ACM Supercomput-

ing Conference (SC2003), Nov. 2003.

[11] A. Bouteiller, P. Lemarinier, G. Krawezik, F. Cappello, “Coordinated checkpoint

versus message log for fault tolerant MPI,” IEEE International Conference on

Cluster Computing, Dec. 2003.

[12] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. Cappello, “MPICH-V

Project: A Multiprotocol Automatic Fault-Tolerant MPI,” International Journal

of High Performance Computing and Applications, SAGE publications, vol. 20,

No. 3, pp. 319-333, 2006.

[13] K. Chandy, L. Lamport, “Distributed Snapshots: Determining Global States of

a Distributed System,” ACM Transactions on Computer Systems vol. 3, No. 1,

pp. 63-75, Feb. 1985.

[14] R. Chillarege, “Self-testing Software Probe System for Failure Detection and

Diagnosis,” Proceedings of the 1994 conference of the Centre for Advanced Studies

on Collaborative Research, pp. 10, 1994.

67

[15] E. N. Elnozahy, W. Zwaenepoel, “Manetho: Transparent Roll Back-Recovery

with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans-

actions on Computers, vol. 41, No. 5, pp. 526-531, May 1992.

[16] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, D. B. Johnson, “A Survey of Rollback-

Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys

(CSUR), vol. 34, No. 3, pp. 375-408, Sep. 2002.

[17] W. Gropp, E. Lusk, N. Doss, A. Skjellum, ”A High-Performance, Portable Imple-

mentation of the MPI Message Passing Interface Standard,” Parallel Computing,

North-Holland, vol. 22, pp. 789-828, 1996.

[18] K. H. Huang, J. A. Abraham, “Algorithm-Based Fault Tolerance for Matrix

Operations,” IEEE Transactions on Computers, vol. 33, pp. 518-528, Dec. 1984.

[19] Y. Huang, C. M. R. Kintala, “Software implemented fault tolerance: Tech-

nologies and experience,” Proceedings of the Twenty-Third Annual International

Symposium on Fault-Tolerant Computing (FTCS-23), pp. 2-9, 1993.

[20] T. Halfhill. “Intel Network Processor Targets Routers,” Microprocessor Report,

vol. 13, No. 12, pp. 66-68, Sep. 1999.

[21] S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, C. Dai, “Impact of CMOS

Scaling and SOI on Soft Error Rates of Logic Processes,” Proceedings of Sympo-

sium on VLSI Technology, pp. 73-74, 2001.

[22] P. H. Hargrove, J. C. Duell, “Berkeley Lab Checkpoint/Restart (BLCR) for

Linux Clusters,” Proceedings of Scientific Discovery through Advanced Comput-

ing Program (SciDAC) Conference, Jun. 2006.

68

[23] D. B. Johnson, W. Zwaenepoel, “Sender-based message logging,” Proceedings of

the Seventeenth Annual International Symposium on Fault-Tolerant Computing

(FTCS-17), pp. 14-19, 1987.

[24] T. Karnik, B. Bloechel, K. Soumyanath, V. De, S. Borkar, “Scaling Trends of

Cosmic Rays Induced Soft Errors in Static Latches Beyond 0.18μ,” Proceedings

of Symposium on VLSI Circuits, pp. 61-62, 2001.

[25] A. V. Karapetian, R. R. Some, J. J. Beahan, “Radiation Fault Modeling and

Fault Rate Estimation for a COTS Based Space-borne Supercomputer,” Pro-

ceedings of IEEE Aerospace Conference, vol. 5, pp. 9-16, Mar. 2002.

[26] M. Litzkow, T. Tannenbaum, J. Basney, M. Livny, “Checkpoint and migration

of unix processes in the condor distributed processing system,” Technical Report

CS-TR-199701346, University of Wisconsin-Madison, Apr. 1997.

[27] V. Lakamraju, I. Koren, C. M. Krishna, “Low Overhead Fault Tolerant Network-

ing in Myrinet,” Proceedings of the Dependable Computing and Communications

Symposium, pp. 193-202, Jun. 2003.

[28] P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, F. Cappello, “Improved

Message Logging versus Improved Coordinated Checkpointing for Fault Tolerant

MPI,” Proceedings IEEE International Conference on Cluster Computing, Sep.

2004.

[29] M. Morisio, N. L. Sunderhaft, “Commercial-Off-The-Shelf (COTS): A Survey,”

Data and Analysis Center for Software, Technical Report, Dec. 2000.

[30] S. S. Mukherjee, J. Emer, S. K. Reinhardt, “The Soft Error Problem: An Archi-

tectural Perspective,” Proceedings of the Eleventh International Symposium on

High-Performance Computer Architecture, pp. 243-247, Feb. 2005.

69

[31] B. Nicolescu, R. Velazco, “Detecting Soft Errors by a Purely Software Approach:

Method, Tools and Experimental Results,” Proceedings of Design, Automation

and Test in Europe Conference and Exhibition (DATE’03 Designers’ Forum),

vol. 2, pp. 57-62, Mar. 2003.

[32] N. Oh, P.P. Shirvani, E.J. McCluskey, “Error Detection by Duplicated Instruc-

tions in Super-scalar Processors,” IEEE Transactions on Reliability, vol. 51, No.

1, pp. 63-75, Mar. 2002.

[33] D. K. Pradhan, Fault-Tolerant Computer System Design, Prentice Hall PTR,

1996.

[34] L. L. Pullum, Software Fault Tolerance Techniques and Implementation, Artech

House, 2001.

[35] S. Rao, L. Alvisi, H. M. Vin, “Hybrid Message-Logging Protocols for Fast Re-

covery,” Digest of Fast Abstracts of The Twenty-Eighth International Symposium

on Fault-Tolerant Computing, pp. 41-42, Jun. 1998.

[36] S. Rao, L. Alvisi, H. M. Vin, “The Cost of Recovery in Message Logging Proto-

cols,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, No. 2,

pp. 160-173, Mar./Apr. 2000.

[37] R. D. Schlichting, F. B. Schneider, “Fail-stop Processors: An Approach to De-

signing Fault-Tolerant Computing Systems,” ACM Transactions on Computer

Systems, vol. 1, No. 3, pp. 222-238, 1983.

[38] R. Strom, S. Yemini, “Optimistic Recovery in Distributed Systems,” ACM Trans-

actions on Computer Systems, vol. 3, No. 3, pp. 204-226, 1985.

70

[39] Q. O. Snell, A. R. Mikler, J. L. Gustafson, “NetPIPE: A Network Protocol

Independent Performance Evaluator,” Proceedings of The IASTED International

Conference on Intelligent Information Management and Systems, Jun. 1996.

[40] D. T. Stott, M.-C. Hsueh, G. L. Ries, R. K. Iyer, “Dependability Analysis of

a Commercial Highspeed Network,” Proceedings of the Twenty-Seventh Annual

International Symposium on Fault-Tolerant Computing, pp. 248-257, Jun. 1997.

[41] P. P. Shirvani, N. R. Saxena, E. J. McCluskey, “Software-Implemented EDAC

Protection Against SEUs,” IEEE Transactions on Reliability, vol. 49, No. 3, pp.

273-284, Sep. 2000.

[42] P. Shivam, P. Wyckoff, D. Panda, “EMP: Zero-copy OS-bypass NIC-driven Giga-

bit Ethernet Message Passing,” Proceedings of The IEEE/ACM Supercomputing

Conference (SC2001), pp. 57-57, Nov. 2001.

[43] A. Thakur, R. K. Iyer, “Analyze-NOW-an environment for collection of analysis

of failures in a network of workstation,” Proceedings of the Seventh International

Symposium on Software Reliability Engineering, pp. 14-23, Oct. 1996.

[44] S. S. Yau, F.-C. Chen, “An Approach to Concurrent Control Flow Checking,”

IEEE Transactions on Software Engineering, vol. 6, No. 2, pp. 126-137, Mar.

1980.

[45] J. F. Ziegler, et al., “IBM Experiments in Soft Fails in Computer Electronics

(1978-1994),” IBM Journal of Research and Development, vol. 40, No. 1, pp.

3-18, Jan. 1996.

[46] Y. Zhou, V. Lakamraju, I. Koren, C. M. Krishna, “Software-Based Adaptive

and Concurrent Self-Testing in Programmable Network Interfaces,” Proceedings

of the Twelfth International Conference on Parallel and Distributed Systems (IC-

PADS’06), vol. 1, pp. 525-532, 2006.

71

[47] Y. Zhou, V. Lakamraju, I. Koren, C. M. Krishna, “Software-Based Failure De-

tection and Recovery in Programmable Network Interfaces,” IEEE Transactions

on Parallel and Distributed Systems, vol. 18, No. 11, pp. 1539-1550, Nov. 2007.

[48] A.T.M. Forum. ATM User-Network Interface Specification, Prentice Hall, 1998.

[49] The Gigabit Ethernet Alliance, http://www.gigabit-ethernet.com/.

[50] Intel Corporation, “Intel IXA SDK ACE Programming Framework - IXA SDK

2.01 Developer’s Guide,”

http://www.intel.com/design/network/products/npfamily/index.htm, Dec. 2001.

[51] Intel Corporation, “Intel IXP1200 Network Processor Family - Harware Refer-

ence Manual, ”

http://www.intel.com/design/network/products/npfamily/index.htm, Dec. 2001.

[52] Infiniband Trade Association, http://www.infinibandta.com/.

[53] Message Passing Interface Forum, http://www.mpi-forum.org/.

[54] Myricom Inc, http://www.myri.com/.

[55] “Single Event Effects Specification,”

http://radhome.gsfc.nasa.gov/radhome/papers/seespec.htm.

[56] The QsNet High Performance Interconnect, http://www.quadrics.com/.

[57] “Remote Exploration and Experimentation (REE) Project,” http://www-

ree.jpl.nasa.gov/.

[58] RadiSys Corporation, “ENP2505 Hardware Reference,” Nov. 2002.

[59] The Human Impacts of Solar Storms and Space Weather.

http://www.solarstorms.org/Scomputers.html.

72