integrating fault-tolerance techniques in grid applications · 2001. 4. 30. · ii for message...

A Dissertation

Presented to

the Faculty of the School of Engineering and Applied Science

at the

University of Virginia

In Partial Fulfill ment

of the Requirements for the Degree

Doctor of Philosophy (Computer Science)

by

Integrating Fault-Tolerance Techniques in Gr id Applications

Anh Nguyen-Tuong

August 2000

© Copyright by

All Rights Reserved

Anh Nguyen-Tuong

August 2000

i

Abstract

The contribution of this thesis is the development of a framework for simpli fying the

construction of grid computational applications. The framework provides a generic

extension mechanism for incorporating functionality into applications and consists of two

models: (1) the reflective graph and event model, and (2), the exoevent notification model.

These models provide a platform for extending user applications with additional

capabiliti es via composition. While the models are generic and can be used for a variety of

purposes, including security, resource accounting, debugging, and application monitoring

[VILE97, FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the

integration of fault-tolerance techniques.

Using the framework, fault-tolerance experts can encapsulate algorithms using the two

reflective models developed in this dissertation. Developers incorporate these algorithms

into their tools and augment the set of services provided to application programmers.

Application programmers then use these augmented tools to increase the likelihood that

their programs will complete successfully.

We claim that the framework enables the easy integration of fault-tolerance techniques

into object-based grid applications. To support this claim, we have mapped onto our

models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD

checkpointing, passive and stateless replication, and pessimistic method logging. We

incorporated these algorithms into three common grid programming tools: Message

Passing Interface (MPI), Mentat, and Stub Generator (SG). MPI is the de facto standard

ii

for message passing; Mentat is a C++-based parallel programming environment; and SG

is a popular tool for writing client/server applications.

We measured the ease by which techniques can be integrated into applications based

on the number of additional li nes of code that a programmer would have to write. In the

best case, programmers needed to add three lines of code. In the worst case, programmers

had to write functions to save and restore the local state of their objects. However, such

functions are simple to write and exploit programmers’ knowledge of their applications.

Acknowledgements

To my ancestors, who have trekked down this path,and cleared a road for others to follow,three centuries is not that long after all

To that turtle in Hanoi,forever gazing at the pond,the smell of incense on a hot summer day

To the committee, for helping me to ascertain,the inside from the outside, the lines delicately drawn

To John Knight,for ensuring a smooth landing

To Andrew, my advisor and mentor,for showing me the difference between a milli second and a microsecond,and for taking me along on his adventures

To Karine, my eternal accomplice,whose support and love,are the real foundation of this research

To my parents, whose journey I have yet to fully appreciate,cam on nhieu

To my sister, Vi,the dancer, the musician, the pharmacist, the photograph,who never ceases to amaze me,may she appreciate her roots on her voyage home

To Madgy, Bootsy, Noushka, Kona,rain or shine, eyes always sparkling,heart purring and tail wagging

Special thanks to Nuts,whose wit is as sharp as his intellect,for all his insights, technical, culinary and otherwise

And to all my friends, Chenxi, Dave, John, Karp, Glenn, Matt, Mike, Paco, Rashmi, the Dinner Gang,who have made this trip so enjoyable

iv

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Current support for fault tolerance in grids . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Properties of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Grid models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Legion grid environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Framework foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5.1 Framework summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Constraints and assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Computational grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 PVM and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.1 DOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.2 CVMULUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

v

2.1.1.3 Other extensions to PVM and MPI . . . . . . . . . . . . . . . 202.1.2 Isis, Horus and Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.3 Linda, Pirhana and JavaSpaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Local events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.1 Protocol stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.2 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . 272.3.1.3 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Distributed events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Aspect-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Integrating fault tolerance in distributed systems. . . . . . . . . . . . . . . . . . . 30

2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 3 Reflective Graph and Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Event API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Overhead for graphs and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Structure of an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.1 Overview of a protocol stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Example of incorporating new functionality . . . . . . . . . . . . . . . . 47

3.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 4 Exoevent Notification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Registering interest in an exoevent. . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Object scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Method scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.1 The notify-root policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.2 The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 The notify-third-party policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.4 The notify-hybrid policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Application programmer interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Example exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.1 Failure detection – push model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi

4.6.2 Failure detection – pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.3 Failure detection – service model . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Chapter 5 Mappings of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.1 SPMD checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1.2 Mapping SPMD checkpointing . . . . . . . . . . . . . . . . . 775.1.1.3 Summary of SPMD checkpointing. . . . . . . . . . . . . . . 80

5.1.2 2-phase commit distributed checkpointing. . . . . . . . . . . . . . . . . . 805.1.2.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1.2.2 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.2.3 Mapping 2-phase commit distributed checkpointing . 835.1.2.4 Summary of 2PCDC algorithm. . . . . . . . . . . . . . . . . . 86

5.2 Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.1 Pessimistic message logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.2.2 Mapping pessimistic message logging . . . . . . . . . . . . . . . . . . . . . 915.2.3 Optimization: pessimistic method logging. . . . . . . . . . . . . . . . . . 945.2.4 Legion system-level support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.5 Summary of pessimistic logging. . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.1 Passive replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.1.1 Mapping passive replication. . . . . . . . . . . . . . . . . . . 1005.3.1.2 Legion system-level support . . . . . . . . . . . . . . . . . . . 1015.3.1.3 Summary of passive replication . . . . . . . . . . . . . . . . 102

5.3.2 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.2.1 Mapping stateless replication . . . . . . . . . . . . . . . . . . 1055.3.2.2 Duplicate method suppression . . . . . . . . . . . . . . . . . 1085.3.2.3 Summary of stateless replication . . . . . . . . . . . . . . . 108

5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 6 Integration into Programming Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.1 MPI (SPMD and 2PCDC Checkpointing) . . . . . . . . . . . . . . . . . . . . . . . 112

6.1.1 Legion MPI (LMPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 Legion MPI-FT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2 Stub generator (passive replication and pessimistic method logging) . . 121

6.2.1 Modifications to the stub generator . . . . . . . . . . . . . . . . . . . . . . 1226.2.2 Integration with pessimistic method logging . . . . . . . . . . . . . . . 1236.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

vii

6.2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.5 Integration with passive replication . . . . . . . . . . . . . . . . . . . . . . 1276.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3 MPL – Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3.1 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3.2 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Chapter 7 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.1 Stub Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.1.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.2.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2.2 BT-MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.3 Mentat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.3.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.2 Complib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

viii

List of Figures

Figure 1: Grid layered implementation models (adapted from [FOST99], pg. 30) . . . 7

Figure 2: Code fragment and RGE graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 3: Example use of the graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 4: Graph interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 5: Example use of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 6: Event interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Figure 7: Structure of an object: sample protocol stack. . . . . . . . . . . . . . . . . . . . . . 47

Figure 8: Adding a handler for logging methods (pseudo-code) . . . . . . . . . . . . . . . 48

Figure 9: The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 10: Propagating exoevents to a catcher object . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 11: Example propagation of exoevents in the notify-hybrid policy . . . . . . . . 59

Figure 12: API for exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 13: Failure detection using the push model . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Figure 14: Failure detection using a pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 15: Generic failure detection service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 16: Structure of a fault-tolerant application . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 17: Lost and orphan messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 18: Insertion of checkpoint in SPMD code. . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Figure 19: Recovery example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 20: Interface for checkpoint server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 21: Interface for application manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Figure 22: Raising the “CheckpointTaken” exoevent . . . . . . . . . . . . . . . . . . . . . . . . 78

ix

Figure 23: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Figure 24: Interface for coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Figure 25: 2PCDC code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure 26: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Figure 27: Pessimistic message logging (PML). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure 28: Interface for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 92

Figure 29: Handlers for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 93

Figure 30: Handler for intercepting outgoing communication. . . . . . . . . . . . . . . . . . 94

Figure 31: Pessimistic method logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Figure 32: Passive replication example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Figure 33: Passive replication interface (primary) . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Figure 34: Handlers for passive replication (primary) . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 35: Server lookup with primary replication . . . . . . . . . . . . . . . . . . . . . . . . . 102

Figure 36: Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 37: Interface for proxy object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Figure 38: Sending a method to a replica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Figure 39: Simple MPI program (myprogram) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Figure 40: Legion MPI architecture augmented with FT modules. . . . . . . . . . . . . . 116

Figure 41: Example of MPI application with checkpointing. . . . . . . . . . . . . . . . . . 119

Figure 42: Example of saving and restoring user state . . . . . . . . . . . . . . . . . . . . . . 120

Figure 43: Creating objects using the stub generator . . . . . . . . . . . . . . . . . . . . . . . . 122

Figure 44: Specification of READONLY methods . . . . . . . . . . . . . . . . . . . . . . . . . 123

Figure 45: Modified client-side stubs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Figure 46: Interface and code for myApp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Figure 47: Example of MPL application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Figure 48: Declaring a Mentat class as stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Figure 49: Specifying parameters for the stateless replication policy . . . . . . . . . . . 131

Figure 50: Interface for context object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Figure 51: Context application structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Figure 52: BT-MED application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Figure 53: Complib application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Figure 54: Complib main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

x

List of Tables

Table 1: Overhead of graphs and events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 2: Sample set of events for building protocol stack of an object . . . . . . . . . 45

Table 3: Example of typical exoevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Table 4: Exoevent interest for notify-root policy . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Table 5: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 6: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 58

Table 7: Exoevent interest for notify-hybrid policy for object AppA . . . . . . . . . . . 59

Table 8: Exoevent interest for notify-hybrid policy for object catcher . . . . . . . . . 60

Table 9: Exoevent interest for notify-hybrid policy for object B . . . . . . . . . . . . . . 60

Table 10: Overhead in creating and raising exoevents . . . . . . . . . . . . . . . . . . . . . . . 63

Table 11: Sample exoevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Table 12: “ I am Alive” exoevent raised by application objects . . . . . . . . . . . . . . . . 64

Table 13: Exoevent raised on object creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Table 14: Exoevent raised by failure detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Table 15: Data structures for FT modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Table 16: Summary SPMD checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Table 17: 2PCDC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Table 18: Recovery in 2PCDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Table 19: Summary 2PCDC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Table 20: Summary of pessimistic logging algorithm . . . . . . . . . . . . . . . . . . . . . . . 96

Table 21: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 102

Table 22: “Object:MethodDone” notification by replica . . . . . . . . . . . . . . . . . . . . 106

xi

Table 23: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 108

Table 24: Summary of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Table 25: Sample MPI functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Table 26: Functions to support checkpoint/restart . . . . . . . . . . . . . . . . . . . . . . . . . 116

Table 27: Options for legion_mpi_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Table 28: Summary of work required for integration of checkpointing algorithms120

Table 29: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Table 30: Summary of work required for integration of PML . . . . . . . . . . . . . . . . 126

Table 31: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Table 32: Summary of work required for integration of passive replication . . . . . 128

Table 33: Summary of work required for integration of stateless replication . . . . 132

Table 34: Stub generator – RPC performance (n = 100, α = 0.05). . . . . . . . . . . . . 136Table 35: Context performance (n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 139Table 36: Context performance with one induced failure (n = 5, α = 0.05) . . . . . . 140Table 37: Send and receive performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . 142Table 38: BT-MED performance (n = 20, α = 0.05). . . . . . . . . . . . . . . . . . . . . . . . 143Table 39: Performance with one induced failure (n = 10, α = 0.05) . . . . . . . . . . . 145Table 40: RPC performance (1 worker, n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . 146Table 41: Complib performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 149Table 42: Complib performance with failure induced (n = 10, α = 0.05) . . . . . . . 149Table 43: Application summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Table 44: Framework overhead based on RPC application . . . . . . . . . . . . . . . . . . 151

1

in-fra-struc-ture \'in-fre-,strek-cher, n (1927)The basic facili ties, services, and installations needed for the functioning of a community

or society, such as transportation and communications systems, water and power lines,and public institutions including schools, post offices, and prisons.

— American Heritage Dictionary

Chapter 1

Introduction

Throughout history, the development of infrastructures has catalyzed and shaped the

evolution of human progress. The construction of Roman roads, the telegraph, the

telephone, the modern banking system, the rail road, the interstate highway system, the

electrical power grids, and the Internet, are all successful infrastructures that have

revolutionized how people communicate and interact. At the dawn of the new millennium,

we are witnessing the birth of what promises to be the next revolutionary infrastructure.

Funded in the United States by several governmental agencies, including the National

Science Foundation (NSF), the Defense Advanced Research Project Agency (DARPA),

the Department of Energy (DOE), and the National Aeronautics and Space Administration

(NASA), this new infrastructure is often referred to as a metasystem or computational grid

[GRIM97A, SMAR97, GRIM98, FOST99, LEIN99].

A computational grid is a specialized instance of a distributed system [MULL93,

TANE94] with the following characteristics: compute and data resources are

geographically distributed; they are under the control of different administrative domains

2

with different security and accounting policies; and the hardware resource base is

heterogeneous and consists of PCs, workstations and supercomputers from different

manufacturers. The abilit y to develop applications over this environment is sometimes

referred to as the wide-area computing problem [GRIM99].

Computational grids present a complex environment in which to develop applications.

Writing a grid application is at least as difficult as writing an application for traditional

distributed systems. Thus, since both are fundamentally distributed memory systems,

programmers must deal with issues of application distribution, communication and

synchronization. Furthermore, grids present additional challenges as programmers may be

required to deal with issues such as security, disjoint file systems, fault tolerance and

placement, to name only a few [GRIM98, FOST99, GRIM99]. Without additional higher

level abstractions, all but the best programmers will be overwhelmed by the complexity of

the environment.

The contribution of this work is the development of a framework for simpli fying the

construction of grid applications. The framework provides a generic extension mechanism

for incorporating functionality into applications and consists of two models: (1) the

reflective graph and event model, and (2), the exoevent notification model. These models

provide a platform for extending user applications with additional capabiliti es via

composition. While the models are generic and can be used for a variety of purposes,

including security, resource accounting, debugging, and application monitoring [VILE97,

FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the

integration of fault-tolerance techniques. Support for the development of fault-tolerant

3

applications has been identified as one of the major technical challenges to address for the

successful deployment of computational grids [GRIM98, FOST99, LEIN99].

Consider application reliabilit y in a grid. As applications scale to take advantage of a

grid’s vast available resources, the probabilit y of failure is no longer negligible and must

be taken into account. For example, consider an application decomposed into 100 objects,

with each object requiring one week of processing time and placed on its own workstation.

Assuming that each workstation has an exponentially distributed failure mode with a

mean-time-to-failure of 120 days, the mean-time-to-failure of the entire application would

only be 1.2 days, thus, the application would rarely finish!

Using the framework, fault-tolerance experts can encapsulate algorithms using the two

reflective models developed in this dissertation. Developers incorporate these algorithms

into their tools and augment the set of services provided to application programmers.

Application programmers then use these augmented tools to increase the likelihood that

their programs will complete successfully.

We claim that the framework enables the easy integration of fault-tolerance techniques

into object-based grid applications. To support this claim, we have mapped onto our

models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD

checkpointing, passive and stateless replication, and pessimistic method logging. We

chose these algorithms to il lustrate the applicabilit y of our framework to a range of fault-

tolerance techniques. Furthermore, we selected these algorithms because we believe that

they are likely to be used in grid applications. We incorporated these algorithms into three

common grid programming tools: Message Passing Interface (MPI), Mentat, and Stub

Generator (SG). MPI is the de facto standard for message passing; Mentat is a C++-based

4

parallel programming environment; and SG is a popular tool for writing client/server

applications.

We measured the ease by which techniques can be integrated into applications based

on the number of additional li nes of code that a programmer would have to write. In the

best case, programmers needed to add three lines of code. In the worst case, programmers

had to write functions to save and restore the local state of their objects. However, such

functions are simple to write and exploit programmers’ knowledge of their applications.

Furthermore, tools to automate save and restore state functions have already been

demonstrated in the literature [BEGU97, FERR97, FABR98].

To the best of our knowledge, we are the first to advocate and use a reflective

architecture to structure applications in computational grids. Moreover, we are the first to

demonstrate the integration of a wide range of fault-tolerance techniques into grid

applications using a single framework.

1.1 Current support for fault tolerance in gr ids

Until recently, the foremost priority for grid developers has been to develop working

prototypes and to show that applications can be written over a grid environment

[GRIM97B, BRUN98, FOST98]. To date, there has been limited support for application-level

fault tolerance in computational grids. Support has consisted mainly of failure detection

services [STEL98, GROP99] or fault-tolerance capabilities in specialized grid toolkits

[NGUY96, CASA97]. Neither solution is satisfactory in the long run. The former places the

burden of incorporating fault-tolerance techniques into the hands of application

programmers, while the latter only works for specialized applications. Even in cases

5

where fault-tolerance techniques have been integrated into programming tools, these

solutions have generally been point solutions, i.e., tool developers have started from

scratch in implementing their solution and have not shared, nor reused, any fault-tolerance

code.

As these tools are ported to grid environments, or as new tools are developed for grid

environments, the continued development of fault-tolerant tools as point solutions

represents wasteful expenditure. We believe a better approach is to provide a structural

framework in which tool developers can integrate fault-tolerance solutions via a

compositional approach in which fault-tolerance experts write algorithms and encapsulate

them into reusable code artifacts, or modules. Tool developers can then integrate these

modules in their environments.

1.2 Properties of the framework

Our long-term goal is to simpli fy the construction of fault-tolerant grid applications.

We believe that a good solution for achieving this goal should exhibit the following

properties:

• P1. Separation of concerns and composition. Designing and writing fault-

tolerance code are complex and error-prone tasks and should be done by experts,

not application programmers or tool developers. Thus, fault-tolerance experts

should be able to encapsulate algorithms into reusable and composable code

artifacts [NGUY99]. Furthermore, the incorporation of fault-tolerance techniques

should not interfere with other non-functional concerns such as security or

accounting.

• P2. Localized cost. By localized cost, we mean that the use of resources or services

to implement fault-tolerance techniques should not be charged to applications that

6

do not require those resources or services—users should pay only for the level of

services that they need. In general, localized cost is an important attribute for any

grid services [GRIM97A].

• P3. Working proof of concept. We should be able to demonstrate the integration of

fault-tolerance techniques in running applications on a working grid prototype and

using multiple programming tools. Further, applications with fault-tolerance

techniques integrated should be able to tolerate more failures than applications that

do not use any fault-tolerance techniques.

1.3 Evaluation

Based on our goal of simpli fying the construction of fault-tolerant applications and the

properties listed in §1.2, we have derived several criteria by which to evaluate our

framework (next to each criterion, we note in parenthesis its related property):

• Multiple programming tools. A successful solution should promote and enable the

incorporation of fault-tolerance techniques into multiple programming tools,

including legacy tools such as MPI or PVM. Legacy tools are already familiar to

programmers and should ease the transition from traditional distributed systems to

grid environments. (P1, P3)

• Breadth of fault-tolerance techniques. A successful solution should support a wide

range of fault-tolerance techniques so that application programmers may use the

one that is most appropriate for their needs. (P1, P2)

• Ease of use. Incorporating fault-tolerance techniques should required only trivial

or small modifications to applications. (P1, P3)

• Localized cost. Application programmers should select and pay only for the level

of fault tolerance that they require. A good framework should not impose a

system-wide solution. Instead, the cost of using fault-tolearnce techniques should

be localized to the applications that use these techniques. (P2)

• Overhead. Is the overhead of using fault-tolerance techniques due to the algorithm

or to the framework itself? In deciding whether to incorporate a fault-tolerance

7

technique, users should only worry about the algorithmic overhead, i.e., the cost of

the algorithm itself. (P2, P3)

1.4 Background

1.4.1 Gr id models

Before describing our framework, we present the implementation models of

computational grids. As shown in Figure 1, a grid consists of services that run on top of

native operating systems. These services provide functionality such as authentication,

failure detection, object and process management, and remote input/output, and are

accessed via grid libraries. Typically, an application programmer will not access these

libraries directly, but will use a programming tool such as MPI [GROP99],

NetSolve [CASA97], Ninf [SATO97] or MPL [GRIM97B], which in turn will call the

underlying grid libraries. The advantage of this layered model is that application

programmers can use familiar programming tools and interfaces and are shielded from the

complexity of accessing grid services.

FIGURE 1: Grid layered implementation models (adapted from [FOST99], pg. 30)

MPI, PVM, NetSolve, DOME, MPL, Fortran

Grid Services

Programming Tools

Applications

Native Operating Systems

Security, Object/Process Management, Scheduling,Failure Detection, Storage

Globus API, Legion API

Applications

Windows NT, Unix

Grid Libraries

8

There are currently three approaches to building grids: the commodity approach, the

service approach, and the integrated architecture approach [FOST99]. In the commodity

approach, existing commodity technologies, e.g. HTTP, CORBA, COM, Java, serve as the

basic building blocks of the grid [ALEX96, BALD96, FOX96, CHRI97]. The primary

advantages of this approach are the use of industry standard protocols, allowing

programmers to ride the technology curve as improvements are made to these protocols.

Furthermore, standard protocols stand a better chance of being adopted by a large

community of developers. The problem with this approach is that the current set of

protocols may not be adequate to meet the requirements of computational grids. In the

service approach, as exempli fied by the Globus project, a set of basic services such as

security, communication, and process management are provided and exported to

developers in the form of a toolkit [FOST97]. In the integrated architecture approach,

resources are treated and accessed through a uniform model of abstraction [GRIM98]. As

we describe in §1.4.3, our framework targets the integrated approach.

1.4.2 Reflection

Our framework relies on the observation that although fault-tolerance techniques are

diverse by nature, their implementation is not. Indeed, the implementation of the major

famili es of fault-tolerance techniques rely on common basic primitives such as:

• intercepting the message stream

• piggybacking information on the message stream

• acting upon the information contained in the message stream

• saving and restoring state

• detecting failure

• exchanging protocol information between participants of an algorithm

9

Thus, by providing an execution model whereby these primitives can be expressed and

manipulated as first class entities, it is possible to achieve our goals of developing fault-

tolerance capabili ties independently and integrating them into programming tools.

We use reflection as the architectural principle behind our execution models. Smith

introduced the concept of reflection as a computational process that can reason about itself

and manipulate representations of its own internal structure [SMIT82]. Two properties

characterize reflective systems: introspection and causal connection.* Introspection

allows a computational process to have access to its own internal structures. Causal

connection enables the process to modify its behavior directly by modifying its internal

data structures—there is a cause-and-effect relationship between changing the values of

the data structures and the behavior of the process. The internal data structures are said to

reside at the metalevel while the computation itself resides at the baselevel. The metalevel

controls the behavior at the baselevel. In our case, the fault-tolerance capabiliti es are

expressed at the metalevel and control the underlying baselevel computation.

1.4.3 Legion gr id environment

Our work targets the Legion environment for multiple reasons: (1) Legion is object-

based, (2) it already uses graphs for inter-object communication, (3) it is an existing grid

prototype, and (4), multiple programming tools are available. None of the other

environments considered, such as Globus and CORBA-based systems, possess all these

attributes. However, our framework is also relevant to these other environments. For

example, it could be used to structure CORBA applications. Recent research has been

* Note that the term causal is used differently in the distributed systems literature where it refersto the “happen-before” relationship as defined by Lamport [LAMP78].

10

oriented towards extending the functionality of CORBA systems through a reflective

architecture [BLAI98, HAYT98, LEDO99]. Our work suggests that structuring CORBA-

reflective architectures using an event-based and/or graph-based paradigm is an idea

worth pursuing.

Legion treats all resources in a computation grid as objects that communicate via

asynchronous method invocations. Objects are address-space-disjoint, i.e., they are

logically-independent collections of data and associated methods. Objects contain a thread

of control, and are named entities identified by a Legion Object IDentifer (LOID). Objects

are persistent and can be in one of two states: active or inert. Active objects contain a

thread of control and are ready to service method calls. They are implemented with

running processes over a message passing layer. Inert objects exist as passive object state

representations on persistent storage. Legion moves objects between active and inert states

to use resources efficiently, to support object mobili ty, and to enable failure resili ence.

Legion objects are under the control of a Class Manager object that is responsible for

the management of its instances. A Class Manager defines policies for its instances and

regulates how an object is created, or deleted, and when it should be migrated, activated or

deactivated. By defining new Class Managers, grid developers can change the

management policies of object instances. Class Managers themselves are managed by

higher-order class managers, forming a rooted hierarchy.

Legion provides several default objects to manage its resource base. The two basic

objects are Host Objects and Vault Objects, which correspond to processor and storage

resources in a traditional operating system. Host objects are responsible for running an

active object while vault objects are used to store inert objects. Legion allows

11

customization of all it s objects. Thus, a host object could represent compute resources that

exhibit varying degrees of reliabilit y and performance, e.g., a personal computer, a

workstation, a server, a cluster, or a queue-controlled supercomputer. Similarly a vault

object could represent a local disk, a RAID disk, or tertiary storage. A full description of

the Legion object model can be found in the literature [GRIM98].

1.5 Framework foundation

The key contribution of this work is the development of two reflective models that are

the foundations of our framework, the reflective graph and event model, and the exoevent

notification model. Together these models provide flexible mechanisms for structuring

applications and specifying the flow of information between objects that comprise an

application. Furthermore, the models enable information propagation policies to be bound

to applications at run-time. The flexibilit y of the models and the abilit y to defer the

binding of policy decisions are the differentiating features of our framework.

The reflective graph and event model (RGE) reflects our target environment of (1) an

environment in which objects are implemented by running processes that communicate

via message passing, and (2) an object-based environment in which an application consists

of a set of cooperating objects. The RGE model employs graphs and events to expose the

structure of objects to fault-tolerance developers. It specifies both its external aspect

(interactions between objects) and its internal aspect (interaction inside objects). Graphs

and events are the building blocks with which fault-tolerance implementors can

incorporate functionali ty inside objects and exchange fault-tolerance protocol information

between objects. Graphs represent interactions between objects; a graph node is either a

12

member function call on an object or another graph, arcs model data or control

dependencies, and each input to a node corresponds to a formal parameter of the member

function. Events specify interactions inside objects and are used to structure their protocol

stack.

Our second model, the exoevent notification model, is a distributed event model.

Similarly to the event model defined by CORBA [BENN95] and the Java Distributed Event

Specification [SUN99A], the exoevent notification model provides a flexible mechanism

for objects to communicate. However, unlike the CORBA and Java models, the salient and

distinguishing features of the exoevent notification model are that it unifies the concept of

exceptions and events—an exception is a special case of an event—and it allows the

specification of event propagation policies to be set on a per-application, per-object or per-

method basis, at run-time. In our model, exoevents denote object state transitions and are

associated with program graphs. Raising an exoevent results in the execution of method

invocations on remote objects through the execution of associated program graphs—

hence the term exoevent. The abilit y to specify handlers as program graphs allows

developers to specify more complex policies than with a traditional event model.

The use of reflection to incorporate non-functional requirements has been proposed by

Stroud [STRO96]. Its use for integrating fault-tolerance capabilit ies into systems has been

successfully employed in many object-based systems, including FRIENDS [FABR98] and

GARF [GUER97]. Reflection has also been used as the basis for extending object

functionality in CORBA-based systems (OpenORB [BLAI98], FlexiNet [HAYT98],

OpenCorba [LEDO99]). The novelty of this dissertation is to suggest the use of events as

the primary structuring mechanism for designing object request brokers, the use of generic

13

program graphs to describe distributed event propagation policy and bind policy at run-

time, and the use of reflection to specify inter- and intra-object communication as generic

and flexible means of extending grid applications with additional functionality. In

particular, we focus on using the models to extend applications with fault-tolerance

capabiliti es.

1.5.1 Framework summary

In order to enable the integration of fault-tolerance techniques with applications, our

framework requires that both fault-tolerance experts and tool developers target the

reflective graph and event model and the exoevent notification model. Note that the

framework does not make any assumptions about the failure model used by the underlying

system, or the failure assumptions made by a given fault-tolerance algorithm. The

framework is an integration framework only; the decision as to whether a given algorithm

is suitable for a given application is not part of the framework proper.

Our framework imposes a unified structure on the way grid libraries are organized.

Specifically, our framework requires that library components use an event paradigm for

intra-object communication. The advantages of events in terms of flexibilit y and

extensibilit y are well-known. Events have been used in such diverse areas as graphical

user interfaces [NYE92], protocol stacks [BHAT97, HAYD98], operating system kernels

[BERS95] and integrated systems [SULL96]. Using events for building the protocol stack

of an object provides natural hooks for inserting fault-tolerance capabiliti es. In fact, the

events required to build a protocol stack for objects are those that are needed for

incorporating fault-tolerance functionality.

14

For inter-object communications, our model provides a data-driven, graph-based

abstraction. Graphs have been used successfully in parallel and distributed systems

[BABA92, BEGU92, GRIM96A]. Graphs enable the expression of traditional client/server

interactions, such as CORBA, as well as more complex interactions, such as pipelined

flow.

1.6 Constraints and assumptions

The fault-tolerance algorithms discussed in this dissertation make use of three

common assumptions: fail-stop, availabil ity of reliable storage, and reliable networks.

However, Legion only provides an approximation of these assumptions. Detecting a

crashed object is approximated using conservatively-set timeouts; reliable storage is

approximated with standard disks; and the use of a high-level retry mechanism for sending

messages is used to mask transient network partitions. Thus, it is possible for an

application using a given fault-tolerance technique to violate its failure assumptions. To

increase the likelihood that these assumptions are met, Legion could be configured to use

hosts and storage devices with higher reliabilit y, e.g., hosts such as those provided by the

NonStopTM Compaq®† or Stratus® architectures, storage such as RAID disks, and

possibly hosts configured with redundant network paths. However, we do not expect this

configuration to be common in grids in the near future. Thus, application developers

should be aware of the possibili ty of violating the failure assumptions—if the cost of

violating these assumptions is too high, e.g., as would be the case with safety-criti cal

applications, then these applications should not be used on Legion.‡ The framework

† Formerly known as Tandem®, acquired by Compaq Corporation.‡ Note that this comment applies to any computational grids.

15

described here is an integration framework only, and does not make any guarantees as to

the suitability of using a given algorithm. However, to increase the likelihood that the

failure assumptions are met, we configured applications to run within a site [DOCT99].

In this dissertation the algorithms we have mapped onto our framework are designed

to tolerate host failures. Computational grids use hardware resources owned by various

entities, including research labs, governmental agencies, and universities. At any moment

in time, it is thus not surprising to find that some hosts used by a grid system have crashed

due to someone rebooting the machine or tripping on a power cord; or by chance; or a host

may simply be down for maintenance. While the crash failure of hosts represents an

important class of failures in grids, we note that they are not the only source of failures—

unreliable software or operator error could also result in the failure of applications

[GRAY85]. Furthermore, we do not concern ourselves with non-fault-masking techniques

such as reconfiguration and presentation of alternative services to cope with failures

[HOFM94, KNIG98, GART99]. We are only concerned with the integration of fault-masking

techniques in grid applications. Once a host fails, we assume that it does not recover.

Furthermore, we seek only to integrate fault-tolerance techniques into user applications

and do not address the case of fault-tolerance for system-level objects.** We assume that

Legion services are always available.

1.7 Outline

We have organized the rest of the dissertation as follows. In Chapter 2, we present an

overview of related work in the areas of computational grids, reflection, event-driven

** Legion system-level objects already tolerate transient host failures.

16

systems, aspect-oriented programming and integration of fault-tolerance techniques in

distributed systems. In Chapter 3, we provide an overview of our execution model, the

reflective graph and event model. In Chapter 4, we describe the development of a

distributed event notification model that is used as a flexible communication model to

exchange protocol information between objects. In Chapter 5, we illustrate mappings from

several well -known fault-tolerance techniques onto the reflective graph and event model

and the distributed event notification model. In Chapter 6, we present the integration of

several mappings described in Chapter 5 into several programming tools available in the

Legion grid. In Chapter 7, we tie the previous chapters together and provide a working

proof that our models have been successfully integrated into several tools and

applications. We also evaluate the performance of these applications. In Chapter 8, we

conclude by presenting lessons we learned and opportunities for future research.

17

There is only one nature – the division into science and engineering is a humanimposition, not a natural one. Indeed, the division is a human failure;

it reflects our limited capacity to comprehend the whole.— Bill Wulf

Chapter 2

Related Work

We present a broad overview of computational grids and potential grid tools to provide

context for our work (§2.1). We discuss reflective systems (§2.2) as our reflective graph

and event model is based on a reflective architecture. We discuss the event model and its

use in various settings to support extensibilit y and flexibil ity (§2.3). We consider aspect-

oriented programming and its potential relationship with event-based extension

mechanisms (§2.4). Finally, we present several approaches to integrating fault-tolerance

techniques into distributed systems, including CORBA-based systems (§2.5).

2.1 Computational gr ids

Foster et al. have identified three approaches to building computational grids: the

commodity approach, the service approach, and the integrated architecture approach

[FOST99]. In the commodity approach, existing commodity technologies, e.g., HTTP,

CORBA, COM, Java, serve as the basic building blocks of the grid [ALEX96, BALD96,

FOX96, CHRI97]. In the service approach, as exempli fied by the Globus project, a set of

18

basic services such as security, communication, and process management are provided and

exported to developers in the form of a toolkit [FOST97]. In the integrated architecture

approach, resources are accessed through a uniform model of abstraction [GRIM98]. For

example, Legion enables the development of grid applications by providing a uniform

object abstraction to encapsulate and represent grid resources, e.g., compute, data, and

people resources. A motivating factor for both the service and integrated architecture

approach is that the set of commodity services provided by current technology does not

suffice to meet the requirements of computational grids [FOST99].

We present several systems below and comment on the suitabilit y of these systems for

developing grid applications.

2.1.1 PVM and MPI

PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are the two

best-known message passing environments in grid computing [GEIS94, GEIS97]. They

provide programmers with library support for writing applications with explicit message

send and receive operations. In addition to message passing, PVM and MPI provide the

il lusion of an abstract virtual machine that supports the creation and deletion of processes

or tasks. As of this writing, MPI has eclipsed PVM to become the primary message

passing standard, and is supported by all major computer manufacturers.

Both Legion and Globus provide support for MPI [FOST99]. Legion also provides

support for PVM. We describe below several systems layered on top of PVM or MPI that

provide fault-tolerance capabilit ies. While these systems have not yet been ported to grid

prototypes, they are representative of the kind of systems that are likely to be incorporated

19

into grids. It is interesting to note that many of these systems are geared towards scientific

computing; they provide support for a style of application known as SPMD applications

(Single Program Multiple Data) in which identical processes process a subdomain of the

application data. SPMD applications are often time-stepped, with periodic exchange of

information at well -defined intervals.

2.1.1.1 DOME

DOME (Distributed Object Migration Environment), runs on top of PVM and

supports application-level fault-tolerance in heterogeneous networks of workstations

[BEGU97]. DOME defines a collection of data parallel objects such as arrays of integers or

floats that are automatically distributed over a network of workstations. DOME supports

the writing of SPMD applications in which a process is replicated on multiple nodes and

executes its computation over a different subset of the data. DOME provides support for

the checkpointing of SPMD applications. Similarly to the checkpointing techniques that

we use, DOME’s checkpoints support the recovery of applications on heterogeneous

architectures.

2.1.1.2 CVMULUS

CVMULUS is a library package for visualization and steering of fault-tolerant SPMD

applications for use on top of PVM [GEIS97]. In CVMULUS, programmers specify the

data decomposition of their applications. CVMULUS automatically uses this information

for checkpoint/recovery and is able to reconfigure applications even if the recovered

application uses fewer workers or tasks. Since CVMULUS is geared towards SPMD

applications, the consistency of application-wide checkpoints is easily maintained.

20

2.1.1.3 Other extensions to PVM and MPI

Fail-Safe PVM is an extension of PVM to provide application-transparent fault

tolerance based on checkpoint and recovery [LEON93]. While it achieves transparency,

Fail-Safe PVM required modifications to the PVM daemons to monitor the flow of

messages between PVM tasks. Silva et. al provide a user-level li brary called PUL-RD to

support checkpointing and recovery of SPMD applications on top of MPI [SILV95].

Programmers are responsible for describing the data layout of their applications. Similarly

to CVMULUS, the PUL-RD library supports the recovery of applications with fewer

processes.

2.1.2 Isis, Horus and Ensemble

Isis, Horus and Ensemble are representative of systems that use a process group

abstraction to structure distributed applications [BIRM93, RENE96, HAYD98]. The central

tenet of such systems is that support for programming with distributed groups is the key to

writing reliable applications.

Process groups enable the realization of a virtually synchronous model of computation

wherein the notion of time is defined based on the ordering of messages [LAMP78].

Typically, a programmer uses various forms of multicast primitives for communication

with members of a group, e.g., causal multicast or totally ordered multicast. The receipt of

messages within a group may be ordered with respect to group membership changes,

thereby enabling programmers to write algorithms such that group members can logically

take some actions “at the same time” with respect to failures. Failures of processes are

treated as changes in the membership of a group. Only processes that are members of a

21

group are allowed to process messages. Thus, group membership as seen in Isis, simulates

a fail -stop model in which processes fail by halting [SCHN83, SABE94].

The process group model has often been criti cized on the basis of the end-to-end

argument [SALT90]. Critics of the model argue that the ordering properties guaranteed by

group communication primitives are provided at too low a level of abstraction, and in

some cases, may be unnecessary to meet the specifications of an application [CHER93].

Proponents of the model argue that the services provided by the model are invaluable in

developing fault-tolerant distributed applications [RENE93, BIRM94, RENE94].

It is interesting to view the progression of systems developed at Cornell University,

from Isis to Horus, and then to Ensemble, as a response to the end-to-end argument. While

Isis was a monolithic system, both Horus and Ensemble allow developers to configure and

customize the protocol stacks of processes to meet the needs of applications. In Ensemble,

the protocol stack of processes can be configured at run-time using an event-driven

paradigm, unlike the protocol stack of Horus which has to be configured statically.

The process group model has found acceptance in several domain areas, including

finance, groupware applications, telecommunication, military systems, factory automation

and production control [BIRM93]. For more information on the model and its applications

to Internet applications, please see the recent book by Birman [BIRM96].

Our framework differs in that its focus is on integrating fault-tolerance techniques in

object-based systems whereas the focus of Isis, Horus and Ensemble, is in supporting the

process group abstraction. The two are not mutually exclusive, it is possible to layer a

reflective framework on top of ordered group communication primitives [FABR98].

22

For grid applications, it is too early to determine how much of a role the process group

model will play. However, the evolution from Isis to Ensemble point to a common design

goal of supporting flexibilit y and extensibility (§2.3).

2.1.3 L inda, Pirhana and JavaSpaces

In Linda, processes in an application cooperate by communication through an

associative shared memory abstraction called tuple space [CARR89]. A tuple in tuple space

names a data element that consists of a sequence of basic data types such as integers,

floats, characters and arrays. Linda defines four basic operations, out, in, rd and eval, to

access tuple space. Out is used to deposit tuples in tuple space, in and rd are used to search

tuple space. A nice property of in and rd is that they can specify a generic pattern to search

tuple space. Finally, eval is used to create a new process. The primary advantages of Linda

are that it is simple to learn its four operations and easy for programmers to use a shared

memory abstraction. PLinda is an extension to Linda to provide fault-tolerance through

the checkpointing and recovery of tuple space and the use of a commit protocol to deposit

and read tuples from tuple space [JEON94]. Another fault-tolerant version of Linda is

Pirhana [CARR95]. Pirhana supports a style of computation known as master-worker

parallelism, in which a master process generates a set of tasks to be consumed by workers.

Pirhana enables users to treat a collection of hosts as a computational resource base on

which to assign tasks. When a user reclaims a host, e.g. by pressing a key or clicking on

the mouse, Pirhana automatically reassigns the task to another host, thus ensuring that an

application eventually completes. The act of reclaiming of host can be treated as a failure

and is analogous to leaving a group in a system with group membership.

23

Linda and its derivatives are particularly well -suited to a master-worker style of

computation—a style that is prevalent in grid applications. We expect that over time, a

Linda-like abstraction, wil l be ported to computational grids. We note that Linda is

currently a commercial product supported by Scientific Computing Associates, Inc, under

the tradename Paradise ® .

The Linda tuple model heavily influenced the development of the Jini JavaSpacestm

Specification [SUN99A]. Similarly to Linda, JavaSpaces provide the abstraction of an

associative shared memory in which Java programs can deposit and retrieve information.

JavaSpaces improve upon the Linda model in that Java programs can be automatically

notified of changes in the JavaSpace through events [SUN99A]. Both Linda tuple space

and JavaSpaces can be viewed as an instance of a blackboard architecture in which

different components interact and coordinate actions based on state changes in a shared

repository [SHAW96].

2.2 Reflection

Smith introduced the concept of reflection and that of a computational process that can

reason about itself and manipulate representations of its own internal structure [SMIT82].

Two properties characterize reflective systems: introspection and causal connection.

Instropection enables a computational process to have access to its own internal structures.

Causal connection enables the computational process to modify its behavior directly by

modifying its internal data structures, i.e., there is a cause-and-effect relationship between

changing the values of the data structures and the behavior of the process. The internal

24

data structures are said to reside at the metalevel while the computation itself resides at the

baselevel; thus the metalevel controls the behavior of the baselevel.

Reflection provides a principled means of achieving open engineering, i.e., of

extending the functionali ty of a system in a disciplined manner [BLAI98]. A key attribute

of reflective systems is that of separation of concerns between the metalevel and the

baselevel. For example, Fabre et al. incorporated replication techniques into objects using

the reflective programming language Open-C++ [FABR95]. The implementation of the

replication techniques was performed at the metalevel with lit tle changes to the underlying

baselevel application. The design and implementation of the replication techniques were

separated from the design and implementation of the actual application, thus allowing the

replication techniques to be composable with many applications. In general, reflective

architectures enable the composition of non-functional concerns with the underlying

computational process [STRO96].

Another advantage of reflective architectures is that they enable flexibilit y and

extensibilit y of functionality. Reflective architectures have been used in such diverse areas

as programming languages [MAES87, WATA88, KICZ91, AKSI98, TATS98, MOSS99,

WELC99], operating systems [YOKO92], real-time systems [SING97, STAN98, STAN99],

fault-tolerant real-time systems [BOND93], agent-based systems [CHAR96], dependable

systems [AGHA94], and distributed middleware systems, e.g., OpenORB [BLAI98],

FlexiNet [HAYT98], OpenCorba [LEDO99] and Legion [NGUY99].

A feature common to all reflective systems is that they answer two questions: What

internal structure or metalevel information (meta-information) is exposed to developers?

How does one access the metalevel? The answer to the first question is application-

25

dependent. For example, in real-time systems such as FERT or Spring [BOND93, STAN98]

the meta-information includes timing constraints of tasks, deadlines, and precedence

constraints. In a programming language such as CLOS, the meta-information includes

slots and methods [KICZ91]. In an object-based distributed systems, meta-information can

include methods, arguments and replies [BLAI98, HAYT98, LEDO99, VILE97]. The answer

to the second question also varies. A popular method of programming the metalevel is

through an object-oriented paradigm in which a metalevel object defines and controls the

behavior of baselevel objects [MAES87, KICZ91]. Other means of accessing meta-

information include using compiler technology [FABR95, CHIB95, TATS98], configuration

files [MOSS99, WELC99], and events [NGUY98, PAWL98].

The reflective models developed in this dissertation reflect our target environment of a

computational grid. Incorporating fault-tolerance techniques in a distributed application—

a set of cooperating objects—requires manipulation of the internal as well as external

aspects of an object. Our models regulate both intra-object interactions, i.e., interactions

between modules inside an object, and inter-object interactions, i.e., interactions between

objects. The dual aspect of our models enable the integration of application-wide

algorithms such as checkpointing, in contrast to other reflective systems whose focus have

been on integrating techniques such as replication in server objects [FABR95, GUER97,

BLAI98, HAYT98].

A further difference between our architecture and other reflective middleware

architectures is that we do not use a metaobject protocol to control the behavior of the

baselevel [AGHA94, FABR95, GUER97, FABR98, HAYT98, LEDO99]. Instead, we present a

graph-and-event-based interface accessible through simple C++ library calls. In contrast,

26

other reflective approaches such as OpenCorba [LEDO99] and Garf [GUER97] rely on the

Smalltalk programming language. We believe that presenting a C++ based interface

expands our potential community of developers.

2.3 Events

Events have been used in a variety of contexts [SHAW96], in graphical user interfaces,

to build protocol stacks [BERS95, BHAT97, HAYD98, VILE97], in integrated systems

[SULL96], or as a generic mechanism for component interactions [BENN95]. We separate

our discussion of events in two sections: local events and distributed events. Local events

propagate within the same address space whereas distributed events propagate to a

different address space.

2.3.1 Local events

2.3.1.1 Protocol stacks

Many projects such as SPIN [BERS95], Coyote [BHAT97] and Ensemble [HAYD98],

use an event-based paradigm for flexibil ity and extensibilit y. SPIN is a dynamically

extensible operating system that uses events as its extension mechanism. A SPIN event is

used to notify the system of a state change or to request a service. For example, an IP

extension to the kernel could announce the event PacketArrived. Events in SPIN are fine-

grained, reflecting their use in an operating system. Likewise, events in the Coyote project

are fine-grained, reflecting their use in a kernel designed for network protocols. Coyote

extends the x-kernel [HUTC91] and enables the construction of micro-protocols that

communicate via events. Micro-protocols implement low-level properties, e.g.,

27

acknowledging that a message has been received or maintaining a membership list of li ve

processes. By composing micro-protocols, the Coyote protocol stack can be easily

configured to implement higher-level properties, e.g., group remote procedure calls with

acknowledgment. Coyote was designed primarily for network protocols and so the set of

pre-defined events relate mostly to messages, e.g., Message_Inserted_Into_Bag or

Message_Ready_To_Be_Sent. Ensemble uses events as the primary mechanism for

composing micro-protocols and supporting the process group abstraction. Example events

in Ensemble include Send-Message and Leave-Group.

The set of events exported by a system depends on the target environment and defines

the extension vocabulary with which developers can extend functionality. Since we target

an object-based system implemented over a message-passing communication layer, we

export events such as MessageSend and MethodReceived. Approaches such as Coyote or

our own in which events manipulate data structures (e.g., messages) contained in shared

data structures (e.g., message repository), can be viewed as a blackboard architecture

augmented with implicit invocations [SHAW96].

2.3.1.2 Graphical user inter face

Events have been widely popular in implement graphical user interfaces, e.g., the

MacOS ®, Microsoft Windows ®, Java’s Abstract Window Toolkit. Events enable the

separation of the visual aspects of a program from the actual computation. Typical events

in these systems deal with various aspects of the desktop metaphor, e.g, mouse, windows,

buttons, menus, keyboard input. Programmers can register event handlers to be notified of

user actions and take appropriate actions. However, coordinating events may be a difficult

28

task. Thus, most environments provide tools to facilit ate the development of graphical

user interfaces, e.g., Java Swing, Visual Basic.

2.3.1.3 JavaBeans

JavaBeanstm is the component technology developed by Sun Microsystems for use

within the Java platform [SUN99B]. A bean is a reusable software artifact that can be

manipulated visually using a builder tool. Beans can communicate with one another using

an event paradigm. The advantages of using Beans are that they are portable across

heterogenous architectures and that many tool builders are actively developing products to

support the development of Java Beans.

2.3.2 Distr ibuted events

Distributed events are used to communicate information between remote objects or

processes. In CORBA, the Event Service allows an object to register its interest in events

raised by other objects [BENN95]. CORBA defines two roles for objects: suppliers and

consumers. Suppliers produce events; consumers processes them. Suppliers and

consumers may be directly linked in which case events flow directly from the suppliers to

the consumers. Alternatively, an event channel may be defined to serve as an intermediary

object between suppliers and consumers. Using an event channel fully decouples suppliers

from consumers—consumers need not be active when suppliers deposit events on an

event channel. Furthermore, event channels may provide added functionality such as

filtering and persistence. The Jini Distributed Event Specification provides similar

functionality as CORBA’s event service [SUN99A]. It also provides additional features

such as the ability to bound the time during which an object is interested in an event raised

29

by some other objects via leasing [SUN99A]. In Jini terminology, an event listener may

register to be notified of an event on a one-time basis, forever, or for a specified time

period.

The exoevent notification model developed in this dissertation is similar to both the

CORBA and the Java Distributed Event specifications in that it supports the flexible

propagation of events between objects. The distinguishing features of our model are that it

unifies the concept of exceptions and events, i.e., an exception is simply a special kind of

event, and it allows programmers to specify the propagation of events on a per-

application, per-object or per-method basis. The exoevent notification model does not

support the concept of leasing.

While we use distributed events in our work for the dissemination of data to support

fault-tolerance algorithms, we note that the publish/subscribe model supported by events

is generic. As an example, the Department of Defense’s High Level Architecture uses the

publish/subscribe model to propagate information about entities in distributed simulations

[DMSO98]. As another example, the Jini Discovery and Join Specification regulates how

devices can discover the presence of other devices on a network [SUN99A].

2.4 Aspect-or iented programming

The use of the event paradigm to extend functionality for middleware systems is

related to the issue of crosscutting and weaving in aspect oriented programming [KICZ97].

Crosscutting is the concept that extensions to a modularly-designed program cannot be

constrained within the bounds of the original program decomposition. An example of

crosscutting in an object-oriented program would be the addition of synchronization

30

primitives at the beginning of each method. Kiczales’ thesis is that crosscutting is

common in large software systems. Our experiences with middleware systems corroborate

his thesis; aside from implementing its functional requirements, an object may also handle

issues such as argument marshalling, security, debugging, performance monitoring and

synchronization. In aspect-oriented programming technology, these issues are called

aspects. Aspect-oriented programming languages elevate aspects to first-class status and

provide a clean separation between the functional decomposition of a program—objects

or modules—and non-functional requirements which pertain to the way objects and

modules relate to one another [HIGH99].

After aspects are elevated to first-class status they must be composed with the

underlying program. This process is known as weaving and seems closely related to events

in the sense that events can be used to implement weaving. For example, an aspect for

debugging could be implemented easily in an object-based system by inserting an event

handler to intercept methods and logging them on storage for future replay. An interesting

avenue of research would be to investigate the use of an aspect-oriented programming

language to extend the functionality of objects in computational grids, or alternatively, to

investigate the suitability of the event paradigm for weaving aspects. Pawlak et al. are

currently investigating this line of research [PAWL98].

2.5 Integrating fault tolerance in distr ibuted systems

Fabre et al. present an excellent analysis of different approaches for integrating fault-

tolerance in distributed systems [FABR95, FABR98]. They distinguish between three main

approaches: the system approach, the library approach and the inheritance approach. In

31

the system approach, the runtime system provides support for fault-tolerance. For

example, Delta-4 [POWE94] offers several replication strategies such as passive, semi-

active and active replication to Delta-4 application programmers. In the library approach,

a set of functions is provided at the application-level to support a set of fault-tolerance

algorithms. For example, ISIS [BIRM93], Horus [RENE96] and Ensemble [HAYD98],

provide developers with various forms of ordered communication primitives. In the

inheritance approach, an object can inherit fault-tolerance properties such as persistence

and recoverabilit y from a base class. Examples of this approach include Avalon/C++

[DETL88] and Arjuna [ARJU92]. Fabre analyzes these approaches in terms of transparency,

reusabilit y and composabilit y, and argues that none meet these criteria simultaneously.

Fabre proposes the use of reflective techniques to meet these criteria and shows how to

integrate replication techniques into distributed objects using the reflective language

Open C++ [FABR95, FABR98]. Other systems that advocate the use of reflection to

incorporate fault-tolerance techniques include MAUD [AGHA94] and Garf [GUER97].

A fertile area of research has been to integrate fault-tolerance techniques into CORBA.

Moser et al. propose a fault-tolerance framework that implement fault-tolerance

management services both above and below an object request broker (ORB) [MOSE99].

Other projects such as Electra and Orbix+Isis integrate replication and group mechanisms

inside the ORB itself [MAFF95, LAND97]. DOORS (Distributed Object-Oriented Reliable

Service) provides fault-tolerance services as CORBA horizonatal services [SCHO98].

Elnozahy et al. provide a library of fault-tolerance techniques that can be used in both

CORBA and DCE environments [ELNO95]. Except for DOORS, which is implemented

above the ORB layer, all the other projects use interception methods to implement

32

replication services. Interception is implemented by modifying the ORB itself [LAND97],

by providing a library to be called from within the ORB [ELNO95], or by using features of

the operating system [MOSE99]. The Orbix ORB includes the notion of f il ters to intercept

method calls. However, Marzullo’s group at the University of Cali fornia, San Diego,

reported difficulties in integrating the messing logging fault-tolerance technique with

Orbix [NAMP99]. Marzullo et al. suggest that an event-driven model would have

alleviated the report diff iculties [NAMP99].

The need to extend the functionality of ORBs have led several researchers to adopt a

reflective architure to structure ORB implementations [BLAI98, HAYT98, LEDO99]. Our

development of the RGE and exoevent notification models also provides an extension

mechanism. The novelty of this work is to suggest the use of events as the primary

structuring mechanism for designing object request brokers and to specify both inter- and

intra-object communication within a unified model.

2.6 Summary

In designing our models, we drew inspiration from reflective systems as well as

previous work on flexible protocol stack. Our approach differs in two respects with most

CORBA-based reflective middleware approaches: (1) we use a simple graph and event-

based interface for extending object functionality instead of a metaobject protocol, and

(2), our reflective models are designed to extend the functionality of applications, not just

single server objects. In the next chapter, we present the cornerstone of our framework, the

reflective graph and event model. We show an application of our model in designing a

protocol stack and extending it with new functionality.

33

Make everything as simple as possible, but not simpler.— Albert Einstein (1879-1955)

Chapter 3

Reflective Graph and Event Model

The cornerstone of our framework is the specification of the reflective graph and event

(RGE) execution model. It provides a structural framework for providing basic object

functionality such as invoking methods, and marshalli ng and unmarshalli ng parameters,

similar to an object request broker (ORB) in CORBA systems [OMG95]. In addition, the

model provides a generic extension mechanism for incorporating new functionality into

objects—such functionality is encapsulated into reusable code artifacts, or modules. Thus,

the RGE model provides a common framework for fault-tolerance designers and tool

developers, and enables the integration and composition of fault-tolerance modules into

programming tools.

The novelty of this work is to suggest the use of events as the primary structuring

mechanism for designing object request brokers and to use a single model to specify both

inter- and intra-object communication. The RGE model employs graphs for inter-object

communication and events for intra-object interactions. Graphs represent interactions

between objects; a graph node is either a member function call on an object or another

34

graph, arcs model data and control dependencies, and each input to a node corresponds to

a formal parameter of the member function. Events specify interactions between modules

inside objects. Graphs and events are the building blocks with which fault-tolerance

developers can incorporate functionality inside objects and exchange protocol information

between objects.

The RGE model is reflective because it exposes the structure of objects (i

integrating fault-tolerance techniques in grid applications · 2001. 4. 30. · ii for message...

Documents