integrating fault-tolerance techniques in grid applications · 2001. 4. 30. · ii for message...
TRANSCRIPT
-
A Dissertation
Presented to
the Faculty of the School of Engineering and Applied Science
at the
University of Virginia
In Partial Fulfill ment
of the Requirements for the Degree
Doctor of Philosophy (Computer Science)
by
Integrating Fault-Tolerance Techniques in Gr id Applications
Anh Nguyen-Tuong
August 2000
-
© Copyright by
All Rights Reserved
Anh Nguyen-Tuong
August 2000
-
i
Abstract
The contribution of this thesis is the development of a framework for simpli fying the
construction of grid computational applications. The framework provides a generic
extension mechanism for incorporating functionality into applications and consists of two
models: (1) the reflective graph and event model, and (2), the exoevent notification model.
These models provide a platform for extending user applications with additional
capabiliti es via composition. While the models are generic and can be used for a variety of
purposes, including security, resource accounting, debugging, and application monitoring
[VILE97, FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the
integration of fault-tolerance techniques.
Using the framework, fault-tolerance experts can encapsulate algorithms using the two
reflective models developed in this dissertation. Developers incorporate these algorithms
into their tools and augment the set of services provided to application programmers.
Application programmers then use these augmented tools to increase the likelihood that
their programs will complete successfully.
We claim that the framework enables the easy integration of fault-tolerance techniques
into object-based grid applications. To support this claim, we have mapped onto our
models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD
checkpointing, passive and stateless replication, and pessimistic method logging. We
incorporated these algorithms into three common grid programming tools: Message
Passing Interface (MPI), Mentat, and Stub Generator (SG). MPI is the de facto standard
-
ii
for message passing; Mentat is a C++-based parallel programming environment; and SG
is a popular tool for writing client/server applications.
We measured the ease by which techniques can be integrated into applications based
on the number of additional li nes of code that a programmer would have to write. In the
best case, programmers needed to add three lines of code. In the worst case, programmers
had to write functions to save and restore the local state of their objects. However, such
functions are simple to write and exploit programmers’ knowledge of their applications.
-
Acknowledgements
To my ancestors, who have trekked down this path,and cleared a road for others to follow,three centuries is not that long after all
To that turtle in Hanoi,forever gazing at the pond,the smell of incense on a hot summer day
To the committee, for helping me to ascertain,the inside from the outside, the lines delicately drawn
To John Knight,for ensuring a smooth landing
To Andrew, my advisor and mentor,for showing me the difference between a milli second and a microsecond,and for taking me along on his adventures
To Karine, my eternal accomplice,whose support and love,are the real foundation of this research
To my parents, whose journey I have yet to fully appreciate,cam on nhieu
To my sister, Vi,the dancer, the musician, the pharmacist, the photograph,who never ceases to amaze me,may she appreciate her roots on her voyage home
To Madgy, Bootsy, Noushka, Kona,rain or shine, eyes always sparkling,heart purring and tail wagging
Special thanks to Nuts,whose wit is as sharp as his intellect,for all his insights, technical, culinary and otherwise
And to all my friends, Chenxi, Dave, John, Karp, Glenn, Matt, Mike, Paco, Rashmi, the Dinner Gang,who have made this trip so enjoyable
-
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Current support for fault tolerance in grids . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Properties of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Grid models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Legion grid environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Framework foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Framework summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Constraints and assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Computational grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 PVM and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.1 DOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.2 CVMULUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
-
v
2.1.1.3 Other extensions to PVM and MPI . . . . . . . . . . . . . . . 202.1.2 Isis, Horus and Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.3 Linda, Pirhana and JavaSpaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Local events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.1 Protocol stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.2 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . 272.3.1.3 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Distributed events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Aspect-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Integrating fault tolerance in distributed systems. . . . . . . . . . . . . . . . . . . 30
2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3 Reflective Graph and Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Event API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Overhead for graphs and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Structure of an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Overview of a protocol stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Example of incorporating new functionality . . . . . . . . . . . . . . . . 47
3.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4 Exoevent Notification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Registering interest in an exoevent. . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Object scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Method scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 The notify-root policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.2 The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 The notify-third-party policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.4 The notify-hybrid policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Application programmer interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Example exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.1 Failure detection – push model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
-
vi
4.6.2 Failure detection – pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.3 Failure detection – service model . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 5 Mappings of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.1 SPMD checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1.2 Mapping SPMD checkpointing . . . . . . . . . . . . . . . . . 775.1.1.3 Summary of SPMD checkpointing. . . . . . . . . . . . . . . 80
5.1.2 2-phase commit distributed checkpointing. . . . . . . . . . . . . . . . . . 805.1.2.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1.2.2 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.2.3 Mapping 2-phase commit distributed checkpointing . 835.1.2.4 Summary of 2PCDC algorithm. . . . . . . . . . . . . . . . . . 86
5.2 Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 Pessimistic message logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.2.2 Mapping pessimistic message logging . . . . . . . . . . . . . . . . . . . . . 915.2.3 Optimization: pessimistic method logging. . . . . . . . . . . . . . . . . . 945.2.4 Legion system-level support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.5 Summary of pessimistic logging. . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Passive replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.1.1 Mapping passive replication. . . . . . . . . . . . . . . . . . . 1005.3.1.2 Legion system-level support . . . . . . . . . . . . . . . . . . . 1015.3.1.3 Summary of passive replication . . . . . . . . . . . . . . . . 102
5.3.2 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.2.1 Mapping stateless replication . . . . . . . . . . . . . . . . . . 1055.3.2.2 Duplicate method suppression . . . . . . . . . . . . . . . . . 1085.3.2.3 Summary of stateless replication . . . . . . . . . . . . . . . 108
5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 6 Integration into Programming Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.1 MPI (SPMD and 2PCDC Checkpointing) . . . . . . . . . . . . . . . . . . . . . . . 112
6.1.1 Legion MPI (LMPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 Legion MPI-FT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Stub generator (passive replication and pessimistic method logging) . . 121
6.2.1 Modifications to the stub generator . . . . . . . . . . . . . . . . . . . . . . 1226.2.2 Integration with pessimistic method logging . . . . . . . . . . . . . . . 1236.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
-
vii
6.2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.5 Integration with passive replication . . . . . . . . . . . . . . . . . . . . . . 1276.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 MPL – Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3.1 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3.2 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Chapter 7 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.1 Stub Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2.2 BT-MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Mentat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.2 Complib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
-
viii
List of Figures
Figure 1: Grid layered implementation models (adapted from [FOST99], pg. 30) . . . 7
Figure 2: Code fragment and RGE graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 3: Example use of the graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 4: Graph interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 5: Example use of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 6: Event interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 7: Structure of an object: sample protocol stack. . . . . . . . . . . . . . . . . . . . . . 47
Figure 8: Adding a handler for logging methods (pseudo-code) . . . . . . . . . . . . . . . 48
Figure 9: The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 10: Propagating exoevents to a catcher object . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 11: Example propagation of exoevents in the notify-hybrid policy . . . . . . . . 59
Figure 12: API for exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 13: Failure detection using the push model . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 14: Failure detection using a pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 15: Generic failure detection service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 16: Structure of a fault-tolerant application . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 17: Lost and orphan messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 18: Insertion of checkpoint in SPMD code. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 19: Recovery example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 20: Interface for checkpoint server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 21: Interface for application manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 22: Raising the “CheckpointTaken” exoevent . . . . . . . . . . . . . . . . . . . . . . . . 78
-
ix
Figure 23: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 24: Interface for coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 25: 2PCDC code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 26: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 27: Pessimistic message logging (PML). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure 28: Interface for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 92
Figure 29: Handlers for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 30: Handler for intercepting outgoing communication. . . . . . . . . . . . . . . . . . 94
Figure 31: Pessimistic method logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 32: Passive replication example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Figure 33: Passive replication interface (primary) . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 34: Handlers for passive replication (primary) . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 35: Server lookup with primary replication . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 36: Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 37: Interface for proxy object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 38: Sending a method to a replica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 39: Simple MPI program (myprogram) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Figure 40: Legion MPI architecture augmented with FT modules. . . . . . . . . . . . . . 116
Figure 41: Example of MPI application with checkpointing. . . . . . . . . . . . . . . . . . 119
Figure 42: Example of saving and restoring user state . . . . . . . . . . . . . . . . . . . . . . 120
Figure 43: Creating objects using the stub generator . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 44: Specification of READONLY methods . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 45: Modified client-side stubs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 46: Interface and code for myApp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 47: Example of MPL application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Figure 48: Declaring a Mentat class as stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 49: Specifying parameters for the stateless replication policy . . . . . . . . . . . 131
Figure 50: Interface for context object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 51: Context application structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Figure 52: BT-MED application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Figure 53: Complib application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Figure 54: Complib main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
-
x
List of Tables
Table 1: Overhead of graphs and events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Table 2: Sample set of events for building protocol stack of an object . . . . . . . . . 45
Table 3: Example of typical exoevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 4: Exoevent interest for notify-root policy . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Table 5: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 57
Table 6: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 7: Exoevent interest for notify-hybrid policy for object AppA . . . . . . . . . . . 59
Table 8: Exoevent interest for notify-hybrid policy for object catcher . . . . . . . . . 60
Table 9: Exoevent interest for notify-hybrid policy for object B . . . . . . . . . . . . . . 60
Table 10: Overhead in creating and raising exoevents . . . . . . . . . . . . . . . . . . . . . . . 63
Table 11: Sample exoevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 12: “ I am Alive” exoevent raised by application objects . . . . . . . . . . . . . . . . 64
Table 13: Exoevent raised on object creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table 14: Exoevent raised by failure detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 15: Data structures for FT modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 16: Summary SPMD checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 17: 2PCDC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 18: Recovery in 2PCDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 19: Summary 2PCDC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 20: Summary of pessimistic logging algorithm . . . . . . . . . . . . . . . . . . . . . . . 96
Table 21: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 102
Table 22: “Object:MethodDone” notification by replica . . . . . . . . . . . . . . . . . . . . 106
-
xi
Table 23: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 108
Table 24: Summary of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Table 25: Sample MPI functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table 26: Functions to support checkpoint/restart . . . . . . . . . . . . . . . . . . . . . . . . . 116
Table 27: Options for legion_mpi_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 28: Summary of work required for integration of checkpointing algorithms120
Table 29: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Table 30: Summary of work required for integration of PML . . . . . . . . . . . . . . . . 126
Table 31: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Table 32: Summary of work required for integration of passive replication . . . . . 128
Table 33: Summary of work required for integration of stateless replication . . . . 132
Table 34: Stub generator – RPC performance (n = 100, α = 0.05). . . . . . . . . . . . . 136Table 35: Context performance (n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 139Table 36: Context performance with one induced failure (n = 5, α = 0.05) . . . . . . 140Table 37: Send and receive performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . 142Table 38: BT-MED performance (n = 20, α = 0.05). . . . . . . . . . . . . . . . . . . . . . . . 143Table 39: Performance with one induced failure (n = 10, α = 0.05) . . . . . . . . . . . 145Table 40: RPC performance (1 worker, n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . 146Table 41: Complib performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 149Table 42: Complib performance with failure induced (n = 10, α = 0.05) . . . . . . . 149Table 43: Application summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Table 44: Framework overhead based on RPC application . . . . . . . . . . . . . . . . . . 151
-
1
in-fra-struc-ture \'in-fre-,strek-cher, n (1927)The basic facili ties, services, and installations needed for the functioning of a community
or society, such as transportation and communications systems, water and power lines,and public institutions including schools, post offices, and prisons.
— American Heritage Dictionary
Chapter 1
Introduction
Throughout history, the development of infrastructures has catalyzed and shaped the
evolution of human progress. The construction of Roman roads, the telegraph, the
telephone, the modern banking system, the rail road, the interstate highway system, the
electrical power grids, and the Internet, are all successful infrastructures that have
revolutionized how people communicate and interact. At the dawn of the new millennium,
we are witnessing the birth of what promises to be the next revolutionary infrastructure.
Funded in the United States by several governmental agencies, including the National
Science Foundation (NSF), the Defense Advanced Research Project Agency (DARPA),
the Department of Energy (DOE), and the National Aeronautics and Space Administration
(NASA), this new infrastructure is often referred to as a metasystem or computational grid
[GRIM97A, SMAR97, GRIM98, FOST99, LEIN99].
A computational grid is a specialized instance of a distributed system [MULL93,
TANE94] with the following characteristics: compute and data resources are
geographically distributed; they are under the control of different administrative domains
-
2
with different security and accounting policies; and the hardware resource base is
heterogeneous and consists of PCs, workstations and supercomputers from different
manufacturers. The abilit y to develop applications over this environment is sometimes
referred to as the wide-area computing problem [GRIM99].
Computational grids present a complex environment in which to develop applications.
Writing a grid application is at least as difficult as writing an application for traditional
distributed systems. Thus, since both are fundamentally distributed memory systems,
programmers must deal with issues of application distribution, communication and
synchronization. Furthermore, grids present additional challenges as programmers may be
required to deal with issues such as security, disjoint file systems, fault tolerance and
placement, to name only a few [GRIM98, FOST99, GRIM99]. Without additional higher
level abstractions, all but the best programmers will be overwhelmed by the complexity of
the environment.
The contribution of this work is the development of a framework for simpli fying the
construction of grid applications. The framework provides a generic extension mechanism
for incorporating functionality into applications and consists of two models: (1) the
reflective graph and event model, and (2), the exoevent notification model. These models
provide a platform for extending user applications with additional capabiliti es via
composition. While the models are generic and can be used for a variety of purposes,
including security, resource accounting, debugging, and application monitoring [VILE97,
FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the
integration of fault-tolerance techniques. Support for the development of fault-tolerant
-
3
applications has been identified as one of the major technical challenges to address for the
successful deployment of computational grids [GRIM98, FOST99, LEIN99].
Consider application reliabilit y in a grid. As applications scale to take advantage of a
grid’s vast available resources, the probabilit y of failure is no longer negligible and must
be taken into account. For example, consider an application decomposed into 100 objects,
with each object requiring one week of processing time and placed on its own workstation.
Assuming that each workstation has an exponentially distributed failure mode with a
mean-time-to-failure of 120 days, the mean-time-to-failure of the entire application would
only be 1.2 days, thus, the application would rarely finish!
Using the framework, fault-tolerance experts can encapsulate algorithms using the two
reflective models developed in this dissertation. Developers incorporate these algorithms
into their tools and augment the set of services provided to application programmers.
Application programmers then use these augmented tools to increase the likelihood that
their programs will complete successfully.
We claim that the framework enables the easy integration of fault-tolerance techniques
into object-based grid applications. To support this claim, we have mapped onto our
models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD
checkpointing, passive and stateless replication, and pessimistic method logging. We
chose these algorithms to il lustrate the applicabilit y of our framework to a range of fault-
tolerance techniques. Furthermore, we selected these algorithms because we believe that
they are likely to be used in grid applications. We incorporated these algorithms into three
common grid programming tools: Message Passing Interface (MPI), Mentat, and Stub
Generator (SG). MPI is the de facto standard for message passing; Mentat is a C++-based
-
4
parallel programming environment; and SG is a popular tool for writing client/server
applications.
We measured the ease by which techniques can be integrated into applications based
on the number of additional li nes of code that a programmer would have to write. In the
best case, programmers needed to add three lines of code. In the worst case, programmers
had to write functions to save and restore the local state of their objects. However, such
functions are simple to write and exploit programmers’ knowledge of their applications.
Furthermore, tools to automate save and restore state functions have already been
demonstrated in the literature [BEGU97, FERR97, FABR98].
To the best of our knowledge, we are the first to advocate and use a reflective
architecture to structure applications in computational grids. Moreover, we are the first to
demonstrate the integration of a wide range of fault-tolerance techniques into grid
applications using a single framework.
1.1 Current support for fault tolerance in gr ids
Until recently, the foremost priority for grid developers has been to develop working
prototypes and to show that applications can be written over a grid environment
[GRIM97B, BRUN98, FOST98]. To date, there has been limited support for application-level
fault tolerance in computational grids. Support has consisted mainly of failure detection
services [STEL98, GROP99] or fault-tolerance capabilities in specialized grid toolkits
[NGUY96, CASA97]. Neither solution is satisfactory in the long run. The former places the
burden of incorporating fault-tolerance techniques into the hands of application
programmers, while the latter only works for specialized applications. Even in cases
-
5
where fault-tolerance techniques have been integrated into programming tools, these
solutions have generally been point solutions, i.e., tool developers have started from
scratch in implementing their solution and have not shared, nor reused, any fault-tolerance
code.
As these tools are ported to grid environments, or as new tools are developed for grid
environments, the continued development of fault-tolerant tools as point solutions
represents wasteful expenditure. We believe a better approach is to provide a structural
framework in which tool developers can integrate fault-tolerance solutions via a
compositional approach in which fault-tolerance experts write algorithms and encapsulate
them into reusable code artifacts, or modules. Tool developers can then integrate these
modules in their environments.
1.2 Properties of the framework
Our long-term goal is to simpli fy the construction of fault-tolerant grid applications.
We believe that a good solution for achieving this goal should exhibit the following
properties:
• P1. Separation of concerns and composition. Designing and writing fault-
tolerance code are complex and error-prone tasks and should be done by experts,
not application programmers or tool developers. Thus, fault-tolerance experts
should be able to encapsulate algorithms into reusable and composable code
artifacts [NGUY99]. Furthermore, the incorporation of fault-tolerance techniques
should not interfere with other non-functional concerns such as security or
accounting.
• P2. Localized cost. By localized cost, we mean that the use of resources or services
to implement fault-tolerance techniques should not be charged to applications that
-
6
do not require those resources or services—users should pay only for the level of
services that they need. In general, localized cost is an important attribute for any
grid services [GRIM97A].
• P3. Working proof of concept. We should be able to demonstrate the integration of
fault-tolerance techniques in running applications on a working grid prototype and
using multiple programming tools. Further, applications with fault-tolerance
techniques integrated should be able to tolerate more failures than applications that
do not use any fault-tolerance techniques.
1.3 Evaluation
Based on our goal of simpli fying the construction of fault-tolerant applications and the
properties listed in §1.2, we have derived several criteria by which to evaluate our
framework (next to each criterion, we note in parenthesis its related property):
• Multiple programming tools. A successful solution should promote and enable the
incorporation of fault-tolerance techniques into multiple programming tools,
including legacy tools such as MPI or PVM. Legacy tools are already familiar to
programmers and should ease the transition from traditional distributed systems to
grid environments. (P1, P3)
• Breadth of fault-tolerance techniques. A successful solution should support a wide
range of fault-tolerance techniques so that application programmers may use the
one that is most appropriate for their needs. (P1, P2)
• Ease of use. Incorporating fault-tolerance techniques should required only trivial
or small modifications to applications. (P1, P3)
• Localized cost. Application programmers should select and pay only for the level
of fault tolerance that they require. A good framework should not impose a
system-wide solution. Instead, the cost of using fault-tolearnce techniques should
be localized to the applications that use these techniques. (P2)
• Overhead. Is the overhead of using fault-tolerance techniques due to the algorithm
or to the framework itself? In deciding whether to incorporate a fault-tolerance
-
7
technique, users should only worry about the algorithmic overhead, i.e., the cost of
the algorithm itself. (P2, P3)
1.4 Background
1.4.1 Gr id models
Before describing our framework, we present the implementation models of
computational grids. As shown in Figure 1, a grid consists of services that run on top of
native operating systems. These services provide functionality such as authentication,
failure detection, object and process management, and remote input/output, and are
accessed via grid libraries. Typically, an application programmer will not access these
libraries directly, but will use a programming tool such as MPI [GROP99],
NetSolve [CASA97], Ninf [SATO97] or MPL [GRIM97B], which in turn will call the
underlying grid libraries. The advantage of this layered model is that application
programmers can use familiar programming tools and interfaces and are shielded from the
complexity of accessing grid services.
FIGURE 1: Grid layered implementation models (adapted from [FOST99], pg. 30)
MPI, PVM, NetSolve, DOME, MPL, Fortran
Grid Services
Programming Tools
Applications
Native Operating Systems
Security, Object/Process Management, Scheduling,Failure Detection, Storage
Globus API, Legion API
Applications
Windows NT, Unix
Grid Libraries
-
8
There are currently three approaches to building grids: the commodity approach, the
service approach, and the integrated architecture approach [FOST99]. In the commodity
approach, existing commodity technologies, e.g. HTTP, CORBA, COM, Java, serve as the
basic building blocks of the grid [ALEX96, BALD96, FOX96, CHRI97]. The primary
advantages of this approach are the use of industry standard protocols, allowing
programmers to ride the technology curve as improvements are made to these protocols.
Furthermore, standard protocols stand a better chance of being adopted by a large
community of developers. The problem with this approach is that the current set of
protocols may not be adequate to meet the requirements of computational grids. In the
service approach, as exempli fied by the Globus project, a set of basic services such as
security, communication, and process management are provided and exported to
developers in the form of a toolkit [FOST97]. In the integrated architecture approach,
resources are treated and accessed through a uniform model of abstraction [GRIM98]. As
we describe in §1.4.3, our framework targets the integrated approach.
1.4.2 Reflection
Our framework relies on the observation that although fault-tolerance techniques are
diverse by nature, their implementation is not. Indeed, the implementation of the major
famili es of fault-tolerance techniques rely on common basic primitives such as:
• intercepting the message stream
• piggybacking information on the message stream
• acting upon the information contained in the message stream
• saving and restoring state
• detecting failure
• exchanging protocol information between participants of an algorithm
-
9
Thus, by providing an execution model whereby these primitives can be expressed and
manipulated as first class entities, it is possible to achieve our goals of developing fault-
tolerance capabili ties independently and integrating them into programming tools.
We use reflection as the architectural principle behind our execution models. Smith
introduced the concept of reflection as a computational process that can reason about itself
and manipulate representations of its own internal structure [SMIT82]. Two properties
characterize reflective systems: introspection and causal connection.* Introspection
allows a computational process to have access to its own internal structures. Causal
connection enables the process to modify its behavior directly by modifying its internal
data structures—there is a cause-and-effect relationship between changing the values of
the data structures and the behavior of the process. The internal data structures are said to
reside at the metalevel while the computation itself resides at the baselevel. The metalevel
controls the behavior at the baselevel. In our case, the fault-tolerance capabiliti es are
expressed at the metalevel and control the underlying baselevel computation.
1.4.3 Legion gr id environment
Our work targets the Legion environment for multiple reasons: (1) Legion is object-
based, (2) it already uses graphs for inter-object communication, (3) it is an existing grid
prototype, and (4), multiple programming tools are available. None of the other
environments considered, such as Globus and CORBA-based systems, possess all these
attributes. However, our framework is also relevant to these other environments. For
example, it could be used to structure CORBA applications. Recent research has been
* Note that the term causal is used differently in the distributed systems literature where it refersto the “happen-before” relationship as defined by Lamport [LAMP78].
-
10
oriented towards extending the functionality of CORBA systems through a reflective
architecture [BLAI98, HAYT98, LEDO99]. Our work suggests that structuring CORBA-
reflective architectures using an event-based and/or graph-based paradigm is an idea
worth pursuing.
Legion treats all resources in a computation grid as objects that communicate via
asynchronous method invocations. Objects are address-space-disjoint, i.e., they are
logically-independent collections of data and associated methods. Objects contain a thread
of control, and are named entities identified by a Legion Object IDentifer (LOID). Objects
are persistent and can be in one of two states: active or inert. Active objects contain a
thread of control and are ready to service method calls. They are implemented with
running processes over a message passing layer. Inert objects exist as passive object state
representations on persistent storage. Legion moves objects between active and inert states
to use resources efficiently, to support object mobili ty, and to enable failure resili ence.
Legion objects are under the control of a Class Manager object that is responsible for
the management of its instances. A Class Manager defines policies for its instances and
regulates how an object is created, or deleted, and when it should be migrated, activated or
deactivated. By defining new Class Managers, grid developers can change the
management policies of object instances. Class Managers themselves are managed by
higher-order class managers, forming a rooted hierarchy.
Legion provides several default objects to manage its resource base. The two basic
objects are Host Objects and Vault Objects, which correspond to processor and storage
resources in a traditional operating system. Host objects are responsible for running an
active object while vault objects are used to store inert objects. Legion allows
-
11
customization of all it s objects. Thus, a host object could represent compute resources that
exhibit varying degrees of reliabilit y and performance, e.g., a personal computer, a
workstation, a server, a cluster, or a queue-controlled supercomputer. Similarly a vault
object could represent a local disk, a RAID disk, or tertiary storage. A full description of
the Legion object model can be found in the literature [GRIM98].
1.5 Framework foundation
The key contribution of this work is the development of two reflective models that are
the foundations of our framework, the reflective graph and event model, and the exoevent
notification model. Together these models provide flexible mechanisms for structuring
applications and specifying the flow of information between objects that comprise an
application. Furthermore, the models enable information propagation policies to be bound
to applications at run-time. The flexibilit y of the models and the abilit y to defer the
binding of policy decisions are the differentiating features of our framework.
The reflective graph and event model (RGE) reflects our target environment of (1) an
environment in which objects are implemented by running processes that communicate
via message passing, and (2) an object-based environment in which an application consists
of a set of cooperating objects. The RGE model employs graphs and events to expose the
structure of objects to fault-tolerance developers. It specifies both its external aspect
(interactions between objects) and its internal aspect (interaction inside objects). Graphs
and events are the building blocks with which fault-tolerance implementors can
incorporate functionali ty inside objects and exchange fault-tolerance protocol information
between objects. Graphs represent interactions between objects; a graph node is either a
-
12
member function call on an object or another graph, arcs model data or control
dependencies, and each input to a node corresponds to a formal parameter of the member
function. Events specify interactions inside objects and are used to structure their protocol
stack.
Our second model, the exoevent notification model, is a distributed event model.
Similarly to the event model defined by CORBA [BENN95] and the Java Distributed Event
Specification [SUN99A], the exoevent notification model provides a flexible mechanism
for objects to communicate. However, unlike the CORBA and Java models, the salient and
distinguishing features of the exoevent notification model are that it unifies the concept of
exceptions and events—an exception is a special case of an event—and it allows the
specification of event propagation policies to be set on a per-application, per-object or per-
method basis, at run-time. In our model, exoevents denote object state transitions and are
associated with program graphs. Raising an exoevent results in the execution of method
invocations on remote objects through the execution of associated program graphs—
hence the term exoevent. The abilit y to specify handlers as program graphs allows
developers to specify more complex policies than with a traditional event model.
The use of reflection to incorporate non-functional requirements has been proposed by
Stroud [STRO96]. Its use for integrating fault-tolerance capabilit ies into systems has been
successfully employed in many object-based systems, including FRIENDS [FABR98] and
GARF [GUER97]. Reflection has also been used as the basis for extending object
functionality in CORBA-based systems (OpenORB [BLAI98], FlexiNet [HAYT98],
OpenCorba [LEDO99]). The novelty of this dissertation is to suggest the use of events as
the primary structuring mechanism for designing object request brokers, the use of generic
-
13
program graphs to describe distributed event propagation policy and bind policy at run-
time, and the use of reflection to specify inter- and intra-object communication as generic
and flexible means of extending grid applications with additional functionality. In
particular, we focus on using the models to extend applications with fault-tolerance
capabiliti es.
1.5.1 Framework summary
In order to enable the integration of fault-tolerance techniques with applications, our
framework requires that both fault-tolerance experts and tool developers target the
reflective graph and event model and the exoevent notification model. Note that the
framework does not make any assumptions about the failure model used by the underlying
system, or the failure assumptions made by a given fault-tolerance algorithm. The
framework is an integration framework only; the decision as to whether a given algorithm
is suitable for a given application is not part of the framework proper.
Our framework imposes a unified structure on the way grid libraries are organized.
Specifically, our framework requires that library components use an event paradigm for
intra-object communication. The advantages of events in terms of flexibilit y and
extensibilit y are well-known. Events have been used in such diverse areas as graphical
user interfaces [NYE92], protocol stacks [BHAT97, HAYD98], operating system kernels
[BERS95] and integrated systems [SULL96]. Using events for building the protocol stack
of an object provides natural hooks for inserting fault-tolerance capabiliti es. In fact, the
events required to build a protocol stack for objects are those that are needed for
incorporating fault-tolerance functionality.
-
14
For inter-object communications, our model provides a data-driven, graph-based
abstraction. Graphs have been used successfully in parallel and distributed systems
[BABA92, BEGU92, GRIM96A]. Graphs enable the expression of traditional client/server
interactions, such as CORBA, as well as more complex interactions, such as pipelined
flow.
1.6 Constraints and assumptions
The fault-tolerance algorithms discussed in this dissertation make use of three
common assumptions: fail-stop, availabil ity of reliable storage, and reliable networks.
However, Legion only provides an approximation of these assumptions. Detecting a
crashed object is approximated using conservatively-set timeouts; reliable storage is
approximated with standard disks; and the use of a high-level retry mechanism for sending
messages is used to mask transient network partitions. Thus, it is possible for an
application using a given fault-tolerance technique to violate its failure assumptions. To
increase the likelihood that these assumptions are met, Legion could be configured to use
hosts and storage devices with higher reliabilit y, e.g., hosts such as those provided by the
NonStopTM Compaq®† or Stratus® architectures, storage such as RAID disks, and
possibly hosts configured with redundant network paths. However, we do not expect this
configuration to be common in grids in the near future. Thus, application developers
should be aware of the possibili ty of violating the failure assumptions—if the cost of
violating these assumptions is too high, e.g., as would be the case with safety-criti cal
applications, then these applications should not be used on Legion.‡ The framework
† Formerly known as Tandem®, acquired by Compaq Corporation.‡ Note that this comment applies to any computational grids.
-
15
described here is an integration framework only, and does not make any guarantees as to
the suitability of using a given algorithm. However, to increase the likelihood that the
failure assumptions are met, we configured applications to run within a site [DOCT99].
In this dissertation the algorithms we have mapped onto our framework are designed
to tolerate host failures. Computational grids use hardware resources owned by various
entities, including research labs, governmental agencies, and universities. At any moment
in time, it is thus not surprising to find that some hosts used by a grid system have crashed
due to someone rebooting the machine or tripping on a power cord; or by chance; or a host
may simply be down for maintenance. While the crash failure of hosts represents an
important class of failures in grids, we note that they are not the only source of failures—
unreliable software or operator error could also result in the failure of applications
[GRAY85]. Furthermore, we do not concern ourselves with non-fault-masking techniques
such as reconfiguration and presentation of alternative services to cope with failures
[HOFM94, KNIG98, GART99]. We are only concerned with the integration of fault-masking
techniques in grid applications. Once a host fails, we assume that it does not recover.
Furthermore, we seek only to integrate fault-tolerance techniques into user applications
and do not address the case of fault-tolerance for system-level objects.** We assume that
Legion services are always available.
1.7 Outline
We have organized the rest of the dissertation as follows. In Chapter 2, we present an
overview of related work in the areas of computational grids, reflection, event-driven
** Legion system-level objects already tolerate transient host failures.
-
16
systems, aspect-oriented programming and integration of fault-tolerance techniques in
distributed systems. In Chapter 3, we provide an overview of our execution model, the
reflective graph and event model. In Chapter 4, we describe the development of a
distributed event notification model that is used as a flexible communication model to
exchange protocol information between objects. In Chapter 5, we illustrate mappings from
several well -known fault-tolerance techniques onto the reflective graph and event model
and the distributed event notification model. In Chapter 6, we present the integration of
several mappings described in Chapter 5 into several programming tools available in the
Legion grid. In Chapter 7, we tie the previous chapters together and provide a working
proof that our models have been successfully integrated into several tools and
applications. We also evaluate the performance of these applications. In Chapter 8, we
conclude by presenting lessons we learned and opportunities for future research.
-
17
There is only one nature – the division into science and engineering is a humanimposition, not a natural one. Indeed, the division is a human failure;
it reflects our limited capacity to comprehend the whole.— Bill Wulf
Chapter 2
Related Work
We present a broad overview of computational grids and potential grid tools to provide
context for our work (§2.1). We discuss reflective systems (§2.2) as our reflective graph
and event model is based on a reflective architecture. We discuss the event model and its
use in various settings to support extensibilit y and flexibil ity (§2.3). We consider aspect-
oriented programming and its potential relationship with event-based extension
mechanisms (§2.4). Finally, we present several approaches to integrating fault-tolerance
techniques into distributed systems, including CORBA-based systems (§2.5).
2.1 Computational gr ids
Foster et al. have identified three approaches to building computational grids: the
commodity approach, the service approach, and the integrated architecture approach
[FOST99]. In the commodity approach, existing commodity technologies, e.g., HTTP,
CORBA, COM, Java, serve as the basic building blocks of the grid [ALEX96, BALD96,
FOX96, CHRI97]. In the service approach, as exempli fied by the Globus project, a set of
-
18
basic services such as security, communication, and process management are provided and
exported to developers in the form of a toolkit [FOST97]. In the integrated architecture
approach, resources are accessed through a uniform model of abstraction [GRIM98]. For
example, Legion enables the development of grid applications by providing a uniform
object abstraction to encapsulate and represent grid resources, e.g., compute, data, and
people resources. A motivating factor for both the service and integrated architecture
approach is that the set of commodity services provided by current technology does not
suffice to meet the requirements of computational grids [FOST99].
We present several systems below and comment on the suitabilit y of these systems for
developing grid applications.
2.1.1 PVM and MPI
PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are the two
best-known message passing environments in grid computing [GEIS94, GEIS97]. They
provide programmers with library support for writing applications with explicit message
send and receive operations. In addition to message passing, PVM and MPI provide the
il lusion of an abstract virtual machine that supports the creation and deletion of processes
or tasks. As of this writing, MPI has eclipsed PVM to become the primary message
passing standard, and is supported by all major computer manufacturers.
Both Legion and Globus provide support for MPI [FOST99]. Legion also provides
support for PVM. We describe below several systems layered on top of PVM or MPI that
provide fault-tolerance capabilit ies. While these systems have not yet been ported to grid
prototypes, they are representative of the kind of systems that are likely to be incorporated
-
19
into grids. It is interesting to note that many of these systems are geared towards scientific
computing; they provide support for a style of application known as SPMD applications
(Single Program Multiple Data) in which identical processes process a subdomain of the
application data. SPMD applications are often time-stepped, with periodic exchange of
information at well -defined intervals.
2.1.1.1 DOME
DOME (Distributed Object Migration Environment), runs on top of PVM and
supports application-level fault-tolerance in heterogeneous networks of workstations
[BEGU97]. DOME defines a collection of data parallel objects such as arrays of integers or
floats that are automatically distributed over a network of workstations. DOME supports
the writing of SPMD applications in which a process is replicated on multiple nodes and
executes its computation over a different subset of the data. DOME provides support for
the checkpointing of SPMD applications. Similarly to the checkpointing techniques that
we use, DOME’s checkpoints support the recovery of applications on heterogeneous
architectures.
2.1.1.2 CVMULUS
CVMULUS is a library package for visualization and steering of fault-tolerant SPMD
applications for use on top of PVM [GEIS97]. In CVMULUS, programmers specify the
data decomposition of their applications. CVMULUS automatically uses this information
for checkpoint/recovery and is able to reconfigure applications even if the recovered
application uses fewer workers or tasks. Since CVMULUS is geared towards SPMD
applications, the consistency of application-wide checkpoints is easily maintained.
-
20
2.1.1.3 Other extensions to PVM and MPI
Fail-Safe PVM is an extension of PVM to provide application-transparent fault
tolerance based on checkpoint and recovery [LEON93]. While it achieves transparency,
Fail-Safe PVM required modifications to the PVM daemons to monitor the flow of
messages between PVM tasks. Silva et. al provide a user-level li brary called PUL-RD to
support checkpointing and recovery of SPMD applications on top of MPI [SILV95].
Programmers are responsible for describing the data layout of their applications. Similarly
to CVMULUS, the PUL-RD library supports the recovery of applications with fewer
processes.
2.1.2 Isis, Horus and Ensemble
Isis, Horus and Ensemble are representative of systems that use a process group
abstraction to structure distributed applications [BIRM93, RENE96, HAYD98]. The central
tenet of such systems is that support for programming with distributed groups is the key to
writing reliable applications.
Process groups enable the realization of a virtually synchronous model of computation
wherein the notion of time is defined based on the ordering of messages [LAMP78].
Typically, a programmer uses various forms of multicast primitives for communication
with members of a group, e.g., causal multicast or totally ordered multicast. The receipt of
messages within a group may be ordered with respect to group membership changes,
thereby enabling programmers to write algorithms such that group members can logically
take some actions “at the same time” with respect to failures. Failures of processes are
treated as changes in the membership of a group. Only processes that are members of a
-
21
group are allowed to process messages. Thus, group membership as seen in Isis, simulates
a fail -stop model in which processes fail by halting [SCHN83, SABE94].
The process group model has often been criti cized on the basis of the end-to-end
argument [SALT90]. Critics of the model argue that the ordering properties guaranteed by
group communication primitives are provided at too low a level of abstraction, and in
some cases, may be unnecessary to meet the specifications of an application [CHER93].
Proponents of the model argue that the services provided by the model are invaluable in
developing fault-tolerant distributed applications [RENE93, BIRM94, RENE94].
It is interesting to view the progression of systems developed at Cornell University,
from Isis to Horus, and then to Ensemble, as a response to the end-to-end argument. While
Isis was a monolithic system, both Horus and Ensemble allow developers to configure and
customize the protocol stacks of processes to meet the needs of applications. In Ensemble,
the protocol stack of processes can be configured at run-time using an event-driven
paradigm, unlike the protocol stack of Horus which has to be configured statically.
The process group model has found acceptance in several domain areas, including
finance, groupware applications, telecommunication, military systems, factory automation
and production control [BIRM93]. For more information on the model and its applications
to Internet applications, please see the recent book by Birman [BIRM96].
Our framework differs in that its focus is on integrating fault-tolerance techniques in
object-based systems whereas the focus of Isis, Horus and Ensemble, is in supporting the
process group abstraction. The two are not mutually exclusive, it is possible to layer a
reflective framework on top of ordered group communication primitives [FABR98].
-
22
For grid applications, it is too early to determine how much of a role the process group
model will play. However, the evolution from Isis to Ensemble point to a common design
goal of supporting flexibilit y and extensibility (§2.3).
2.1.3 L inda, Pirhana and JavaSpaces
In Linda, processes in an application cooperate by communication through an
associative shared memory abstraction called tuple space [CARR89]. A tuple in tuple space
names a data element that consists of a sequence of basic data types such as integers,
floats, characters and arrays. Linda defines four basic operations, out, in, rd and eval, to
access tuple space. Out is used to deposit tuples in tuple space, in and rd are used to search
tuple space. A nice property of in and rd is that they can specify a generic pattern to search
tuple space. Finally, eval is used to create a new process. The primary advantages of Linda
are that it is simple to learn its four operations and easy for programmers to use a shared
memory abstraction. PLinda is an extension to Linda to provide fault-tolerance through
the checkpointing and recovery of tuple space and the use of a commit protocol to deposit
and read tuples from tuple space [JEON94]. Another fault-tolerant version of Linda is
Pirhana [CARR95]. Pirhana supports a style of computation known as master-worker
parallelism, in which a master process generates a set of tasks to be consumed by workers.
Pirhana enables users to treat a collection of hosts as a computational resource base on
which to assign tasks. When a user reclaims a host, e.g. by pressing a key or clicking on
the mouse, Pirhana automatically reassigns the task to another host, thus ensuring that an
application eventually completes. The act of reclaiming of host can be treated as a failure
and is analogous to leaving a group in a system with group membership.
-
23
Linda and its derivatives are particularly well -suited to a master-worker style of
computation—a style that is prevalent in grid applications. We expect that over time, a
Linda-like abstraction, wil l be ported to computational grids. We note that Linda is
currently a commercial product supported by Scientific Computing Associates, Inc, under
the tradename Paradise ® .
The Linda tuple model heavily influenced the development of the Jini JavaSpacestm
Specification [SUN99A]. Similarly to Linda, JavaSpaces provide the abstraction of an
associative shared memory in which Java programs can deposit and retrieve information.
JavaSpaces improve upon the Linda model in that Java programs can be automatically
notified of changes in the JavaSpace through events [SUN99A]. Both Linda tuple space
and JavaSpaces can be viewed as an instance of a blackboard architecture in which
different components interact and coordinate actions based on state changes in a shared
repository [SHAW96].
2.2 Reflection
Smith introduced the concept of reflection and that of a computational process that can
reason about itself and manipulate representations of its own internal structure [SMIT82].
Two properties characterize reflective systems: introspection and causal connection.
Instropection enables a computational process to have access to its own internal structures.
Causal connection enables the computational process to modify its behavior directly by
modifying its internal data structures, i.e., there is a cause-and-effect relationship between
changing the values of the data structures and the behavior of the process. The internal
-
24
data structures are said to reside at the metalevel while the computation itself resides at the
baselevel; thus the metalevel controls the behavior of the baselevel.
Reflection provides a principled means of achieving open engineering, i.e., of
extending the functionali ty of a system in a disciplined manner [BLAI98]. A key attribute
of reflective systems is that of separation of concerns between the metalevel and the
baselevel. For example, Fabre et al. incorporated replication techniques into objects using
the reflective programming language Open-C++ [FABR95]. The implementation of the
replication techniques was performed at the metalevel with lit tle changes to the underlying
baselevel application. The design and implementation of the replication techniques were
separated from the design and implementation of the actual application, thus allowing the
replication techniques to be composable with many applications. In general, reflective
architectures enable the composition of non-functional concerns with the underlying
computational process [STRO96].
Another advantage of reflective architectures is that they enable flexibilit y and
extensibilit y of functionality. Reflective architectures have been used in such diverse areas
as programming languages [MAES87, WATA88, KICZ91, AKSI98, TATS98, MOSS99,
WELC99], operating systems [YOKO92], real-time systems [SING97, STAN98, STAN99],
fault-tolerant real-time systems [BOND93], agent-based systems [CHAR96], dependable
systems [AGHA94], and distributed middleware systems, e.g., OpenORB [BLAI98],
FlexiNet [HAYT98], OpenCorba [LEDO99] and Legion [NGUY99].
A feature common to all reflective systems is that they answer two questions: What
internal structure or metalevel information (meta-information) is exposed to developers?
How does one access the metalevel? The answer to the first question is application-
-
25
dependent. For example, in real-time systems such as FERT or Spring [BOND93, STAN98]
the meta-information includes timing constraints of tasks, deadlines, and precedence
constraints. In a programming language such as CLOS, the meta-information includes
slots and methods [KICZ91]. In an object-based distributed systems, meta-information can
include methods, arguments and replies [BLAI98, HAYT98, LEDO99, VILE97]. The answer
to the second question also varies. A popular method of programming the metalevel is
through an object-oriented paradigm in which a metalevel object defines and controls the
behavior of baselevel objects [MAES87, KICZ91]. Other means of accessing meta-
information include using compiler technology [FABR95, CHIB95, TATS98], configuration
files [MOSS99, WELC99], and events [NGUY98, PAWL98].
The reflective models developed in this dissertation reflect our target environment of a
computational grid. Incorporating fault-tolerance techniques in a distributed application—
a set of cooperating objects—requires manipulation of the internal as well as external
aspects of an object. Our models regulate both intra-object interactions, i.e., interactions
between modules inside an object, and inter-object interactions, i.e., interactions between
objects. The dual aspect of our models enable the integration of application-wide
algorithms such as checkpointing, in contrast to other reflective systems whose focus have
been on integrating techniques such as replication in server objects [FABR95, GUER97,
BLAI98, HAYT98].
A further difference between our architecture and other reflective middleware
architectures is that we do not use a metaobject protocol to control the behavior of the
baselevel [AGHA94, FABR95, GUER97, FABR98, HAYT98, LEDO99]. Instead, we present a
graph-and-event-based interface accessible through simple C++ library calls. In contrast,
-
26
other reflective approaches such as OpenCorba [LEDO99] and Garf [GUER97] rely on the
Smalltalk programming language. We believe that presenting a C++ based interface
expands our potential community of developers.
2.3 Events
Events have been used in a variety of contexts [SHAW96], in graphical user interfaces,
to build protocol stacks [BERS95, BHAT97, HAYD98, VILE97], in integrated systems
[SULL96], or as a generic mechanism for component interactions [BENN95]. We separate
our discussion of events in two sections: local events and distributed events. Local events
propagate within the same address space whereas distributed events propagate to a
different address space.
2.3.1 Local events
2.3.1.1 Protocol stacks
Many projects such as SPIN [BERS95], Coyote [BHAT97] and Ensemble [HAYD98],
use an event-based paradigm for flexibil ity and extensibilit y. SPIN is a dynamically
extensible operating system that uses events as its extension mechanism. A SPIN event is
used to notify the system of a state change or to request a service. For example, an IP
extension to the kernel could announce the event PacketArrived. Events in SPIN are fine-
grained, reflecting their use in an operating system. Likewise, events in the Coyote project
are fine-grained, reflecting their use in a kernel designed for network protocols. Coyote
extends the x-kernel [HUTC91] and enables the construction of micro-protocols that
communicate via events. Micro-protocols implement low-level properties, e.g.,
-
27
acknowledging that a message has been received or maintaining a membership list of li ve
processes. By composing micro-protocols, the Coyote protocol stack can be easily
configured to implement higher-level properties, e.g., group remote procedure calls with
acknowledgment. Coyote was designed primarily for network protocols and so the set of
pre-defined events relate mostly to messages, e.g., Message_Inserted_Into_Bag or
Message_Ready_To_Be_Sent. Ensemble uses events as the primary mechanism for
composing micro-protocols and supporting the process group abstraction. Example events
in Ensemble include Send-Message and Leave-Group.
The set of events exported by a system depends on the target environment and defines
the extension vocabulary with which developers can extend functionality. Since we target
an object-based system implemented over a message-passing communication layer, we
export events such as MessageSend and MethodReceived. Approaches such as Coyote or
our own in which events manipulate data structures (e.g., messages) contained in shared
data structures (e.g., message repository), can be viewed as a blackboard architecture
augmented with implicit invocations [SHAW96].
2.3.1.2 Graphical user inter face
Events have been widely popular in implement graphical user interfaces, e.g., the
MacOS ®, Microsoft Windows ®, Java’s Abstract Window Toolkit. Events enable the
separation of the visual aspects of a program from the actual computation. Typical events
in these systems deal with various aspects of the desktop metaphor, e.g, mouse, windows,
buttons, menus, keyboard input. Programmers can register event handlers to be notified of
user actions and take appropriate actions. However, coordinating events may be a difficult
-
28
task. Thus, most environments provide tools to facilit ate the development of graphical
user interfaces, e.g., Java Swing, Visual Basic.
2.3.1.3 JavaBeans
JavaBeanstm is the component technology developed by Sun Microsystems for use
within the Java platform [SUN99B]. A bean is a reusable software artifact that can be
manipulated visually using a builder tool. Beans can communicate with one another using
an event paradigm. The advantages of using Beans are that they are portable across
heterogenous architectures and that many tool builders are actively developing products to
support the development of Java Beans.
2.3.2 Distr ibuted events
Distributed events are used to communicate information between remote objects or
processes. In CORBA, the Event Service allows an object to register its interest in events
raised by other objects [BENN95]. CORBA defines two roles for objects: suppliers and
consumers. Suppliers produce events; consumers processes them. Suppliers and
consumers may be directly linked in which case events flow directly from the suppliers to
the consumers. Alternatively, an event channel may be defined to serve as an intermediary
object between suppliers and consumers. Using an event channel fully decouples suppliers
from consumers—consumers need not be active when suppliers deposit events on an
event channel. Furthermore, event channels may provide added functionality such as
filtering and persistence. The Jini Distributed Event Specification provides similar
functionality as CORBA’s event service [SUN99A]. It also provides additional features
such as the ability to bound the time during which an object is interested in an event raised
-
29
by some other objects via leasing [SUN99A]. In Jini terminology, an event listener may
register to be notified of an event on a one-time basis, forever, or for a specified time
period.
The exoevent notification model developed in this dissertation is similar to both the
CORBA and the Java Distributed Event specifications in that it supports the flexible
propagation of events between objects. The distinguishing features of our model are that it
unifies the concept of exceptions and events, i.e., an exception is simply a special kind of
event, and it allows programmers to specify the propagation of events on a per-
application, per-object or per-method basis. The exoevent notification model does not
support the concept of leasing.
While we use distributed events in our work for the dissemination of data to support
fault-tolerance algorithms, we note that the publish/subscribe model supported by events
is generic. As an example, the Department of Defense’s High Level Architecture uses the
publish/subscribe model to propagate information about entities in distributed simulations
[DMSO98]. As another example, the Jini Discovery and Join Specification regulates how
devices can discover the presence of other devices on a network [SUN99A].
2.4 Aspect-or iented programming
The use of the event paradigm to extend functionality for middleware systems is
related to the issue of crosscutting and weaving in aspect oriented programming [KICZ97].
Crosscutting is the concept that extensions to a modularly-designed program cannot be
constrained within the bounds of the original program decomposition. An example of
crosscutting in an object-oriented program would be the addition of synchronization
-
30
primitives at the beginning of each method. Kiczales’ thesis is that crosscutting is
common in large software systems. Our experiences with middleware systems corroborate
his thesis; aside from implementing its functional requirements, an object may also handle
issues such as argument marshalling, security, debugging, performance monitoring and
synchronization. In aspect-oriented programming technology, these issues are called
aspects. Aspect-oriented programming languages elevate aspects to first-class status and
provide a clean separation between the functional decomposition of a program—objects
or modules—and non-functional requirements which pertain to the way objects and
modules relate to one another [HIGH99].
After aspects are elevated to first-class status they must be composed with the
underlying program. This process is known as weaving and seems closely related to events
in the sense that events can be used to implement weaving. For example, an aspect for
debugging could be implemented easily in an object-based system by inserting an event
handler to intercept methods and logging them on storage for future replay. An interesting
avenue of research would be to investigate the use of an aspect-oriented programming
language to extend the functionality of objects in computational grids, or alternatively, to
investigate the suitability of the event paradigm for weaving aspects. Pawlak et al. are
currently investigating this line of research [PAWL98].
2.5 Integrating fault tolerance in distr ibuted systems
Fabre et al. present an excellent analysis of different approaches for integrating fault-
tolerance in distributed systems [FABR95, FABR98]. They distinguish between three main
approaches: the system approach, the library approach and the inheritance approach. In
-
31
the system approach, the runtime system provides support for fault-tolerance. For
example, Delta-4 [POWE94] offers several replication strategies such as passive, semi-
active and active replication to Delta-4 application programmers. In the library approach,
a set of functions is provided at the application-level to support a set of fault-tolerance
algorithms. For example, ISIS [BIRM93], Horus [RENE96] and Ensemble [HAYD98],
provide developers with various forms of ordered communication primitives. In the
inheritance approach, an object can inherit fault-tolerance properties such as persistence
and recoverabilit y from a base class. Examples of this approach include Avalon/C++
[DETL88] and Arjuna [ARJU92]. Fabre analyzes these approaches in terms of transparency,
reusabilit y and composabilit y, and argues that none meet these criteria simultaneously.
Fabre proposes the use of reflective techniques to meet these criteria and shows how to
integrate replication techniques into distributed objects using the reflective language
Open C++ [FABR95, FABR98]. Other systems that advocate the use of reflection to
incorporate fault-tolerance techniques include MAUD [AGHA94] and Garf [GUER97].
A fertile area of research has been to integrate fault-tolerance techniques into CORBA.
Moser et al. propose a fault-tolerance framework that implement fault-tolerance
management services both above and below an object request broker (ORB) [MOSE99].
Other projects such as Electra and Orbix+Isis integrate replication and group mechanisms
inside the ORB itself [MAFF95, LAND97]. DOORS (Distributed Object-Oriented Reliable
Service) provides fault-tolerance services as CORBA horizonatal services [SCHO98].
Elnozahy et al. provide a library of fault-tolerance techniques that can be used in both
CORBA and DCE environments [ELNO95]. Except for DOORS, which is implemented
above the ORB layer, all the other projects use interception methods to implement
-
32
replication services. Interception is implemented by modifying the ORB itself [LAND97],
by providing a library to be called from within the ORB [ELNO95], or by using features of
the operating system [MOSE99]. The Orbix ORB includes the notion of f il ters to intercept
method calls. However, Marzullo’s group at the University of Cali fornia, San Diego,
reported difficulties in integrating the messing logging fault-tolerance technique with
Orbix [NAMP99]. Marzullo et al. suggest that an event-driven model would have
alleviated the report diff iculties [NAMP99].
The need to extend the functionality of ORBs have led several researchers to adopt a
reflective architure to structure ORB implementations [BLAI98, HAYT98, LEDO99]. Our
development of the RGE and exoevent notification models also provides an extension
mechanism. The novelty of this work is to suggest the use of events as the primary
structuring mechanism for designing object request brokers and to specify both inter- and
intra-object communication within a unified model.
2.6 Summary
In designing our models, we drew inspiration from reflective systems as well as
previous work on flexible protocol stack. Our approach differs in two respects with most
CORBA-based reflective middleware approaches: (1) we use a simple graph and event-
based interface for extending object functionality instead of a metaobject protocol, and
(2), our reflective models are designed to extend the functionality of applications, not just
single server objects. In the next chapter, we present the cornerstone of our framework, the
reflective graph and event model. We show an application of our model in designing a
protocol stack and extending it with new functionality.
-
33
Make everything as simple as possible, but not simpler.— Albert Einstein (1879-1955)
Chapter 3
Reflective Graph and Event Model
The cornerstone of our framework is the specification of the reflective graph and event
(RGE) execution model. It provides a structural framework for providing basic object
functionality such as invoking methods, and marshalli ng and unmarshalli ng parameters,
similar to an object request broker (ORB) in CORBA systems [OMG95]. In addition, the
model provides a generic extension mechanism for incorporating new functionality into
objects—such functionality is encapsulated into reusable code artifacts, or modules. Thus,
the RGE model provides a common framework for fault-tolerance designers and tool
developers, and enables the integration and composition of fault-tolerance modules into
programming tools.
The novelty of this work is to suggest the use of events as the primary structuring
mechanism for designing object request brokers and to use a single model to specify both
inter- and intra-object communication. The RGE model employs graphs for inter-object
communication and events for intra-object interactions. Graphs represent interactions
between objects; a graph node is either a member function call on an object or another
-
34
graph, arcs model data and control dependencies, and each input to a node corresponds to
a formal parameter of the member function. Events specify interactions between modules
inside objects. Graphs and events are the building blocks with which fault-tolerance
developers can incorporate functionality inside objects and exchange protocol information
between objects.
The RGE model is reflective because it exposes the structure of objects (i