integrating fault-tolerance techniques in grid applications · 2001. 4. 30. · ii for message...

183
A Dissertation Presented to the Faculty of the School of Engineering and Applied Science at the University of Virginia In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy (Computer Science) by Integrating Fault-Tolerance Techniques in Grid Applications Anh Nguyen-Tuong August 2000

Upload: others

Post on 13-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • A Dissertation

    Presented to

    the Faculty of the School of Engineering and Applied Science

    at the

    University of Virginia

    In Partial Fulfill ment

    of the Requirements for the Degree

    Doctor of Philosophy (Computer Science)

    by

    Integrating Fault-Tolerance Techniques in Gr id Applications

    Anh Nguyen-Tuong

    August 2000

  • © Copyright by

    All Rights Reserved

    Anh Nguyen-Tuong

    August 2000

  • i

    Abstract

    The contribution of this thesis is the development of a framework for simpli fying the

    construction of grid computational applications. The framework provides a generic

    extension mechanism for incorporating functionality into applications and consists of two

    models: (1) the reflective graph and event model, and (2), the exoevent notification model.

    These models provide a platform for extending user applications with additional

    capabiliti es via composition. While the models are generic and can be used for a variety of

    purposes, including security, resource accounting, debugging, and application monitoring

    [VILE97, FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the

    integration of fault-tolerance techniques.

    Using the framework, fault-tolerance experts can encapsulate algorithms using the two

    reflective models developed in this dissertation. Developers incorporate these algorithms

    into their tools and augment the set of services provided to application programmers.

    Application programmers then use these augmented tools to increase the likelihood that

    their programs will complete successfully.

    We claim that the framework enables the easy integration of fault-tolerance techniques

    into object-based grid applications. To support this claim, we have mapped onto our

    models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD

    checkpointing, passive and stateless replication, and pessimistic method logging. We

    incorporated these algorithms into three common grid programming tools: Message

    Passing Interface (MPI), Mentat, and Stub Generator (SG). MPI is the de facto standard

  • ii

    for message passing; Mentat is a C++-based parallel programming environment; and SG

    is a popular tool for writing client/server applications.

    We measured the ease by which techniques can be integrated into applications based

    on the number of additional li nes of code that a programmer would have to write. In the

    best case, programmers needed to add three lines of code. In the worst case, programmers

    had to write functions to save and restore the local state of their objects. However, such

    functions are simple to write and exploit programmers’ knowledge of their applications.

  • Acknowledgements

    To my ancestors, who have trekked down this path,and cleared a road for others to follow,three centuries is not that long after all

    To that turtle in Hanoi,forever gazing at the pond,the smell of incense on a hot summer day

    To the committee, for helping me to ascertain,the inside from the outside, the lines delicately drawn

    To John Knight,for ensuring a smooth landing

    To Andrew, my advisor and mentor,for showing me the difference between a milli second and a microsecond,and for taking me along on his adventures

    To Karine, my eternal accomplice,whose support and love,are the real foundation of this research

    To my parents, whose journey I have yet to fully appreciate,cam on nhieu

    To my sister, Vi,the dancer, the musician, the pharmacist, the photograph,who never ceases to amaze me,may she appreciate her roots on her voyage home

    To Madgy, Bootsy, Noushka, Kona,rain or shine, eyes always sparkling,heart purring and tail wagging

    Special thanks to Nuts,whose wit is as sharp as his intellect,for all his insights, technical, culinary and otherwise

    And to all my friends, Chenxi, Dave, John, Karp, Glenn, Matt, Mike, Paco, Rashmi, the Dinner Gang,who have made this trip so enjoyable

  • iv

    Table of Contents

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

    Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

    Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Current support for fault tolerance in grids . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2 Properties of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.4 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4.1 Grid models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Legion grid environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.5 Framework foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.5.1 Framework summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Constraints and assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Computational grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.1.1 PVM and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.1 DOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.2 CVMULUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

  • v

    2.1.1.3 Other extensions to PVM and MPI . . . . . . . . . . . . . . . 202.1.2 Isis, Horus and Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.3 Linda, Pirhana and JavaSpaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3.1 Local events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.1 Protocol stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.2 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . 272.3.1.3 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.2 Distributed events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Aspect-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.5 Integrating fault tolerance in distributed systems. . . . . . . . . . . . . . . . . . . 30

    2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    Chapter 3 Reflective Graph and Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.1.1 Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.2.1 Event API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Overhead for graphs and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.4 Structure of an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.4.1 Overview of a protocol stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Example of incorporating new functionality . . . . . . . . . . . . . . . . 47

    3.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    Chapter 4 Exoevent Notification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.1.1 Registering interest in an exoevent. . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Object scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Method scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.2.1 The notify-root policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.2 The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 The notify-third-party policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.4 The notify-hybrid policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.3 Application programmer interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.4 Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.5 Example exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.6.1 Failure detection – push model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • vi

    4.6.2 Failure detection – pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.3 Failure detection – service model . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    Chapter 5 Mappings of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.1.1 SPMD checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1.2 Mapping SPMD checkpointing . . . . . . . . . . . . . . . . . 775.1.1.3 Summary of SPMD checkpointing. . . . . . . . . . . . . . . 80

    5.1.2 2-phase commit distributed checkpointing. . . . . . . . . . . . . . . . . . 805.1.2.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1.2.2 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.2.3 Mapping 2-phase commit distributed checkpointing . 835.1.2.4 Summary of 2PCDC algorithm. . . . . . . . . . . . . . . . . . 86

    5.2 Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    5.2.1 Pessimistic message logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.2.2 Mapping pessimistic message logging . . . . . . . . . . . . . . . . . . . . . 915.2.3 Optimization: pessimistic method logging. . . . . . . . . . . . . . . . . . 945.2.4 Legion system-level support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.5 Summary of pessimistic logging. . . . . . . . . . . . . . . . . . . . . . . . . . 96

    5.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    5.3.1 Passive replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.1.1 Mapping passive replication. . . . . . . . . . . . . . . . . . . 1005.3.1.2 Legion system-level support . . . . . . . . . . . . . . . . . . . 1015.3.1.3 Summary of passive replication . . . . . . . . . . . . . . . . 102

    5.3.2 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.2.1 Mapping stateless replication . . . . . . . . . . . . . . . . . . 1055.3.2.2 Duplicate method suppression . . . . . . . . . . . . . . . . . 1085.3.2.3 Summary of stateless replication . . . . . . . . . . . . . . . 108

    5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    Chapter 6 Integration into Programming Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.1 MPI (SPMD and 2PCDC Checkpointing) . . . . . . . . . . . . . . . . . . . . . . . 112

    6.1.1 Legion MPI (LMPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 Legion MPI-FT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

    6.2 Stub generator (passive replication and pessimistic method logging) . . 121

    6.2.1 Modifications to the stub generator . . . . . . . . . . . . . . . . . . . . . . 1226.2.2 Integration with pessimistic method logging . . . . . . . . . . . . . . . 1236.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

  • vii

    6.2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.5 Integration with passive replication . . . . . . . . . . . . . . . . . . . . . . 1276.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

    6.3 MPL – Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

    6.3.1 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3.2 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    Chapter 7 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.1 Stub Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    7.1.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    7.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

    7.2.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2.2 BT-MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    7.3 Mentat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    7.3.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.2 Complib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    7.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

    Chapter 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

    8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

  • viii

    List of Figures

    Figure 1: Grid layered implementation models (adapted from [FOST99], pg. 30) . . . 7

    Figure 2: Code fragment and RGE graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    Figure 3: Example use of the graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    Figure 4: Graph interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    Figure 5: Example use of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    Figure 6: Event interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    Figure 7: Structure of an object: sample protocol stack. . . . . . . . . . . . . . . . . . . . . . 47

    Figure 8: Adding a handler for logging methods (pseudo-code) . . . . . . . . . . . . . . . 48

    Figure 9: The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    Figure 10: Propagating exoevents to a catcher object . . . . . . . . . . . . . . . . . . . . . . . . 58

    Figure 11: Example propagation of exoevents in the notify-hybrid policy . . . . . . . . 59

    Figure 12: API for exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    Figure 13: Failure detection using the push model . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    Figure 14: Failure detection using a pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    Figure 15: Generic failure detection service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    Figure 16: Structure of a fault-tolerant application . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    Figure 17: Lost and orphan messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    Figure 18: Insertion of checkpoint in SPMD code. . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    Figure 19: Recovery example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    Figure 20: Interface for checkpoint server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    Figure 21: Interface for application manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    Figure 22: Raising the “CheckpointTaken” exoevent . . . . . . . . . . . . . . . . . . . . . . . . 78

  • ix

    Figure 23: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    Figure 24: Interface for coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    Figure 25: 2PCDC code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    Figure 26: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    Figure 27: Pessimistic message logging (PML). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    Figure 28: Interface for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 92

    Figure 29: Handlers for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 93

    Figure 30: Handler for intercepting outgoing communication. . . . . . . . . . . . . . . . . . 94

    Figure 31: Pessimistic method logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    Figure 32: Passive replication example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    Figure 33: Passive replication interface (primary) . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    Figure 34: Handlers for passive replication (primary) . . . . . . . . . . . . . . . . . . . . . . . 101

    Figure 35: Server lookup with primary replication . . . . . . . . . . . . . . . . . . . . . . . . . 102

    Figure 36: Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    Figure 37: Interface for proxy object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    Figure 38: Sending a method to a replica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    Figure 39: Simple MPI program (myprogram) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    Figure 40: Legion MPI architecture augmented with FT modules. . . . . . . . . . . . . . 116

    Figure 41: Example of MPI application with checkpointing. . . . . . . . . . . . . . . . . . 119

    Figure 42: Example of saving and restoring user state . . . . . . . . . . . . . . . . . . . . . . 120

    Figure 43: Creating objects using the stub generator . . . . . . . . . . . . . . . . . . . . . . . . 122

    Figure 44: Specification of READONLY methods . . . . . . . . . . . . . . . . . . . . . . . . . 123

    Figure 45: Modified client-side stubs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    Figure 46: Interface and code for myApp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    Figure 47: Example of MPL application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    Figure 48: Declaring a Mentat class as stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    Figure 49: Specifying parameters for the stateless replication policy . . . . . . . . . . . 131

    Figure 50: Interface for context object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    Figure 51: Context application structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    Figure 52: BT-MED application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    Figure 53: Complib application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    Figure 54: Complib main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

  • x

    List of Tables

    Table 1: Overhead of graphs and events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    Table 2: Sample set of events for building protocol stack of an object . . . . . . . . . 45

    Table 3: Example of typical exoevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    Table 4: Exoevent interest for notify-root policy . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    Table 5: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 57

    Table 6: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 58

    Table 7: Exoevent interest for notify-hybrid policy for object AppA . . . . . . . . . . . 59

    Table 8: Exoevent interest for notify-hybrid policy for object catcher . . . . . . . . . 60

    Table 9: Exoevent interest for notify-hybrid policy for object B . . . . . . . . . . . . . . 60

    Table 10: Overhead in creating and raising exoevents . . . . . . . . . . . . . . . . . . . . . . . 63

    Table 11: Sample exoevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    Table 12: “ I am Alive” exoevent raised by application objects . . . . . . . . . . . . . . . . 64

    Table 13: Exoevent raised on object creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    Table 14: Exoevent raised by failure detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    Table 15: Data structures for FT modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    Table 16: Summary SPMD checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    Table 17: 2PCDC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    Table 18: Recovery in 2PCDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    Table 19: Summary 2PCDC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    Table 20: Summary of pessimistic logging algorithm . . . . . . . . . . . . . . . . . . . . . . . 96

    Table 21: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 102

    Table 22: “Object:MethodDone” notification by replica . . . . . . . . . . . . . . . . . . . . 106

  • xi

    Table 23: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 108

    Table 24: Summary of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    Table 25: Sample MPI functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    Table 26: Functions to support checkpoint/restart . . . . . . . . . . . . . . . . . . . . . . . . . 116

    Table 27: Options for legion_mpi_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    Table 28: Summary of work required for integration of checkpointing algorithms120

    Table 29: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    Table 30: Summary of work required for integration of PML . . . . . . . . . . . . . . . . 126

    Table 31: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    Table 32: Summary of work required for integration of passive replication . . . . . 128

    Table 33: Summary of work required for integration of stateless replication . . . . 132

    Table 34: Stub generator – RPC performance (n = 100, α = 0.05). . . . . . . . . . . . . 136Table 35: Context performance (n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 139Table 36: Context performance with one induced failure (n = 5, α = 0.05) . . . . . . 140Table 37: Send and receive performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . 142Table 38: BT-MED performance (n = 20, α = 0.05). . . . . . . . . . . . . . . . . . . . . . . . 143Table 39: Performance with one induced failure (n = 10, α = 0.05) . . . . . . . . . . . 145Table 40: RPC performance (1 worker, n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . 146Table 41: Complib performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 149Table 42: Complib performance with failure induced (n = 10, α = 0.05) . . . . . . . 149Table 43: Application summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

    Table 44: Framework overhead based on RPC application . . . . . . . . . . . . . . . . . . 151

  • 1

    in-fra-struc-ture \'in-fre-,strek-cher, n (1927)The basic facili ties, services, and installations needed for the functioning of a community

    or society, such as transportation and communications systems, water and power lines,and public institutions including schools, post offices, and prisons.

    — American Heritage Dictionary

    Chapter 1

    Introduction

    Throughout history, the development of infrastructures has catalyzed and shaped the

    evolution of human progress. The construction of Roman roads, the telegraph, the

    telephone, the modern banking system, the rail road, the interstate highway system, the

    electrical power grids, and the Internet, are all successful infrastructures that have

    revolutionized how people communicate and interact. At the dawn of the new millennium,

    we are witnessing the birth of what promises to be the next revolutionary infrastructure.

    Funded in the United States by several governmental agencies, including the National

    Science Foundation (NSF), the Defense Advanced Research Project Agency (DARPA),

    the Department of Energy (DOE), and the National Aeronautics and Space Administration

    (NASA), this new infrastructure is often referred to as a metasystem or computational grid

    [GRIM97A, SMAR97, GRIM98, FOST99, LEIN99].

    A computational grid is a specialized instance of a distributed system [MULL93,

    TANE94] with the following characteristics: compute and data resources are

    geographically distributed; they are under the control of different administrative domains

  • 2

    with different security and accounting policies; and the hardware resource base is

    heterogeneous and consists of PCs, workstations and supercomputers from different

    manufacturers. The abilit y to develop applications over this environment is sometimes

    referred to as the wide-area computing problem [GRIM99].

    Computational grids present a complex environment in which to develop applications.

    Writing a grid application is at least as difficult as writing an application for traditional

    distributed systems. Thus, since both are fundamentally distributed memory systems,

    programmers must deal with issues of application distribution, communication and

    synchronization. Furthermore, grids present additional challenges as programmers may be

    required to deal with issues such as security, disjoint file systems, fault tolerance and

    placement, to name only a few [GRIM98, FOST99, GRIM99]. Without additional higher

    level abstractions, all but the best programmers will be overwhelmed by the complexity of

    the environment.

    The contribution of this work is the development of a framework for simpli fying the

    construction of grid applications. The framework provides a generic extension mechanism

    for incorporating functionality into applications and consists of two models: (1) the

    reflective graph and event model, and (2), the exoevent notification model. These models

    provide a platform for extending user applications with additional capabiliti es via

    composition. While the models are generic and can be used for a variety of purposes,

    including security, resource accounting, debugging, and application monitoring [VILE97,

    FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the

    integration of fault-tolerance techniques. Support for the development of fault-tolerant

  • 3

    applications has been identified as one of the major technical challenges to address for the

    successful deployment of computational grids [GRIM98, FOST99, LEIN99].

    Consider application reliabilit y in a grid. As applications scale to take advantage of a

    grid’s vast available resources, the probabilit y of failure is no longer negligible and must

    be taken into account. For example, consider an application decomposed into 100 objects,

    with each object requiring one week of processing time and placed on its own workstation.

    Assuming that each workstation has an exponentially distributed failure mode with a

    mean-time-to-failure of 120 days, the mean-time-to-failure of the entire application would

    only be 1.2 days, thus, the application would rarely finish!

    Using the framework, fault-tolerance experts can encapsulate algorithms using the two

    reflective models developed in this dissertation. Developers incorporate these algorithms

    into their tools and augment the set of services provided to application programmers.

    Application programmers then use these augmented tools to increase the likelihood that

    their programs will complete successfully.

    We claim that the framework enables the easy integration of fault-tolerance techniques

    into object-based grid applications. To support this claim, we have mapped onto our

    models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD

    checkpointing, passive and stateless replication, and pessimistic method logging. We

    chose these algorithms to il lustrate the applicabilit y of our framework to a range of fault-

    tolerance techniques. Furthermore, we selected these algorithms because we believe that

    they are likely to be used in grid applications. We incorporated these algorithms into three

    common grid programming tools: Message Passing Interface (MPI), Mentat, and Stub

    Generator (SG). MPI is the de facto standard for message passing; Mentat is a C++-based

  • 4

    parallel programming environment; and SG is a popular tool for writing client/server

    applications.

    We measured the ease by which techniques can be integrated into applications based

    on the number of additional li nes of code that a programmer would have to write. In the

    best case, programmers needed to add three lines of code. In the worst case, programmers

    had to write functions to save and restore the local state of their objects. However, such

    functions are simple to write and exploit programmers’ knowledge of their applications.

    Furthermore, tools to automate save and restore state functions have already been

    demonstrated in the literature [BEGU97, FERR97, FABR98].

    To the best of our knowledge, we are the first to advocate and use a reflective

    architecture to structure applications in computational grids. Moreover, we are the first to

    demonstrate the integration of a wide range of fault-tolerance techniques into grid

    applications using a single framework.

    1.1 Current support for fault tolerance in gr ids

    Until recently, the foremost priority for grid developers has been to develop working

    prototypes and to show that applications can be written over a grid environment

    [GRIM97B, BRUN98, FOST98]. To date, there has been limited support for application-level

    fault tolerance in computational grids. Support has consisted mainly of failure detection

    services [STEL98, GROP99] or fault-tolerance capabilities in specialized grid toolkits

    [NGUY96, CASA97]. Neither solution is satisfactory in the long run. The former places the

    burden of incorporating fault-tolerance techniques into the hands of application

    programmers, while the latter only works for specialized applications. Even in cases

  • 5

    where fault-tolerance techniques have been integrated into programming tools, these

    solutions have generally been point solutions, i.e., tool developers have started from

    scratch in implementing their solution and have not shared, nor reused, any fault-tolerance

    code.

    As these tools are ported to grid environments, or as new tools are developed for grid

    environments, the continued development of fault-tolerant tools as point solutions

    represents wasteful expenditure. We believe a better approach is to provide a structural

    framework in which tool developers can integrate fault-tolerance solutions via a

    compositional approach in which fault-tolerance experts write algorithms and encapsulate

    them into reusable code artifacts, or modules. Tool developers can then integrate these

    modules in their environments.

    1.2 Properties of the framework

    Our long-term goal is to simpli fy the construction of fault-tolerant grid applications.

    We believe that a good solution for achieving this goal should exhibit the following

    properties:

    • P1. Separation of concerns and composition. Designing and writing fault-

    tolerance code are complex and error-prone tasks and should be done by experts,

    not application programmers or tool developers. Thus, fault-tolerance experts

    should be able to encapsulate algorithms into reusable and composable code

    artifacts [NGUY99]. Furthermore, the incorporation of fault-tolerance techniques

    should not interfere with other non-functional concerns such as security or

    accounting.

    • P2. Localized cost. By localized cost, we mean that the use of resources or services

    to implement fault-tolerance techniques should not be charged to applications that

  • 6

    do not require those resources or services—users should pay only for the level of

    services that they need. In general, localized cost is an important attribute for any

    grid services [GRIM97A].

    • P3. Working proof of concept. We should be able to demonstrate the integration of

    fault-tolerance techniques in running applications on a working grid prototype and

    using multiple programming tools. Further, applications with fault-tolerance

    techniques integrated should be able to tolerate more failures than applications that

    do not use any fault-tolerance techniques.

    1.3 Evaluation

    Based on our goal of simpli fying the construction of fault-tolerant applications and the

    properties listed in §1.2, we have derived several criteria by which to evaluate our

    framework (next to each criterion, we note in parenthesis its related property):

    • Multiple programming tools. A successful solution should promote and enable the

    incorporation of fault-tolerance techniques into multiple programming tools,

    including legacy tools such as MPI or PVM. Legacy tools are already familiar to

    programmers and should ease the transition from traditional distributed systems to

    grid environments. (P1, P3)

    • Breadth of fault-tolerance techniques. A successful solution should support a wide

    range of fault-tolerance techniques so that application programmers may use the

    one that is most appropriate for their needs. (P1, P2)

    • Ease of use. Incorporating fault-tolerance techniques should required only trivial

    or small modifications to applications. (P1, P3)

    • Localized cost. Application programmers should select and pay only for the level

    of fault tolerance that they require. A good framework should not impose a

    system-wide solution. Instead, the cost of using fault-tolearnce techniques should

    be localized to the applications that use these techniques. (P2)

    • Overhead. Is the overhead of using fault-tolerance techniques due to the algorithm

    or to the framework itself? In deciding whether to incorporate a fault-tolerance

  • 7

    technique, users should only worry about the algorithmic overhead, i.e., the cost of

    the algorithm itself. (P2, P3)

    1.4 Background

    1.4.1 Gr id models

    Before describing our framework, we present the implementation models of

    computational grids. As shown in Figure 1, a grid consists of services that run on top of

    native operating systems. These services provide functionality such as authentication,

    failure detection, object and process management, and remote input/output, and are

    accessed via grid libraries. Typically, an application programmer will not access these

    libraries directly, but will use a programming tool such as MPI [GROP99],

    NetSolve [CASA97], Ninf [SATO97] or MPL [GRIM97B], which in turn will call the

    underlying grid libraries. The advantage of this layered model is that application

    programmers can use familiar programming tools and interfaces and are shielded from the

    complexity of accessing grid services.

    FIGURE 1: Grid layered implementation models (adapted from [FOST99], pg. 30)

    MPI, PVM, NetSolve, DOME, MPL, Fortran

    Grid Services

    Programming Tools

    Applications

    Native Operating Systems

    Security, Object/Process Management, Scheduling,Failure Detection, Storage

    Globus API, Legion API

    Applications

    Windows NT, Unix

    Grid Libraries

  • 8

    There are currently three approaches to building grids: the commodity approach, the

    service approach, and the integrated architecture approach [FOST99]. In the commodity

    approach, existing commodity technologies, e.g. HTTP, CORBA, COM, Java, serve as the

    basic building blocks of the grid [ALEX96, BALD96, FOX96, CHRI97]. The primary

    advantages of this approach are the use of industry standard protocols, allowing

    programmers to ride the technology curve as improvements are made to these protocols.

    Furthermore, standard protocols stand a better chance of being adopted by a large

    community of developers. The problem with this approach is that the current set of

    protocols may not be adequate to meet the requirements of computational grids. In the

    service approach, as exempli fied by the Globus project, a set of basic services such as

    security, communication, and process management are provided and exported to

    developers in the form of a toolkit [FOST97]. In the integrated architecture approach,

    resources are treated and accessed through a uniform model of abstraction [GRIM98]. As

    we describe in §1.4.3, our framework targets the integrated approach.

    1.4.2 Reflection

    Our framework relies on the observation that although fault-tolerance techniques are

    diverse by nature, their implementation is not. Indeed, the implementation of the major

    famili es of fault-tolerance techniques rely on common basic primitives such as:

    • intercepting the message stream

    • piggybacking information on the message stream

    • acting upon the information contained in the message stream

    • saving and restoring state

    • detecting failure

    • exchanging protocol information between participants of an algorithm

  • 9

    Thus, by providing an execution model whereby these primitives can be expressed and

    manipulated as first class entities, it is possible to achieve our goals of developing fault-

    tolerance capabili ties independently and integrating them into programming tools.

    We use reflection as the architectural principle behind our execution models. Smith

    introduced the concept of reflection as a computational process that can reason about itself

    and manipulate representations of its own internal structure [SMIT82]. Two properties

    characterize reflective systems: introspection and causal connection.* Introspection

    allows a computational process to have access to its own internal structures. Causal

    connection enables the process to modify its behavior directly by modifying its internal

    data structures—there is a cause-and-effect relationship between changing the values of

    the data structures and the behavior of the process. The internal data structures are said to

    reside at the metalevel while the computation itself resides at the baselevel. The metalevel

    controls the behavior at the baselevel. In our case, the fault-tolerance capabiliti es are

    expressed at the metalevel and control the underlying baselevel computation.

    1.4.3 Legion gr id environment

    Our work targets the Legion environment for multiple reasons: (1) Legion is object-

    based, (2) it already uses graphs for inter-object communication, (3) it is an existing grid

    prototype, and (4), multiple programming tools are available. None of the other

    environments considered, such as Globus and CORBA-based systems, possess all these

    attributes. However, our framework is also relevant to these other environments. For

    example, it could be used to structure CORBA applications. Recent research has been

    * Note that the term causal is used differently in the distributed systems literature where it refersto the “happen-before” relationship as defined by Lamport [LAMP78].

  • 10

    oriented towards extending the functionality of CORBA systems through a reflective

    architecture [BLAI98, HAYT98, LEDO99]. Our work suggests that structuring CORBA-

    reflective architectures using an event-based and/or graph-based paradigm is an idea

    worth pursuing.

    Legion treats all resources in a computation grid as objects that communicate via

    asynchronous method invocations. Objects are address-space-disjoint, i.e., they are

    logically-independent collections of data and associated methods. Objects contain a thread

    of control, and are named entities identified by a Legion Object IDentifer (LOID). Objects

    are persistent and can be in one of two states: active or inert. Active objects contain a

    thread of control and are ready to service method calls. They are implemented with

    running processes over a message passing layer. Inert objects exist as passive object state

    representations on persistent storage. Legion moves objects between active and inert states

    to use resources efficiently, to support object mobili ty, and to enable failure resili ence.

    Legion objects are under the control of a Class Manager object that is responsible for

    the management of its instances. A Class Manager defines policies for its instances and

    regulates how an object is created, or deleted, and when it should be migrated, activated or

    deactivated. By defining new Class Managers, grid developers can change the

    management policies of object instances. Class Managers themselves are managed by

    higher-order class managers, forming a rooted hierarchy.

    Legion provides several default objects to manage its resource base. The two basic

    objects are Host Objects and Vault Objects, which correspond to processor and storage

    resources in a traditional operating system. Host objects are responsible for running an

    active object while vault objects are used to store inert objects. Legion allows

  • 11

    customization of all it s objects. Thus, a host object could represent compute resources that

    exhibit varying degrees of reliabilit y and performance, e.g., a personal computer, a

    workstation, a server, a cluster, or a queue-controlled supercomputer. Similarly a vault

    object could represent a local disk, a RAID disk, or tertiary storage. A full description of

    the Legion object model can be found in the literature [GRIM98].

    1.5 Framework foundation

    The key contribution of this work is the development of two reflective models that are

    the foundations of our framework, the reflective graph and event model, and the exoevent

    notification model. Together these models provide flexible mechanisms for structuring

    applications and specifying the flow of information between objects that comprise an

    application. Furthermore, the models enable information propagation policies to be bound

    to applications at run-time. The flexibilit y of the models and the abilit y to defer the

    binding of policy decisions are the differentiating features of our framework.

    The reflective graph and event model (RGE) reflects our target environment of (1) an

    environment in which objects are implemented by running processes that communicate

    via message passing, and (2) an object-based environment in which an application consists

    of a set of cooperating objects. The RGE model employs graphs and events to expose the

    structure of objects to fault-tolerance developers. It specifies both its external aspect

    (interactions between objects) and its internal aspect (interaction inside objects). Graphs

    and events are the building blocks with which fault-tolerance implementors can

    incorporate functionali ty inside objects and exchange fault-tolerance protocol information

    between objects. Graphs represent interactions between objects; a graph node is either a

  • 12

    member function call on an object or another graph, arcs model data or control

    dependencies, and each input to a node corresponds to a formal parameter of the member

    function. Events specify interactions inside objects and are used to structure their protocol

    stack.

    Our second model, the exoevent notification model, is a distributed event model.

    Similarly to the event model defined by CORBA [BENN95] and the Java Distributed Event

    Specification [SUN99A], the exoevent notification model provides a flexible mechanism

    for objects to communicate. However, unlike the CORBA and Java models, the salient and

    distinguishing features of the exoevent notification model are that it unifies the concept of

    exceptions and events—an exception is a special case of an event—and it allows the

    specification of event propagation policies to be set on a per-application, per-object or per-

    method basis, at run-time. In our model, exoevents denote object state transitions and are

    associated with program graphs. Raising an exoevent results in the execution of method

    invocations on remote objects through the execution of associated program graphs—

    hence the term exoevent. The abilit y to specify handlers as program graphs allows

    developers to specify more complex policies than with a traditional event model.

    The use of reflection to incorporate non-functional requirements has been proposed by

    Stroud [STRO96]. Its use for integrating fault-tolerance capabilit ies into systems has been

    successfully employed in many object-based systems, including FRIENDS [FABR98] and

    GARF [GUER97]. Reflection has also been used as the basis for extending object

    functionality in CORBA-based systems (OpenORB [BLAI98], FlexiNet [HAYT98],

    OpenCorba [LEDO99]). The novelty of this dissertation is to suggest the use of events as

    the primary structuring mechanism for designing object request brokers, the use of generic

  • 13

    program graphs to describe distributed event propagation policy and bind policy at run-

    time, and the use of reflection to specify inter- and intra-object communication as generic

    and flexible means of extending grid applications with additional functionality. In

    particular, we focus on using the models to extend applications with fault-tolerance

    capabiliti es.

    1.5.1 Framework summary

    In order to enable the integration of fault-tolerance techniques with applications, our

    framework requires that both fault-tolerance experts and tool developers target the

    reflective graph and event model and the exoevent notification model. Note that the

    framework does not make any assumptions about the failure model used by the underlying

    system, or the failure assumptions made by a given fault-tolerance algorithm. The

    framework is an integration framework only; the decision as to whether a given algorithm

    is suitable for a given application is not part of the framework proper.

    Our framework imposes a unified structure on the way grid libraries are organized.

    Specifically, our framework requires that library components use an event paradigm for

    intra-object communication. The advantages of events in terms of flexibilit y and

    extensibilit y are well-known. Events have been used in such diverse areas as graphical

    user interfaces [NYE92], protocol stacks [BHAT97, HAYD98], operating system kernels

    [BERS95] and integrated systems [SULL96]. Using events for building the protocol stack

    of an object provides natural hooks for inserting fault-tolerance capabiliti es. In fact, the

    events required to build a protocol stack for objects are those that are needed for

    incorporating fault-tolerance functionality.

  • 14

    For inter-object communications, our model provides a data-driven, graph-based

    abstraction. Graphs have been used successfully in parallel and distributed systems

    [BABA92, BEGU92, GRIM96A]. Graphs enable the expression of traditional client/server

    interactions, such as CORBA, as well as more complex interactions, such as pipelined

    flow.

    1.6 Constraints and assumptions

    The fault-tolerance algorithms discussed in this dissertation make use of three

    common assumptions: fail-stop, availabil ity of reliable storage, and reliable networks.

    However, Legion only provides an approximation of these assumptions. Detecting a

    crashed object is approximated using conservatively-set timeouts; reliable storage is

    approximated with standard disks; and the use of a high-level retry mechanism for sending

    messages is used to mask transient network partitions. Thus, it is possible for an

    application using a given fault-tolerance technique to violate its failure assumptions. To

    increase the likelihood that these assumptions are met, Legion could be configured to use

    hosts and storage devices with higher reliabilit y, e.g., hosts such as those provided by the

    NonStopTM Compaq®† or Stratus® architectures, storage such as RAID disks, and

    possibly hosts configured with redundant network paths. However, we do not expect this

    configuration to be common in grids in the near future. Thus, application developers

    should be aware of the possibili ty of violating the failure assumptions—if the cost of

    violating these assumptions is too high, e.g., as would be the case with safety-criti cal

    applications, then these applications should not be used on Legion.‡ The framework

    † Formerly known as Tandem®, acquired by Compaq Corporation.‡ Note that this comment applies to any computational grids.

  • 15

    described here is an integration framework only, and does not make any guarantees as to

    the suitability of using a given algorithm. However, to increase the likelihood that the

    failure assumptions are met, we configured applications to run within a site [DOCT99].

    In this dissertation the algorithms we have mapped onto our framework are designed

    to tolerate host failures. Computational grids use hardware resources owned by various

    entities, including research labs, governmental agencies, and universities. At any moment

    in time, it is thus not surprising to find that some hosts used by a grid system have crashed

    due to someone rebooting the machine or tripping on a power cord; or by chance; or a host

    may simply be down for maintenance. While the crash failure of hosts represents an

    important class of failures in grids, we note that they are not the only source of failures—

    unreliable software or operator error could also result in the failure of applications

    [GRAY85]. Furthermore, we do not concern ourselves with non-fault-masking techniques

    such as reconfiguration and presentation of alternative services to cope with failures

    [HOFM94, KNIG98, GART99]. We are only concerned with the integration of fault-masking

    techniques in grid applications. Once a host fails, we assume that it does not recover.

    Furthermore, we seek only to integrate fault-tolerance techniques into user applications

    and do not address the case of fault-tolerance for system-level objects.** We assume that

    Legion services are always available.

    1.7 Outline

    We have organized the rest of the dissertation as follows. In Chapter 2, we present an

    overview of related work in the areas of computational grids, reflection, event-driven

    ** Legion system-level objects already tolerate transient host failures.

  • 16

    systems, aspect-oriented programming and integration of fault-tolerance techniques in

    distributed systems. In Chapter 3, we provide an overview of our execution model, the

    reflective graph and event model. In Chapter 4, we describe the development of a

    distributed event notification model that is used as a flexible communication model to

    exchange protocol information between objects. In Chapter 5, we illustrate mappings from

    several well -known fault-tolerance techniques onto the reflective graph and event model

    and the distributed event notification model. In Chapter 6, we present the integration of

    several mappings described in Chapter 5 into several programming tools available in the

    Legion grid. In Chapter 7, we tie the previous chapters together and provide a working

    proof that our models have been successfully integrated into several tools and

    applications. We also evaluate the performance of these applications. In Chapter 8, we

    conclude by presenting lessons we learned and opportunities for future research.

  • 17

    There is only one nature – the division into science and engineering is a humanimposition, not a natural one. Indeed, the division is a human failure;

    it reflects our limited capacity to comprehend the whole.— Bill Wulf

    Chapter 2

    Related Work

    We present a broad overview of computational grids and potential grid tools to provide

    context for our work (§2.1). We discuss reflective systems (§2.2) as our reflective graph

    and event model is based on a reflective architecture. We discuss the event model and its

    use in various settings to support extensibilit y and flexibil ity (§2.3). We consider aspect-

    oriented programming and its potential relationship with event-based extension

    mechanisms (§2.4). Finally, we present several approaches to integrating fault-tolerance

    techniques into distributed systems, including CORBA-based systems (§2.5).

    2.1 Computational gr ids

    Foster et al. have identified three approaches to building computational grids: the

    commodity approach, the service approach, and the integrated architecture approach

    [FOST99]. In the commodity approach, existing commodity technologies, e.g., HTTP,

    CORBA, COM, Java, serve as the basic building blocks of the grid [ALEX96, BALD96,

    FOX96, CHRI97]. In the service approach, as exempli fied by the Globus project, a set of

  • 18

    basic services such as security, communication, and process management are provided and

    exported to developers in the form of a toolkit [FOST97]. In the integrated architecture

    approach, resources are accessed through a uniform model of abstraction [GRIM98]. For

    example, Legion enables the development of grid applications by providing a uniform

    object abstraction to encapsulate and represent grid resources, e.g., compute, data, and

    people resources. A motivating factor for both the service and integrated architecture

    approach is that the set of commodity services provided by current technology does not

    suffice to meet the requirements of computational grids [FOST99].

    We present several systems below and comment on the suitabilit y of these systems for

    developing grid applications.

    2.1.1 PVM and MPI

    PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are the two

    best-known message passing environments in grid computing [GEIS94, GEIS97]. They

    provide programmers with library support for writing applications with explicit message

    send and receive operations. In addition to message passing, PVM and MPI provide the

    il lusion of an abstract virtual machine that supports the creation and deletion of processes

    or tasks. As of this writing, MPI has eclipsed PVM to become the primary message

    passing standard, and is supported by all major computer manufacturers.

    Both Legion and Globus provide support for MPI [FOST99]. Legion also provides

    support for PVM. We describe below several systems layered on top of PVM or MPI that

    provide fault-tolerance capabilit ies. While these systems have not yet been ported to grid

    prototypes, they are representative of the kind of systems that are likely to be incorporated

  • 19

    into grids. It is interesting to note that many of these systems are geared towards scientific

    computing; they provide support for a style of application known as SPMD applications

    (Single Program Multiple Data) in which identical processes process a subdomain of the

    application data. SPMD applications are often time-stepped, with periodic exchange of

    information at well -defined intervals.

    2.1.1.1 DOME

    DOME (Distributed Object Migration Environment), runs on top of PVM and

    supports application-level fault-tolerance in heterogeneous networks of workstations

    [BEGU97]. DOME defines a collection of data parallel objects such as arrays of integers or

    floats that are automatically distributed over a network of workstations. DOME supports

    the writing of SPMD applications in which a process is replicated on multiple nodes and

    executes its computation over a different subset of the data. DOME provides support for

    the checkpointing of SPMD applications. Similarly to the checkpointing techniques that

    we use, DOME’s checkpoints support the recovery of applications on heterogeneous

    architectures.

    2.1.1.2 CVMULUS

    CVMULUS is a library package for visualization and steering of fault-tolerant SPMD

    applications for use on top of PVM [GEIS97]. In CVMULUS, programmers specify the

    data decomposition of their applications. CVMULUS automatically uses this information

    for checkpoint/recovery and is able to reconfigure applications even if the recovered

    application uses fewer workers or tasks. Since CVMULUS is geared towards SPMD

    applications, the consistency of application-wide checkpoints is easily maintained.

  • 20

    2.1.1.3 Other extensions to PVM and MPI

    Fail-Safe PVM is an extension of PVM to provide application-transparent fault

    tolerance based on checkpoint and recovery [LEON93]. While it achieves transparency,

    Fail-Safe PVM required modifications to the PVM daemons to monitor the flow of

    messages between PVM tasks. Silva et. al provide a user-level li brary called PUL-RD to

    support checkpointing and recovery of SPMD applications on top of MPI [SILV95].

    Programmers are responsible for describing the data layout of their applications. Similarly

    to CVMULUS, the PUL-RD library supports the recovery of applications with fewer

    processes.

    2.1.2 Isis, Horus and Ensemble

    Isis, Horus and Ensemble are representative of systems that use a process group

    abstraction to structure distributed applications [BIRM93, RENE96, HAYD98]. The central

    tenet of such systems is that support for programming with distributed groups is the key to

    writing reliable applications.

    Process groups enable the realization of a virtually synchronous model of computation

    wherein the notion of time is defined based on the ordering of messages [LAMP78].

    Typically, a programmer uses various forms of multicast primitives for communication

    with members of a group, e.g., causal multicast or totally ordered multicast. The receipt of

    messages within a group may be ordered with respect to group membership changes,

    thereby enabling programmers to write algorithms such that group members can logically

    take some actions “at the same time” with respect to failures. Failures of processes are

    treated as changes in the membership of a group. Only processes that are members of a

  • 21

    group are allowed to process messages. Thus, group membership as seen in Isis, simulates

    a fail -stop model in which processes fail by halting [SCHN83, SABE94].

    The process group model has often been criti cized on the basis of the end-to-end

    argument [SALT90]. Critics of the model argue that the ordering properties guaranteed by

    group communication primitives are provided at too low a level of abstraction, and in

    some cases, may be unnecessary to meet the specifications of an application [CHER93].

    Proponents of the model argue that the services provided by the model are invaluable in

    developing fault-tolerant distributed applications [RENE93, BIRM94, RENE94].

    It is interesting to view the progression of systems developed at Cornell University,

    from Isis to Horus, and then to Ensemble, as a response to the end-to-end argument. While

    Isis was a monolithic system, both Horus and Ensemble allow developers to configure and

    customize the protocol stacks of processes to meet the needs of applications. In Ensemble,

    the protocol stack of processes can be configured at run-time using an event-driven

    paradigm, unlike the protocol stack of Horus which has to be configured statically.

    The process group model has found acceptance in several domain areas, including

    finance, groupware applications, telecommunication, military systems, factory automation

    and production control [BIRM93]. For more information on the model and its applications

    to Internet applications, please see the recent book by Birman [BIRM96].

    Our framework differs in that its focus is on integrating fault-tolerance techniques in

    object-based systems whereas the focus of Isis, Horus and Ensemble, is in supporting the

    process group abstraction. The two are not mutually exclusive, it is possible to layer a

    reflective framework on top of ordered group communication primitives [FABR98].

  • 22

    For grid applications, it is too early to determine how much of a role the process group

    model will play. However, the evolution from Isis to Ensemble point to a common design

    goal of supporting flexibilit y and extensibility (§2.3).

    2.1.3 L inda, Pirhana and JavaSpaces

    In Linda, processes in an application cooperate by communication through an

    associative shared memory abstraction called tuple space [CARR89]. A tuple in tuple space

    names a data element that consists of a sequence of basic data types such as integers,

    floats, characters and arrays. Linda defines four basic operations, out, in, rd and eval, to

    access tuple space. Out is used to deposit tuples in tuple space, in and rd are used to search

    tuple space. A nice property of in and rd is that they can specify a generic pattern to search

    tuple space. Finally, eval is used to create a new process. The primary advantages of Linda

    are that it is simple to learn its four operations and easy for programmers to use a shared

    memory abstraction. PLinda is an extension to Linda to provide fault-tolerance through

    the checkpointing and recovery of tuple space and the use of a commit protocol to deposit

    and read tuples from tuple space [JEON94]. Another fault-tolerant version of Linda is

    Pirhana [CARR95]. Pirhana supports a style of computation known as master-worker

    parallelism, in which a master process generates a set of tasks to be consumed by workers.

    Pirhana enables users to treat a collection of hosts as a computational resource base on

    which to assign tasks. When a user reclaims a host, e.g. by pressing a key or clicking on

    the mouse, Pirhana automatically reassigns the task to another host, thus ensuring that an

    application eventually completes. The act of reclaiming of host can be treated as a failure

    and is analogous to leaving a group in a system with group membership.

  • 23

    Linda and its derivatives are particularly well -suited to a master-worker style of

    computation—a style that is prevalent in grid applications. We expect that over time, a

    Linda-like abstraction, wil l be ported to computational grids. We note that Linda is

    currently a commercial product supported by Scientific Computing Associates, Inc, under

    the tradename Paradise ® .

    The Linda tuple model heavily influenced the development of the Jini JavaSpacestm

    Specification [SUN99A]. Similarly to Linda, JavaSpaces provide the abstraction of an

    associative shared memory in which Java programs can deposit and retrieve information.

    JavaSpaces improve upon the Linda model in that Java programs can be automatically

    notified of changes in the JavaSpace through events [SUN99A]. Both Linda tuple space

    and JavaSpaces can be viewed as an instance of a blackboard architecture in which

    different components interact and coordinate actions based on state changes in a shared

    repository [SHAW96].

    2.2 Reflection

    Smith introduced the concept of reflection and that of a computational process that can

    reason about itself and manipulate representations of its own internal structure [SMIT82].

    Two properties characterize reflective systems: introspection and causal connection.

    Instropection enables a computational process to have access to its own internal structures.

    Causal connection enables the computational process to modify its behavior directly by

    modifying its internal data structures, i.e., there is a cause-and-effect relationship between

    changing the values of the data structures and the behavior of the process. The internal

  • 24

    data structures are said to reside at the metalevel while the computation itself resides at the

    baselevel; thus the metalevel controls the behavior of the baselevel.

    Reflection provides a principled means of achieving open engineering, i.e., of

    extending the functionali ty of a system in a disciplined manner [BLAI98]. A key attribute

    of reflective systems is that of separation of concerns between the metalevel and the

    baselevel. For example, Fabre et al. incorporated replication techniques into objects using

    the reflective programming language Open-C++ [FABR95]. The implementation of the

    replication techniques was performed at the metalevel with lit tle changes to the underlying

    baselevel application. The design and implementation of the replication techniques were

    separated from the design and implementation of the actual application, thus allowing the

    replication techniques to be composable with many applications. In general, reflective

    architectures enable the composition of non-functional concerns with the underlying

    computational process [STRO96].

    Another advantage of reflective architectures is that they enable flexibilit y and

    extensibilit y of functionality. Reflective architectures have been used in such diverse areas

    as programming languages [MAES87, WATA88, KICZ91, AKSI98, TATS98, MOSS99,

    WELC99], operating systems [YOKO92], real-time systems [SING97, STAN98, STAN99],

    fault-tolerant real-time systems [BOND93], agent-based systems [CHAR96], dependable

    systems [AGHA94], and distributed middleware systems, e.g., OpenORB [BLAI98],

    FlexiNet [HAYT98], OpenCorba [LEDO99] and Legion [NGUY99].

    A feature common to all reflective systems is that they answer two questions: What

    internal structure or metalevel information (meta-information) is exposed to developers?

    How does one access the metalevel? The answer to the first question is application-

  • 25

    dependent. For example, in real-time systems such as FERT or Spring [BOND93, STAN98]

    the meta-information includes timing constraints of tasks, deadlines, and precedence

    constraints. In a programming language such as CLOS, the meta-information includes

    slots and methods [KICZ91]. In an object-based distributed systems, meta-information can

    include methods, arguments and replies [BLAI98, HAYT98, LEDO99, VILE97]. The answer

    to the second question also varies. A popular method of programming the metalevel is

    through an object-oriented paradigm in which a metalevel object defines and controls the

    behavior of baselevel objects [MAES87, KICZ91]. Other means of accessing meta-

    information include using compiler technology [FABR95, CHIB95, TATS98], configuration

    files [MOSS99, WELC99], and events [NGUY98, PAWL98].

    The reflective models developed in this dissertation reflect our target environment of a

    computational grid. Incorporating fault-tolerance techniques in a distributed application—

    a set of cooperating objects—requires manipulation of the internal as well as external

    aspects of an object. Our models regulate both intra-object interactions, i.e., interactions

    between modules inside an object, and inter-object interactions, i.e., interactions between

    objects. The dual aspect of our models enable the integration of application-wide

    algorithms such as checkpointing, in contrast to other reflective systems whose focus have

    been on integrating techniques such as replication in server objects [FABR95, GUER97,

    BLAI98, HAYT98].

    A further difference between our architecture and other reflective middleware

    architectures is that we do not use a metaobject protocol to control the behavior of the

    baselevel [AGHA94, FABR95, GUER97, FABR98, HAYT98, LEDO99]. Instead, we present a

    graph-and-event-based interface accessible through simple C++ library calls. In contrast,

  • 26

    other reflective approaches such as OpenCorba [LEDO99] and Garf [GUER97] rely on the

    Smalltalk programming language. We believe that presenting a C++ based interface

    expands our potential community of developers.

    2.3 Events

    Events have been used in a variety of contexts [SHAW96], in graphical user interfaces,

    to build protocol stacks [BERS95, BHAT97, HAYD98, VILE97], in integrated systems

    [SULL96], or as a generic mechanism for component interactions [BENN95]. We separate

    our discussion of events in two sections: local events and distributed events. Local events

    propagate within the same address space whereas distributed events propagate to a

    different address space.

    2.3.1 Local events

    2.3.1.1 Protocol stacks

    Many projects such as SPIN [BERS95], Coyote [BHAT97] and Ensemble [HAYD98],

    use an event-based paradigm for flexibil ity and extensibilit y. SPIN is a dynamically

    extensible operating system that uses events as its extension mechanism. A SPIN event is

    used to notify the system of a state change or to request a service. For example, an IP

    extension to the kernel could announce the event PacketArrived. Events in SPIN are fine-

    grained, reflecting their use in an operating system. Likewise, events in the Coyote project

    are fine-grained, reflecting their use in a kernel designed for network protocols. Coyote

    extends the x-kernel [HUTC91] and enables the construction of micro-protocols that

    communicate via events. Micro-protocols implement low-level properties, e.g.,

  • 27

    acknowledging that a message has been received or maintaining a membership list of li ve

    processes. By composing micro-protocols, the Coyote protocol stack can be easily

    configured to implement higher-level properties, e.g., group remote procedure calls with

    acknowledgment. Coyote was designed primarily for network protocols and so the set of

    pre-defined events relate mostly to messages, e.g., Message_Inserted_Into_Bag or

    Message_Ready_To_Be_Sent. Ensemble uses events as the primary mechanism for

    composing micro-protocols and supporting the process group abstraction. Example events

    in Ensemble include Send-Message and Leave-Group.

    The set of events exported by a system depends on the target environment and defines

    the extension vocabulary with which developers can extend functionality. Since we target

    an object-based system implemented over a message-passing communication layer, we

    export events such as MessageSend and MethodReceived. Approaches such as Coyote or

    our own in which events manipulate data structures (e.g., messages) contained in shared

    data structures (e.g., message repository), can be viewed as a blackboard architecture

    augmented with implicit invocations [SHAW96].

    2.3.1.2 Graphical user inter face

    Events have been widely popular in implement graphical user interfaces, e.g., the

    MacOS ®, Microsoft Windows ®, Java’s Abstract Window Toolkit. Events enable the

    separation of the visual aspects of a program from the actual computation. Typical events

    in these systems deal with various aspects of the desktop metaphor, e.g, mouse, windows,

    buttons, menus, keyboard input. Programmers can register event handlers to be notified of

    user actions and take appropriate actions. However, coordinating events may be a difficult

  • 28

    task. Thus, most environments provide tools to facilit ate the development of graphical

    user interfaces, e.g., Java Swing, Visual Basic.

    2.3.1.3 JavaBeans

    JavaBeanstm is the component technology developed by Sun Microsystems for use

    within the Java platform [SUN99B]. A bean is a reusable software artifact that can be

    manipulated visually using a builder tool. Beans can communicate with one another using

    an event paradigm. The advantages of using Beans are that they are portable across

    heterogenous architectures and that many tool builders are actively developing products to

    support the development of Java Beans.

    2.3.2 Distr ibuted events

    Distributed events are used to communicate information between remote objects or

    processes. In CORBA, the Event Service allows an object to register its interest in events

    raised by other objects [BENN95]. CORBA defines two roles for objects: suppliers and

    consumers. Suppliers produce events; consumers processes them. Suppliers and

    consumers may be directly linked in which case events flow directly from the suppliers to

    the consumers. Alternatively, an event channel may be defined to serve as an intermediary

    object between suppliers and consumers. Using an event channel fully decouples suppliers

    from consumers—consumers need not be active when suppliers deposit events on an

    event channel. Furthermore, event channels may provide added functionality such as

    filtering and persistence. The Jini Distributed Event Specification provides similar

    functionality as CORBA’s event service [SUN99A]. It also provides additional features

    such as the ability to bound the time during which an object is interested in an event raised

  • 29

    by some other objects via leasing [SUN99A]. In Jini terminology, an event listener may

    register to be notified of an event on a one-time basis, forever, or for a specified time

    period.

    The exoevent notification model developed in this dissertation is similar to both the

    CORBA and the Java Distributed Event specifications in that it supports the flexible

    propagation of events between objects. The distinguishing features of our model are that it

    unifies the concept of exceptions and events, i.e., an exception is simply a special kind of

    event, and it allows programmers to specify the propagation of events on a per-

    application, per-object or per-method basis. The exoevent notification model does not

    support the concept of leasing.

    While we use distributed events in our work for the dissemination of data to support

    fault-tolerance algorithms, we note that the publish/subscribe model supported by events

    is generic. As an example, the Department of Defense’s High Level Architecture uses the

    publish/subscribe model to propagate information about entities in distributed simulations

    [DMSO98]. As another example, the Jini Discovery and Join Specification regulates how

    devices can discover the presence of other devices on a network [SUN99A].

    2.4 Aspect-or iented programming

    The use of the event paradigm to extend functionality for middleware systems is

    related to the issue of crosscutting and weaving in aspect oriented programming [KICZ97].

    Crosscutting is the concept that extensions to a modularly-designed program cannot be

    constrained within the bounds of the original program decomposition. An example of

    crosscutting in an object-oriented program would be the addition of synchronization

  • 30

    primitives at the beginning of each method. Kiczales’ thesis is that crosscutting is

    common in large software systems. Our experiences with middleware systems corroborate

    his thesis; aside from implementing its functional requirements, an object may also handle

    issues such as argument marshalling, security, debugging, performance monitoring and

    synchronization. In aspect-oriented programming technology, these issues are called

    aspects. Aspect-oriented programming languages elevate aspects to first-class status and

    provide a clean separation between the functional decomposition of a program—objects

    or modules—and non-functional requirements which pertain to the way objects and

    modules relate to one another [HIGH99].

    After aspects are elevated to first-class status they must be composed with the

    underlying program. This process is known as weaving and seems closely related to events

    in the sense that events can be used to implement weaving. For example, an aspect for

    debugging could be implemented easily in an object-based system by inserting an event

    handler to intercept methods and logging them on storage for future replay. An interesting

    avenue of research would be to investigate the use of an aspect-oriented programming

    language to extend the functionality of objects in computational grids, or alternatively, to

    investigate the suitability of the event paradigm for weaving aspects. Pawlak et al. are

    currently investigating this line of research [PAWL98].

    2.5 Integrating fault tolerance in distr ibuted systems

    Fabre et al. present an excellent analysis of different approaches for integrating fault-

    tolerance in distributed systems [FABR95, FABR98]. They distinguish between three main

    approaches: the system approach, the library approach and the inheritance approach. In

  • 31

    the system approach, the runtime system provides support for fault-tolerance. For

    example, Delta-4 [POWE94] offers several replication strategies such as passive, semi-

    active and active replication to Delta-4 application programmers. In the library approach,

    a set of functions is provided at the application-level to support a set of fault-tolerance

    algorithms. For example, ISIS [BIRM93], Horus [RENE96] and Ensemble [HAYD98],

    provide developers with various forms of ordered communication primitives. In the

    inheritance approach, an object can inherit fault-tolerance properties such as persistence

    and recoverabilit y from a base class. Examples of this approach include Avalon/C++

    [DETL88] and Arjuna [ARJU92]. Fabre analyzes these approaches in terms of transparency,

    reusabilit y and composabilit y, and argues that none meet these criteria simultaneously.

    Fabre proposes the use of reflective techniques to meet these criteria and shows how to

    integrate replication techniques into distributed objects using the reflective language

    Open C++ [FABR95, FABR98]. Other systems that advocate the use of reflection to

    incorporate fault-tolerance techniques include MAUD [AGHA94] and Garf [GUER97].

    A fertile area of research has been to integrate fault-tolerance techniques into CORBA.

    Moser et al. propose a fault-tolerance framework that implement fault-tolerance

    management services both above and below an object request broker (ORB) [MOSE99].

    Other projects such as Electra and Orbix+Isis integrate replication and group mechanisms

    inside the ORB itself [MAFF95, LAND97]. DOORS (Distributed Object-Oriented Reliable

    Service) provides fault-tolerance services as CORBA horizonatal services [SCHO98].

    Elnozahy et al. provide a library of fault-tolerance techniques that can be used in both

    CORBA and DCE environments [ELNO95]. Except for DOORS, which is implemented

    above the ORB layer, all the other projects use interception methods to implement

  • 32

    replication services. Interception is implemented by modifying the ORB itself [LAND97],

    by providing a library to be called from within the ORB [ELNO95], or by using features of

    the operating system [MOSE99]. The Orbix ORB includes the notion of f il ters to intercept

    method calls. However, Marzullo’s group at the University of Cali fornia, San Diego,

    reported difficulties in integrating the messing logging fault-tolerance technique with

    Orbix [NAMP99]. Marzullo et al. suggest that an event-driven model would have

    alleviated the report diff iculties [NAMP99].

    The need to extend the functionality of ORBs have led several researchers to adopt a

    reflective architure to structure ORB implementations [BLAI98, HAYT98, LEDO99]. Our

    development of the RGE and exoevent notification models also provides an extension

    mechanism. The novelty of this work is to suggest the use of events as the primary

    structuring mechanism for designing object request brokers and to specify both inter- and

    intra-object communication within a unified model.

    2.6 Summary

    In designing our models, we drew inspiration from reflective systems as well as

    previous work on flexible protocol stack. Our approach differs in two respects with most

    CORBA-based reflective middleware approaches: (1) we use a simple graph and event-

    based interface for extending object functionality instead of a metaobject protocol, and

    (2), our reflective models are designed to extend the functionality of applications, not just

    single server objects. In the next chapter, we present the cornerstone of our framework, the

    reflective graph and event model. We show an application of our model in designing a

    protocol stack and extending it with new functionality.

  • 33

    Make everything as simple as possible, but not simpler.— Albert Einstein (1879-1955)

    Chapter 3

    Reflective Graph and Event Model

    The cornerstone of our framework is the specification of the reflective graph and event

    (RGE) execution model. It provides a structural framework for providing basic object

    functionality such as invoking methods, and marshalli ng and unmarshalli ng parameters,

    similar to an object request broker (ORB) in CORBA systems [OMG95]. In addition, the

    model provides a generic extension mechanism for incorporating new functionality into

    objects—such functionality is encapsulated into reusable code artifacts, or modules. Thus,

    the RGE model provides a common framework for fault-tolerance designers and tool

    developers, and enables the integration and composition of fault-tolerance modules into

    programming tools.

    The novelty of this work is to suggest the use of events as the primary structuring

    mechanism for designing object request brokers and to use a single model to specify both

    inter- and intra-object communication. The RGE model employs graphs for inter-object

    communication and events for intra-object interactions. Graphs represent interactions

    between objects; a graph node is either a member function call on an object or another

  • 34

    graph, arcs model data and control dependencies, and each input to a node corresponds to

    a formal parameter of the member function. Events specify interactions between modules

    inside objects. Graphs and events are the building blocks with which fault-tolerance

    developers can incorporate functionality inside objects and exchange protocol information

    between objects.

    The RGE model is reflective because it exposes the structure of objects (i