2008 chen - towards fault-tolerant hla-based distributed simulations

http://sim.sagepub.com

SIMULATION

DOI: 10.1177/0037549708095518 2008; 84; 493 SIMULATION

Dan Chen, Stephen J. Turner and Wentong Cai Towards Fault-tolerant HLA-based Distributed Simulations

http://sim.sagepub.com/cgi/content/abstract/84/10-11/493 The online version of this article can be found at:

Published by:

http://www.sagepublications.com

On behalf of:

Society for Modeling and Simulation International (SCS)

can be found at:SIMULATION Additional services and information for

http://sim.sagepub.com/cgi/alerts Email Alerts:

http://sim.sagepub.com/subscriptions Subscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.co.uk/journalsPermissions.navPermissions:

http://sim.sagepub.com/cgi/content/refs/84/10-11/493 Citations

at University of Birmingham on April 2, 2009 http://sim.sagepub.comDownloaded from

http://www.scs.org/

http://sim.sagepub.com/cgi/alerts

http://sim.sagepub.com/subscriptions

http://www.sagepub.com/journalsReprints.nav

http://www.sagepub.co.uk/journalsPermissions.nav

http://sim.sagepub.com/cgi/content/refs/84/10-11/493


Towards Fault-tolerant HLA-based DistributedSimulationsDan ChenInstitute of Electrical EngineeringYanshan UniversityQinhuangdao066004 [email protected]

Stephen J. TurnerWentong CaiSchool of Computer EngineeringNanyang Technological University639798 Singapore

Large scale High Level Architecture (HLA)-based simulations are built to study complex problems,and they often involve a large number of federates and vast computing resources. Simulation fed-erates running at different locations are subject to failure. The failure of one federate can lead tothe crash of the overall simulation execution. Such risk increases with the scale of a distributedsimulation. Hence, fault tolerance is required to support runtime robustness. This paper introducesa framework for robust HLA-based distributed simulations using a ‘Decoupled Federate Architec-ture’. The framework provides a generic fault-tolerant model, which deals with failure with a dynamicsubstitution approach. A sender-based method is designed to ensure reliable in-transit messagedelivery, which is coupled with a novel algorithm to perform effective fossil collection. The fault-tolerant model also avoids any unnecessary repeated computation when handling failure. Using amiddleware approach, the framework supports reusability of legacy federate code and it is platform-neutral and independent of federate modeling approaches. Experiments have been carried out tovalidate and benchmark the fault-tolerant federates using an example of a supply-chain simulation.The experimental results show that the framework provides correct failure recovery. The results alsoindicate that the framework only incurs minimal overhead for facilitating fault tolerance and has apromising scalability.

Keywords: High Level Architecture, runtime infrastructure, fault tolerance, Decoupled Federate Ar-chitecture

1. Introduction

Distributed simulation technology facilitates the con-struction of a large-scale simulation with componentmodels that can be developed independently on hetero-geneous platforms and distributed geographically. TheHigh Level Architecture (HLA) defines the rules, in-

SIMULATION, Vol. 84, Issue 10/11, Oct./Nov. 2008 493–509c� 2008 The Society for Modeling and Simulation InternationalDOI: 10.1177/0037549708095518Figures 1–4, 6–15 appear in color online: http://sim.sagepub.com

terface specification and object model template to sup-port reusability and interoperability among the simula-tion components, known as federates. The Runtime In-frastructure (RTI) software supports and synchronizes theinteractions among different federates conforming to theHLA standard [1] to give an overall simulation applica-tion, known as a federation.

In the case where the problem domain is particularlycomplex or involves multiple collaborative parties, the an-alysts often need to construct a large-scale federation withindividual simulation federates interacting over the Inter-net. Some typical examples include military commissionrehearsal, Internet gaming and supply-chain simulation.Those applications are usually time consuming and com-

Volume 84, Number 10/11 SIMULATION 493



Chen, Turner and Cai

putationally intensive and require vast distributed comput-ing resources. Simulation federates running at different lo-cations are subject to failure. As the current IEEE 1516HLA standard does not support a formal fault-tolerantmodel [2], crash of a federate or a part of a federation maylead to the failure of the whole federation. When failureoccurs, even if it is feasible to restart the simulation froma previous checkpoint [3], repeating the execution couldeither be costly or result in the loss of functions of thefailed simulation. (For example, a random event may notbe regenerated in the new ‘recovered’ simulation execu-tion). The risk of such failure increases with the numberof federates inside one single federation. Hence, there ex-ists a pressing need for a mechanism to support runtimerobustness in HLA-based distributed simulations.

A normal federate usually exists as a single process atruntime, and the simulation model shares the same mem-ory space with the Local RTI Component (LRC) [4, 5].In the case where the RTI crashes or meets congestion,the failure of any LRC prevents the simulation executionfrom proceeding correctly even although the simulationmodel contains no error at all. Thus, providing fault toler-ance to federates requires an approach to ‘isolate’ the errorof the LRC from the simulation model in addition to thechallenge to develop a generic state-saving and recoverymechanism.

We have proposed a Decoupled Federate Architectureapproach to enable state saving and recovery for feder-ate cloning [6], and we have also suggested a preliminaryscheme to achieve fault tolerance using the architecture[7]. In this paper, we focus on the investigation of thefault-tolerance issue, and whether the Decoupled FederateArchitecture can be used for this purpose. This extendedpaper (see also Chen et al. [8]) also presents a scalabilitystudy.

This study aims to explore a solution to runtime ro-bustness upon existing RTI implementations, and it alsoprovides the designers of the future RTI software with aviable direction to address the fault tolerance issues. Ourframework has been designed with the following objec-tives and scope:

1. tackling unpredictable failure of RTI services re-gardless of cause�

2. minimizing overheads for providing runtime robust-ness to ensure execution efficiency�

3. resuming normal execution from exactly where afailure occurs without repeating or disrupting theglobal simulation execution�

4. providing user transparency, which (1) avoids theneed for developers to include extra fault-tolerantcodes in modeling federates, to minimize develop-ment cost and support reuse of legacy federates, (2)allows developers to model their federates freely us-ing various software packages and on different plat-

forms, (3) masks failure from the users at runtime,and (4) allows users to deploy/execute fault-tolerantfederations in the same way as normal federations.

This paper proposes a framework that takes advantageof the Decoupled Federate Architecture to handle an RTIfailure. The basic idea is to prevent a local failure fromaffecting the overall distributed computation (simulation).A generic fault-tolerant model has been developed as mid-dleware transparent to the user. The model dynamicallysubstitutes the crashed RTI components with backupswhile the simulation federates still continue to operate asnormal without being disrupted. The fault-tolerant modelavoids repeating the execution of federates when handlingfailure. Furthermore, the framework uses a sender-basedmethod to ensure reliable in-transit message delivery incase of failure. We have also designed a novel algorithmto dispose of buffered events after they have been success-fully delivered to the subscribers. A series of experimentshas been performed to validate and benchmark the fault-tolerant model.

The rest of this paper is organized as follows. Section 2discusses related work and analyzes the problems to besolved. Section 3 gives an overview of the Decoupled Fed-erate Architecture. Section 4 details the functionalities anddesign of the framework as well as the algorithms for deal-ing with in-transit messages. Section 5 presents the exper-iments based on a distributed supply-chain simulation ex-ample, which examine the correctness of the fault-tolerantmodel and compare the robust federates with normal fed-erates in terms of execution efficiency and scalability. InSection 6, we conclude with a summary and proposals onfuture work.

2. Related Work

Many technologies have been developed for facilitatingfault tolerance in distributed applications. Cristian high-lighted some principles of fault tolerance in distributedsystem architectures [9]: understanding failure semantics,masking failure and balancing design cost.

The checkpoint and message-logging approach is com-monly used. For example, as proposed by Johnson [3], aprocess records each message received in a message logwhile the state of each process is occasionally saved as acheckpoint. A failed process can be restored using someprevious checkpoint of the process and the log of mes-sages. The HLA federation save and restore services [5]could be used to save the RTI states at some checkpoints.In the case of failure, a new federation could be cre-ated to restore the federation with the saved states. How-ever, in the checkpoint approach, the simulation modelshould have the functionality to manipulate the states atthe model level, and it repeats the computation from oneof the checkpoints onwards. Moreover, the overhead forexecuting federation save and restore can be significant[10].

494 SIMULATION Volume 84, Number 10/11



TOWARDS FAULT-TOLERANT HLA-BASED DISTRIBUTED SIMULATIONS

Fault-tolerant techniques often employ redundant/backup components to achieve system robustness. Birmanused backup to ensure fault tolerance in building reliablenetwork applications [11]. Fault tolerance was enabled ina distributed system using rollback-recovery and processreplication [12].

These methods take advantage of replication to en-sure reliability of distributed applications. Another typi-cal example is the proposed Replica Federate approach[13]. This approach produces multiple identical instancesof one single federate, and failures can be detected andrecovered upon the outputs of those identical instances.However, replication consumes extra resources and re-quires synchronization of the replicas to maintain consis-tency. As such, it results in lowered system performance.

Furthermore, extra federate replicas in a single feder-ation increase the probability of overall system fault dueto an RTI failure, and this may also limit the scalabilityof the approach. Our fault-tolerant framework adopts bothstate saving and replication in the design, and it avoids thedrawbacks of the above approaches. By separating the ex-ecution of simulation models from RTI failures, the frame-work does not require rollback support from simulationmodels. The light-weighted physical federation consumesminimal system resources and operates independently ofthe simulation model execution. The framework makesreplicas of physical federates only when failure occurswhile leaving simulation models intact. The redundancyincurred by common replication approach is therefore alsominimized.

Although fault-tolerance support has been informallyproposed in the latest HLA-evolved specification andsome design patterns for fault-tolerant federations havebeen suggested [14], there are only a few preliminary andnon-standard implementations for this purpose. In addi-tion, the ‘failure-over RTI’ design pattern suggested byMöller et al. [14] is similar to the scheme proposed by theauthors [6]. The design pattern provides federates with aprioritized list of RTIs with an active RTI servicing thefederates, and they connect to another when the active RTIfails. Our scheme [6] suggests replacing a failed RTI witha new RTI using the decoupled federate architecture.

Eklöf et al. [15, 16] proposed a framework – Distrib-uted Resource Management System (DRMS) – for robustexecution of federations. Their framework deals with fail-ure by migrating federates to new hosts upon failure, whileusing the checkpoint approach for state saving/recovery.In the context of this approach, the federates should bespecially developed to save/recover the simulation mod-els’ internal states. It is also assumed that federates ex-ecuted within the scope of DRMS are portable, mean-ing that they should not be bound to a specific piece ofhardware and can be easily migrated between differenthost environments. The experimental results indicate asignificant overhead for providing fault tolerance in termsof extra messages in some scenarios [16]. In contrast, ourframework is not subject to the above constraints. It pro-

vides a relatively generic fault-tolerance solution to theHLA-based distributed simulations.

However, it is a challenge to develop a generic fault-tolerant model. One of the difficulties is due to the as-sumption that developers can model their federates in a to-tally free manner. It is unlikely that a generic state-savingand replication mechanism can be provided that will besuitable for any federate. Even given such a mechanism, itis unlikely that all developers will use the same standardpackage to model their simulations. Without the ability tocustomize the user’s simulation code, it is almost impossi-ble to make snapshots of all system states of any federate.The principle of reusing existing federate code increasesthe difficulty of this task.

However, the HLA standard makes it relatively easy tointercept the system states at the RTI level using a mid-dleware approach. Furthermore, we can see that the sim-ulation model and the Local RTI Component have verydifferent characteristics. A distinction should therefore bemade between these two modules when dealing with fail-ure. This study aims to develop a generic framework tohandle the failures of the RTI rather than the faults of thesimulation models.

3. Decoupled Federate Architecture

As shown in Figure 1(a), a normal simulation federatecan be viewed as an integrated program consisting of asimulation model and Local RTI Component (LRC) inan HLA-based distributed simulation [5]. The simulationmodel executes the representation of the system being an-alyzed, whereas the LRC services it by interacting andsynchronizing with other federates. In a sense, the sim-ulation model performs local computing while the LRCcarries out distributed computing for the model.

The Decoupled Federate Architecture [6] was initiallydesigned to tackle the problems involved in replicatingrunning federates for distributed simulation cloning. Itseparates a federate simulation model from the Local RTIComponent. A virtual federate is built with the same codeas the original federate. Figure 1(c) gives the abstractmodel of the virtual federate. Compared with the orig-inal federate, the only difference is in the module be-low the RTI interface, which remains transparent to theusers.

A physical federate (PhyFed) is designed as shown inFigure 1(b), and associates itself with a real LRC. Physi-cal federates interact with each other via a common RTIand form a ‘physical federation’ serving the overall simu-lation. Both virtual federate and physical federate operateas independent processes. Reliable external communica-tion channels link the two modules into a single federateexecutive. The virtual federate and the physical federatemay operate within the same address space or in differentmachines in a networking environment, depending on thedevelopers’ requirements.





Figure 1. Normal federate and decoupled federate architecture

A well-designed Decoupled Federate Architecture canprovide federated simulations with almost equivalent ex-ecution efficiency to that obtained using normal federatesin terms of both latency and time advancement perfor-mance [6, 17]. As the Decoupled Federate Architecturekeeps the standard HLA interface, we can customize ourown RTI++ library (middleware) to expand the function-alities of the original RTI software without altering thesemantics of RTI services. With these merits, the architec-ture seems to be an infrastructure suitable for developingthe fault-tolerant model (see Section 4.1).

4. Framework for Supporting Robust HLA-basedSimulations

This section introduces the internal design of the fault-tolerant model and related issues. No implementation can

ensure that any program is immune from all faults, and thefocus of this study is to develop a robust infrastructure forfacilitating distributed simulations rather than to releasedevelopers from validating their simulation models.

The fault-tolerant model therefore does not considerfederate crashes due to the incorrect implementation ofits simulation model or address deadlock in federationsynchronization. It assumes also that (1) the underlyingRTI software is properly implemented which improba-bly contains bugs, and (2) the messages sent and re-ceived in the network are also uncorrupted. An RTI fail-ure in the current implementation can be: (1) time-outof an RTI invocation, (2) a critical RTI exception (e.g.RTI::RTIinternalError or any other exception specified ascritical by the user), (3) any other unknown error (e.g. aruntime error of system libraries) from the RTI or (4) crashof the physical federate or RTIEXEC/FEDEXEC. Appar-ently, crash of RTIEXEC/FEDEXEC only concerns the





Figure 2. Fault-tolerant model upon dynamic LRC substitution

DMSO RTI software. In the rest of this paper, a federatemeans one that contains a virtual federate and a physicalfederate, and we will explicitly refer to a traditional fed-erate that directly interacts with the real RTI as a ‘normalfederate’.

4.1 Fault-tolerant Model

In the framework, the fault-tolerant model is embeddedin the customized RTI++ library (middleware of the De-coupled Federate Architecture). As shown in Figure 2, themodel contains a Management Module and a Failure De-tector in the middleware. The Management Module com-prises an RTI States Manipulator and a Buffer Manager.

At runtime, the middleware intercepts the invocationof each RTI service method. The RTI States Manipulatorsaves RTI states immediately before passing the RTI callto the physical federate to execute it. For example, whenthe virtual federate invokes publishObjectClass, the RTIStates Manipulator intercepts this call and saves the in-formation, after which it will call the physical federate viathe External Communication channel. In this way, the RTIStates Manipulator logs all the RTI system states into localstable storage.

Some RTI states are relatively static, such as thefederate identity, federation information, the published/subscribed classes and time constrained/regulating status.Other states include the registered or deleted object in-stances, and granted federate time. Some event data mayalso need to be saved, such as sent and received interac-tions, updated and reflected attribute values of object in-stances, etc. The RTI States Manipulator logs those statesthrough the standard RTI interface, and its design is trans-parent to and independent of the underlying RTI imple-mentation. The Buffer Manager makes use of saved at-tribute updates and interactions for dealing with in-transitevents (see Section 4.2 for details).

The Failure Detector monitors the status of the LRCor even the RTIEXEC/FEDEXEC if necessary. In the fourcases of RTI failure (see above), the first three cases canbe detected passively via the physical federate while thefourth requires the failure detector actively checking thestatus of the physical federate or RTIEXEC/FEDEXEC.Subsequent to confirming an occurrence of an RTI fail-ure, the Management Module will start a failure recoveryprocedure. Management Modules of other federates willeventually detect the ‘remote’ failure. In this section, wedescribe a straightforward recovery scheme (as shown inFigure 3) from the perspective of the first failed federate(s)using the following steps.

1. Preparation for recovery. The Management Mod-ule cuts off the connection from its PhyFed andterminates it, while other federates’ middleware at-tempt to extract received events before doing this.

2. Initiation of new physical federation. Since theoriginal physical federation cannot function prop-erly due to the RTI failure, the Management Mod-ule has to create a new physical federation and ini-tiates a new PhyFed instance. Other federates’ mid-dleware also perform exactly the same operation.All virtual federates switch to the new PhyFeds andform a new workable federation together.

3. State recovery. All RTI States Manipulators recoverRTI states from stable storage to the PhyFeds.

4. Handling in-transit events. All Buffer Managers en-sure in-transit events are delivered properly to thesubscribers.

5. Coordination among Management Modules. TheManagement Module synchronizes the recoveredfederation to guarantee that all federates are fullyreinitialized and ready to proceed.





Figure 3. Illustration of straightforward failure recovery procedure

Finally, the virtual federates obtain control again andcontinue execution with the support of a new physical fed-eration. Therefore, physical federates work as plug-and-play components, and they can be replaced at runtime. Thefault-tolerant model functions as a firewall to prevent fail-ure of local or remote LRCs from stopping the executionof the simulation model.

4.2 Dealing with In-transit Events

The current design of the fault-tolerant model supports theconservative time synchronization scheme [2]. The RTIdoes not keep the events, such as updating of attributes andsending interactions, after they have been processed. Therecovered federates may therefore miss some events pre-viously generated with a timestamp greater than the feder-ation time on failure, which should be delivered to them.Examples of this problem are shown in Figure 4. Althoughthe example is discussed based on the timestamp ordered(TSO) events, it is similar for receive order (RO) events.

As illustrated in Figure 4(a), at simulation time T,Fed[1] sends a TSO event EvX with timestamp T ��t1��t1 � �t0 � Lookahead� to subscribers (e.g.Fed[2]). In the case that Fed[2] encounters RTI failure attime T ��t0, Fed[2] resumes with a new PhyFed. How-ever, the recovered federate will never receive the event

EvX as it has already been lost due to the failure. In an-other case (Figure 4(b)), Fed[3] encounters RTI failure attime T immediately after sending a TSO event EvY withtimestamp T � �t0. Refer to the failure recovery proce-dure shown in Figure 3� in this case the message EvY mayor may not be received by the Fed[1] before it flushes itsoriginal PhyFed’s TSO queue to initiate a new PhyFed.

In order to ensure that in-transit events are deliveredto the receivers when the simulation resumes from fail-ure, a solution is proposed to resend ‘image’ events withan identical content/timestamp to the corresponding in-transit events generated previously. The Buffer Manager(Figure 2) at the sender side records each outgoing TSOevent and indexes the event in time order. The buffer canbe flushed to stable storage from time to time. This ap-proach is similar to the commonly used message loggingapproach in the sense of recording events, but it does notrequire rollback of the model’s execution [18]. The ap-proach needs to make a tradeoff between redundancy inmessage passing and complexity of the control mecha-nism under the condition that the new PhyFed must notmiss any event that ought to be received. A general prin-ciple in designing the resending approach is to ensure thatall federates operate in the same way as normal federatesthat have not encountered a fault.

To minimize extra networking overhead, the proposedapproach requires the sender only resends those events





Figure 4. Illustration of the problems in dealing with in-transit events

that (1) have been subscribed and (2) have not been re-ceived or buffered in the subscriber’s TSO queue. Themiddleware can be designed to help the subscribers no-tify the particular sender(s) about the reception statusof the events originating from the sender. Accordingto the feedback, the sender can selectively generate therequired events. The procedure is as follows, includ-ing preparation before a crash and the action on failurerecovery.

1. Collecting Subscription/Publication/RegistrationData. Each federate builds a Federate Subscrip-tion/Publication/Registration (FSPR) Table, whichrecords the classes subscribed/published and theobjects registered by other federates. Each federatebroadcasts its subscription/publication informationthat enables other federates to update correspondingentries in their own FSPR Tables. When an objectinstance (the table is updated when ownership istransferred) is registered (with federate ID encodedusing middleware), each subscriber updates thetable according to the object class and federate towhich it belongs. Thus, when attribute updates of anobject instance (events) are received, the receivercan trace the source of this event. When dealingwith interactions, the middleware simply codes thefederate ID in the tag of an interaction and decodesthe ID on reception.

2. Buffering Events. Each sender records its local up-dates in timestamp order according to their associ-ated object classes. Thus each sender records whatevents it has generated. Referring to the FSPR Ta-

ble, each sender also knows which federates shouldreceive these events.

3. Regenerating Events. On recovering from failure,the PhyFed being recovered requests the sendersto deliver those events with a timestamp greaterthan its current granted time. (It is possible that thetimestamps of some events to be resent are less thanthe sender’s granted time plus lookahead. In thiscase, the RTI++ middleware can be designed to en-code the content and timestamps of these events ina special RO message, which can then be decodedin the form of TSO events at the receiver’s end.)The recovered federate can therefore receive thoseevents and pass them to the simulation model withthe advance of time.

The re-sending approach is also applicable for process-ing interactions. For processing RO events, the BufferManager logs and indexes all the outgoing RO events ac-cording to the sequence in which they are created. Theindex can therefore be used to identify the RO events.The subscribers simply keep the indexes of the RO eventsthey have received. When failure occurs, the PhyFed be-ing recovered requests the senders to deliver the missingRO events. The RO events to be resent can be easily iden-tified by comparing the indexes maintained by the senderand the receiver. After that, the senders resend the missingRO events and dispose the logged RO events accordinglyafter successful delivery. This approach is similar to thecounter mechanism [19]. The Data Distribution Manage-ment (DDM) method [20] can be also adopted in our case,to optimize the delivery of missing events.





Figure 5. Example for calculating timestamp of events tobe disposed

4.3 Fossil Collection

Using the scheme described in the previous section, sentevents are buffered at the senders’ side against any po-tential unpredictable failure. As the simulation executionproceeds, the buffered data will accumulate indefinitely.At some stage this will become a bottleneck as systemresources are wasted in maintaining a huge amount ofredundant data. It is therefore necessary to perform fos-sil collection on the logged events. The fossil collectionshould (1) ensure events that any subscriber might missin case of failure are always available, as well as (2) dis-pose of events that have been received by all subscribersas soon as possible.

The RTI ensures that a federate receives all events withtimestamp less than its granted time. Therefore, sendersdo not need to keep events with timestamp less than thegranted time of a receiver. Based on this fundamental as-sumption, the main task of fossil collection is to determinewhich logged TSO events are safe for a sender to disposeaccording to its current granted time.

A time-constrained federate has an associated LowerBound Time Stamp (LBTS), which is the timestamp ofthe earliest possible TSO event that may be generated byany other regulating federate [4]. In the scenario depictedin Figure 5, we write the lookahead of the ith federate(Fed[i]) as Lai , its current granted time as Ti and the timeof the next request this federate may make to the RTI toadvance time as T �i . We define a maximum timestep bywhich Fed[i] advances its time in each loop as �i = T �i –Ti .

Considering the simplest scenario consisting of onlytwo federates, from Fed[1]’s perspective, failure may oc-cur in Fed[2] either (1) after Fed[2] has been granted timeT2 but before Fed[2] makes another request to advancetime, or (2) after Fed[2] has made a request to advancetime to T2 � �2 but before the request is granted.

Fed[1]’s LBTS is T2 + La2 in the first case (hence T1 �T2 + La2), and its LBTS is T2 + La2+ �2 in the secondcase (hence T1 � T2 + La2 + �2). It is safe for Fed[1]

to dispose of buffered events earlier than Fed[2]’s currentgranted time T2, which means any event with timestampless than T1� (La2 + �2) can be removed immediately.

Generalizing to n federates, suppose that Fed[k]( k ��1) has the smallest federate time Tk of the other federates,so that it is safe for Fed[1] to dispose of events with timeearlier than Tk . It is obvious that: Lak � �k �max{�Lai +�i ) i �� 1}. Thus, in the worst case, it is safe for Fed[1] todispose of all logged TSO events with timestamp less thanT1� max{�Lai + �i ) i �� 1}, and we define this value asFed[1]’s safe lower bound. The fossil collection algorithmcan determine this safe lower bound easily given that thelookahead and timestep of other federates are available,and this can be achieved easily using a middleware ap-proach.

Furthermore, any federate’s timestep may change fromtime to time. To minimize global propagations, a simula-tion time window can be defined with an upper and lowerbound specified. The window of a federate (say Fed[i]) isan interval around Ti��i , i.e. [�Ti��i �� 1� �Ti��i �� 2],for some � 1 �i and � 1, � 2 � 0.

When Fed[i] requests to advance its time to T �i , thereare four cases.

1. If T �i � �Ti � �i �� 1, set new �i � ��i � � 1�.

2. If �Ti � �i�� 1 T �i � �Ti � �i �, �i is unchanged.

3. If �Ti � �i � T �i � �Ti � �i � � � 2, set new �i ��i � � 2�.

4. If T �i � �Ti � �i �� 2, set new �i � �T �i � Ti �.

When �i decreases, it is safe for the other federatesto calculate safe lower bounds using a larger � value forFed[i]. In case (1), we still need to send other federates thenew �i , as it is out of the window. For cases (3) and (4), us-ing middleware can ensure other federates have receivedthe new �i before Fed[i] requests the RTI to advance time.After Fed[i] is granted a new time, the time window willbe moved forward to adapt to the change.

4.4 Optimizing the Failure Recovery Procedure

The failure recovery procedure starts from the point wherea failure is detected by the first federate and ends at thepoint where all federates are completely re-initialized andready for resuming normal execution. The straightfor-ward recovery scheme (see Figure 3) requires two time-consuming RTI related operations to be performed, whichare (1) to create the physical federation and (2) for eachfederate, to join the existing federation.

The joinFederationExecution call incurs costly federa-tion-wide operations. For example in DMSO RTI-NG,this operation usually requires opening TCP sockets to allother federates in the federation, which is expensive [21].To minimize the overhead (which can be greater than 20 s�





Figure 6. Physical federate pool approach

see Section 5), a possible solution is to avoid these callsduring the procedure itself. We attempt to solve this prob-lem using a Physical Federate Pool approach as shown inFigure 6.

This approach creates one or multiple ‘backup’ physi-cal federations concurrently to the normal simulation ex-ecution. Depending on the fault-tolerance requirements,these multiple physical federations can be supported byone or multiple RTIEXEC and the backup physical fed-erates can be executed on the same or different machinesas the active physical federate. An appropriate number ofPhyFed instances are created, which join their respectivebackup federations and form PhyFed instance pools (onepool for each federation).

In the context of the pool approach, a PhyFed instancemay operate in two modes: (1) working mode, servicinga virtual federate as normal, and (2) idle mode, callingtick regularly to maintain connection session with the RTIwhile checking for invocation from a virtual federate. Onstartup, a virtual federate connects to a PhyFed from thepool and the PhyFed operates in working mode from thenonwards.

The backup physical federations consist purely of idlePhyFeds instances, which are neither time regulating nortime-constrained, and only have minimum interactionwith each other. The backup physical federations poten-tially serve for recovery in the future. On failure recovery,an idle PhyFed instance can be fetched from the pool bythe virtual federate to provide the required RTI servicesimmediately. This approach therefore avoids consumingtime for creating the federation and joining the federa-tion execution prior to state replication. Maintaining sparePhyFed instances consumes extra system resources, andwe need to investigate the overhead this may cause. Cor-respondingly, the straightforward fault recovery scheme

(Section 4.1) can be optimized using the pool approach asin Figure 7.

Another uncertain factor is the time needed for the re-maining federates to detect the failure propagated fromthe origin. It depends on the form in which the failure ap-pears and how the fault-tolerant model handles it. If thefailure is detected as one of the last three cases defined inSection 4, other federates’ middleware need only immedi-ately initiate a passive failure recovery. For the first case,the time required to confirm the occurrence of a failuremust be longer than the specified ‘time-out’ period.

The situation becomes even more complicated if the‘symptoms’ of failure cannot be explicitly identified atall. For example, suppose a federate does not receive atimeAdvanceGranted (TAG) for a significantly long pe-riod after it makes the request to advance its time from theRTI [5]. Basically, this may due to the fact that (1) someLRCs have failed, or (2) the condition for granting its re-quest has not been met yet or (3) some other reason notrelated to failure, e.g. an unexpected communication de-lay for the RTI to convey callbacks. There needs to be amethod to distinguish the first case from the others.

The PhyFed pool approach can be used to solve thisissue: a pre-selected backup physical federation can alsoserve as an out-of-band channel for a failed federate tonotify the remaining federates of the occurrence of fail-ure. We define a special ‘system’ object class (RTI_FAIL)and have all idle PhyFed instances subscribe to and pub-lish this object class. The Management Module of thefirst failed federate registers an RTI_FAIL object in-stance in the selected backup physical federation. Theremaining federates’ Management Modules periodicallycheck the existence of such an object from the selectedbackup physical federation to decide whether to start apassive fault recovery procedure. It is therefore possible





Figure 7. Illustration of the optimized failure recovery procedure using PhyFed pool

for the whole federation to quickly respond to a localfailure.

5. Experiments and results

In order to verify the correctness and investigate the over-head incurred in the proposed fault-tolerant model, weperform a series of experiments to compare the robustfederates with normal federates using a simple distributedsupply-chain simulation.

5.1 Configuration of Experiments

The simulated supply chain comprises an agent company,a factory and a transportation company. The agent keepsissuing orders to the factory, and the latter processes theseorders and plans production accordingly. The transporta-tion company is responsible for delivering products of thefactory and reporting the delivery status.

The three nodes in the supply chain can be modeled asthree federates as shown in Figure 8, namely simAgent,simFactory and simTransportation. These federates forma simple distributed simulation to simulate the supply-chain operation in almost a year (from simulation time 0 to

361). Two object classes ‘Order’ and ‘Products’ and oneinteraction class deliveryReport are defined in the Feder-ation Object Model (FOM) [2] to represent the types ofevents exchanged among the federates. Table 1 gives theclasses published and/or subscribed to by the federates.

The simFactory reports the cost incurred for each orderat the end of the simulation. The simulation starts with aninitialization procedure then enters the ‘real’ simulationprocedure after a global synchronization. The initializa-tion procedure denotes the interval from the point a fed-erate is started to the exact point where it has completedthe operations: create/join the federation� enable time reg-ulating/constrained� publish/subscribe object/interactionclasses� and register object instances. During the simu-lation procedure, federates interact and coordinate timeadvancement with each other using the conservative syn-chronization scheme. In this paper, the elapsed times ofthe initialization procedure and the simulation procedureof each run are referred to as its initialization time andsimulation execution time, respectively.

Using the same codes for the simulation models, thefederates are built into two versions by linking to: (1) theDMSO RTI library directly (normal) and (2) the RTI++middleware library supporting fault tolerance (robust).The RTI++ in these experiments adopts the PhyFed pool





Figure 8. A simple distributed supply-chain simulation

Table 1. Declaration information of the federates

Object Classes and Attributes Interaction Classes and Parameters

Order Products deliveryReport

Federate Index, Size Amount, Index, Date Index, Status

simAgent Publish NIL NIL

simFactory Subscribe Publish Subscribe

simTransportation NIL Subscribe Publish

Table 2. Configuration of experiment test bed (WS: workstation)

Computers

WS 1, 2 WS 3 Server WS 4–12

Operating System Sun Solaris OS 5.8 Sun Solaris OS 5.8 Sun Solaris OS 5.8 Sun Solaris OS 5.9

CPU Sparcv9 CPU,at 900 MHz

Sparcv9 CPU,at 360 MHz*2

Sparcv9 CPU * 6,at 248 MHz

Sparc II CPU,at 400 MHz

RAM (M) 1024 512 2048 512

Compiler GCC 2.95.3 GCC 2.95.3 GCC 2.95.3 GCC 2.95.3

Underlying RTI DMSO NG 1.3 V6 DMSO NG 1.3 V6 DMSO NG 1.3 V6 DMSO NG 1.3 V6

Processes running on simAgent orSim-Transportation

simFactory RTIEXEC and FEDEXEC SimAgent orsim-Transportation

approach and uses the IPC Message Queue [22] as the ex-ternal communication to bridge the virtual federate and itsPhyFed. The PhyFed pool maintains one backup physicalfederation consisting of three idle PhyFeds.

The experiment architecture and platform specificationare listed in Table 2. The experiments use three totwelve workstations and one server, which are interlinkedvia a 100 Mbps based backbone. Workstations four totwelve are only used in scalability studies (Section 5.4).Each federate occupies one individual workstation, withthe RTIEXEC and FEDEXEC processes running on theserver.

5.2 Correctness of Fault-tolerant Model

To verify the correctness of the fault-tolerant model, wespecify federate simAgent to generate the same set oforders in different runs. There are three sets of experi-ments in this session. We first execute the normal fed-erates, in which the outputs are used as a reference in

subsequent experiments. Secondly, we repeat the simula-tion using the robust federates without introducing fail-ure (FAULT_FREE). The last experiment also uses robustfederates but with failure abruptly triggered once by man-ually terminating a working PhyFed during the simulationprocedure (FAULT_INCURRED). The outputs obtainedusing normal federates are summarized as follows.

1. simAgent issues 240 orders, in which the first andthe last order carries timestamp 2.5 and 362.5, re-spectively.

2. simFactory receives 239 orders (note the last orderis not received as it is after the simulation end time)and makes products accordingly.

3. simTransportation receives all product updates is-sued earlier than the end time and sends deliveryRe-port interactions with respect to these updates.

From the FAULT_FREE and FAULT_INCURRED ex-periments, we check the orders issued and received, prod-





Figure 9. Simulation execution time

ucts produced and delivered as well as the calculation ofcosts. Outputs (including the timestamps and values ofall events) in these experiments exactly match those us-ing normal federates. This indicates that the fault-tolerantmodel does not introduce any variation to the simulationresults, and our framework provides a correct robustnessmechanism for HLA-based distributed simulations. TheFAULT_INCURRED experiments also show the benefitof the decouple architecture. The failure and the recoveryprocedure are properly handled and executed by the mid-dleware during the runtime. The fault handling and recov-ery is transparent to the simulation model execution. Theuser’s simulation model was executed exactly in the sameway as if the failure had not occurred.

5.3 Efficiency of Fault-tolerant Model

To investigate the performance of the fault-tolerancemechanism, another set of experiments are performed tocollect the overall execution time using normal and ro-bust federates. We specify federate simAgent to generateorders randomly in each run. For normal federates, wehave a number of runs and the average execution timeof these is referred to as the NORMAL time of execut-ing one simulation session. As for the robust federates, wefirst repeat the FAULT_FREE experiments then carry outa number of the FAULT_INCURRED experiments. FromFAULT_INCURRED experiments, we select three runs inwhich the failure of the PhyFed corresponding to federatesimFactory occurs only once at simulation time 43, 182or 320. These points represent failure at the start (FI_S),middle (FI_M) and end (FI_E) stages, respectively.

The average CPU utilization of a single normal feder-ate or a virtual federate (in workstation 1 or 2) is reportedas above 80%. A PhyFed has an average CPU utilizationas low as <0.5% in working mode and <0.02% in idlemode.

The initialization time of normal federates varies from19 s to 27 s in different runs, and it varies from 21 sto 27 s using robust federates. The latency for initiatingthe PhyFed pool is well hidden. The simulation executiontimes of different experiments are reported in Figure 9.The normal simulation execution time is 584 s usingnormal federates, which is almost the same as the averagesimulation execution time in FAULT_FREE experiments.This means that the overhead for federate decoupling andmaintaining the PhyFed pool has little influence on exe-cution efficiency.

In the FAULT_INCURRED experiments, the simula-tion execution time is only 11–13 s longer than the normalcase. The overhead does not fluctuate much for failure atdifferent stages of the simulation execution. In the exper-iments, if an RTI call issued from the virtual federate hasnot been returned by the PhyFed after more than 6 s, atime-out will occur. Because of this, a large part of thisslight overhead is mainly due to the failure detection pro-cedure.

When failure occurs, we assume that the normal fed-erates have to start from the beginning, and the sum ofthe elapsed times of both the failed and repeated simula-tion executions is used for comparison with the simulationexecution times using robust federates. The percentage ofsaved execution time is shown in Figure 10. Obviously,the later the failure occurs, the more execution time canbe saved (up to 50%).





Figure 10. Percentage of saved execution time with failure occurring at different stages

Figure 11. Initial federation for the ten sets of experiments for the scalability test

5.4 Scalability of Fault-tolerant Model

A third series of experiments is performed to test the scal-ability of the fault-tolerant model. This consists of ten setsof experiments in total, with the number of federates vary-ing from 3 to 12.

Each set of experiments uses a federation which alwayscontains one simFactory and one or multiple instances ofsimAgent and simTransportation. In the first set of exper-iments, the federation has one simFactory, one simAgentand one simTransportation. In each subsequent set of ex-periments, we always introduce one new instance of sim-Agent or simTransportation alternately to the previous setof experiments. As shown in Figure 11, each added simA-gent (simTransportation) is marked with the total numberof federates in the federation after it is added.

Similar to the experiments reported in Section 5.3,the execution time is measured and compared usingdifferent types of federates, i.e. normal and robust(FAULT_FREE and FAULT_INCURRED) federates. ForFAULT_INCURRED experiments, the failure of federatesimFactory is configured to occur only once at simula-tion time 182 (i.e. always at the middle of the simula-tion execution). The experiment architecture and platform

specification for the scalability study are listed in Table 1and the workstation on which each federate operates isalso illustrated in Figure 11.

The average CPU utilization of each federate is thesame as in previous experiments. The execution timesfor all experiments are recorded in Figure 12. The exe-cution times using normal federates increases smoothlywhen the federation contains an increasing number offederates (starting from 584 seconds for 3 federatesto 714 seconds for 12 federates), and the same trendcan be observed when using robust federates. The ex-tra execution time (versus normal federates) consumedby FAULT_INCURRED federates is highlighted in Fig-ure 13. Note that the percentage increment of the ex-ecution time remains invariant to the number of fed-erates in the federation� it is about 2% for all thecases.

The experimental results show that (1) the fault-tolerant model scales well with increasing distributed sim-ulation size, (2) in the case of no fault, robust federatesperform almost the same as the normal federates and (3)in the case of a fault occurring, the model’s overhead re-mains negligible compared to normal federates encounter-ing no fault.





Figure 12. Simulation execution times with increasing number of federates

Figure 13. Overhead of robust federates (fault incurred) versus normal federates (fault free)

Figure 14 shows the overall simulation execution timesof normal federates and robust federates in the presenceof the fault. Similar to the experiments reported in Sec-tion 5.3, we assume that the normal federates have torestart from the beginning when the fault occurs. Usingthe same calculation method as in Figure 10, the percent-ages of saved execution time are given in Figure 15. Thepercentage remains about 32% steadily with increasing

federation scale. The results indicate that the robust feder-ates can significantly reduce execution time when dealingwith failure compared to normal federates without fault-tolerance functionalities.





Figure 14. Simulation execution times with incurred failure at the middle stage

Figure 15. Percentage of saved execution time with increasing number of federates

5.5 User Transparency and Related Issues

As introduced previously, the extra effort for building ro-bust federates is minimal as users only need to link nor-mal federates’ code to the RTI++ library. Our framework

does not require a federate to be modeled on any partic-ular software package or specially coded for the purposeof fault tolerance. The FOM defined for a normal federateonly needs to be slightly extended for the use of the cor-responding robust federate, which includes several extra





Table 3. System object/interaction classes in the extended FOM

Object/Interaction classes Declaration Functionalities

RTI_FAIL (Object) Publish&Subscribe Notifying other federates about the occurrence of an RTI failure

SYS_FED_DECLARATION (Interaction) Publish&Subscribe Broadcasting the information about which object/interaction classesthe local federate has published/subscribed to

SYS_IN_TRAN_MSG (Interaction) Publish&Subscribe Delivering the content of in-transit TSO or RO events to the receiversduring fault recovery

‘system’ object/interaction classes (Section 4.4). Table 3lists the system object/interaction classes added into theFOM.

To examine whether robust federates can interoperateproperly with normal federates, we repeat the same simu-lation scenario as described in Section 5.2 using both ro-bust federates and normal federates in each session (thereare 23 possible combinations in total, inclusive of two con-structed by pure normal/robust federates). The extendedFOM applies for both types of federates. The six ‘hybrid’sessions have the same outputs as those reported in Sec-tion 5.2. Robust federates upon our framework can inter-act properly with normal federates. The extended FOMdoes not cause any semantic problem or any difference insimulation execution. The experiments further show thatthe fault-tolerant model provides reuse of federate codeand user transparency.

6. Conclusions and future work

We have introduced a framework for supporting run-time robustness to HLA-based distributed simulations. Wehave investigated the issues and design of a generic fault-tolerant federate model. Based upon the Decoupled Fed-erate Architecture, the model is developed to prevent anRTI error from disrupting the execution of simulation fed-erates and ensures correct recovery of a distributed simu-lation session. Algorithms have been presented to ensurereliable delivery of in-transit messages and to perform safefossil collection.

The fault-tolerant model supports the reuse of legacyfederates while enabling robustness and reduces the ef-fort required by developers to model robust federates.The model is platform neutral and model independent.User transparency has been provided with failure prop-erly masked. Robust federates do not require rollback ofsimulation execution in the case of failure.

A series of experiments has been performed to inves-tigate the accuracy and performance of the fault-tolerantmodel using an example of a distributed supply-chain sim-ulation. The experimental results are compared for normaland robust federates in terms of uniformity of output sta-tistics and computing efficiency. The output statistics in-dicate that the model provides correct fault recovery. Theresults show that robust federates have a very close perfor-mance to normal federates and only incur minimal extraoverhead. Our work indicates that the fault-tolerant model

is a feasible and efficient solution to the support of runtimerobustness in HLA-based distributed simulations, whichcan be used in the design of robust RTI software in thefuture.

For future work, it is necessary to extend the cur-rent design to support an optimistic time synchroniza-tion scheme. Other work involves the benchmarking ofalternative External Communication backbones betweenthe virtual and physical federates such as Sockets, MPIor Web/Grid services and on various platforms. Anotherissue is to test the implementation and performance ofthe framework with federates developed using commer-cial off-the-shelf simulation packages.

7. References

[1] Dahmann, J. S., F. Kuhl, R. Weatherly. 1998. Standards for Simula-tion: As Simple As Possible But Not Simpler, The High LevelArchitecture for Simulation. Simulation 71(6): 378–387.

[2] IEEE 1516. 2000. IEEE Standard for High Level Architecture.[3] Johnson, D. B. 1989. Distributed System Fault-tolerance Using Mes-

sage Logging and Checkpointing. Ph.D. Thesis, Rice University,Texas.

[4] Kuhl, F., R. Weatherly, and J. Dahmann. 1999. Creating ComputerSimulation Systems: An Introduction to HLA. Prentice Hall: US.

[5] DMSO. 2002. RTI 1.3-Next Generation Programmer’s Guide Version5, DoD, DMSO, USA.

[6] Chen, D., S. J. Turner, B. P. Gan, W. Cai, and J. Wei. 2003. A Decou-pled Federate Architecture for Distributed Simulation Cloning.In Proceedings of the 15th European Simulation Symposium,Delft, Netherlands, October 2003. A. Verbraeck, V. Hlupic (eds).SCS European Publishing House: Germany� pp. 131–140.

[7] Chen, D. 2006. Cloning Mechanisms for HLA-based DistributedSimulations. PhD Thesis, School of Computer Engineering,Nanyang Technological University.

[8] Chen, D. S. J. Turner, and W. Cai. 2006. A Framework for Ro-bust HLA-Based Distributed Simulation. In Proceedings of the20th ACM/IEEE/SCS Workshop on Principles of Advanced andDistributed Simulation (PADS 2006), May 2006, Singapore�pp. 183–192.

[9] Cristian, F. 1991. Understanding Fault-Tolerant Distributed Systems.Communications of the ACM 34(2): 57–78.

[10] Zaj �ac, K., M. Bubak, M. Malawski, and P. Sloot. 2003. Towardsa Grid Management System for HLA-based Interactive Simula-tions. In Proceedings of the 7th IEEE International Symposiumon Distributed Simulation and Real Time Applications, Delft,Netherlands� pp. 4–11.

[11] Birman, K. P. 1997. Building Secure and Reliable Network Appli-cations. Prentice Hall and Manning Publishing Company, Green-wich, CT, USA.

[12] Elnozahy, E. N. 1993. Fault-tolerance in Distributed systems UsingRollback-Recovery and Process Replication. Ph.D. Thesis, RiceUniversity, Texas.





[13] Berchtold, C., and M. Hezel. 2001. An Architecture for Fault-tolerant HLA-based Simulation. In Proceedings of the 15th In-ternational European Simulation Multi-Conference (ESM) 2001,June 2001, Prague, Czech Republic� pp. 616–620.

[14] Möller, B., B. Löfstrand, and Mikael Karlsson. 2005. Develop-ing Fault Tolerant Federations Using HLA Evolved. In Proceed-ings of 2005 Spring Simulation Interoperability Workshop, SanDiego, California, USA� paper no. 05S-SIW-048.

[15] Eklöf, M, F. Moradi, and R. Ayani. 2005, A Framework forFault-Tolerance in HLA-Based Distributed Simulations. In Pro-ceedings of the 2005 Winter Simulation Conference, Orlando,Florida, December 2005. M. E. Kuhl, N. M. Steiger, F. B. Arm-strong J. A. Joines (eds). pp. 1182–1189.

[16] Eklöf, M, R. Ayani, and F. Moradi. 2006. Evaluation of a Fault-Tolerance Mechanism for HLA-Based Distributed Simulations.In Proceedings of the 20th ACM/IEEE/SCS Workshop on Princi-ples of Advanced and Distributed Simulation (PADS 2006), May2006, Singapore, pp. 175–182.

[17] Chen, D., S. J. Turner, W. Cai, B. P. Gan, and M. Y. H.Low. 2005. Algorithms for HLA-based Distributed SimulationCloning. ACM Transactions on Modeling and Computer Simula-tion 15(4): 316–345.

[18] Tanenbaum, A. S. and M. van Steen. 2002. Distributed Systems:Principles and Paradigms. Prentice Hall: US.

[19] Cai, W., Z. Yuan, M. Y. H. Low, and S. J. Turner. 2005. FederateMigration in HLA-based Distributed Simulation. Future Gener-ation Computer System 21(1): 87–95.

[20] Yuan, Z., W. Cai, and Y. H. Low. 2003. A Framework for ExecutingParallel Simulation using RTI. In Proceedings of 7th IEEE In-ternational Symposium on Distributed Simulation and Real TimeApplications (DSRT 2003), Oct 2003, Delft, Netherlands� pp. 12–19.

[21] Rycerz, K., M. Bubak, M. Malawski, and P. Sloot. 2005. A Frame-work for HLA-Based Interactive Simulations on the Grid. Simu-lation: Transactions of the Society for Modeling and SimulationInternational 81(1): 67–76.

[22] Stevens, W. R. 1999. UNIX Network Programming, Inter-ProcessCommunications, Vol. 2, 2nd Edition. Prentice Hall: US.

Dan Chen is a Professor with the Institute of Electrical En-gineering, Yanshan University (China). He was a Postdoctoral

Research Fellow at the School of Computer Science at the Uni-versity of Birmingham and the School of Computer Engineeringat Nanayang Technological University, Singapore. He achieveda B.Sc. in applied physics from Wuhan University (China) anda M.Eng. in computer science from Huazhong University ofScience and Technology (China). He also achieved a M.Eng.and a Ph.D. at NTU. His research interests include computer-based modeling and simulation, distributed computing, multi-agent systems and grid computing. Recently, he has also beenstudying large scale crowd simulation and neuroinformatics.

Stephen John Turner joined Nanyang Technological University(NTU, Singapore) in 1999 and is Director of the Parallel andDistributed Computing Centre in the School of Computer Engi-neering. Previously, he was a Senior Lecturer in Computer Sci-ence at Exeter University (UK). He received his M.A. in Math-ematics and Computer Science from Cambridge University andhis M.Sc. and Ph.D. in Computer Science from Manchester Uni-versity. His current research interests include: parallel and dis-tributed simulation, distributed virtual environments, grid com-puting and multi-agent systems. He is steering committee chairof the Principles of Advanced and Distributed Simulation con-ference and advisory committee member of the Distributed Sim-ulation and Real Time Applications symposium.

Wentong Cai is an Associate Professor with the School of Com-puter Engineering at Nanyang Technological University (NTU,Singapore), and head of the Computer Science Division. He re-ceived his B.Sc. in Computer Science from Nankai University (P.R. China) and Ph.D., also in Computer Science, from Universityof Exeter. He was a Postdoctoral Research Fellow at Queen’sUniversity (Canada) before joining NTU in February 1993. DrCai’s research interests include parallel and distributed simula-tion and programming environments and grid and cluster com-puting. His main areas of expertise are the design and analysis ofscalable architecture, framework and protocols to support par-allel and distributed simulation, and the development of mod-els and software tools for programming parallel/distributed sys-tems.