a co-ordination framework for distributed financial applications

CONCURRENCY: PRACTICE AND EXPERIENCEConcurrency: Pract. Exper.,Vol. 11(4), 205–219 (1999)

A co-ordination framework for distributedfinancial applicationsJ. A. KEANE

Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK

SUMMARYThis paper presents the design and development of a co-ordination framework that enablesa distributed, heterogeneous environment to be used as a single integrated system. Thedesign has been motivated by the requirements of financial applications: in particular by acomputationally intensive actuarial system that for large schemes requires computing and dataresources.

The framework is structured into two levels: an interprocess communication layerfor application programmers, and an underlying communication management layer. Theenvisaged distributed environment includes MPPs, SMPs and workstations, connected viaLANs and WANs. Copyright 1999 John Wiley & Sons, Ltd.

1. INTRODUCTION

As the need for information management and decision support becomes more crucialto ensure organisation profitability, applications become correspondingly more complex.In turn, there comes a commensurate need for an increase in computing resource. Oneapproach to this is parallel systems and their increasing use in the commercial sector[1–3].

A complementary solution, that incorporates parallel systems, is to co-ordinate theexisting resource of an organisation. For a number of years it has been realised that mostcomputing departments have much more resource than is actually utilisedmost of thetime. This is particularly true as organisations, that are naturally distributed, install systemsbased at sites and often used intermittently. Further, with the proliferation of multinationalcompanies some sites are operational at precisely the times that other sites are not.

Distributed systems[4,5] are made up of multiple computer (nodes) with interconnec-tions between them. The distributed environment within a multinational organisation islikely to be heterogeneous, and consist of a wide area network (WAN) linking togetherlocal area networks (LANs), which themselves consist of workstations and parallel systems– both shared-memory multiprocessors (SMP) and massively parallel processors (MPP).

Co-ordination can enable these nodes to be used as a single integrated whole to solvelarge problems. A number of approaches to providinggeneral-purposeco-ordinationframeworks have been developed: for example, PVM[6], MPI[7], Linda[8], BOB[9] andPCN[10].

The approach here is to look at the requirements of financial applications of sufficienteconomic significance to justify an application-specific framework. The key advantage toimplementing such a framework, over a general-purpose one, is that a tailored system canbe optimised for the requirements of the application.

∗Correspondence to: J. A. Keane, Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD,UK.

CCC 1040–3108/99/040205–15$17.50 Received 13 January 1997Copyright 1999 John Wiley & Sons, Ltd. Accepted 3 October 1997

206 J. A. KEANE

Specifically, in this paper, application characteristics are considered and used asrequirements to determinewhat the framework must support to allow co-ordination ofa distributed environment for these applications at an abstract level. The focus is thenon how to support this abstract layer, i.e. with the design and implementation of aframework that supports interprocess communication (IPC) for applications to run in adistributed environment. The framework allows applications split into separate parent–child processes to communicate and co-ordinate independently of where processes residein the system. The IPC protocol sits on top of a communication management infrastructurethat accommodates processing nodes from different manufacturers, different versions ofUNIX and different communication packages.

The structure of the paper is as follows: in Section2 the type of applications underconsideration are described; in Section3 the requirements for the co-ordination frameworkfor these applications are discussed; Sections4 and 5 deal with the design of theinterprocess communication layer and the communication management layer, respectively;Section 6 describes implementation considerations; in Section7 the framework isconsidered in the context of existing general-purpose systems, in particular PVM and MPI;finally, conclusions are presented in Section8.

2. APPLICATION CONSIDERATIONS

In this Section the need for a co-ordination framework is motivated by discussingthe intended application area of financial systems. Most approaches to developing co-ordination frameworks have been motivated by computational science and the associatedrequirement for number-crunching. The work here is firmly in the commercial sector,and thus the applications of concern have larger data requirements than are typicallyencountered in computational science.

Two specific applications have motivated the co-ordination framework design:

• an actuarial system, which has both computation and data requirements; and• a pension management system, which is primarily data-intensive.

As the actuarial system is more generic it is focused upon here.The purpose of an actuarial system is to evaluate the liabilities of a pension scheme

by calculating for each member their expected cost to the system in the future. For largeschemes with many members this requires considerable processing resources and has largedata requirements[11,12].

This is an application where the amount of data required can vary enormously accordingto the processing power available; the more data that are used, the more accurate will bethe result of the calculation. The data needed for the calculation are of two types:

1. details associated with a specific individual – these data will at least include age,salary and family details. Given enough resources, potentially this can include everyfinancial transaction, retail buying behaviour, leisure activities, complete medicalhistory, ancestor details, etc.

2. general aggregation (geo-demographic and statistical) data contained in actuarialtables – these data can include postcode details, census data, life-stagecharacteristics, life-style questionnaires, etc.

Therefore, there is practically no limit to the amount of data that can be used withinthe calculation. The limiting factor is rather the time spent to perform the calculation: for

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 205–219 (1999)

CO-ORDINATION FRAMEWORK FOR DISTRIBUTED FINANCIAL APPLICATIONS 207

pension schemes involving hundreds of thousands of members, the run time on a SunSPARC 330 is in terms of number of days. Thus the commercial value of achievingperformance speedup, to make such large schemes more tractable, is considerable. Inaddition, the ability to incorporate larger and more detailed datasets will produce moreaccurate, and thus more profitable, results.

An analysis of the structure of the system indicates that the main processing activityinvolves the application of a function to every member of a set, i.e.

∀m ∈ pension-scheme.

liabilities[m] := process-member(individual-member-data, actuarial-tables);Given access to the actuarial tables and the member’s individual data, each member can beexecuted in parallel on a separate processor, with no communication necessary between theprocessors. The concurrency is coarse-grained and an example ofagendaparallelism[8];the agendabeing the processing of all of the members, i.e. a set of tasks. The model,therefore, embraces atask queue(the members to be processed) andlocal computation.

The most flexible organisation for agenda paralleism is the master–worker (parent–child) paradigm[8,13]. In a master–worker program, the master process is responsible forinitialising the computation and for creating a collection of identical worker processes.Each worker process is capable of carrying out any of the tasks in the agenda. A goodload balance to tasks in inherent in the very nature of this type of problem decompositionbecause tasks are distributed on the fly.

With pension schemes of tens of thousands of members the number of tasks issubstantially larger than the number of processors likely to be available. This is importantas it is considered a good rule of thumb that the number of tasks in a manager–workerscheme should be an order of magnitude larger than the number of processors to ensurehigh levels of efficiency[10].

The concurrent system, therefore, can be modelled as the composition of a singleparent (master) process with a set of child (worker) processes, all running in parallel.In the message-passing model of the actuarial system, the master process acts as the co-ordinator process, distributing work and collecting results. The worker processes carry outthe member processing by receiving the relevant member-data from the master and sendingthe calculated values back. No communication is required between the workers. To achievea reasonable load balance and, in particular, to ensure that the master does not become abottleneck, the number of tasks issued to each workers can be varied.

Work parallelising the actuarial system has been done[11]; this showed that whilstparallelism was very beneficial, in addition, there was potential to make use of othersystems in the network for very large jobs.

A second, related application is that of a pension management system that involves aworkstation-based system at various sites that occasionally, and at unpredictable times,requires data resource that may be held at many sites. In comparison to the actuarial systemit is characterised by relatively little computation requirements but significant data accessrequirements.

3. REQUIREMENTS FOR THE CO-ORDINATION FRAMEWORK

Both applications are highly business-critical; hence their performance is economicallysignificant enough to motivate the design of a designated co-ordination framework. In


208 J. A. KEANE

addition, given their critical nature, it is important that the framework, as well as improvingperformance, also provides some level of fault-tolerance. It is clear that the possibility fornode or communication failure is higher in a distributed system; hence the requirement isthat the framework is resilient to such failure. Special attention has been paid to the testingof the framework in this regard as the nature of these failures is highly unpredictable.These issues are addressed in more detail in Section5.3(Message Acknowledgment ;Retries ) and Section6 (Communication Modes ; Testing ). The issue of beingable to re-use computation carried out before failure is not addressed here.

When the applications are considered from the viewpoint of making them multiprocess,they both correspond well to a parent/child configuration where one process acts as theco-ordinator (parent) for the application.

The envisaged normal scenario for execution is that the nodes on a LAN would besufficient; hence a set of processes that enable distributed execution on the LAN canbe established and left running idle for long periods until there is need for them. Thisresults in the application not having to step up the co-ordination framework each time thusminimising overhead. As the framework is hierarchic – nodes–LANS–WANs – if eachLAN is configured as above then jobs that require the use of nodes across a WAN forexecution should not cause significant overhead.

For the purposes of these applications, all nodes within the framework provide a UNIXoperating system. In addition, both targeted applications are written in C; hence theframework only needs to operate within one programming language.

To make use of a distributed system for these type of applications a way is needed topartition an application into a set of concurrent processes. Both applications have a fewwell-known functions, such asprocess-member , hence each function can be compiledon each node at start-up time. As a result there are few concerns with regard to compatiblebinaries.

The performance advantage of the approach occurs because (some of) these processescan executeat the same timeand thus, with more than one processor, the time to run theapplication should be decreased.

To allow separate processes, a mechanism is needed to create processes. Further,independent processes must have a way of communicating and co-ordinating their activity.Consideration ofwhere the created process may execute is outside the scope of thislevel: it is necessary to separate concerns, as far as possible, intowhat the interprocesscommunication allows, andhowa distributed environment can be co-ordinated to enablethis.

The system model of the co-ordination environment has thus been split into two:

1. Inter Process Communication (IPC), that provides a protocol to the application level2. Communication Management Layer (CML), that supports the IPC protocol.

A System Management activity is also necessary to loosely co-ordinate activity on anode and between nodes.

4. INTER-PROCESS COMMUNICATION (IPC)

The IPC layer is the one seen by the application programmers. Given the applications ofconcern, the programmers are primarily domain, rather than systems-level, experts. Hencethe protocol should be high-level and abstract.



As outlined above two facilities are required:

1. a means to create processes2. a means to co-ordinate and communicate between processes.

4.1. Process creation

The most abstract way of providing this facility is a commandCREATEcalled by anexisting process; this command creates a new process and passes it the parent’s (creatingprocess’) identifier, a function to execute, and any arguments for the function. This provideslocation transparency.

Given the applications it is considered necessary to provide slightly more controlover location than this. In particular, as the application programmer often has the mostknowledge as to the amount of work to be carried out by the process to be created, thecreating process can pass ‘hints’ (via a small message control structure) to the processcreation mechanism. In particular, the Parent can indicate if it wishes the process to becreated on a particular node (including the local node) or on the same LAN rather than onanother LAN. This may be useful if the creating process knows the created process requiresaccess to files located on a particular LAN† The hints can also include node characteristicsfor where the process should be created.

When hints are not passed, the underlying system will determine where to createthe process within the distributed system WAN. Below we discuss how communicationbetween the creating process and the created process is effected.

4.2. Process communication

To achieve inter-process communication the application programmers need to provide theleast amount of information necessary. For a process to send a message all that is requiredis a destination process identifier and a message content. Issues such as mapping processidentifier to location, determining communication mode, message sizes etc, are not theconcern of the application programmer or of the IPC layer.

As discussed above, the applications of concern correspond well to a parent/childprocess relationship. As a result, the design of the process communication protocol hasbeen influenced by the communication patterns required between a parent process and itschildren.

Analysis suggests that onlySENDandRECEIVEoperations are necessary. These areused to communicate data of any size between processes. In addition, a number ofinterprocess commands are introduced, these are sent as data and interpreted by the receiveras commands. TheSENDis non-blocking at the application level.

As part of aSENDthe following commands can be passed between parties:

1. Child→ Parent communication:

• ID , send the identity of the child to the parent.This is the first action of the child process after creation, as theCREATEprotocol does not return a value to the parent.CREATEpasses the identifier ofthe parent process to the created process.

†The provision of a global name space is not considered in this model except for process identifiers. For the typeof application of concern, the overhead of a global namespace is excessive.


210 J. A. KEANE

• READY, to inform the parent that the child is ready to begin work.This is passed following the creation and identification of the new process,and also used to inform that the child has finished its previous task, and thus itmay return a value to the parent.

• STATUS, provides details of the child’s work to the parent.• FAILURE, the child reports a fatal error to the parent.• FINISHED , the child informs its parent that it has finished. This is the last

action of the child process.

2. Parent→ Child communication:

• REPORT, the parent asks for the status of the child.The child responds withREADY, or STATUSdepending on circumstance.

• TERMINATE, the parent informs the child to terminate in an orderly fashionwhen its current task has finished.Notification that the child has finished will be done withFINISHED .

• ABORT, the parent tells the child to abort the task immediately and terminate.Notification of finish is once again viaFINISHED .

3. System Manager→ Parent:

• RELEASE, a request to the parent to terminate a specific child.This could be as a result of various system management issues, such as nodefailure or load balance requirements.

5. COMMUNICATION MANAGEMENT LAYER (CML)

The CML is responsible for exchange of data between two nodes. At this level thereis concern with the physical distribution of nodes, the different types of communicationsupported, and the potential for hardware failure both permanent and transient. Each nodein the system is able to both send and receive messages.

On each node there is a configuration file detailing the other nodes on the LAN. Thisprovides information such as the type of communication supported by a node, its type ofarchitecture, etc. A designated node on each LAN is the link to the WAN. Communicationbetween nodes on different LANs occurs through the appropriate designated nodes‡.

Given the interface provided by the IPC layer, the CML must provide process creation,and process co-ordination and communication, in a distributed fashion.

Each node in the distributed system has the following:

1. a process creation component, termed aSCHEDULER;2. a process communication component, made up of:

(a) aLISTENER process, which handles communicationfrom other nodes;(b) aROUTERprocess, which handles communicationto other nodes;(c) a set of message queues, one for theROUTERand one for each process who

identifies itself as a message receiver.

‡A node can be designated as the backup entry-point to a LAN in case of failure.



5.1. Process creation

A SCHEDULERprocess exists on each node in the system. Its role is to implement thefunctionality of the IPC operationCREATE. As no value is returned by theSCHEDULER,the parent receives the child process identifier directly from the child (see Section4). TheCML at the child process’ node will add to the message control structure the node-id andLAN-id of the child; at the parent process’ node this information will be stored so thatsubsequent sends from the parent to the child have location information.

If the process is to be created on the local node then theSCHEDULERcan make useof the local operating systems process creation routine, for example, in UNIX systems wecould usefork() and thenexec() the required program.

If the decision is made to create a process off-node, then theSCHEDULERis responsiblefor determining where in the distributed system to create the new process. In terms ofimplementation, eachSCHEDULERprocess knows all nodes on its local LAN, and knowsthe route to other WANs. To achieve a system-wide load balance, each localSCHEDULERknows the load balance on its own node, and can communicate with otherSCHEDULERson the LAN to determine overall LAN load information. Each LAN can communicate todetermine overall WAN load information. This can be provided by each node propagatingits load information at regular intervals. Obviously communication across a WAN ismore expensive than in a LAN, and communication across a LAN more expensive thanperforming processes locally. Hence it is important to only export tasks to other nodes onthe LAN or WAN if there is sufficient work to overcome the communication cost; hencethe use of hints to theSCHEDULERfrom the IPCCREATEprotocol (see Section4.1) arevery important.

In general, the achievement of a ‘good’ load balancing scheme is very difficult.Often a round-robin allocation can be successful. For the pension management systemthe requirements are very unpredictable; with the actuarial application the parent–childstructure provides a reasonable load balance (as described in Section2). Even so, duringexecution, the loads of each node will change dynamically and so there is no ideal loadbalance. Information such as typical workload of nodes at certain times, and the run timesof particular functions can be used to influence the choice of where to create a process.

During the period of time the control processes will live – the order of months – itis likely that nodes within the system will go down, or new nodes be added. Hence theSCHEDULERallows for these events; in the case of node failure such information can bedetected by regular polling and broadcast to interested parties. This issue is consideredfurther in the context of message acknowledgment and retires (Section5.3).

5.2. Process communication

The CML process communication component must support the IPC operationsSENDandRECEIVE. As described above the concerns of process communication at this level canbe separated into three components: handling communicationfrom other nodes, handlingcommunicationto other nodes, and providing message queues.

The LISTENERprocess

The action of theLISTENER process is to check at regular intervals, for eachcommunication mode supported on its node, if there has been a connection initiated from


212 J. A. KEANE

another node. If there has, theLISTENERcreates a child process to handle the connectionappropriately and to pass on any messages received to the appropriate input queues.

A further action of theLISTENER process is to check for System Managementmessages such as anECHOrequest, where the System Manager is checking if theLISTENER is still alive, and answering such messages.

The ROUTERprocess

Based on a pre-set interval, theROUTERchecks its message queue to see if a message hasbeen received. TheROUTERcan receive two types of messages:

• application messages to be sent off node; this is considered in Section5.3• system management messages; for example, a status check on theROUTER.

TheROUTERis further responsible for checking at regular intervals that the input queuesare not full; if a queue is full theROUTERmay remove messages (a pre-set percentage ofthe queue). If the message is anACKNOWLEDGEMENT(see Section5.3), it is not removedas this would only cause more message sends and receives.

The messages that are removed are dropped. The sending process is informed thatthe message has not been read by the receiving process. This is to handle the case of aprocess dying but still receiving messages; thus the queue fills. It is expected that such anoccurrence will be rare.

It is possible to envisage an overflow queue being used in this case; however, at somepoint, as queues must be finite, the possibility of a full queue must be considered.

5.3. Design considerations

In this Section various options considered in the design are discussed.

5.3.1. Sending a message

At the application level messages are sent to process-ids. The CML level must determinewhether the target of a message is on-node or off-node. If the destination process is on-node then the message is not sent via theROUTERbut is placed directly onto the inputqueue for the destination process.

When sending off-node, if there is a process-id destination theROUTERshould be ableto send the message directly to its target via the route determined by the CML when allchild processes contact their parent; see Section5.1.

If there is no process identifier, then it is aCREATEmessage to be sent to theSCHEDULERon another LAN; theSCHEDULERon this node will determine which LANby the application hints and the current load balance.

To send off-node theROUTERmust add the node-id and the LAN-id. For off-nodemessages, the data are encoded via well-known encode functions to ensure that differentsystems can interpret the data correctly.

5.3.2. Message size

There is a physical limit to the amount of data that can be sent in one connection betweennodes. The amount of data that can be sent in one go is termed aframe .



This physical limit should not affect the size of message that can be sent by a process.Based on the size of a frame, two types of messages are distinguished:

• SHORT, messages that fit into one frame, for example, the command messages in theIPC protocol

• LONG, messages that are greater than one frame in size, for example, the datamessages in the IPC protocolLONGmessages are further split based on the following analysis: there is a finitequeue into which the message will be placed following receipt. If the receipt of amessage willdefinitelycause the receiving queue to fill, then there is no point insending the message.As a resultLONGmessages are split into:

(i) SPLIT messages, that are too long for a single frame but will not fill the queueon their own, and thus can be sent immediately as a sequence of frames.

(ii) WARNmessages, that are too long for a single frame and exceed the size of thereceiving queue.In this case, the receiver must allocate enough space into which the messagecan be copied directly (without being placed on the queue). To achieve this, aninitial (SHORT) message is sent indicating that the receiver must allocate spacewithin the process address area for the message to be copied into directly. Onlythen will theWARNmessage be sent, thus ensuring that the message will notfill the queue for the receiver.This will not cause a race condition as there is no check made of the receiptqueue. As the size of message queues is constant the sending node candetermine if the message is larger than the receiver queue size. There remainsthe possibility that any message will (as an accumulation) cause a queue tofill. This is why theROUTERprocess will occasionally remove messages thathave remained on the queue for a long time.

5.3.3. Reading a message

When we read a message it can be one of the three types described above. In all cases wereceive an initial frame which tells the receiver what message to expect.

The handling of the different message types is as follows:

1. For aSHORTmessage, there is only one frame; hence there is no more to receive.2. For aSPLIT message, the first frame indicates the number of frames to be expected

as part of this message.Note, this first frame constitutes the actual data at the start of aSPLIT message.

3. For aWARNmessage, the initial frame contains the total length of the message andthe destination process is expected to allocate enough room for this length of data.Note, this first frame does not contain the actual message data.

5.3.4. Message acknowledgment

For the applications of concern here, it is important that message arrival is guaranteed.Hence, message receipt must be acknowledged. By this it is meantreceiptby the node


214 J. A. KEANE

on which the destination process resides; we do not mean the stronger condition thatthe destination process has read the message, though it is possible to extend the modelto accommodate this requirement. In the present protocol, send is non-blocking at theapplication level.

A message is viewed as not being sent until the receiving node acknowledges its receipt.Depending on application type, it may not be necessary to wait for an acknowledgment,and hence this property could be a parameter to the send. With this extension, send wouldbe non-blocking at both the application and the CML.

Assuming that acknowledgment is required, it could be done on a per-message basis,at the end of receipt, as a single discrete unit. However, for applications which send largemessages, it is possible that only one frame may have not been received correctly. In thiscase, using a single acknowledgment per message may mean that the entire message mustbe resent. As a result for certain messages we require an acknowledgment for each part ofthe message.

However, having an acknowledgment for every frame is expensive in terms ofboth communication and synchronisation. As a compromise between these conflictingperformance considerations – we want an acknowledgment at a finer grain than a singlemessage because re-sending a long message for one lost frame is expensive, at the sametime, sending an acknowledgment per frame is expensive – we introduce the conceptof a window. A message is viewed as being a sequence of windows, and a window isa sequence of frames; aSHORTmessage consists of one frame within one window. Awindow, therefore, acts as the unit of acknowledgment and thus synchronisation during amessage send. If a window is not acknowledged then we re-send each frame within it.

Using a window as this unit also allows each frame within a window to be sent offasynchronously and potentially in parallel. If these frames are sent to their destination viaa number of routes, as is likely on a WAN, we have the possibility of framen + 1 arrivingbefore framen, i.e. analogous to out-of-order message arrival. The approach we take is toallow frames within a window to arrive out of order and to be ordered at the destination. Awindow is acknowledged when all frames have arrived – note, because of the possibility ofretrying frames, we must ensure that each frame is numbered, i.e. we need both the rightnumber of frames and each frame in the sequence.

We could extend this approach to out-of-order window arrival as well. In this approach amessage is acknowledged when the right number, and set, of windows have been received.In the initial implementation we expect all windows to arrive in sequence order. This ispartly because of the complexity of keeping track of acknowledgments and retries (seebelow).

5.3.5. Retries

Because of the importance given to message guarantee at the application level, messagesends can be retried when failure occurs. This is extended to allow windows and framesends to be retried. Once again, for different applications, different numbers of retries anddifferent granularity of the retry unit can be allowed.

To enable retries for a unit, if an acknowledgment is not received within a certain intervalthe unit is resent. It is difficult to establish what this interval should be: if it is too long thentime is wasted; if it is too short a resend is unnecessary. If there is more than one route tothe destination then if failure occurs on the first chosen the alternative can be used. This



allows message delivery to occur where an intermediate node has failed and an alternativeroute exists.

More fundamentally, if a message is resent and then the acknowledgment of the earliermessage is received, the acknowledgment for the resent message must also be received.Thus it is possible to distinguish between an acknowledgment sent to different tries of awindow.

5.3.6. Authentication

As the applications of concern involve personal data we need to ensure that the access tomessages is checked. To ensure that processes do not read messages belonging to othersa unique key is returned on initial identification of a process as a receiver, and must beauthenticated on each message read.

6. IMPLEMENTATION CONSIDERATIONS

In this Section issues considered in the implementation of the framework and its presentstatus are discussed. Of particular concern is that the framework should beflexible(in thatit supports many communication modes),reliable (in that it has been thoroughly tested)andefficient(in that wherever possible work is done locally and not across a network).

6.1. Communication modes

Nodes within a distributed system are usually heterogeneous and likely to support differentmodes of communication. Between two nodes the most efficient form of communicationis used, so between two UNIX systems ifTLI (Transport Layer Interface) is supported itis used. However, the lowest common denominator must be used, so, for example, if oneof the nodes effecting communication only supports sockets whilst the other supports bothTLI and sockets, then socket communication must be used. Note that transitively messagescan be sent between nodes that do not support a common communication mode providedthat there is an intermediate route involving other nodes: for example, node A that supportsmode X can send to node C that supports mode Y, via node B that supports both X and Ymodes.

If a node supports more than one communication mode, for example bothTLI andsocket communication, then when theLISTENER process on a node starts it initialisesboth the services.

For bothTLI and the socket communication the basic mechanism is quite similar, andconsequently only the socket communication is discussed. When a read socket is initialiseda socket is created in the Ethernet domain, it is bound to a well-known address, and then alisten() is set on the socket. When the send socket is initialised a socket is created,given a name, and then aconnect() is attempted on this socket. To send, data arepumped down the socket connection once it has been set up.

To check for a connection on a socket aselect is done on the socket to do a blockingaccept with time-out. If there is a connection pending it is accepted; otherwise the actionis to poll for connections on the other communication modes supported.

When using both socket andTLI communication there is a time-out on a connectionremaining open. Rather than discover this by sending a message via the network and


216 J. A. KEANE

finding time-out has occurred (and the consequent waste of time and effort), the sendingnode records the time of the last communication and determines locally before sendingwhether a new connection needs to be established, i.e. whether the connection at the otherend will have timed-out.

6.2. Testing

Because of the complexity of the framework software, combined with the importance ofthe applications to be supported, very thorough testing is essential. However, it is difficultto test system-level software of the type described. The difficulty arises because the typesof fault anticipated are outside the scope of the software under development. Such faultsare caused by hardware failure or network problems, that are transient and unpredictable.It is extremely difficult to ensure that the system can tolerate such faults because they maynot occur during the testing phase.

Further, as described, with the introduction of message acknowledgment and retries,the design has attempted to ensure tolerance to these faults; hence if the system works asdesigned, the occurrence of such problems may never be detected!

To allow the system to be tested in circumstances corresponding to the types of failurethat may occur in a live system, aROGUEprocess has been implemented. The purpose oftheROGUEis to steal frames, windows and entire messages from the message queues inorder to simulate the loss of frames caused by network problems or hardware failure. Tofurther simulate ‘real’ failure, the disruption caused by theROGUEis triggered randomly.We allow this process to be parametrised by the amount of disruption, for example, theamount of the queue to be ‘stolen’.

6.3. Present status

The framework as described has been implemented across a LAN made up of SUNworkstation nodes, Sequent SMP parallel nodes, and a Meiko Computing Surface MPPnode. Each of the systems runs a different variation of UNIX; it has proven relatively easyto port the framework on to each different type of node.

The messages sent to the Computing Surface make use of the SPARC host-board whichruns UNIX; message sends within the Computing Surface make use of Meiko’s CS Toolsinterprocess communication. Wherever possible the framework makes use of underlyingsystem facilities.

At present the size of a window is fixed and corresponds to the maximum size ofa SPLIT message; however, there is no restriction on this. The window size could bedetermined by the application and be a parameter to the send operation.

A pragmatic consideration is the number of nodes that can be realistically incorporatedinto the framework outlined here. From a technical perspective the framework ishierarchical, dealing with WANs, LANs and nodes within LANS, and hence should beextensible. However, there is a practical issue of performing separate compilation on eachnode. In normal processing it is likely that nodes on a LAN would be used, and thesenodes would be centrally administered. When a WAN is used, a number of administratorsare likely to be involved which may cause co-ordination problems. The reason for using aWAN is that the applications are sufficiently business-critical to require such processing



resource. It is in this context that separate compilation on each LAN becomes morepractical in a business sense.

7. COMPARISON WITH RELATED WORK

As stated in the introduction the motivation for this work has come from the economicsignificance of a particular application domain which justifies the design and developmentof an application-specific framework. To repeat, the key advantage to implementing sucha framework, over a general-purpose one, is that a tailored system can be optimised for therequirements of the application.

In this Section a technical comparison of the framework is made with PVM[6] andMPI[7], both more general-purpose and widely used frameworks. A recent paper[14]comparing these two systems forms a basis for the comparison here.

As would be expected there are many similarities between PVM, MPI and the approachhere, and more generally between any such frameworks, their aims being to allow co-ordination and communication between multiprocess applications. Both PVM and MPIconcentrate upon message-passing as being the co-ordination mechanism. Both derivefrom the needs of computational scientists to achieve number-crunching. The requirementshere are somewhat different as the applications of concern are commercial, and have bothcomputation and data requirements. Hence an important design issue here has been toensure that large messages can be accommodated; this has led to the concentration ondifferent message types and particularly onLONGmessages.

PVM is based on the notion of a ‘virtual machine’: a set of heterogeneous hostsconnected by a network that appears as a single large parallel system. The major focus hasbeen the accommodation and utilisation of heterogeneity; as a result portability is viewedas being more important than performance.

MPI, in contrast, has focused upon performance, and has been targeted at being astandard message-passing system implemented on MPP (Massively Parallel Processing)systems. Geistet al.[14] state that MPI has a much more sophisticated set ofcommunications functions than PVM.

The approach here can be placed between those of PVM and MPI: PVM primarilyfocuses on portability, and MPI on performance. As with PVM, the focus here is onheterogeneous networks. It is felt that the approach here will achieve higher performancethan PVM as it is directed towards a specific application domain rather than beinggeneral-purpose. Thus it has a much simpler child-process co-ordination framework, andfar fewer types of interaction. In the same vein, the framework here has only beenimplemented to operate with the C language, unlike PVM, which is concerned withmultilanguage interoperability. Finally, on performance, the nodes here all support UNIX,so the framework can be optimised for a particular operating system.

In terms of process control features, the approach here is much closer to that of PVM: inthe context of the parent–child model we can start and stop processes, and enquire abouttheir status. Geistet al. conclude that PVM lacks a non-blocking send that is available inMPI. The approach here has a non-blocking send.

As the framework here targets heterogeneous networks and has been implemented ondifferent types of architectures (networks, SMP and MPP) from different vendors, theapproach here is more portable than MPI. In addition, the framework is far simpler,compared with the 250 functions of the proposed MPI-2 standard.


218 J. A. KEANE

Geistet al.[14] conclude that it is difficult to use MPI in a heterogeneous environment asthere are no standard methods to create processes on separate hosts: ‘none of the existingMPI implementations can inter-operate’. The approach described here has been designedprecisely to co-ordinate within a heterogeneous environment.

Geistet al.also conclude that MPI lacks provision to enable fault-tolerant applications tobe written. As described, the possibility of failure at a number of levels has been consideredhere, motivated by the application requirements.

In summary, because the approach here is domain-specific, it is simpler and thus likelyto be more efficientfor the applications of concernthan a general-purpose approach. Asindicated, some of the weaknesses of both approaches are addressed here within a simplerchild–parent framework.

8. CONCLUSIONS

The work described here is part of a continuing investigation of the requirementsof financial applications and the utility of enabling technologies such as parallel anddistributed systems in this area[11,12, 15–17]. Previously the focus has been on the useof parallel systems; in this paper the focus has been on the provision of a framework inwhich all the distributed computing resource of an organisation, including parallel systems,can be co-ordinated. The techniques used in parallelising applications extend to distributedsystems – it is simply that because of the extra communication overheads the granularityof the task must be larger.

The design of a co-ordination framework for a distributed system has been presentedat two levels: IPC and CML. The aim of the framework is to enable a distributedheterogeneous system to be used as an integrated whole whenever an extra processingresource is required. A subset of the design has been implemented.

Related work has investigated techniques to make use of intermediate results oftransaction execution when node failure occurs[18]. Other work has addressed theissues of formally describing the requirements of a parallel system in terms of processinteraction[19].

ACKNOWLEDGEMENTS

It is a great pleasure to acknowledge all members of the Systems Department at NobleLowndes and Partners Ltd., in particular John Kay, Tony Harris and Paul Symons, whowere involved in the design and development of the described system. The author alsothanks the anonymous referees for their constructive and detailed comments on earlierdrafts.

REFERENCES

1. F. Wray,High Performance Computing: Status and Markets, UNICOM Applied InformationTechnology Report, UNICOM Publishers, 1996.

2. F. Darema, ‘On the parallel characteristics of engineering/scientific and commercialapplications’, inParallel Information Processing, J. A. Keane (ed.), Stanley Thornes Publ.,1996, pp. 1–30.

3. F. N. Teskey, ‘Parallel processing in a commercial open systems market’,Parallel InformationProcessing, J. A. Keane (ed.), Stanley Thornes Publ., 1996, pp. 31–38.



4. S. Mullender, (ed.),Distributed Systems, Addison-Wesley, 2 edn., 1993.5. T. L. Casavant and M. Singhal, (eds.),Readings in Distributed Systems, IEEE Press, 1994.6. V. Sunderam, ‘The PVM concurrent computing system’,Parallel Information Processing,

J. A. Keane (ed.), Stanley Thornes Publ., 1996, pp. 133–150.7. MPI Forum, ‘MPI: a message-passing interface standard’,Int. J. Supercomput. Appl., 8(3/4),

165–416 (1994).8. N. Carriero and D. Gelernter,How to Write Parallel Programs – A First Course, MIT Press,

1990.9. M. D. Schroeder, ‘A state-of-the-art distributed system: computing with BOB’, in S. Mullender

(ed.),Distributed Systems, Addison-Wesley, 1993, 2 edn, pp. 1–16 .10. K. M. Chandy and S. Taylor,An Introduction to Parallel Programming, Jones and Bartlett

Publishers, 1992.11. J. A. Keane, ‘Parallelising a financial system’,Future Gener. Comput. Syst., 9(1), 41–51 (1993).12. J. A. Keane, ‘Parallel systems in financial information processing’,Concurrency: Pract. Exp.,

8(10), 757–768 (1996).13. A. J. G. Hey, ‘Experiments in MIMD parallelism’, in E. Odijk, M. Rem and J.-C. Syre (eds.),

PARLE ’89 Parallel Architectures and Languages Europe, Lecture Notes in Computer Science,Vol. 366, Springer-Verlag, 1989, pp. 28–42.

14. G. A. Geistet al., ‘PVM and MPI: a comparison of features’,Calculatuers Paralleles, b(2)(1996).

15. J. A. Keaneet al., ‘Commercial users’ requirements for parallel systems’, inProc. of 2nd DAGSParallel Computation Symposium, F. Makedonet al. (eds.), 1993, pp. 15–25.

16. J. A. Keaneet al., ‘Benchmarking financial database queries on a parallel machine’, inProc. of28th Hawaii International Conference on System Sciences, Vol. II, IEEE Press, 1995, pp. 402–411.

17. J. A. Keane, ‘High performance banking’,Proc. of RIDE ’97, IEEE Press, 1997, pp. 66–69 .18. J. A. Keane and X. F. Ye, ‘A fault tolerant model for a parallel database system’,Proc. of

EUROSIM Conference, L. Dekkeret al. (eds.), North-Holland, 1996, pp. 127–134 .19. W. Hussak and J. A. Keane, ‘Expressing requirements on a parallel system formally’,

Requirements Eng. J., 1(4), 199–209 (1996).


a co-ordination framework for distributed financial applications

Documents