the starfish system: providing intrusion detection and intrusion

The Starfish System: Providing Intrusion Detection andIntrusion Tolerance for Middleware Systems

Kim Potter KihlstromDepartment of Mathematics and Computer Science

Westmont College955 La Paz Road, Santa Barbara, CA 93108

[email protected]

Priya NarasimhanElectrical and Computer Engineering Department

Carnegie Mellon University5000 Forbes Avenue, Pittsburgh, PA 15213-3890

[email protected]

Abstract

We introduce the Starfish1 system, a new system thatprovides intrusion detection and intrusion tolerance formiddleware applications operating in an asynchronousdistributed system. The Starfish system contains a cen-tral, highly secure and tightly coupled core. This coreis augmented by “arms” that are less tightly coupledand that have less stringent security guarantees, each ofwhich can be removed from the core if a significant secu-rity breach occurs. New arms can be “grown” as needed.

The Starfish system aims to employ a number of tech-niques for providing intrusion detection and intrusion tol-erance. The specific challenges that we will address inthis paper are infrastructural support for voting and end-to-end intrusion detection.

1. Introduction

As middleware technologies such as Jini, EJB,CORBA, and DCOM have become more important fordesigning large software systems, it is has become in-creasingly important to design algorithms and develop

1Starfish are known to have small bodies, out of which spring fortha varying number of arms, which break off when damaged. Thesearms subsequently heal and re-grow. Detached starfish arms (also called”comets”) can also regenerate new bodies.

mechanisms to provide intrusion detection and intrusiontolerance capabilities in middleware systems. Intrusiontolerant systems are systems that continue to operate cor-rectly and reliably despite intrusions that cause somenodes to behave in an arbitrary or malicious manner.While traditional computer security concerns keeping in-truders out of a system, our goal is detection of an in-trusion and continued provision of useful services evenduring or after a successful intrusion.

It is of crucial importance to protect critical applica-tions such as electronic banking and commence systems,the national power control grid, air traffic control, andmedical monitoring and support systems, from maliciousintrusions and corruption. While some work has beendone in this area, much remains to be investigated andmany questions remain open. We must develop algo-rithms and mechanisms for providing intrusion toleranceto middleware systems, and provide a framework for un-derstanding intrusions in such systems. It is critical thatwe describe properties that must be achieved by intrusiontolerant systems, so that middleware systems can be eval-uated against these properties.

As shown in Figure 1, the Starfish system consists of atightly coupled and highly secure core containing criticalcomponents, as well as arms providing less stringent se-curity guarantees and containing less critical components.In the event of a security breach, an arm can be removed.Additionally, new arms can be grown.

Figure 1. The Starfish system.

Our work explores the challenges underlying provid-ing intrusion tolerance for complex middleware applica-tions in a distributed asynchronous system. The specificchallenges that we will address in this paper are infras-tructural support for voting and end-to-end intrusion de-tection. One of the goals of the mechanisms and the algo-rithms that we have developed, and are continuing to de-velop, is to be widely applicable across a variety of mid-dleware technologies and, where possible, to transcenddifferences in diverse middleware implementations andstandards. Drawing on our previous work on CORBA, wewill investigate, in this paper, generic algorithms, tech-niques and mechanisms that will be essential foundationsfor the intrusion tolerant middleware systems of the fu-ture.

2. System model

We consider an asynchronous distributed system con-sisting of processors that communicate via messages overa network. Processors have access to local clocks, butthese clocks are not synchronized.

The system is subject to communication, processor,and object faults. Communication between processors

is unreliable, and thus messages may need to be retrans-mitted and may be arbitrarily delayed. Communicationchannels are not assumed to be FIFO or authenticated.

Processors are either correct or faulty. Correct pro-cessors always behave according to their specification.Faulty processors exhibit arbitrary (Byzantine) behav-ior. Because a Byzantine processor can behave as if itcrashed, crash faults are considered as Byzantine faults.

A faulty processor may attempt to disrupt the systemby sending different information to different processors,purporting that it is the same information, or by selec-tively sending messages to some processors and not toothers. Further, a malicious processor can attempt tothwart the consistent ordering of messages by sendingmessages to different processors in a different order. Sucha processor can also masquerade as another processor byincluding the identifier of another processor in the mes-sages it sends. A faulty processor can refuse to send mes-sages, can send messages that are syntactically or seman-tically incorrect, or can claim that another processor isfaulty.

Objects are also subject to crash faults or may behavein an arbitrary or malicious manner. Correct objects al-

ways behave according to their specification and do notcrash. A faulty object may fail to send a message as re-quired by the application, which we refer to as a sendomission fault. A faulty object may also send a mes-sage that contains an incorrect (corrupted) invocation orresponse, which we refer to as a value fault.

3. Infrastructural support for voting

In an environment subject to malicious faults due tointrusions, a single object may be corrupted and provideincorrect semantics, including providing an incorrect re-sponse to an invocation. Thus, to provide intrusion tol-erance, it is crucial to replicate objects. We make useof state machine replication [17] to actively replicate ob-jects. In the case of a deterministic object, if all of theobject replicas begin in the same state and process thesame updates in the same order, then consistency will bemaintained.

However, maintaining replica consistency in an envi-ronment subject to malicious faults is a complex problem.Ensuring that all objects process the same updates in thesame order is very difficult. We must ensure, for exam-ple, that a malicious processor is unable to disrupt thesystem by sending different information to different pro-cessors, purporting that it is the same information. Se-cure reliable group communication systems can ensurethat all of the processors receive the same information,even if some of the processors in the system behave mali-ciously. Throughout the rest of this paper, we assume theexistence of an underlying group communication systemsuch as the SecureRing system [8] that provides securereliable totally-ordered message delivery and processorgroup membership services despite malicious processorfaults.

We make use of the object group abstraction, in whichthe replicas of an object form an object group and the de-gree of replication is the size of the object group. An ob-ject group interface can be employed to hide the detailsof the underlying group communication protocols. Theobject group interface allows an object to transparentlysend an invocation or response to all of the replicas of anobject; i.e., to the object group. The object group inter-face also allows replicas to join and leave (or be removedfrom) an object group, and for object groups to be createdand eliminated.

Fault-tolerant systems typically perform some type ofvoting. Most work on intrusion tolerance has focusedon faulty servers, with clients performing majority votingon responses from replicated server objects. However, in

an intrusion tolerant system we must also consider faultyclients, and ensure that such clients are unable to corruptthe state of server objects. We advocate replicating clientsas well as servers, and we thus perform majority votingon both invocations and responses, as shown in Figure 2.

There are key questions with regard to voting in a dy-namic environment, in which object replicas may be cre-ated and removed. In particular, it is crucial to be ableto determine how many votes from a particular replicatedobject constitute a majority, given the current status of thesystem. To send an invocation or a response to an objectgroup, the originating object does not need to be awareof the number or location of the replicas of the target ob-ject. Typically, while the object group layer on a particu-lar processor would process membership changes regard-ing object groups hosted by the processor, for reasons ofscalability it would not receive such membership changesregarding other object groups that it does not host. How-ever, to vote on incoming invocations or responses, thevoter at the target replica must know the number of thereplicas in the originating object group (in order to knowhow many votes to expect and, thus, how many votes con-stitutes a majority). To handle this, we propose, in addi-tion to prior work, a hierarchical membership structurethat consists of both object groups and voting groups, aswell as the use of a voting group interface, as shown inFigure 3.

The hierarchical group structure allows membershipchanges to be selectively disseminated only to interestedparties, and is thus scalable while providing the informa-tion necessary for voting. When a client object group��

sends an invocation to a server object group��

,a voting group

��is formed that contains both

��and��

. Membership information for both��

and��

is maintained locally and, when there is a change in ob-ject group membership of either

��or��

, all of thereplicas in the voting group

� �are made aware of those

changes. The voting group interface allows members ofa voting group to transparently vote on the invocationsor responses from other members of that voting group.Thus,

� ��is able to transparently vote on invocations

from��

, and��

is able to transparently vote on re-sponses from

��.

4. End-to-end intrusion detection

Various types of faults can be manifested as the re-sult of an intrusion. Some of these include communi-cation faults, processor faults, and object faults. Oncevarious types of faults have been identified and catego-

Client object group Goinvoking an operation

1 Server object group Govoting on invocations

2

v vv

Client object group Govoting on responses

1

Server object group Goresponding to the operation

2

v v v

Figure 2. Majority voting on invocations and responses.

rized, strategies must be developed to protect systemsfrom these types of faults.

As described above, membership occurs in several lev-els. At the lowest level, there is the collection of all pro-cessors that host objects in the system. We refer to this asprocessor membership. Next, there is voting group mem-bership, which encompasses the client and server objectgroups that are communicating at any particular point intime. Finally, there is object group membership, in whichan object group consists of all of the replicas of an object.All of these memberships are dynamic and change as in-trusions are detected, as new objects are created or modi-fied in their degree of replication, and as objects commu-nicate through invocations and responses.

Prior work on Byzantine fault detectors [9] can be ap-plied to intrusion detection in middleware. These tech-niques allow for the detection and removal of corruptedprocessors. Each processor has access to a local fault de-tector module that provides hints about which processorsare faulty. The local fault detector monitors messages thatare sent and received, and provides an output consistingof a list of processors that it currently suspects.

It is important for intrusion tolerant middleware sys-tems to be adaptive. It is crucial not only to detect Byzan-tine faults, but also to remove the faulty object or proces-sor. There are many issues to be resolved, such as whetherto remove only an object replica that has been found tobe faulty, or to remove the processor hosting the replica

as well. Another key question is how to ensure that a ma-licious object replica is not able to cause the removal of acorrect object or processor by claiming that it is faulty.

When an object replica exhibits a value fault thatis subsequently detected through the voting mechanism,two options are available. The first possibility is simply toremove the faulty replica from the object group of whichit is a member, and all voting groups of which its objectgroup is a member. This is the easiest course of action,and may be appropriate when there is reasonable confi-dence that one faulty object is unable to corrupt anotherobject on the same processor. The process of removinga faulty replica is illustrated in Figure 4. In Figure 4(a),a faulty replica � that is a member of object group

��

sends a vault fault to another object group��

.

In order to ensure that faulty replica � is removed fromits object group

��, it is necessary that all of the replicas

in��

receive evidence of the value fault. To ensure thatthis happens, a Value Fault Vote message is employed.This special message contains embedded votes from eachof the object replicas in group

��, including the faulty

vote from replica � . A means of authentication must beemployed in order to ensure that a malicious replica isnot able to forge an incorrect vote from a correct proces-sor. The Value Fault Vote message is sent from the targetreplicas in object group

� ��that detected the value fault

back to the originating object group��

, as shown inFigure 4(b). The object group layer makes use of a value

Secure Multicast

Secure Multicast

Secure MulticastSecure Multicast

Secure Multicast

ObjectGroups

Object ReplicasVotingGroups

Voting

Intrusion Detection/Intrusion Tolerance





Replication

Voting

Voting

Replication

Voting

Replication

Replication

Voting

Replication

Figure 3. Voting groups and object groups.

fault detector, which votes on the values in the messagesof a given set and identifies a corrupt sender replica aswell as the object group of which it is a member. The ob-ject group interface is then employed to remove the faultyobject replica from object group

��, as shown in Fig-

ure 4(c). Due to the voting group mechanism, the changein the object group membership is then propagated to allof the object groups that are members along with group��

in a voting group (in this example, object groups��and

�� in voting groups

��and

��). This pro-

cess is illustrated in Figure 4(d).

In addition to simply removing a faulty object replicafrom its object group and voting groups, the second op-tion is to remove the processor hosting the faulty objectfrom the system. This is the safest course of action be-cause the corruption of any object residing on a proces-sor may affect the integrity of other objects in residence.However, this semantic requires additional support mech-anisms.

To ensure that all processors handle a value fault dueto a corrupt replica as a malicious processor fault, the fol-lowing completeness property of the Byzantine fault de-tector module must be satisfied:

� There is a time after which every processor that hasexhibited a fault is suspected by every correct pro-cessor

This requires that all of the processors in the system,not merely those that host objects that are members ofobject or voting groups with the faulty replica, must votelocally on the same set of votes and reach the same deci-sion. To achieve this, a special base group containing allof the processors is employed. A value fault detector isused at the processor level and, on detecting an incorrectvote, the value fault detector sends a Value Fault Votemessage to the base group. The value fault detector ateach receiving processor compares this set of invocationsor responses to determine the corrupt client or serverreplica and hosting processor.

A subtle problem may occur here if care is not takenin the mechanism. If a malicious object is not removedimmediately at the application level, a problem may oc-cur with cascading in which the faulty object continuesto send incorrect values, each of which causes receivingprocessors to send a Value Fault Vote message to the basegroup. Effectively, this can result in a denial of serviceattack if not handled properly. It is important to handle

ObjectGroup Go1 Object

Group Go2

ObjectGroup Go3

VotingGroup Gv1

VotingGroup Gv2

(a) Faulty object replica r (shown shaded)sends value fault to object group Go3


Group Go2

ObjectGroup Go3

VotingGroup Gv1

VotingGroup Gv2

(c) Faulty object replica r is removedfrom object group Go2


Group Go2

ObjectGroup Go3

VotingGroup Gv1

VotingGroup Gv2

(d) Object groupsreceive membership update on

Go and GoGo

1 3

2


Group Go2

ObjectGroup Go3

VotingGroup Gv1

VotingGroup Gv2

(b) Object group Go detects fault and sendsmessage to object group Go

3

2Value_fault_vote

r r

Object Group Interface

Voting Group Interface

Secure Multicast

Voting


Replication

Object Group Interface

Voting Group Interface

Secure Multicast

Voting


ReplicationSecure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


ReplicationSecure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Secure Multicast

Voting


Replication

Figure 4. Steps in removing a faulty replica.

the fault initially at the application level, by removingthe faulty replica from its object and voting groups as de-scribed above, before handling the fault at the processorlevel.

5. Related work

This work builds on the authors’ experience with theImmune system [15]. The Immune system aims to pro-vide survivability to CORBA applications transparently,enabling them to continue to operate despite intrusionsor accidents that damage the underlying distributed sys-tem, or faults that occur within the system. The Immunesystem can protect an existing unmodified CORBA ap-plication, running over an unmodified commercial ORB,against arbitrary faults, including those that arise frommalicious attacks within the system. Every object withinthe CORBA application is actively replicated by the Im-

mune system, with majority voting applied on incominginvocations and responses to each replica of the object.

The Immune system grew out of the authors’ work ontwo projects: the Eternal system and the SecureRing sys-tem. The SecureRing system [8] provides secure reliabletotally-ordered message delivery and group membershipservices despite the malicious corruption of some of theprocessors. The protocols multicast messages to groupsof processors and deliver messages in a consistent totalorder to all members of the group. The Eternal system[13] aims to provide support for fault tolerance and evo-lution to CORBA-based applications. The application ob-jects are replicated and distributed across the system, andthe consistency of the states of the replicas is maintainedby exploiting the facilities of the underlying group com-munication protocol for all inter-object communication.

Phalanx [11] is a software system that enables wide-area applications to survive the malicious corruption ofclients and servers. Phalanx is a persistent object store

that provides support for shared data abstractions such asfiles and locks. Phalanx is designed to scale to a largenumber of servers, and employs quorum constructions[10] that enable efficient scaling. Phalanx was developedfor use in an electronic election trial during the 1998 CostRican presidential elections.

Fleet [12] is a middleware system that implements adistributed object repository. Fleet generalizes Phalanxin several ways, including support for persistent objectsof arbitrary types, techniques for fault monitoring, andcapabilities for dynamically extending the infrastructurewith new object types. Fleet objects are persistent in thatthe object may outlive the client that created it. Fleet ob-jects are replicated and retain correct semantics despitethe malicious corruption of servers. Fleet is designed forJava/Jini applications and makes use of Byzantine quo-rum systems [10] to achieve scalability.

The Object Group Service [6] is a CORBA-compliantapproach that provides fault tolerance for CORBA appli-cations through a set of CORBA services. Mechanismsare provided to allow for the delivery of an invocation(response) to a server (client) object only if a majority ofcopies of the invocation (response) have been received,though not voted on for verification of their contents orvalue, from the replicated client (server) object.

The AQuA architecture [4] provides a CORBA-baseddependability framework that can tolerate processor crashfaults and replica crash faults, as well as send omissionfaults and value faults due to an corrupted object. TheAQuA architecture exploits the facilities of the underly-ing Ensemble and Maestro toolkits [18], and uses ma-jority voting at the application object level, to detect anincorrect value of an invocation (response) from a repli-cated client (server). AQuA does not provide for thedetection, or tolerance, of malicious processor faults ormessage corruption faults.

COPE [14] is a set of CORBA security policies andCORBA-based services that is being developed to pro-vide intrusion tolerance. An implementation has beendeveloped using the commercial ORB, OrbixWeb, fromIona Technologies. COPE allows a method of anyCORBA object in the application to be invoked only asdetermined by the application access policy for that ob-ject, based on permissions and authentication. Optimisticprotocols serve to contain the damage done by an intruderbefore it is detected.

The BFT library [2] is a state machine replication sys-tem that tolerates Byzantine faults. It achieves high per-formance because it uses digital signatures only for view-change and new-view messages, and authenticates other

messages using message authentication codes (MACs).A more recent extension [3] provides for replicas to berecovered proactively, allowing the system to tolerate anynumber of faults over its lifetime provided that fewer than1/3 of the replicas are faulty within a small time period.A state transfer mechanism is also provided. BASE [16]is a further extension of BFT that enables replicas to rundifferent or non-deterministic implementations. This isdone by defining a common abstract specification for aservice and then implementing a wrapper for each dis-tinct implementation to make it behave according to thecommon specification.

The FRIENDS [5] system aims to provide mecha-nisms for building fault-tolerant applications in a flexibleway through the use of libraries of metaobjects. Separatemetaobjects can be provided for fault tolerance, securityand group communication. FRIENDS is composed of anumber of subsystems, including a fault tolerance subsys-tem which provides support for replication and detectionof faults. The secure communication subsystem includessupport for authentication, authorization and audit. Whileevery method invocation on a server can be signed at theclient and authenticated at the server, majority voting isnot used to determine the delivery of the invocation at theserver.

The Cactus project [7] is a framework aimed at pro-viding a flexible way of customizing different levels andstrategies for survivability through the combination ofcomponents known as micro-protocols, each implement-ing a specific property. Fault tolerance is implemented asa micro-protocol which supports active and passive repli-cation. Security is provided through micro-protocols thatimplement various authentication and encryption meth-ods. GroupRPC is a middleware service, implementedusing the Cactus approach, which would allow an appli-cation to tolerate crash, omission, timing and Byzantinefaults.

The WAFT system [1] is a distributed object sys-tem that aims to support replication and security strate-gies suited to wide-area applications. The fault detectionmechanism that WAFT employs is able to handle perfor-mance faults, such as a failure to comply with a specificquality of service requirement.

6. Conclusions and future directions

The increasingly complex and unpredictable world inwhich we live requires a greater understand of intrusiondetection and the issues involved in the design of intru-sion tolerant middleware systems. We must continue to

develop mechanisms, tools, and techniques to aid in thedevelopment of such systems.

We have presented two general mechanisms for build-ing intrusion tolerant systems. The first is the use of hier-archical groups for scalability. We introduced the idea ofa voting group and a voting group interface for selectivepropagation of membership information and transparentvoting. Our second mechanism is that of end-to-end in-trusion detection at the object group, voting group, andprocessor level. These mechanisms, as well as others, areillustrated in the Starfish system currently under develop-ment.

Areas for future work include investigation into scal-able techniques to achieve replication. Byzantine quo-rums [10] are a recent advance in this direction, but morework is necessary. Another key question is how to dealwith denial of service attacks; very little progress hasbeen made in this area to date. Work is also needed in im-plementing diversity in servers so that a successful attackon one server cannot be easily extended to other servers.

Another area for future work lies in considerations forrecovery of processors and replicas. It is important thatrepaired objects and processors be allowed to rejoin thesystem. However, this is a difficult problem in systemssubject to malicious faults. Issues of what to recover andwhat state to transfer when a new or repaired processorcomes up, or when a new or repaired replica comes up,must be considered.

References

[1] L. Alvisi and K. Marzullo. WAFT: Support for fault-tolerance in wide-area object oriented systems. In Pro-ceedings of the 1998 Information Survivability Workshop,pages 5–10, Orlando, FL, Oct. 1998.

[2] M. Castro and B. Liskov. Practical Byzantine fault toler-ance. In Proceedings of the 3rd Symposium on OperatingSystems Design and Implementation, pages 173–86, NewOrleans, LA, Feb. 1999.

[3] M. Castro and B. Liskov. Proactive recovery in aByzantine-fault-tolerant system. In Proceedings of theFourth Symposium on Operating Systems Design and Im-plementation, pages 273–87, San Diego, CA, Oct. 2000.

[4] M. Cukier, J. Ren, C. Sabnis, W. H. Sanders, D. E.Bakken, M. E. Berman, D. A. Karr, and R. Schantz.AQuA: An adaptive architecture that provides dependabledistributed objects. In Proceedings of the 17th IEEE Sym-posium on Reliable Distributed Systems, pages 245–253,West Lafayette, IN, Oct. 1998.

[5] J.-C. Fabre and T. Perennou. A metaobject architecturefor fault-tolerant distributed systems: The FRIENDS ap-

proach. IEEE Transactions on Computers, 47(1):78–95,1998.

[6] P. Felber, R. Guerraoui, and A. Schiper. The implementa-tion of a CORBA object group service. Theory and Prac-tice of Object Systems, 4(2):93–105, 1998.

[7] M. A. Hiltunen, R. D. Schlichting, and C. A. Ugarte. Sur-vivability issues in Cactus. In Proceedings of the 1998 In-formation Survivability Workshop, pages 85–88, Orlando,FL, Oct. 1998.

[8] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith. The SecureRing group communication system.ACM Transactions on Information and System Security,4(4):371–406, Nov. 2001.

[9] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith.Byzantine fault detectors for solving consensus. TheComputer Journal, 46(1):16–35, 2003.

[10] D. Malkhi and M. Reiter. Byzantine quorum systems.Distributed Computing, 11(4):203–213, 1998.

[11] D. Malkhi and M. Reiter. Secure and scalable replica-tion in Phalanx. In Proceedings of the 17th IEEE Sympo-sium on Reliable Distributed Systems, pages 51–58, WestLafayette, IN, Oct. 1998.

[12] D. Malkhi, M. Reiter, D. Tulone, and E. Ziskind. Persis-tent object in the Fleet system. In Proceedings of the 2ndDARPA Information Survivability Conference and Expo-sition, pages 126–36, Anaheim, CA, June 2001.

[13] L. E. Moser, P. M. Melliar-Smith, and P. Narasimhan.Consistent object replication in the Eternal system. The-ory and Practice of Object Systems, 4(2):81–92, Apr.1998.

[14] C. Namprempre, J. Sussman, and K. Marzullo. Imple-menting causal logging using OrbixWeb interception. InProceedings of the Fifth USENIX Conference on Object-Oriented Technologies and Systems, pages 257–267, SanDiego, CA, May 1999.

[15] P. Narasimhan, K. P. Kihlstrom, L. E. Moser, and P. M.Melliar-Smith. Providing support for survivable CORBAapplications with the Immune system. In Proceedings ofthe 19th IEEE International Conference on DistributedComputing Systems, pages 507–516, Austin, TX, May1999.

[16] R. Rodrigues, M. Castro, and B. Liskov. BASE: Usingabstraction to improve fault tolerance. Operating SystemsReview (Proceedings of the 18th ACM Symposium on Op-erating System Principles), 35(5):15–28, Oct. 2001.

[17] F. B. Schneider. Implementing fault-tolerant services us-ing the state machine approach: A tutorial. ACM Com-puting Surveys, 22(4):299–319, Dec. 1990.

[18] A. Vaysburd and K. P. Birman. The Maestro approachto building reliable interoperable distributed applicationswith multiple execution styles. Theory and Practice ofObject Systems, 4(2):73–80, 1998.

the starfish system: providing intrusion detection and intrusion

Documents