usenix | the advanced computing systems …...t for building and executing highly a v ailable,...

29
THE ADVANCED COMPUTING SYSTEMS ASSOCIATION The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference Monterey, California, USA, June 6-11, 1999 The MultiSpace: An Evolutionary Platform for Infrastructural Services _ _ Steven D. Gribble, Matt Welsh, Eric A. Brewer, and David Culler University of California at Berkeley © 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org

Upload: others

Post on 09-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

The following paper was originally published in the

Proceedings of the USENIX Annual Technical ConferenceMonterey, California, USA, June 6-11, 1999

The MultiSpace: An Evolutionary Platformfor Infrastructural Services

_

_

Steven D. Gribble, Matt Welsh,Eric A. Brewer, and David Culler

University of California at Berkeley

© 1999 by The USENIX AssociationAll Rights Reserved

Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercialreproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper.USENIX acknowledges all trademarks herein.

For more information about the USENIX Association:Phone: 1 510 528 8649 FAX: 1 510 548 5738Email: [email protected] WWW: http://www.usenix.org

Page 2: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

The MultiSpace: an Evolutionary Platform

for Infrastructural Services

Steven D. Gribble, Matt Welsh, Eric A. Brewer, and David Culler

The University of California at Berkeley

fgribble,mdw,brewer,[email protected]

Abstract

This paper presents the architecture for a Base, a

clustered environment for building and executing highly

available, scalable, but exible and adaptable infras-

tructure services. Our architecture has three organizing

principles: addressing all of the diÆcult service fault-

tolerance, availability, and consistency problems in a

carefully controlled environment, building that environ-

ment out of a collection of execution environments that

are receptive to mobile code, and using dynamically gen-

erated code to introduce run-time-generated levels of in-

direction separating clients from services. We present

a prototype Java implementation of a Base called the

MultiSpace, and talk about two applications written on

this prototype: the Ninja Jukebox (a cluster based mu-

sic warehouse), and Keiretsu (an instant messaging ser-

vice that supports heterogeneous clients). We show that

the MultiSpace implementation successfully reduces the

complexity of implementing services, and that the plat-

form is conducive to rapid service evolution.

1 Introduction

The performance and utility of a personal

computer will be defined less by faster Intel

processors and new Microsoft software and

increasingly by Internet services and software.

cjnet news article excerpt, 11/25/98

Once a disorganized collection of data reposi-tories and web pages, the Internet has become alandscape populated with rich, industrial-strengthapplications. Many businesses and organizationshave counterparts on the web: banks, restaurants,stock trading services, communities, and even gov-ernments and countries. These applications possesssimilar properties to traditional utilities such as thetelephone network or power grid: they support largeand potentially rapidly growing populations, they

must be available 24x7, and they must abstract com-plex engineering behind simple interfaces. We be-lieve that the Internet is evolving towards a service-oriented infrastructure, in which these high qualityutility-like applications will be commonplace. Un-like traditional utilities, Internet services tend torapidly evolve, are typically customizable by theend-user, and may even be composable.

Although today's Internet services are mature,the process of erecting and modifying services isquite immature. Most authors of complex, new ser-vices are forced to engineer substantial amounts ofcustom, service-speci�c code, largely because of thediversity in the requirements of each service|it isdiÆcult to conceive of a general-purpose, reusable,shrink-wrapped, adequately customizable and ex-tensible service construction product.

Faced with a seemingly inevitable engineeringtask, authors tend to adopt one of two strategiesfor adding new services to the Internet landscape:

In exible, highly tuned, hand-constructedservices: by far, this is the most dominant serviceconstruction strategy found on the Internet. Here,service authors carefully design a system targetedtowards a speci�c application and feature set, op-erating system, and hardware platform. Examplesof such systems are large, carrier-class web searchengines, portals, and application-speci�c web sitessuch as news, stock trading, and shopping sites. Therationale for this approach is sound: it leads to ro-bust and high-performance services. However, thesoftware architectures of these systems are too re-strictive; they result in a �xed service that performsa single, rigid function. The large amount of care-fully crafted and hand-tuned code means that theseservices are diÆcult to evolve; consider, for example,how hard it would be to radically change the behav-ior of a popular search engine service, or to movethe service into a new environment|these sorts ofmodi�cations would take massive engineering e�ort.

\Emergent services" in a world of dis-tributed objects: this strategy is just beginning

Page 3: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

to become popularized with architectures such asSun's JINI [31] and the ongoing CORBA e�ort [25].In this world, instead of erecting complex, in exibleservices, large numbers of components or objectsare made available over the wide area, and servicesemerge through the composition of many such com-ponents. This approach has the bene�t that addingto the Internet landscape is a much simpler task,since the granularity of contributed components ismuch smaller. Because of the explicit decomposi-tion of the world into much smaller pieces, it is alsosimpler to retask or extend services by dropping oneset of components and linking in others.

There are signi�cant disadvantages to this ap-proach. As a side-e�ect of the more evolutionarynature of services, it is diÆcult to manage the stateof the system, as state may be arbitrarily replicatedand distributed across the wide area. Wide-areanetwork partitions are commonplace, meaning thatit is nearly impossible to provide consistency guar-antees while maintaining a reasonable amount ofsystem availability. Furthermore, although it is pos-sible to make incremental, localized changes to thesystem, it is diÆcult to make large, global changesbecause the system components may span many ad-ministrative domains.

In this paper, we advocate a third approach.We argue that we can reap many of the bene�tsof the distributed objects approach while avoidingdiÆcult state management problems by encapsulat-ing services and service state in a carefully con-trolled environment called a Base. To the outsideworld, a Base provides the appearance and guaran-tees of a non-distributed, robust, highly-available,high-performance service. Within a Base, servicesaren't constructed out of brittle, restrictive softwarearchitectures, but instead are \grown" out multiple,smaller, reusable components distributed across aworkstation cluster [3]. These components may bereplicated across many nodes in the cluster for thepurposes of fault tolerance and high performance.The Base provides the glue that binds the compo-nents together, keeping the state of replicated ob-jects consistent, ensuring that all of the constituentcomponents are available, and distributing traÆcacross the components in the cluster as necessary.

The rest of this paper discusses the design princi-ples that we advocate for the architecture of a Base(section 2), and presents a preliminary Base imple-mentation called the Ninja MultiSpace1 (section 3)that uses techniques such as dynamic code genera-tion and code mobility as mechanisms for demon-

1The MultiSpace implementation is available with theNinja platform release - see http://ninja.cs.berkeley.edu.

service #2

service #1

clie

nt c

ode

node 1 node 2 node 3

node 5

node 8node 7node 6

node 4

client

Base

stub for

stub for

= service #2= service #1

Figure 1: Architecture of a Base: a Base is com-prised of a cluster of workstations connected by ahigh-speed network. Each node houses an executionenvironment into which code can be pushed. Ser-vices have many instances within the cluster, butclients are shielded from this by a service \stub"through which they interact with the cluster.

strating and evaluating our hypotheses of service exibility, rapid evolution, and robustness underchange. While we have begun preliminary explo-rations into the scalability and high availability as-pects of our prototype, that has not been the ex-plicit focus of this initial implementation, and in-stead remains the subject of future work. Two ex-ample services running on our prototype Base aredescribed in section 4. In section 5, we discuss someof the lessons we learned while building our proto-type. Section 6 presents related work, and in Section7 we draw conclusions.

2 Organizing Principles

In this section we present three design principlesthat guided our service architecture development(shown at a high level in �gure 1):

1. Solve the challenging service availability andscalability problems in carefully controlledenvironments (Bases),

2. Gain service exibility by decomposing Basesinto a number of receptive execution envi-ronments, and

3. Introduce a level of indirection between theclients and services through the use of dy-namic code generation techniques.

We now discuss each of these principles in turn.

Page 4: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

2.1 Solve Challenging Problems in aBase

The high availability and scalability \utility" re-quirements that Internet services require are diÆ-cult to deliver; our �rst principle is an attempt tosimplify the problem of meeting them by carefullychoosing the environment in which we tackle theseissues. As in [11], we argue that clusters of worksta-tions provide the best platform on which to build In-ternet services. Clusters allow incremental scalabil-ity through the addition of extra nodes, high avail-ability through replication and failover, and cost-performance by using commodity building blocks asthe basis of the computing environment.

Clusters are the backbone of a Base. Physically,a Base must include everything necessary to keepa mission-critical cluster running: system adminis-trators, a physically secure machine room, redun-dant internal networks and external network feeds,UPS systems, and so on. Logically, a cluster-widesoftware layer provides data consistency, availabil-ity, and fault tolerance mechanisms.

The power of locating services inside a Base arisesfrom the assumptions that service authors can nowmake when designing their services. Communica-tion is fast and local, and network partitions areexceptionally rare. Individual nodes can be forcedto be as homogeneous as necessary, and if a nodedies, there will always be an identical replacementavailable. Storage is local, cheap, plentiful, and well-guarded. Finally, everything is under a single do-main, simplifying administration.

Nothing outside of the Base should try to dupli-cate the fault-tolerance or data consistency guaran-tees of the Base. For example, e-mail clients shouldnot attempt to keep local copies of mail messages ex-cept in the capacity of a cache; all messages are per-manently kept by an e-mail service in the Base. Be-cause services promise to be highly available, a usercan rely on being able to access her email throughit while she is network connected.

2.2 Receptive Execution Environments

Internet services are generally built from a com-plex assortment of resources, including heteroge-neous single-CPU and multiprocessor systems, diskarrays, and networks. In many cases these ser-vices are constructed by rigidly placing functional-ity on particular systems and statically partitioningresources and state. This approach represents theview that a service's design and implementation are\sancti�ed" and must be carefully planned and laid

out across the available hardware. In such a regime,there is little tolerance for failures which disrupt thebalance and structure of the service architecture.

To alleviate the problems associated with this ap-proach, the Base architecture employs the principleof receptive execution environments|systems whichcan be dynamically con�gured to host a componentof the service software. A collection of receptiveexecution environments can be constructed eitherfrom a set of homogeneous workstations or morediverse resources as required by the service. Thedistinguishing feature of a receptive execution envi-ronment, however, is that the service is \grown" ontop of a fertile platform; functionality is pushed into

each node as appropriate for the application. Eachnode in the Base can be remotely and dynamicallycon�gured by uploading service code components asneeded, allowing us to delay the decision about thedetails of a particular node's specialization as far aspossible into the service construction and mainte-nance lifecycle.2

As we will see in section 3, our approach has beento make a single assumption of homogeneity acrosssystems in a Base: a Java Virtual Machine is avail-able on each node. In doing so, we raise the bar ofservice construction by providing a common instruc-tion set across all nodes, uni�ed views on thread-ing models, underlying system APIs (such as socketand �lesystem access), as well as the usual strongtyping and safety features a�orded by the Java en-vironment. Because of these provisions, any servicecomponent can be pushed into any node in our Baseand be expected to execute, subject to local resourceconsiderations (such as whether a particular nodehas access to a disk array or a CD drive). Assumingthat every node is capable of receiving Java byte-codes, however, means that techniques generally ap-plied to mobile code systems [21, 13, 28, 32, 18] canbe employed internally to the Base: the adminis-trator can deploy service components by uploadingJava classes into nodes as needed, and the servicecan push itself towards resources redistributing codeamongst the participating nodes. Furthermore, be-cause in this environment we are restricting our useof code mobility to deploying local code within thescope of a single, trusted administrative domain,some of the security diÆculties of mobile code arereduced.

2We rely on two mechanisms for mobile code security:werestrict the use of mobile code inside the Base to code thatoriginates from trusted sources within the Base itself, and weuse the Java Security Manager mechanism to sandbox thismobile code. Our research goals, however, do not includesolving the mobile code security problem.

Page 5: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

2.3 Dynamic Redirector Stub Genera-tion

One challenge for clustered servers is to present asingle service interface to the outside world, and tomask load-balancing and failover mechanisms in thecluster. The naive solution is to have a single front-end machine that clients �rst contact; the front-endthen dispatches these incoming requests to one ofseveral back-end machines. Failures can be hiddenthrough the selection of another back-end machine,and load-balancing can be directly controlled by thefront-end's dispatch algorithm. Unfortunately, thefront-end can become a performance bottleneck anda single point of failure [11, 8]. One solution to theseproblems is to use multiple front-end machines, butthis introduces new problems of naming (how clientsdetermine which front-end to use) and consistency(whether the front-ends mutually agree on the back-end state of the cluster). The naming problem canbe addressed in a number of ways, such as round-robin DNS [6], static assignment of front-ends toclients, or \lightweight" redirection in the style ofscalable Web servers [17]. The consistency problemcan be solved through one of many distributed sys-tems techniques [4] or ignored if consistent state isunimportant to the front-end nodes.

The Base architecture takes another approach tocluster access indirection: the use of dynamically-

generated Redirector Stubs. A stub is client-side code which provides access to a service; acommon example is the stub code generated forCORBA/IIOP [25] and Java Remote Method In-vocation (RMI) [23] systems. The stub code runson the client and converts client requests for ser-vice functionality (such as Java method calls) intonetwork messages, marshalling request parametersand unmarshalling results. In the case of Java RMI,clients download stubs on-demand from the server.

Base services employ a similar technique to RPCstub generation except that the Redirector Stubfor a service is dynamically generated at run-timeand contains embedded logic to select from a set ofnodes within the cluster (�gure 2). Load balancingis implemented within this \Redirector Stub", andfailover is accomplished by reissuing failed or timed-out service calls to an alternative back-end machine.The redirection logic and information about thestate of the Base is built up by the Base and ad-vertised to clients periodically; clients obtain theRedirector Stubs from a registry. This methodol-ogy has a number of signi�cant implications aboutthe nature of services, namely that they must beidempotent and maintain self-consistency across a

����

�������

���

���

���

�������������

�������������

3

2

1

policyClientfrom

Serverto

Figure 2: A \Redirector Stub": embedded insidea Redirectory Stub are several RPC stubs with thesame interface, each of which communicates with adi�erent service instance inside a Base.

service's instances on di�erent nodes in the Base. Insection 3.3.1, we will discuss an implementation ofa cluster-wide distributed data structure that sim-pli�es the task of satisfying these implications.

Client applications can be coded without knowl-

edge of the Redirector Stub logic; by moving failoverand load-balancing functionality to the client, theuse of front-end machines can be avoided altogether.This is similar to the notion of smart clients [34], butwith the intelligence being injected into the client atrun-time instead of being compiled in.

3 Implementation

Our prototype Base implementation (written inJava) is called the MultiSpace. It serves to demon-strate the e�ectiveness of our architecture in termsof facilitating the construction of exible services,and to allow us to begin explorations into the issuesof our platform's scalability. The MultiSpace im-plementation has three layers: the bottom layer isa set of communications primitives (NinjaRMI); themiddle layer is a single-node execution environment(the iSpace); and the top layer is a set of multiplenode abstractions (the MultiSpace layer). We de-scribe each of these in turn, from the bottom up.

3.1 NinjaRMI

A rich set of high performance communicationsprimitives is a necessary component of any clusteredenvironment. We chose to make heavy use of Java'sRemote Method Invocation (RMI) facilities for per-forming RPC-like [5] calls across nodes in the clus-ter, and between clients and services. When a callerinvokes an RMI method, stub code intercepts theinvocation, marshalls arguments, and sends themto a remote \skeleton" method handler for unmar-shalling and execution. Using RMI as the �nestgranularity communication data unit in our clus-tered environment has many useful properties. Be-

Page 6: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

cause method invocations have completely encap-sulated, atomic semantics, retransmissions or com-munication failures are easy to reason about|theycorrespond to either successful or failed method in-vocations, rather than partial data transmissions.

However, from the point of view of clients, if aremote method invocation does not successfully re-turn, it can be impossible for the client to knowwhether or not the method was successfully invokedon the server. The client has two choices: it canreinvoke the method call (and risk calling the samemethod twice), or it can assume that the methodwas not invoked, risking that the method was in factinvoked, successfully or unsuccessfully with an ex-ception, but results were not returned to the client.Currently, on failure our Redirector Stubs will retryusing a di�erent, randomly chosen service stub; inthe case of many successive failures, the Redirec-tor Stub will return an exception to the caller. Itis because of the at-least-once semantics implied bythese client-side reinvocations that we must requireservices to be idempotent. The exploration of dif-ferent retry policies inside the Redirector Stubs isan area of future research.

3.1.1 NinjaRMI enhancements to Sun'sRMI

NinjaRMI is a ground-up reimplementation of Sun'sJava Remote Method Invocation for use by com-ponents within the Ninja system. NinjaRMI wasdesigned to permit maximum exibility in imple-mentation options. NinjaRMI provides three in-teresting transport-level RMI enhancements. First,it provides a unicast, UDP-based RMI that allowsclients to call methods with \best-e�ort" seman-tics. If the UDP packet containing the marshalledarguments successfully arrives at the service, themethod is invoked; if not, no retransmissions areattempted. Because of this, we enforce the require-ment that such methods do not have any returnvalues. This transport is useful for beacons, logentries, or other such side-e�ect oriented uses thatdo not require reliability. Our second enhancementis a multicast version of this unreliable transport.RMI services can associate themselves with a multi-cast group, and RMI calls into that multicast groupresult in method invocations on all listening ser-vices. Our third enhancement is to provide very exible certi�cate-based authentication and encryp-tion support for reliable, unicast RMI. Endpoints inan RMI session can associate themselves with dig-ital certi�cates issued by a certi�cation authority.When the TCP-connection underlying an RMI ses-

sion is established, these certi�cates are exchangedby the RMI layer and veri�ed at each endpoint. Ifthe certi�cate veri�cation succeeds, the remainderof the communication over that TCP connection isencrypted using a Triple DES session key obtainedfrom a DiÆe-Hellman key exchange. These securityenhancements are described in [12].

Packaged along with our NinjaRMI implementa-tion is an interface compiler which, when given anobject that exports an RMI interface, generates theclient-side stub and server-side skeleton stubs forthat object. All source code for stubs and skele-tons can be generated dynamically at run-time, al-lowing the Ninja system to leverage the use of anintelligent code-generation step when constructingwrappers for service components. We use this to in-troduce the level of indirection needed to implementRedirector Stubs|stubs can be overloaded to causemethod invocations to occur on many remote nodes,or for method invocations to fail over to auxiliarynodes in the case of a primary node's failure.

3.1.2 Measurements of NinjaRMI

Method Local Sun Ninjamethod RMI RMI

f(void) 0.19 �s 0.83 ms 0.82 ms

f(int) 0.20 �s 0.84 ms 0.85 ms

int f(int) 0.18 �s 0.85 ms 0.84 ms

int f(

int,int,int,int) 0.22 �s 0.88 ms 0.86 ms

f(byte[100]) 0.19 �s 1.06 ms 1.05 ms

f(byte[1000]) 0.20 �s 1.20 ms 1.09 ms

f(byte[10000]) 0.21 �s 2.21 ms 2.23 ms

byte[100] f(int) 0.19 �s 1.07 ms 1.00 ms

byte[1000] f(int) 0.20 �s 1.17 ms 1.01 ms

byte[10000] f(int) 0.20 �s 2.20 ms 2.10 ms

Table 1: NinjaRMI microbenchmarks: Thesebenchmarks were gathered on two 400Mhz PentiumII based machines, each with 128MB of physicalmemory, connected by a switched 100 Mb/s Eth-ernet, and using Sun's JDK 1.1.6v2 with the TYAjust-in-time compiler on Linux 2.0.36. For the sakeof comparison, UDP round-trip times between twoC programs were measured at 0.185 ms, and be-tween two Java programs at 0.316 ms.

As shown in table 1, NinjaRMI performs as wellas or better than Sun's Java RMI package. Giventhat a null RMI invocation cost 0.82 ms and thata round-trip across the network and through theJVMs cost 0.316 ms, we conclude that the di�er-ence (roughly 0.5 ms) is RMI marshalling and pro-

Page 7: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

tocol overhead. Pro�ling the code shows that themain contributor to this overhead is object seri-alization, speci�cally the use of methods such asjava.io.ObjectInputStream.read().

3.2 iSpace

A Base consists of a number of workstations, eachrunning a suitable receptive execution environmentfor single-node service components. In our proto-type, this receptive execution environment is theiSpace: a Java Virtual Machine (JVM) that runsa component loading service into which Java classescan be pushed as needed. The iSpace is responsiblefor managing component resources, naming, protec-tion, and security. The iSpace exports the compo-nent loader interface via NinjaRMI; this interfaceallows a remote client to obtain a list of compo-nents running on the iSpace, obtain an RMI stub toa particular component, and upload a new servicecomponent or kill a component already running onthe iSpace (subject to authentication). Service com-ponents running on the iSpace are protected fromone another and from the surrounding execution en-vironment in three ways:

1. each component is coded as a Java class whichprovides protection from hard crashes (such asnull pointer dereferences),

2. components are separated into thread groups;this limits the interaction one component canhave with threads of another, and

3. all components are subject to the iSpace Se-curity Manager, which traps certain Java APIcalls and determines whether the componenthas the credentials to perform the operation inquestion, such as �le or network access.

Other assumptions must be made in order tomake this approach viable. In essence, we are re-lying on the JVM to behave and perform like aminiature operating system, even though it was notdesigned as such. For example, the Java Virtual Ma-chine does not provide adequate protection betweenthreads of multiple components running within thesame JVM: one component could, for example, con-sume the entire CPU by running inde�nitely withina non-blocking thread. Here, we must assume thatthe JVM employs a preemptive thread scheduler(as is true in the Sun's Solaris Java environment)and that fairness can be guaranteed through its use.Likewise, the iSpace Security Manager must utilizea strategy for resource management which ensures

both fairness and safety. In this respect iSpace hassimilar goals to other systems which provide multi-ple protection domains within a single JVM, such asthe JKernel [16]. However, our approach does notnecessitate a re-engineering of the Java runtime li-braries, particularly because intra-JVM thread com-munication is not a high-priority feature.

3.3 MultiSpace

The highest layer in our implementation is theMultiSpace layer, which tethers together a collec-tion of iSpaces (�gure 3). A primary function ofthis layer is to provide each iSpace with a repli-cated registry of all service instances running in thecluster.

A MultiSpace service inherits from an abstractMultiSpaceService class, whose constructor registerseach service instance with a \MultiSpaceLoader"running on the local iSpace. All MultiSpaceLoad-ers in a cluster cooperate to maintain the replicatedregistry; each one periodically sends out a multicastbeacon3 that carries its list of local services to allother nodes in the MultiSpace, and also listens formulticast messages from other MultiSpace nodes.Each MultiSpaceLoader builds an independent ver-sion of the registry from these beacons. The reg-istries are thus soft-state, similar in nature to thecluster state maintained in [11] and [1]|if an iS-pace node goes down and comes back up, its Mul-tiSpaceLoader simply has to listen to the multicastchannel to rebuild its state. Registries on di�erentnodes may see temporary periods of inconsistencyas services are pushed into the MultiSpace or movedacross nodes in the MultiSpace, but in steady state,all nodes asymptotically approach consistency. Thisconsistency model is similar in nature to that ofGrapevine [10].

The multicast beacons also carry RMI stubs foreach local service component, which implies thatany service instance running on any node in theMultiSpace can identify and contact any other ser-vice in the MultiSpace. It also means that the Mul-tiSpaceLoader on every node has enough informa-tion to construct Redirector Stubs for all of the ser-vices in the MultiSpace, and advertise those Redi-rector Stubs to o�-cluster clients through a \servicediscovery service"[9].4 Each RMI stub embedded in

3Currently, IP multicast is used for this purpose | themulticast channel that beacons are sent over thus de�nes thelogical scope and boundary of an individual MultiSpace. Weintend to replace this transport with multicast NinjaRMI.

4The service discovery service (or SDS) implementationconsists of an XML search engine that allows client programsto locate services based on arbitrary XML predicates.

Page 8: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

...svc 2

svc N

multicast

beaconSystem

Area

Netw

ork

iSpace Loader

MultiSpaceLoader

Service 1

Service 2

Service N

Security Mgr.

Java Virtual M

achineiSpace node M

iSpace Loader

MultiSpaceLoader

Service 1

Service 2

Service N

Security Mgr.

Java Virtual M

achine

svc 1iSpace node 1

Figure

3:

The

MultiS

paceim

plementatio

n:

MultiS

pace

services

are

insta

ntia

tedontopofa

sandbox

(theSecu

rityManager),

andruninsid

ethe

contex

tofaJava

virtu

almachine(JVM).

amultica

stbeaconis(based

onourobserva

tions)

roughly

500bytes

inavera

gelen

gth;there

isthere-

fore

animporta

nttra

deo�betw

eenthebeaconfre-

quency

(andtherefo

refresh

ness

ofinform

atio

nin

theMultiS

paceL

oaders)

andthenumber

ofserv

iceswhosestu

bsarebein

gbeaconed

thatwillultim

ately

a�ect

thesca

lability

oftheMultiS

pace.

Serv

iceinsta

nces

canelect

toreceiv

emultica

stbeaconsfro

mother

MultiS

pace

nodes;

aserv

icecan

use

this

mech

anism

tobeco

meaw

are

ofits

peers

runningelsew

here

intheclu

ster.Ifaserv

iceover-

rides

thesta

ndard

beaconcla

ss,itcanaugmentits

beaconswith

additio

nalinform

atio

n,such

asthe

loadthatitiscurren

tlyexperien

cing.Serv

icesthat

wantto

calloutto

other

services

could

thusmake

coarse

grained

loadbalancin

gdecisio

nswith

outre-

quirin

gacen

tralized

loadmanager.

3.3.1

Built-In

MultiS

paceServices

Inclu

ded

with

theMultiS

pace

implem

entatio

nare

twoserv

icesthatenrich

thefunctio

nality

available

toother

MultiS

pace

services:

thedistrib

uted

hash

table,

andtheuptim

emonito

ringserv

ice.

Distrib

utedhash

table:asmentio

ned

insec-

tion2.3,dueto

theRedirecto

rStubmech

anism

MultiS

pace

service

insta

nces

must

maintain

self-consisten

cyacro

ssthenodesoftheclu

ster.Tomake

thistask

simpler,

wehaveprov

ided

service

authors

with

adistrib

uted

,rep

licated

,fault-to

leranthash

table

thatisimplem

ented

inC

forthesakeofef-

�cien

cy.Thehash

table

isdesig

ned

topresen

ta

consisten

tview

ofdata

acro

ssallnodes

intheclu

s-ter,

andassuch,serv

icesmay

useitto

rendezv

ousin

asty

lesim

ilarto

Linda[2]orIBM'sT-Spaces

[33].

Thecurren

timplem

entatio

nismodera

telyfast

(itcanhandlemore

than1000insertio

nsper

secondof

500byte

entries

ona4node100Mb/sMultiS

pace

cluster),

isfaulttolera

nt,

andtra

nsparen

tlymask

multip

lenodefailures.

How

ever,

itdoes

notyet

prov

idealloftheconsisten

cyguarantees

thatsome

services

would

ideally

prefer,

such

ason-lin

erecov

-ery

ofcra

shed

state

ortra

nsactio

nsacro

ssmultip

leopera

tions.

Thecurren

tlyimplem

entatio

nis,

how

-ever,

suita

ble

formanyIntern

et-style

services

for

which

thislev

elofconsisten

cyisnotessen

tial.

Uptim

emonito

ringservice:Even

with

the

service

beaconingmech

anism

,itis

diÆcult

tode-

tectthefailure

ofindividualnodes

intheclu

ster.Thisispartly

beca

use

ofthelack

ofaclea

rfailure

modelin

Java

:aJava

service

isjust

acollectio

nof

objects,

notnecessa

rilyeven

possessin

gathrea

dof

execu

tion.A

service

failure

may

just

imply

thata

setofobjects

hasentered

amutually

inconsisten

tsta

te.Theabsen

ceofbeaconsdoesn

'tnecessa

rilymeanthataserv

iceinsta

nce

failure

hasoccu

rred;

thebeaconsmay

belost

dueto

congestio

nin

the

intern

alnetw

ork,orthebeaconsmay

nothavebeen

genera

tedbeca

usetheserv

iceinsta

nce

isoverlo

aded

andisbusy

processin

gother

tasks.

Forthis

reason,wehaveprov

ided

an

uptim

emonito

ringabstra

ctionto

service

authors.

Ifaser-

vice

runningin

theMultiS

pace

implem

ents

awell-

know

nJava

interfa

ce,theinfra

structu

reautomat-

ically

detects

this

andbeginsperio

dica

llycallin

gthedoProbe()meth

odin

thatinterfa

ce.Byim-

plem

entin

gthis

meth

od,serv

iceauthors

promise

toperfo

rmanapplica

tion-lev

eltask

thatdem

on-

strates

thattheserv

iceisaccep

tingandsuccessfu

llyprocessin

greq

uests.

Byusin

gthisapplica

tion-lev

eluptim

echeck

,theinfra

structu

recanexplicitly

de-

tectwhen

aserv

iceinsta

nce

hasfailed

.Curren

tly,weonly

logthis

failure

inorder

togenera

teup-

timesta

tistics,andwerely

ontheRedirecto

rStub

failov

ermech

anism

sto

mask

these

failures.

4Applicatio

ns

Inthissectio

nofthepaper,

wediscu

sstwoappli-

catio

nsthatdem

onstra

tethevalidity

ofourguiding

prin

ciples

andtheeÆ

cacy

ofourMultiS

pace

imple-

mentatio

n.The�rst

applica

tion,theNinja

Juke-

box,abstra

ctsthemanyindependentcompact-d

iscplay

ersandlocal�lesy

stemsin

theBerk

eleyNet-

work

ofWorksta

tions(N

OW)clu

sterinto

asin

gle

poolofava

ilablemusic.

Theseco

nd,Keiretsu

,isa

three-tiered

applica

tionthatprov

ides

insta

ntmes-

sagingacro

sshetero

geneousdevices.

Page 9: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Figure 4: The Ninja Jukebox GUI: users are pre-sented with a single Jukebox interface, even thoughsongs in the Jukebox are scattered across multipleworkstations, and may be either MP3 �les on a local�lesystem, or audio CDs in CD-ROM drives.

4.1 The Ninja Jukebox

The original goal of the Ninja jukebox was toharness all of the audio CD players in the BerkeleyNOW (a 100+ node cluster of Sun UltraSparc work-stations) to provide a single, giant virtual musicjukebox to the Berkeley CS graduate students. Themost interesting features of the Ninja Jukebox arisefrom its implementation on top of iSpace: new nodescan be dynamically harnessed by pushing appropri-ate CD track \ripper" services onto them, and thefeatures of the Ninja Jukebox are simple to evolveand customize, as evidenced by the seamless trans-formation of the service to the batch conversion ofaudio CDs to MP3 format, and the authenticatedtransmission of these MP3s over the network.

The Ninja Jukebox service is decomposed intothree components: a master directory, a CD \rip-per" and indexer, and a gateway to the onlineCDDB service [22] that provides artist and tracktitle information given a CD serial number. Theability to push code around the cluster to growthe service proved to be exceptionally useful, sincewe didn't have to decide a priori which nodes inthe cluster would house CDs|we could dynami-cally push the ripper/indexer component towardsthe CDs as the CDs were inserted into nodes in thecluster. When a new CD is added to a node in theNOW cluster, the master directory service pushesan instance of the ripper service into the iSpace res-ident on that node. The ripper scans the CD todetermine what music is on it. It then contacts a lo-cal instance of the CDDB service to gather detailedinformation about the CD's artist and track titles;this information is put into a playlist which is pe-riodically sent to the master directory service. Themaster directory incorporates playlists from all of

the rippers running across the cluster into a single,global directory of music, and makes this directoryavailable over both RMI (for song browsing and se-lection) and HTTP (for simple audio streaming).

After the Ninja Jukebox had been running for awhile, our users expressed the desire to add MP3audio �les to the Jukebox. To add this new behav-ior, we only had to subclass the CD ripper/indexerto recognize MP3 audio �les on the local �lesystem,and create new Jukebox components that batch con-verted between audio CDs and MP3 �les. Also, toprotect the copyright of the music in the system weadded access control lists to the MP3 repositories,with the policy that users could only listen to mu-sic that they had added to the Jukebox.5 We thenbegan pushing this new subclass to nodes in the sys-tem, and our system evolved while it was running.

The performance of the Ninja Jukebox is com-pletely dominated by the overhead of authenticationand the network bandwidth consumed by stream-ing MP3 �les. The �rst factor (authentication over-head) is currently benchmarked at a crippling 10seconds per certi�cate exchange, entirely due to apure Java implementation of a public key cryptosys-tem. The second factor (network consumption) isnot quite as crippling, but still signi�cant: eachMP3 consumes at least 128 Kb/s, and since the MP3�les are streamed over HTTP, each transmission ischaracterized by a large burst as the MP3 is pushedover the network as quickly as possible. Both limita-tions can be remedied with signi�cant engineering,but this would be beyond the scope of our research.

4.2 Keiretsu: The Ninja Instant-Messaging Service

Keiretsu6 is a MultiSpace service that providesinstant messaging between heterogeneous devices:Web browsers, one- or two-way pagers, and PDAssuch as the Palm Pilot (see Figure 5). Users areable to view a list of other users connected to theKeiretsu service, and can send short text messagesto other users. The service component of Keiretsuexploits the MultiSpace features: Keiretsu serviceinstances use the soft-state registry of peer nodes inorder to exchange client routing information acrossthe cluster, and automatically generated RedirectorStubs are handed out to clients for use in commu-nicating with Keiretsu nodes.

5This ACL policy is enforced using the authentication ex-tensions to NinjaRMI described in 3.1.1.

6Keiretsu is a Japanese concept in which a group of re-lated companies work together for each other's mutual suc-cess.

Page 10: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Keiretsuservice

(3) Message sentto MultiSpace

(4) Message routed to destination Active Proxy

(2) Active Proxyconverts outgoingmsg to RMI (5) Active Proxy

converts messageto Pilot format

(6) ReceivingPilot displaysmessage

(1) User entersmessage on Pilot

MultiSpace

Figure 5: The Keiretsu Service

Keiretsu is a three-tired application: simpleclient devices (such as pagers or Palm Pilots) thatcannot run a JVM connect to an Active Proxy,which can be thought of as a simpli�ed iSpacenode meant to run soft-state mobile code. The Ac-tive Proxy converts simple text messages from de-vices into NinjaRMI calls into the Keiretsu Mul-tiSpace service. The Active Proxies are assumedto have enough sophistication to run Java-basedmobile code (the protocol conversion routines) andspeak NinjaRMI, while rudimentary client devicesneed only speak a simple text-based protocol.

As described in section 2.3, Redirector Stubsare used to access the back-end service componentswithin a MultiSpace by pushing load-balancing andfailover logic towards the client|in the case of sim-ple clients, Redirector Stubs execute in the ActiveProxy. For each protocol message received by anActive Proxy from a user device (such as \send mes-sage M to user U"), the Redirector Stub is invokedto call into the MultiSpace.

Because the Keiretsu proxy is itself a mobile Javacomponent that runs on an iSpace, the Keiretsuproxy service can be pushed into appropriate loca-tions on demand, making it easy to bootstrap suchan Active Proxy as needed. State management in-side the Active Proxy is much simpler than statemanagement inside a Base|the only state that Ac-tive Proxies maintain is the session state for con-nected clients. This session state is soft-state, andit does not need to be carefully guarded, as it can beregenerated given suÆcient intelligence in the Base,or by having users manually recover their sessions.

Rudimentary devices are not the only allowablemembers of a Keiretsu. More complex clients thatcan run a JVM speak directly to the Keiretsu, in-stead of going through an Active Proxy. An exampleof such a client is our e-mail agent, which attachesitself to the Keiretsu and acts as a gateway, relayingKeiretsu messages to users over Internet e-mail.

4.2.1 The Keiretsu MultiSpace service

public void identifySelf(

String clientName,

KeiretsuClientIF clientStub);

public void disconnectSelf(String clientName);

public void injectMessage(KeiretsuMessage msg);

public String[] getClientList();

Figure 6: The Keiretsu service API

The MultiSpace service that performs messagerouting is surprisingly simple. Figure 6 shows theAPI exported by the service to clients. Throughthe identifySelfmethod, a client periodically an-nounces its presence to the Keiretsu, and handsthe Keiretsu an RMI stub which the service willuse to send it messages. If a client stops call-ing this method, the Keiretsu assumes the clienthas disconnected; in this way, participation in theKeiretsu is treated as a lease. Alternately, a clientcan invalidate its binding immediately by callingthe disconnectSelfmethod. Messages are sent bycalling the injectMessage method, and clients canobtain a list of other connected clients by calling thegetClientList method.

Inside the Keiretsu, all nodes maintain a soft-state table of other nodes by listening to Multi-Space beacons, as discussed in section 3.3. Whena client connects to a Keiretsu node, that nodesends the client's RMI stub to all other nodes; allKeiretsu nodes maintain individual tables of theseclient bindings. This means that in steady state,each node can route messages to any client.

Because clients access the Keiretsu servicethrough Redirector Stubs, and because Keiretsunodes replicate service state, individual nodes in the

Page 11: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Keiretsu can fail and service will continue uninter-rupted, at the cost of capacity and perhaps per-formance. In an experiment on a 4-node cluster,we demonstrated that the service continued unin-terrupted even when 3 of the 4 nodes went down.The Keiretsu source code consists of 5 pages of Javacode; however, most of the code deals with manag-ing the soft-state tables of the other Keiretsu nodesin the cluster and the client RMI stub bindings. Theactual business of routing messages to clients con-sists of only half a page of Java code|the rest ofthe service functionality (namely, building and ad-vertising Redirector Stubs, tracking service imple-mentations across the cluster, and load balancingand failover across nodes) is hidden inside the Mul-tiSpace layer. We believe that the MultiSpace im-plementation is quite successful in shielding serviceauthors from a signi�cant amount of complexity.

4.2.2 Keiretsu Performance

We ran an experiment to measure the performanceand scalability of our MultiSpace implementationand the Keiretsu service. We used a cluster of 400MHz Pentium II machines, each with 128 MB ofphysical memory, connected by a 100 Mb/s switchedEthernet. We implemented two Keiretsu clients:the \speedometer", which open up a parameteriz-able number of identities in the Keiretsu and thenwaits to receive messages, and the \driver", whichgrabs a parameterizable number of Redirector Stubsto the Keiretsu, downloads a list of clients in theKeiretsu, and then blasts 75 byte messages to ran-domly selected clients as fast as it can.

We started our Keiretsu service on a single node,and incrementally grew the cluster to 4 nodes, mea-suring the maximum message throughput obtainedfor 10x, 50x, and 100x \speedometer" receivers,where x is the number of nodes in the cluster. Toachieve maximum throughput, we added incremen-tally more \driver" connections until message deliv-ery saturated. The drivers and speedometers werelocated on many dedicated machines, connected tothe Keiretsu cluster by the same 100 Mb/s switchedEthernet. Table 2 shows our results.

For a small number of receivers (10 per node), weobserved linear scaling in the message throughput.This is because each node in the Keiretsu is essen-tially independent: only a small amount of stateis shared (the client stubs for the 10 receivers pernode). In this case, the CPU was the bottleneck,likely due to Java overhead in message processingand argument marshalling and unmarshalling.

For larger number of receivers, we observed a

# Clients Max. message# Nodes per node throughput

(msgs / s)

10 246 � 41 50 200 � 8

100 195 � 10

10 420 � 102 50 300 � 20

100 260 � 20

10 490 � 153 50 370 � 20

100 160 � 15

10 570 � 154 50 210 � 10

100 120 � 10

Table 2: Keiretsu performance: These bench-marks were run on 400Mhz Pentium II machines,each with 128MB of physical memory, connectedby a 100 Mb/s switched Ethernet, using Sun'sJDK 1.1.6v2 with the TYA just-in-time compiler onLinux 2.2.1, and sending 75 byte Keiretsu messages.

breakdown in scaling when the total number of re-ceivers reached roughly 200 (i.e. 3-4 nodes at 50receivers per node, or 2 nodes at 100 receivers pernode). The CPU was still the bottleneck in thesecases, but most of the CPU time was spent pro-cessing the client stubs exchanged between Keiretsunodes, rather than processing the clients' messages.This is due to poor design of the Keiretsu service; wedid not need to exhange client stubs as frequently aswe did, and we should have simply exchanged times-tamps for previously distributed stubs rather thanrepeatedly sending the same stub. This limitationcould be also removed by modifying the Keiretsuservice to inject client stubs in a distributed hashtable, and rely on service instances to pull thestubs out of the table as needed. However, a sim-ilar N2 state exchange happens at the MultiSpacelayer with the multicast exchange of service instancestubs; this could potentially become another scalingbottleneck for large clusters.

5 Discussion and Future Work

Our MultiSpace and service implementation ef-forts have given some insights into our original de-sign principles, and into the use of Java as an Inter-net service construction language. In this section ofthe paper, we delve into some of these insights.

Page 12: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

5.1 Code Mobility as a Service Con-struction Primitive

When we were designing the MultiSpace, weknew that code mobility would be a powerful tool.We originally intended to use code mobility for de-livering code to clients (which we do in the form ofRedirector Stubs), and for it to be used by clientsto upload customization code into the MultiSpace(which has not been implemented). However, codemobility turned out to be useful inside the Multi-Space as a mechanisms for structuring services, anddistributing service components across the cluster.

Code mobility solved a software distributionproblem in the Jukebox, without us realizing thatsoftware distribution might become a problem.When we updated the ripper service, we needed todistribute the new functionality to all of the nodesin the cluster that would potentially have CD'sinserted into them. Code mobility also partiallysolved the service location problem in the Jukebox:the ripper services depend on the CDDB service togather detailed track and album information, butthe ripper has no easy way to know where in thecluster the CDDB service is running. Using codemobility to push the CDDB service onto the samenode as the ripper, we enforced the invariant thatthe CDDB service is colocated with the ripper.

5.2 Bases are a Simplifying Principle,but not a Complete Solution

The principle of solving complex service problemsin a Base makes it easier to reason about the inter-actions between services and clients, and to ensurethat diÆcult tasks like state management are dealtwith in an environment in which there is a chanceof success. However, this organizational principlealone is not enough to solve the problem of con-structing highly available services. Given the con-trolled environment of a Base, service authors muststill construct services to ensure consistency andavailability. We believe that a Base can provideprimitives that further simply service authors' jobs;the Redirector Stub is an example of such a primi-tive.

While building the Ninja Jukebox and Keiretsu,we made observations about how we achieved avail-ability and consistency. Most of the code in theseservices dealt with distributing and maintaining ta-bles of shared state. In the Ninja Jukebox, thisstate was the list of available music. In Keiretsu,this state was the list of other Keiretsu nodes, andthe tables of client stub bindings. The distributed

hash table was not yet complete when these two ser-vices were being implemented. If we had relied onit instead of our ad-hoc peer state exchange mecha-nisms, much of the services' source code would havebeen eliminated.

In both of these services, work queues are notexplicitly exposed to service authors. These queuesare hidden inside the thread scheduler, since Nin-jaRMI spawns a thread per connected client. Thisdesign decision had repercussions on service struc-ture: each service had to be written to handle mul-tithreading, since service authors must handle con-sistency within a single service instance as well asacross instances throughout the cluster. Providinga mechanism to expose work queues to service au-thors may simplify the structure of some services,for example if services serialize requests to avoid is-sues associated with multithreading.

In the current MultiSpace, service instances mustexplicitly keep track of their counterparts on othernodes and spawn new services when load or avail-ability demands it. A useful primitive would beto allow authors to specify conditions that dictatewhen service instances are spawned or pushed toa speci�c node, and to allow the MultiSpace infras-tructure to handle the triggering of these conditions.

5.3 Java as an Internet-service con-struction environment

Java has proven to be an invaluable tool in thedevelopment of the Ninja infrastructure. The abil-ity to rapidly deploy cross-platform code compo-nents simply by assuming the existence of a JavaVirtual Machine made it easy to construct complexdistributed services without concerning oneself withthe heterogeneity of the systems involved. The useof RMI as a strongly-typed RPC, tied very closely tothe Java language semantics, makes distributed pro-gramming comparably simple to single node devel-opment. The protection, modularization, and safetyguarantees provided by the Java runtime environ-ment make dynamic dissemination of code compo-nents a natural activity. Similarly, the use of Javaclass re ection to generate new code wrappers forexisting components (as with Redirector Stubs) pro-vides automatic indirection at the object level.

Java has a number of drawbacks in its currentform, however. Performance is always an issue,and work on just-in-time [14, 24] and ahead-of-time[27] compilation is addressing many of these prob-lems. The widely-used JVM from Sun Microsys-tems exhibits a large memory footprint (we haveobserved 3-4 MB for \Hello World", and up to 30

Page 13: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

MB for a relatively simple application that performsmany of memory allocations and deallocations7),and crossing the boundary from Java to native coderemains an expensive operation. In addition, theJava threading model permits threads to be non-preemptive, which has serious implications for com-ponents which must run in a protected environment.Our approach has been to use only Java Virtual Ma-chines which employ preemptive threads.

We have started an e�ort to improve the perfor-mance of the Java runtime environment. Our ini-tial prototype, called Jaguar, permits direct Javaaccess to hardware resources through the use of amodi�ed just-in-time compiler. Rather than goingthrough the relatively expensive Java Native Inter-face for access to devices, Jaguar generates machinecode for direct hardware access which is inlined withcompiled Java bytecodes. We have implemented aJaguar interface to a VIA8 enabled fast system areanetwork, obtaining performance equivalent to VIAaccess from C (80 microseconds round-trip time forsmall messages and over 400 megabits/second peakbandwidth). We believe that this approach is a vi-able way to tailor the Java environment for high-performance use in a clustered environment.

6 Related Work

Directly related to the Base architecture is theTACC [11] platform, which provides a cluster-based environment for scalable Internet services. InTACC, service components (or \workers") can bewritten in a number of languages and are controlledby a front-end machine which dispatches incomingrequests to back-end cluster machines, incorporat-ing load-balancing and restart in the case of nodefailure. TACC workers may be chained across thecluster for composable tasks. TACC was designedto support Internet services which perform datatransformation and aggregation tasks. Base servicescan additionally implement long-lived and persis-tent services; the result is that the Ninja approachaddresses a wider set of potential applications andsystem-support issues. Furthermore, Base servicescan dynamically created and destroyed through theiSpace loader interface on each MultiSpace node|TACC did not have this functionality.

7There are no explicit deallocations in Java | by \deallo-cation", we mean discarding all references to an object, thusenabling it to be garbage collected.

8VIA is the Virtual Interface Architecture [26], whichspeci�es an industry-standard architecture for high-bandwidth, low-latency communication within clusters.

Sun's JINI [31] architecture is similar to the theBase architecture in that it proposes to developa Java-based lingua franca for binding users, de-vices, and services together in an intelligent, pro-grammable Internet-wide infrastructure. JINI's useof RMI as the basic communication substrate anduse of code mobility for distributing service anddevice interfaces has a great deal of rapport withour approach. However, we believe that we areaddressing problems which JINI does not directlysolve: providing a hardware and software environ-ment supporting scalable, fault-tolerant services isnot within the JINI problem domain, nor is theuse of dynamically-generated code components toact as interfaces into services. However, JINI hastouched on issues such as service discovery and nam-ing which have not yet been dealt with by the Basearchitecture; likewise, JINI's use of JavaSpaces and\leases" as a persistent storage model may interactwell with the Base service model.

ANTS [30, 29] is a system that enables the dy-namic deployment of mobile code which implementsnetwork protocols within Active Routers. Coded inJava, and utilizing techniques similar to those in theiSpace environment, ANTS has a similar set of goalsin mind as the Base architecture. However, ANTSuses mobile code for processing each packet passingthrough a router; Base service components are ex-ecuted on the granularity of an RMI call. LiquidSoftware [19] and Joust [15] are similar to ANTS inthat they propose an environment which uses mobilecode to customize nodes in the network for commu-nications oriented tasks. These systems focused onadapting systems at the level of network protocolcode, while the Base architecture uses code mobil-ity for distribution of service components both in-ternally to and externally from a Base.

SanFrancisco [7] is a platform for building dis-tributed, object-oriented business applications. Itsprimary goal is to simplify the development ofthese applications by providing developers a set ofindustrial-strength \Foundation" objects which im-plement common functionality. As such, SanFran-cisco is very similar to Sun's Enterprise Java Beansin that it provides a framework for constructingapplications using reusable components, with San-Francisco providing a number of generic componentsto start with. MultiSpace addresses a di�erent set ofgoals than Enterprise Java Beans and SanFranciscoin that it de�nes a exible runtime environment forservices, and MultiSpace intends to provide scalabil-ity and fault-tolerance by leveraging the exibilityof a component architecture. MultiSpace servicescould be built using the EJB or SanFrancisco model

Page 14: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

(extended to expose the MultiSpace functionality),but these issues appear to be orthogonal.

The Distributed Computing Environment (DCE)[20] is a software suite that provides a middlewareplatform that operates on many operating systemsand environments. DCE abstracts away many OSand network services (such as threads, security, a di-rectory service, and RPC) and therefore allows pro-grammers to implement DCE middleware indepen-dent of the vagaries of particular operating systems.DCE is rich, robust, but notoriously heavyweight,and its focus is on providing interoperable, wide-area middleware. MultiSpace is far less mature, butfocuses instead on providing a platform for rapidlyadaptable services that are housed within a singleadministrative domain (the Base).

7 Conclusions

In this paper, we presented an architecture fora Base, a clustered hardware and software platformfor building and executing exible and adaptable in-frastructure services. The Base architecture was de-signed to adhere to three organizing principles: (1)solve the challenging service availability and scala-bility problems in carefully controlled environments,allowing service authors to make many assumptionsthat would not otherwise be valid in an uncontrolledor wide area environment, (2) gain service exibil-ity by decomposing Bases into a number of receptiveexecution environments; and (3) introduce a level ofindirection between the clients and services throughthe use of dynamic code generation techniques.

We built a prototype implementation (Multi-Space) of the Base architecture using Java as a foun-dation. This implementation took advantage of thecode mobility and dynamic compilation techniquesto help in the structuring and deployment of ser-vices inside the cluster. The MultiSpace abstractsaway load balancing, failover, and service instancedetection and naming from service authors. Usingthe MultiSpace platform, we implemented two novelservices: the Ninja Jukebox, and Keiretsu. TheNinja Jukebox implementation demonstrated thatcode mobility is valuable inside a cluster environ-ment, as it permits rapid evolution of services, andrun-time binding of service components to availableresources in the cluster. The Keiretsu applicationdemonstrated that our MultiSpace layer successfullyreduced the complexity of building new services: thecore Keiretsu service functionality was implementedin less than a page of code, but the application wasdemonstrably fault-tolerant. We also demonstrated

that Keiretsu code limited the scalability of this ser-vice, rather than any inherent limitation in the Mul-tiSpace layer, although we hypothesized that our useof multicast beacons would ultimately limit the scal-ability of the current MultiSpace implementation.

8 Acknowledgements

The authors thank Ian Goldberg and David Wag-ner for their contributions to iSpace, NinjaRMI, andthe Jukebox application. We would also like tothank all of the graduate and undergraduate stu-dents in the Ninja project for agreeing to be guineapigs for our platform, and Brian Shiratsuki andKeith Sklower for their help in the system of theMultiSpace machines and networks. Finally, wewould like to thank our anonymous reviewers andour shepherd, Charles Antonelli, for their insightand suggestions for improvement.

References

[1] Elan Amir, Steven McCanne, and Randy Katz. AnActive Service Framework and its Application toReal-Time Multimedia Transcoding. In Proceedingsof ACM SIGCOMM '98, volume 28, pages 178{189,October 1998.

[2] B. Anderson and D. Shasha. Persistent Linda:Linda + Transactions + Query Processing. InSpringer-Verlag Lecture Notes in Computer Science574, Mont-Saint-Michel, France, June 1991.

[3] Thomas E. Anderson, David E. Culler, and DavidPatterson. A Case for NOW (Networks of Worksta-tions). IEEE Micro, 12(1):54{64, February 1995.

[4] Ken Birman, Andre Schiper, and Pat Stephen-son. Lightweight Causal and Atomic Group Mul-ticast. ACM Transactions on Computer Systems,9(3):272{314, 1991.

[5] Andrew D. Birrell and Bruce Jay Nelson. Imple-menting Remote Procedure Call. ACM Transac-tions on Computing Systems, 2(1):39{59, February1984.

[6] T. Brisco. RFC 1764: DNS Support for Load Bal-ancing, April 1995.

[7] IBM Corporation. IBM SanFrancisco producthomepage. http://www.software.ibm.com/ad/

sanfrancisco/.

[8] Inktomi Corporation. The Technology Behind Hot-Bot. http://www.inktomi.com/whitepap.html,May 1996.

[9] Steven Czerwinski, Ben Y. Zhao, Todd Hodes, An-thony Joseph, and Randy Katz. An Architecture

Page 15: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

for a Secure Service Discovery Service. In Proceed-ings of MobiCom '99, Seattle, WA, August 1999.ACM.

[10] A.D. Birrell et al. Grapevine: An Exercise in Dis-tributed Computing. Communications of the Asso-ciation for Computing Machinery, 25(4):3{23, Feb1984.

[11] Armando Fox, Steven D. Gribble, Yatin Chawathe,Eric A. Brewer, and Paul Gauthier. Cluster-BasedScalable Network Services. In Proceedings of the16th ACM Symposium on Operating Systems Prin-ciples, St.-Malo, France, October 1997.

[12] Ian Goldberg, Steven D. Gribble, David Wagner,and Eric A. Brewer. The Ninja Jukebox. In Sub-mitted to the 2nd USENIX Symposium on Inter-net Technologies and Systems, Boulder, Colorado,USA, October 1999.

[13] Robert S. Gray. Agent Tcl: A Flexible and Se-cure Mobile-Agent System. In Proceedings of theFourth Annual Usenix Tcl/Tk Workshop. USENIXAssociation, 1996.

[14] The Open Group. The Fajita CompilerProject. http://www.gr.opengroup.org/java/

compiler/fajita/index-b.htm, 1998.

[15] J. Hartman, L. Peterson, A. Bavier, P. Bigot,P. Bridges, B. Montz, R. Piltz, T. Proebsting, andO. Spatscheck. Joust: A Platform for Liquid Soft-ware. In IEEE Network (Special Edition on Activeand Programmable Networks), July 1998.

[16] Chris Hawblitzel, Chi-Chao Chang, Grzegorz Cza-jkowski, Deyu Hu, and Thorsten von Eicken. Im-plementing Multiple Protection Domains in Java.In Proceedings of the 1998 Usenix Annual Techni-cal Conference, June 1998.

[17] Open Group Research Institute. Scalable HighlyAvailable Web Server Project (SHAWS). http://

www.osf.org/RI/PubProjPgs/SFTWWW.htm.

[18] Dag Johansen, Robbert van Renesse, , and Fred R.Schneider. Operating System Support for MobileAgents. In Proceedings of the 5th IEEE Workshopon Hot Topics in Operating Systems, 1995.

[19] John Hartman and Udi Manber and Larry Peter-son and Todd Proebsting. Liquid Software: A NewParadigm for Networked Systems. Technical re-port, Department of Computer Science, Universityof Arizona, June 1996.

[20] Brad Curtis Johnson. A Distributed Comput-ing Environment Framework: an OSF Perspec-tive. Technical Report DEV-DCE-TP6-1, the OpenGroup, June 1991.

[21] Eric Jul, Henry M. Levy, Norman C. Hutchinson,and Andrew P. Black. Fine-Grained Mobility in theEmerald System. ACM Transactions on ComputingSystems, 6(1):109{133, 1988.

[22] Ti Kan and Steve Scherf. CDDB Speci�ca-ton. http://www.cddb.com/ftp/cddb-docs/cddb_howto.gz.

[23] Sun Microsystems. Java Remote MethodInvocation|Distributed Computing for Java.http://java.sun.com/.

[24] Sun Microsystems. The Solaris JIT Compiler.http://www.sun.com/solaris/jit, 1998.

[25] The Object Management Group (OMG). The Com-mon Object Request Broker: Architecture andSpeci�cation, February 1998. http://www.omg.

org/library/c2indx.html.

[26] Virtual Interface Architecture Organization. Vir-tual Interface Architecture Speci�cation version1.0, December 1997. http://www.viarch.org.

[27] Todd A. Proebsting, Gregg Townsend, PatrickBridges, John H. Hartman, Tim Newsham, andScott A. Watterson. Toba: Java for Applications|A Way Ahead of Time (WAT) Compiler. InProceedings of the Third USENIX Conference onObject-Oriented Technologies (COOTS), Portland,Oregon, USA, June 1997.

[28] Joseph Tardo and Luis Valente. Mobile Agent Secu-rity and Telescript. In Proceedings of the 41st Inter-national Conference of the IEEE Computer Society(CompCon '96), February 1996.

[29] David L. Tennenhouse, Jonathan M. Smith,W. David Sincoskie, David J. Wetherall, andGary J. Minden. A Survey of Active Network Re-search. Active Networks home page (MIT Teleme-dia, Networks and Systems group), 1996.

[30] David L. Tennenhouse and David J. Wetherall. To-wards an Active Network Architecture. In ACMSIGCOMM '96 (Computer Communications Re-view). ACM, 1996.

[31] Jim Waldo. Jini Architecture Overview. Avail-able at http://java.sun.com/products/jini/

whitepapers.

[32] James E. White. Telescript Technology: The Foun-dation for the Electronic Marketplace, 1991. http://www.generalmagic.com.

[33] P. Wycko�, S. W. McLaughry, T. J. Lehman, andD. A. Ford. TSpaces. IBM Systems Journal, 37(3),April 1998.

[34] C. Yoshikawa, B. Chun, P. Eastham, A. Vahdat,T. Anderson, and D. Culler. Using Smart Clientsto Build Scalable Services. In Proceedings of theWinter 1997 USENIX Technical Conference, Jan-uary 1997.

Page 16: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

The MultiSpace: an Evolutionary Platform

for Infrastructural Services

Steven D. Gribble, Matt Welsh, Eric A. Brewer, and David Culler

The University of California at Berkeley

fgribble,mdw,brewer,[email protected]

Abstract

This paper presents the architecture for a Base, a

clustered environment for building and executing highly

available, scalable, but exible and adaptable infras-

tructure services. Our architecture has three organizing

principles: addressing all of the diÆcult service fault-

tolerance, availability, and consistency problems in a

carefully controlled environment, building that environ-

ment out of a collection of execution environments that

are receptive to mobile code, and using dynamically gen-

erated code to introduce run-time-generated levels of in-

direction separating clients from services. We present

a prototype Java implementation of a Base called the

MultiSpace, and talk about two applications written on

this prototype: the Ninja Jukebox (a cluster based mu-

sic warehouse), and Keiretsu (an instant messaging ser-

vice that supports heterogeneous clients). We show that

the MultiSpace implementation successfully reduces the

complexity of implementing services, and that the plat-

form is conducive to rapid service evolution.

1 Introduction

The performance and utility of a personal

computer will be defined less by faster Intel

processors and new Microsoft software and

increasingly by Internet services and software.

cjnet news article excerpt, 11/25/98

Once a disorganized collection of data reposi-tories and web pages, the Internet has become alandscape populated with rich, industrial-strengthapplications. Many businesses and organizationshave counterparts on the web: banks, restaurants,stock trading services, communities, and even gov-ernments and countries. These applications possesssimilar properties to traditional utilities such as thetelephone network or power grid: they support largeand potentially rapidly growing populations, they

must be available 24x7, and they must abstract com-plex engineering behind simple interfaces. We be-lieve that the Internet is evolving towards a service-oriented infrastructure, in which these high qualityutility-like applications will be commonplace. Un-like traditional utilities, Internet services tend torapidly evolve, are typically customizable by theend-user, and may even be composable.

Although today's Internet services are mature,the process of erecting and modifying services isquite immature. Most authors of complex, new ser-vices are forced to engineer substantial amounts ofcustom, service-speci�c code, largely because of thediversity in the requirements of each service|it isdiÆcult to conceive of a general-purpose, reusable,shrink-wrapped, adequately customizable and ex-tensible service construction product.

Faced with a seemingly inevitable engineeringtask, authors tend to adopt one of two strategiesfor adding new services to the Internet landscape:

In exible, highly tuned, hand-constructedservices: by far, this is the most dominant serviceconstruction strategy found on the Internet. Here,service authors carefully design a system targetedtowards a speci�c application and feature set, op-erating system, and hardware platform. Examplesof such systems are large, carrier-class web searchengines, portals, and application-speci�c web sitessuch as news, stock trading, and shopping sites. Therationale for this approach is sound: it leads to ro-bust and high-performance services. However, thesoftware architectures of these systems are too re-strictive; they result in a �xed service that performsa single, rigid function. The large amount of care-fully crafted and hand-tuned code means that theseservices are diÆcult to evolve; consider, for example,how hard it would be to radically change the behav-ior of a popular search engine service, or to movethe service into a new environment|these sorts ofmodi�cations would take massive engineering e�ort.

\Emergent services" in a world of dis-tributed objects: this strategy is just beginning

Page 17: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

to become popularized with architectures such asSun's JINI [31] and the ongoing CORBA e�ort [25].In this world, instead of erecting complex, in exibleservices, large numbers of components or objectsare made available over the wide area, and servicesemerge through the composition of many such com-ponents. This approach has the bene�t that addingto the Internet landscape is a much simpler task,since the granularity of contributed components ismuch smaller. Because of the explicit decomposi-tion of the world into much smaller pieces, it is alsosimpler to retask or extend services by dropping oneset of components and linking in others.

There are signi�cant disadvantages to this ap-proach. As a side-e�ect of the more evolutionarynature of services, it is diÆcult to manage the stateof the system, as state may be arbitrarily replicatedand distributed across the wide area. Wide-areanetwork partitions are commonplace, meaning thatit is nearly impossible to provide consistency guar-antees while maintaining a reasonable amount ofsystem availability. Furthermore, although it is pos-sible to make incremental, localized changes to thesystem, it is diÆcult to make large, global changesbecause the system components may span many ad-ministrative domains.

In this paper, we advocate a third approach.We argue that we can reap many of the bene�tsof the distributed objects approach while avoidingdiÆcult state management problems by encapsulat-ing services and service state in a carefully con-trolled environment called a Base. To the outsideworld, a Base provides the appearance and guaran-tees of a non-distributed, robust, highly-available,high-performance service. Within a Base, servicesaren't constructed out of brittle, restrictive softwarearchitectures, but instead are \grown" out multiple,smaller, reusable components distributed across aworkstation cluster [3]. These components may bereplicated across many nodes in the cluster for thepurposes of fault tolerance and high performance.The Base provides the glue that binds the compo-nents together, keeping the state of replicated ob-jects consistent, ensuring that all of the constituentcomponents are available, and distributing traÆcacross the components in the cluster as necessary.

The rest of this paper discusses the design princi-ples that we advocate for the architecture of a Base(section 2), and presents a preliminary Base imple-mentation called the Ninja MultiSpace1 (section 3)that uses techniques such as dynamic code genera-tion and code mobility as mechanisms for demon-

1The MultiSpace implementation is available with theNinja platform release - see http://ninja.cs.berkeley.edu.

service #2

service #1

clie

nt c

ode

node 1 node 2 node 3

node 5

node 8node 7node 6

node 4

client

Base

stub for

stub for

= service #2= service #1

Figure 1: Architecture of a Base: a Base is com-prised of a cluster of workstations connected by ahigh-speed network. Each node houses an executionenvironment into which code can be pushed. Ser-vices have many instances within the cluster, butclients are shielded from this by a service \stub"through which they interact with the cluster.

strating and evaluating our hypotheses of service exibility, rapid evolution, and robustness underchange. While we have begun preliminary explo-rations into the scalability and high availability as-pects of our prototype, that has not been the ex-plicit focus of this initial implementation, and in-stead remains the subject of future work. Two ex-ample services running on our prototype Base aredescribed in section 4. In section 5, we discuss someof the lessons we learned while building our proto-type. Section 6 presents related work, and in Section7 we draw conclusions.

2 Organizing Principles

In this section we present three design principlesthat guided our service architecture development(shown at a high level in �gure 1):

1. Solve the challenging service availability andscalability problems in carefully controlledenvironments (Bases),

2. Gain service exibility by decomposing Basesinto a number of receptive execution envi-ronments, and

3. Introduce a level of indirection between theclients and services through the use of dy-namic code generation techniques.

We now discuss each of these principles in turn.

Page 18: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

2.1 Solve Challenging Problems in aBase

The high availability and scalability \utility" re-quirements that Internet services require are diÆ-cult to deliver; our �rst principle is an attempt tosimplify the problem of meeting them by carefullychoosing the environment in which we tackle theseissues. As in [11], we argue that clusters of worksta-tions provide the best platform on which to build In-ternet services. Clusters allow incremental scalabil-ity through the addition of extra nodes, high avail-ability through replication and failover, and cost-performance by using commodity building blocks asthe basis of the computing environment.

Clusters are the backbone of a Base. Physically,a Base must include everything necessary to keepa mission-critical cluster running: system adminis-trators, a physically secure machine room, redun-dant internal networks and external network feeds,UPS systems, and so on. Logically, a cluster-widesoftware layer provides data consistency, availabil-ity, and fault tolerance mechanisms.

The power of locating services inside a Base arisesfrom the assumptions that service authors can nowmake when designing their services. Communica-tion is fast and local, and network partitions areexceptionally rare. Individual nodes can be forcedto be as homogeneous as necessary, and if a nodedies, there will always be an identical replacementavailable. Storage is local, cheap, plentiful, and well-guarded. Finally, everything is under a single do-main, simplifying administration.

Nothing outside of the Base should try to dupli-cate the fault-tolerance or data consistency guaran-tees of the Base. For example, e-mail clients shouldnot attempt to keep local copies of mail messages ex-cept in the capacity of a cache; all messages are per-manently kept by an e-mail service in the Base. Be-cause services promise to be highly available, a usercan rely on being able to access her email throughit while she is network connected.

2.2 Receptive Execution Environments

Internet services are generally built from a com-plex assortment of resources, including heteroge-neous single-CPU and multiprocessor systems, diskarrays, and networks. In many cases these ser-vices are constructed by rigidly placing functional-ity on particular systems and statically partitioningresources and state. This approach represents theview that a service's design and implementation are\sancti�ed" and must be carefully planned and laid

out across the available hardware. In such a regime,there is little tolerance for failures which disrupt thebalance and structure of the service architecture.

To alleviate the problems associated with this ap-proach, the Base architecture employs the principleof receptive execution environments|systems whichcan be dynamically con�gured to host a componentof the service software. A collection of receptiveexecution environments can be constructed eitherfrom a set of homogeneous workstations or morediverse resources as required by the service. Thedistinguishing feature of a receptive execution envi-ronment, however, is that the service is \grown" ontop of a fertile platform; functionality is pushed into

each node as appropriate for the application. Eachnode in the Base can be remotely and dynamicallycon�gured by uploading service code components asneeded, allowing us to delay the decision about thedetails of a particular node's specialization as far aspossible into the service construction and mainte-nance lifecycle.2

As we will see in section 3, our approach has beento make a single assumption of homogeneity acrosssystems in a Base: a Java Virtual Machine is avail-able on each node. In doing so, we raise the bar ofservice construction by providing a common instruc-tion set across all nodes, uni�ed views on thread-ing models, underlying system APIs (such as socketand �lesystem access), as well as the usual strongtyping and safety features a�orded by the Java en-vironment. Because of these provisions, any servicecomponent can be pushed into any node in our Baseand be expected to execute, subject to local resourceconsiderations (such as whether a particular nodehas access to a disk array or a CD drive). Assumingthat every node is capable of receiving Java byte-codes, however, means that techniques generally ap-plied to mobile code systems [21, 13, 28, 32, 18] canbe employed internally to the Base: the adminis-trator can deploy service components by uploadingJava classes into nodes as needed, and the servicecan push itself towards resources redistributing codeamongst the participating nodes. Furthermore, be-cause in this environment we are restricting our useof code mobility to deploying local code within thescope of a single, trusted administrative domain,some of the security diÆculties of mobile code arereduced.

2We rely on two mechanisms for mobile code security:werestrict the use of mobile code inside the Base to code thatoriginates from trusted sources within the Base itself, and weuse the Java Security Manager mechanism to sandbox thismobile code. Our research goals, however, do not includesolving the mobile code security problem.

Page 19: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

2.3 Dynamic Redirector Stub Genera-tion

One challenge for clustered servers is to present asingle service interface to the outside world, and tomask load-balancing and failover mechanisms in thecluster. The naive solution is to have a single front-end machine that clients �rst contact; the front-endthen dispatches these incoming requests to one ofseveral back-end machines. Failures can be hiddenthrough the selection of another back-end machine,and load-balancing can be directly controlled by thefront-end's dispatch algorithm. Unfortunately, thefront-end can become a performance bottleneck anda single point of failure [11, 8]. One solution to theseproblems is to use multiple front-end machines, butthis introduces new problems of naming (how clientsdetermine which front-end to use) and consistency(whether the front-ends mutually agree on the back-end state of the cluster). The naming problem canbe addressed in a number of ways, such as round-robin DNS [6], static assignment of front-ends toclients, or \lightweight" redirection in the style ofscalable Web servers [17]. The consistency problemcan be solved through one of many distributed sys-tems techniques [4] or ignored if consistent state isunimportant to the front-end nodes.

The Base architecture takes another approach tocluster access indirection: the use of dynamically-

generated Redirector Stubs. A stub is client-side code which provides access to a service; acommon example is the stub code generated forCORBA/IIOP [25] and Java Remote Method In-vocation (RMI) [23] systems. The stub code runson the client and converts client requests for ser-vice functionality (such as Java method calls) intonetwork messages, marshalling request parametersand unmarshalling results. In the case of Java RMI,clients download stubs on-demand from the server.

Base services employ a similar technique to RPCstub generation except that the Redirector Stubfor a service is dynamically generated at run-timeand contains embedded logic to select from a set ofnodes within the cluster (�gure 2). Load balancingis implemented within this \Redirector Stub", andfailover is accomplished by reissuing failed or timed-out service calls to an alternative back-end machine.The redirection logic and information about thestate of the Base is built up by the Base and ad-vertised to clients periodically; clients obtain theRedirector Stubs from a registry. This methodol-ogy has a number of signi�cant implications aboutthe nature of services, namely that they must beidempotent and maintain self-consistency across a

����

�������

���

���

���

�������������

�������������

3

2

1

policyClientfrom

Serverto

Figure 2: A \Redirector Stub": embedded insidea Redirectory Stub are several RPC stubs with thesame interface, each of which communicates with adi�erent service instance inside a Base.

service's instances on di�erent nodes in the Base. Insection 3.3.1, we will discuss an implementation ofa cluster-wide distributed data structure that sim-pli�es the task of satisfying these implications.

Client applications can be coded without knowl-

edge of the Redirector Stub logic; by moving failoverand load-balancing functionality to the client, theuse of front-end machines can be avoided altogether.This is similar to the notion of smart clients [34], butwith the intelligence being injected into the client atrun-time instead of being compiled in.

3 Implementation

Our prototype Base implementation (written inJava) is called the MultiSpace. It serves to demon-strate the e�ectiveness of our architecture in termsof facilitating the construction of exible services,and to allow us to begin explorations into the issuesof our platform's scalability. The MultiSpace im-plementation has three layers: the bottom layer isa set of communications primitives (NinjaRMI); themiddle layer is a single-node execution environment(the iSpace); and the top layer is a set of multiplenode abstractions (the MultiSpace layer). We de-scribe each of these in turn, from the bottom up.

3.1 NinjaRMI

A rich set of high performance communicationsprimitives is a necessary component of any clusteredenvironment. We chose to make heavy use of Java'sRemote Method Invocation (RMI) facilities for per-forming RPC-like [5] calls across nodes in the clus-ter, and between clients and services. When a callerinvokes an RMI method, stub code intercepts theinvocation, marshalls arguments, and sends themto a remote \skeleton" method handler for unmar-shalling and execution. Using RMI as the �nestgranularity communication data unit in our clus-tered environment has many useful properties. Be-

Page 20: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

cause method invocations have completely encap-sulated, atomic semantics, retransmissions or com-munication failures are easy to reason about|theycorrespond to either successful or failed method in-vocations, rather than partial data transmissions.

However, from the point of view of clients, if aremote method invocation does not successfully re-turn, it can be impossible for the client to knowwhether or not the method was successfully invokedon the server. The client has two choices: it canreinvoke the method call (and risk calling the samemethod twice), or it can assume that the methodwas not invoked, risking that the method was in factinvoked, successfully or unsuccessfully with an ex-ception, but results were not returned to the client.Currently, on failure our Redirector Stubs will retryusing a di�erent, randomly chosen service stub; inthe case of many successive failures, the Redirec-tor Stub will return an exception to the caller. Itis because of the at-least-once semantics implied bythese client-side reinvocations that we must requireservices to be idempotent. The exploration of dif-ferent retry policies inside the Redirector Stubs isan area of future research.

3.1.1 NinjaRMI enhancements to Sun'sRMI

NinjaRMI is a ground-up reimplementation of Sun'sJava Remote Method Invocation for use by com-ponents within the Ninja system. NinjaRMI wasdesigned to permit maximum exibility in imple-mentation options. NinjaRMI provides three in-teresting transport-level RMI enhancements. First,it provides a unicast, UDP-based RMI that allowsclients to call methods with \best-e�ort" seman-tics. If the UDP packet containing the marshalledarguments successfully arrives at the service, themethod is invoked; if not, no retransmissions areattempted. Because of this, we enforce the require-ment that such methods do not have any returnvalues. This transport is useful for beacons, logentries, or other such side-e�ect oriented uses thatdo not require reliability. Our second enhancementis a multicast version of this unreliable transport.RMI services can associate themselves with a multi-cast group, and RMI calls into that multicast groupresult in method invocations on all listening ser-vices. Our third enhancement is to provide very exible certi�cate-based authentication and encryp-tion support for reliable, unicast RMI. Endpoints inan RMI session can associate themselves with dig-ital certi�cates issued by a certi�cation authority.When the TCP-connection underlying an RMI ses-

sion is established, these certi�cates are exchangedby the RMI layer and veri�ed at each endpoint. Ifthe certi�cate veri�cation succeeds, the remainderof the communication over that TCP connection isencrypted using a Triple DES session key obtainedfrom a DiÆe-Hellman key exchange. These securityenhancements are described in [12].

Packaged along with our NinjaRMI implementa-tion is an interface compiler which, when given anobject that exports an RMI interface, generates theclient-side stub and server-side skeleton stubs forthat object. All source code for stubs and skele-tons can be generated dynamically at run-time, al-lowing the Ninja system to leverage the use of anintelligent code-generation step when constructingwrappers for service components. We use this to in-troduce the level of indirection needed to implementRedirector Stubs|stubs can be overloaded to causemethod invocations to occur on many remote nodes,or for method invocations to fail over to auxiliarynodes in the case of a primary node's failure.

3.1.2 Measurements of NinjaRMI

Method Local Sun Ninjamethod RMI RMI

f(void) 0.19 �s 0.83 ms 0.82 ms

f(int) 0.20 �s 0.84 ms 0.85 ms

int f(int) 0.18 �s 0.85 ms 0.84 ms

int f(

int,int,int,int) 0.22 �s 0.88 ms 0.86 ms

f(byte[100]) 0.19 �s 1.06 ms 1.05 ms

f(byte[1000]) 0.20 �s 1.20 ms 1.09 ms

f(byte[10000]) 0.21 �s 2.21 ms 2.23 ms

byte[100] f(int) 0.19 �s 1.07 ms 1.00 ms

byte[1000] f(int) 0.20 �s 1.17 ms 1.01 ms

byte[10000] f(int) 0.20 �s 2.20 ms 2.10 ms

Table 1: NinjaRMI microbenchmarks: Thesebenchmarks were gathered on two 400Mhz PentiumII based machines, each with 128MB of physicalmemory, connected by a switched 100 Mb/s Eth-ernet, and using Sun's JDK 1.1.6v2 with the TYAjust-in-time compiler on Linux 2.0.36. For the sakeof comparison, UDP round-trip times between twoC programs were measured at 0.185 ms, and be-tween two Java programs at 0.316 ms.

As shown in table 1, NinjaRMI performs as wellas or better than Sun's Java RMI package. Giventhat a null RMI invocation cost 0.82 ms and thata round-trip across the network and through theJVMs cost 0.316 ms, we conclude that the di�er-ence (roughly 0.5 ms) is RMI marshalling and pro-

Page 21: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

tocol overhead. Pro�ling the code shows that themain contributor to this overhead is object seri-alization, speci�cally the use of methods such asjava.io.ObjectInputStream.read().

3.2 iSpace

A Base consists of a number of workstations, eachrunning a suitable receptive execution environmentfor single-node service components. In our proto-type, this receptive execution environment is theiSpace: a Java Virtual Machine (JVM) that runsa component loading service into which Java classescan be pushed as needed. The iSpace is responsiblefor managing component resources, naming, protec-tion, and security. The iSpace exports the compo-nent loader interface via NinjaRMI; this interfaceallows a remote client to obtain a list of compo-nents running on the iSpace, obtain an RMI stub toa particular component, and upload a new servicecomponent or kill a component already running onthe iSpace (subject to authentication). Service com-ponents running on the iSpace are protected fromone another and from the surrounding execution en-vironment in three ways:

1. each component is coded as a Java class whichprovides protection from hard crashes (such asnull pointer dereferences),

2. components are separated into thread groups;this limits the interaction one component canhave with threads of another, and

3. all components are subject to the iSpace Se-curity Manager, which traps certain Java APIcalls and determines whether the componenthas the credentials to perform the operation inquestion, such as �le or network access.

Other assumptions must be made in order tomake this approach viable. In essence, we are re-lying on the JVM to behave and perform like aminiature operating system, even though it was notdesigned as such. For example, the Java Virtual Ma-chine does not provide adequate protection betweenthreads of multiple components running within thesame JVM: one component could, for example, con-sume the entire CPU by running inde�nitely withina non-blocking thread. Here, we must assume thatthe JVM employs a preemptive thread scheduler(as is true in the Sun's Solaris Java environment)and that fairness can be guaranteed through its use.Likewise, the iSpace Security Manager must utilizea strategy for resource management which ensures

both fairness and safety. In this respect iSpace hassimilar goals to other systems which provide multi-ple protection domains within a single JVM, such asthe JKernel [16]. However, our approach does notnecessitate a re-engineering of the Java runtime li-braries, particularly because intra-JVM thread com-munication is not a high-priority feature.

3.3 MultiSpace

The highest layer in our implementation is theMultiSpace layer, which tethers together a collec-tion of iSpaces (�gure 3). A primary function ofthis layer is to provide each iSpace with a repli-cated registry of all service instances running in thecluster.

A MultiSpace service inherits from an abstractMultiSpaceService class, whose constructor registerseach service instance with a \MultiSpaceLoader"running on the local iSpace. All MultiSpaceLoad-ers in a cluster cooperate to maintain the replicatedregistry; each one periodically sends out a multicastbeacon3 that carries its list of local services to allother nodes in the MultiSpace, and also listens formulticast messages from other MultiSpace nodes.Each MultiSpaceLoader builds an independent ver-sion of the registry from these beacons. The reg-istries are thus soft-state, similar in nature to thecluster state maintained in [11] and [1]|if an iS-pace node goes down and comes back up, its Mul-tiSpaceLoader simply has to listen to the multicastchannel to rebuild its state. Registries on di�erentnodes may see temporary periods of inconsistencyas services are pushed into the MultiSpace or movedacross nodes in the MultiSpace, but in steady state,all nodes asymptotically approach consistency. Thisconsistency model is similar in nature to that ofGrapevine [10].

The multicast beacons also carry RMI stubs foreach local service component, which implies thatany service instance running on any node in theMultiSpace can identify and contact any other ser-vice in the MultiSpace. It also means that the Mul-tiSpaceLoader on every node has enough informa-tion to construct Redirector Stubs for all of the ser-vices in the MultiSpace, and advertise those Redi-rector Stubs to o�-cluster clients through a \servicediscovery service"[9].4 Each RMI stub embedded in

3Currently, IP multicast is used for this purpose | themulticast channel that beacons are sent over thus de�nes thelogical scope and boundary of an individual MultiSpace. Weintend to replace this transport with multicast NinjaRMI.

4The service discovery service (or SDS) implementationconsists of an XML search engine that allows client programsto locate services based on arbitrary XML predicates.

Page 22: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

...svc 2

svc N

multicast

beaconSystem

Area

Netw

ork

iSpace Loader

MultiSpaceLoader

Service 1

Service 2

Service N

Security Mgr.

Java Virtual M

achineiSpace node M

iSpace Loader

MultiSpaceLoader

Service 1

Service 2

Service N

Security Mgr.

Java Virtual M

achine

svc 1iSpace node 1

Figure

3:

The

MultiS

paceim

plementatio

n:

MultiS

pace

services

are

insta

ntia

tedontopofa

sandbox

(theSecu

rityManager),

andruninsid

ethe

contex

tofaJava

virtu

almachine(JVM).

amultica

stbeaconis(based

onourobserva

tions)

roughly

500bytes

inavera

gelen

gth;there

isthere-

fore

animporta

nttra

deo�betw

eenthebeaconfre-

quency

(andtherefo

refresh

ness

ofinform

atio

nin

theMultiS

paceL

oaders)

andthenumber

ofserv

iceswhosestu

bsarebein

gbeaconed

thatwillultim

ately

a�ect

thesca

lability

oftheMultiS

pace.

Serv

iceinsta

nces

canelect

toreceiv

emultica

stbeaconsfro

mother

MultiS

pace

nodes;

aserv

icecan

use

this

mech

anism

tobeco

meaw

are

ofits

peers

runningelsew

here

intheclu

ster.Ifaserv

iceover-

rides

thesta

ndard

beaconcla

ss,itcanaugmentits

beaconswith

additio

nalinform

atio

n,such

asthe

loadthatitiscurren

tlyexperien

cing.Serv

icesthat

wantto

calloutto

other

services

could

thusmake

coarse

grained

loadbalancin

gdecisio

nswith

outre-

quirin

gacen

tralized

loadmanager.

3.3.1

Built-In

MultiS

paceServices

Inclu

ded

with

theMultiS

pace

implem

entatio

nare

twoserv

icesthatenrich

thefunctio

nality

available

toother

MultiS

pace

services:

thedistrib

uted

hash

table,

andtheuptim

emonito

ringserv

ice.

Distrib

utedhash

table:asmentio

ned

insec-

tion2.3,dueto

theRedirecto

rStubmech

anism

MultiS

pace

service

insta

nces

must

maintain

self-consisten

cyacro

ssthenodesoftheclu

ster.Tomake

thistask

simpler,

wehaveprov

ided

service

authors

with

adistrib

uted

,rep

licated

,fault-to

leranthash

table

thatisimplem

ented

inC

forthesakeofef-

�cien

cy.Thehash

table

isdesig

ned

topresen

ta

consisten

tview

ofdata

acro

ssallnodes

intheclu

s-ter,

andassuch,serv

icesmay

useitto

rendezv

ousin

asty

lesim

ilarto

Linda[2]orIBM'sT-Spaces

[33].

Thecurren

timplem

entatio

nismodera

telyfast

(itcanhandlemore

than1000insertio

nsper

secondof

500byte

entries

ona4node100Mb/sMultiS

pace

cluster),

isfaulttolera

nt,

andtra

nsparen

tlymask

multip

lenodefailures.

How

ever,

itdoes

notyet

prov

idealloftheconsisten

cyguarantees

thatsome

services

would

ideally

prefer,

such

ason-lin

erecov

-ery

ofcra

shed

state

ortra

nsactio

nsacro

ssmultip

leopera

tions.

Thecurren

tlyimplem

entatio

nis,

how

-ever,

suita

ble

formanyIntern

et-style

services

for

which

thislev

elofconsisten

cyisnotessen

tial.

Uptim

emonito

ringservice:Even

with

the

service

beaconingmech

anism

,itis

diÆcult

tode-

tectthefailure

ofindividualnodes

intheclu

ster.Thisispartly

beca

use

ofthelack

ofaclea

rfailure

modelin

Java

:aJava

service

isjust

acollectio

nof

objects,

notnecessa

rilyeven

possessin

gathrea

dof

execu

tion.A

service

failure

may

just

imply

thata

setofobjects

hasentered

amutually

inconsisten

tsta

te.Theabsen

ceofbeaconsdoesn

'tnecessa

rilymeanthataserv

iceinsta

nce

failure

hasoccu

rred;

thebeaconsmay

belost

dueto

congestio

nin

the

intern

alnetw

ork,orthebeaconsmay

nothavebeen

genera

tedbeca

usetheserv

iceinsta

nce

isoverlo

aded

andisbusy

processin

gother

tasks.

Forthis

reason,wehaveprov

ided

an

uptim

emonito

ringabstra

ctionto

service

authors.

Ifaser-

vice

runningin

theMultiS

pace

implem

ents

awell-

know

nJava

interfa

ce,theinfra

structu

reautomat-

ically

detects

this

andbeginsperio

dica

llycallin

gthedoProbe()meth

odin

thatinterfa

ce.Byim-

plem

entin

gthis

meth

od,serv

iceauthors

promise

toperfo

rmanapplica

tion-lev

eltask

thatdem

on-

strates

thattheserv

iceisaccep

tingandsuccessfu

llyprocessin

greq

uests.

Byusin

gthisapplica

tion-lev

eluptim

echeck

,theinfra

structu

recanexplicitly

de-

tectwhen

aserv

iceinsta

nce

hasfailed

.Curren

tly,weonly

logthis

failure

inorder

togenera

teup-

timesta

tistics,andwerely

ontheRedirecto

rStub

failov

ermech

anism

sto

mask

these

failures.

4Applicatio

ns

Inthissectio

nofthepaper,

wediscu

sstwoappli-

catio

nsthatdem

onstra

tethevalidity

ofourguiding

prin

ciples

andtheeÆ

cacy

ofourMultiS

pace

imple-

mentatio

n.The�rst

applica

tion,theNinja

Juke-

box,abstra

ctsthemanyindependentcompact-d

iscplay

ersandlocal�lesy

stemsin

theBerk

eleyNet-

work

ofWorksta

tions(N

OW)clu

sterinto

asin

gle

poolofava

ilablemusic.

Theseco

nd,Keiretsu

,isa

three-tiered

applica

tionthatprov

ides

insta

ntmes-

sagingacro

sshetero

geneousdevices.

Page 23: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Figure 4: The Ninja Jukebox GUI: users are pre-sented with a single Jukebox interface, even thoughsongs in the Jukebox are scattered across multipleworkstations, and may be either MP3 �les on a local�lesystem, or audio CDs in CD-ROM drives.

4.1 The Ninja Jukebox

The original goal of the Ninja jukebox was toharness all of the audio CD players in the BerkeleyNOW (a 100+ node cluster of Sun UltraSparc work-stations) to provide a single, giant virtual musicjukebox to the Berkeley CS graduate students. Themost interesting features of the Ninja Jukebox arisefrom its implementation on top of iSpace: new nodescan be dynamically harnessed by pushing appropri-ate CD track \ripper" services onto them, and thefeatures of the Ninja Jukebox are simple to evolveand customize, as evidenced by the seamless trans-formation of the service to the batch conversion ofaudio CDs to MP3 format, and the authenticatedtransmission of these MP3s over the network.

The Ninja Jukebox service is decomposed intothree components: a master directory, a CD \rip-per" and indexer, and a gateway to the onlineCDDB service [22] that provides artist and tracktitle information given a CD serial number. Theability to push code around the cluster to growthe service proved to be exceptionally useful, sincewe didn't have to decide a priori which nodes inthe cluster would house CDs|we could dynami-cally push the ripper/indexer component towardsthe CDs as the CDs were inserted into nodes in thecluster. When a new CD is added to a node in theNOW cluster, the master directory service pushesan instance of the ripper service into the iSpace res-ident on that node. The ripper scans the CD todetermine what music is on it. It then contacts a lo-cal instance of the CDDB service to gather detailedinformation about the CD's artist and track titles;this information is put into a playlist which is pe-riodically sent to the master directory service. Themaster directory incorporates playlists from all of

the rippers running across the cluster into a single,global directory of music, and makes this directoryavailable over both RMI (for song browsing and se-lection) and HTTP (for simple audio streaming).

After the Ninja Jukebox had been running for awhile, our users expressed the desire to add MP3audio �les to the Jukebox. To add this new behav-ior, we only had to subclass the CD ripper/indexerto recognize MP3 audio �les on the local �lesystem,and create new Jukebox components that batch con-verted between audio CDs and MP3 �les. Also, toprotect the copyright of the music in the system weadded access control lists to the MP3 repositories,with the policy that users could only listen to mu-sic that they had added to the Jukebox.5 We thenbegan pushing this new subclass to nodes in the sys-tem, and our system evolved while it was running.

The performance of the Ninja Jukebox is com-pletely dominated by the overhead of authenticationand the network bandwidth consumed by stream-ing MP3 �les. The �rst factor (authentication over-head) is currently benchmarked at a crippling 10seconds per certi�cate exchange, entirely due to apure Java implementation of a public key cryptosys-tem. The second factor (network consumption) isnot quite as crippling, but still signi�cant: eachMP3 consumes at least 128 Kb/s, and since the MP3�les are streamed over HTTP, each transmission ischaracterized by a large burst as the MP3 is pushedover the network as quickly as possible. Both limita-tions can be remedied with signi�cant engineering,but this would be beyond the scope of our research.

4.2 Keiretsu: The Ninja Instant-Messaging Service

Keiretsu6 is a MultiSpace service that providesinstant messaging between heterogeneous devices:Web browsers, one- or two-way pagers, and PDAssuch as the Palm Pilot (see Figure 5). Users areable to view a list of other users connected to theKeiretsu service, and can send short text messagesto other users. The service component of Keiretsuexploits the MultiSpace features: Keiretsu serviceinstances use the soft-state registry of peer nodes inorder to exchange client routing information acrossthe cluster, and automatically generated RedirectorStubs are handed out to clients for use in commu-nicating with Keiretsu nodes.

5This ACL policy is enforced using the authentication ex-tensions to NinjaRMI described in 3.1.1.

6Keiretsu is a Japanese concept in which a group of re-lated companies work together for each other's mutual suc-cess.

Page 24: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Keiretsuservice

(3) Message sentto MultiSpace

(4) Message routed to destination Active Proxy

(2) Active Proxyconverts outgoingmsg to RMI (5) Active Proxy

converts messageto Pilot format

(6) ReceivingPilot displaysmessage

(1) User entersmessage on Pilot

MultiSpace

Figure 5: The Keiretsu Service

Keiretsu is a three-tired application: simpleclient devices (such as pagers or Palm Pilots) thatcannot run a JVM connect to an Active Proxy,which can be thought of as a simpli�ed iSpacenode meant to run soft-state mobile code. The Ac-tive Proxy converts simple text messages from de-vices into NinjaRMI calls into the Keiretsu Mul-tiSpace service. The Active Proxies are assumedto have enough sophistication to run Java-basedmobile code (the protocol conversion routines) andspeak NinjaRMI, while rudimentary client devicesneed only speak a simple text-based protocol.

As described in section 2.3, Redirector Stubsare used to access the back-end service componentswithin a MultiSpace by pushing load-balancing andfailover logic towards the client|in the case of sim-ple clients, Redirector Stubs execute in the ActiveProxy. For each protocol message received by anActive Proxy from a user device (such as \send mes-sage M to user U"), the Redirector Stub is invokedto call into the MultiSpace.

Because the Keiretsu proxy is itself a mobile Javacomponent that runs on an iSpace, the Keiretsuproxy service can be pushed into appropriate loca-tions on demand, making it easy to bootstrap suchan Active Proxy as needed. State management in-side the Active Proxy is much simpler than statemanagement inside a Base|the only state that Ac-tive Proxies maintain is the session state for con-nected clients. This session state is soft-state, andit does not need to be carefully guarded, as it can beregenerated given suÆcient intelligence in the Base,or by having users manually recover their sessions.

Rudimentary devices are not the only allowablemembers of a Keiretsu. More complex clients thatcan run a JVM speak directly to the Keiretsu, in-stead of going through an Active Proxy. An exampleof such a client is our e-mail agent, which attachesitself to the Keiretsu and acts as a gateway, relayingKeiretsu messages to users over Internet e-mail.

4.2.1 The Keiretsu MultiSpace service

public void identifySelf(

String clientName,

KeiretsuClientIF clientStub);

public void disconnectSelf(String clientName);

public void injectMessage(KeiretsuMessage msg);

public String[] getClientList();

Figure 6: The Keiretsu service API

The MultiSpace service that performs messagerouting is surprisingly simple. Figure 6 shows theAPI exported by the service to clients. Throughthe identifySelfmethod, a client periodically an-nounces its presence to the Keiretsu, and handsthe Keiretsu an RMI stub which the service willuse to send it messages. If a client stops call-ing this method, the Keiretsu assumes the clienthas disconnected; in this way, participation in theKeiretsu is treated as a lease. Alternately, a clientcan invalidate its binding immediately by callingthe disconnectSelfmethod. Messages are sent bycalling the injectMessage method, and clients canobtain a list of other connected clients by calling thegetClientList method.

Inside the Keiretsu, all nodes maintain a soft-state table of other nodes by listening to Multi-Space beacons, as discussed in section 3.3. Whena client connects to a Keiretsu node, that nodesends the client's RMI stub to all other nodes; allKeiretsu nodes maintain individual tables of theseclient bindings. This means that in steady state,each node can route messages to any client.

Because clients access the Keiretsu servicethrough Redirector Stubs, and because Keiretsunodes replicate service state, individual nodes in the

Page 25: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

Keiretsu can fail and service will continue uninter-rupted, at the cost of capacity and perhaps per-formance. In an experiment on a 4-node cluster,we demonstrated that the service continued unin-terrupted even when 3 of the 4 nodes went down.The Keiretsu source code consists of 5 pages of Javacode; however, most of the code deals with manag-ing the soft-state tables of the other Keiretsu nodesin the cluster and the client RMI stub bindings. Theactual business of routing messages to clients con-sists of only half a page of Java code|the rest ofthe service functionality (namely, building and ad-vertising Redirector Stubs, tracking service imple-mentations across the cluster, and load balancingand failover across nodes) is hidden inside the Mul-tiSpace layer. We believe that the MultiSpace im-plementation is quite successful in shielding serviceauthors from a signi�cant amount of complexity.

4.2.2 Keiretsu Performance

We ran an experiment to measure the performanceand scalability of our MultiSpace implementationand the Keiretsu service. We used a cluster of 400MHz Pentium II machines, each with 128 MB ofphysical memory, connected by a 100 Mb/s switchedEthernet. We implemented two Keiretsu clients:the \speedometer", which open up a parameteriz-able number of identities in the Keiretsu and thenwaits to receive messages, and the \driver", whichgrabs a parameterizable number of Redirector Stubsto the Keiretsu, downloads a list of clients in theKeiretsu, and then blasts 75 byte messages to ran-domly selected clients as fast as it can.

We started our Keiretsu service on a single node,and incrementally grew the cluster to 4 nodes, mea-suring the maximum message throughput obtainedfor 10x, 50x, and 100x \speedometer" receivers,where x is the number of nodes in the cluster. Toachieve maximum throughput, we added incremen-tally more \driver" connections until message deliv-ery saturated. The drivers and speedometers werelocated on many dedicated machines, connected tothe Keiretsu cluster by the same 100 Mb/s switchedEthernet. Table 2 shows our results.

For a small number of receivers (10 per node), weobserved linear scaling in the message throughput.This is because each node in the Keiretsu is essen-tially independent: only a small amount of stateis shared (the client stubs for the 10 receivers pernode). In this case, the CPU was the bottleneck,likely due to Java overhead in message processingand argument marshalling and unmarshalling.

For larger number of receivers, we observed a

# Clients Max. message# Nodes per node throughput

(msgs / s)

10 246 � 41 50 200 � 8

100 195 � 10

10 420 � 102 50 300 � 20

100 260 � 20

10 490 � 153 50 370 � 20

100 160 � 15

10 570 � 154 50 210 � 10

100 120 � 10

Table 2: Keiretsu performance: These bench-marks were run on 400Mhz Pentium II machines,each with 128MB of physical memory, connectedby a 100 Mb/s switched Ethernet, using Sun'sJDK 1.1.6v2 with the TYA just-in-time compiler onLinux 2.2.1, and sending 75 byte Keiretsu messages.

breakdown in scaling when the total number of re-ceivers reached roughly 200 (i.e. 3-4 nodes at 50receivers per node, or 2 nodes at 100 receivers pernode). The CPU was still the bottleneck in thesecases, but most of the CPU time was spent pro-cessing the client stubs exchanged between Keiretsunodes, rather than processing the clients' messages.This is due to poor design of the Keiretsu service; wedid not need to exhange client stubs as frequently aswe did, and we should have simply exchanged times-tamps for previously distributed stubs rather thanrepeatedly sending the same stub. This limitationcould be also removed by modifying the Keiretsuservice to inject client stubs in a distributed hashtable, and rely on service instances to pull thestubs out of the table as needed. However, a sim-ilar N2 state exchange happens at the MultiSpacelayer with the multicast exchange of service instancestubs; this could potentially become another scalingbottleneck for large clusters.

5 Discussion and Future Work

Our MultiSpace and service implementation ef-forts have given some insights into our original de-sign principles, and into the use of Java as an Inter-net service construction language. In this section ofthe paper, we delve into some of these insights.

Page 26: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

5.1 Code Mobility as a Service Con-struction Primitive

When we were designing the MultiSpace, weknew that code mobility would be a powerful tool.We originally intended to use code mobility for de-livering code to clients (which we do in the form ofRedirector Stubs), and for it to be used by clientsto upload customization code into the MultiSpace(which has not been implemented). However, codemobility turned out to be useful inside the Multi-Space as a mechanisms for structuring services, anddistributing service components across the cluster.

Code mobility solved a software distributionproblem in the Jukebox, without us realizing thatsoftware distribution might become a problem.When we updated the ripper service, we needed todistribute the new functionality to all of the nodesin the cluster that would potentially have CD'sinserted into them. Code mobility also partiallysolved the service location problem in the Jukebox:the ripper services depend on the CDDB service togather detailed track and album information, butthe ripper has no easy way to know where in thecluster the CDDB service is running. Using codemobility to push the CDDB service onto the samenode as the ripper, we enforced the invariant thatthe CDDB service is colocated with the ripper.

5.2 Bases are a Simplifying Principle,but not a Complete Solution

The principle of solving complex service problemsin a Base makes it easier to reason about the inter-actions between services and clients, and to ensurethat diÆcult tasks like state management are dealtwith in an environment in which there is a chanceof success. However, this organizational principlealone is not enough to solve the problem of con-structing highly available services. Given the con-trolled environment of a Base, service authors muststill construct services to ensure consistency andavailability. We believe that a Base can provideprimitives that further simply service authors' jobs;the Redirector Stub is an example of such a primi-tive.

While building the Ninja Jukebox and Keiretsu,we made observations about how we achieved avail-ability and consistency. Most of the code in theseservices dealt with distributing and maintaining ta-bles of shared state. In the Ninja Jukebox, thisstate was the list of available music. In Keiretsu,this state was the list of other Keiretsu nodes, andthe tables of client stub bindings. The distributed

hash table was not yet complete when these two ser-vices were being implemented. If we had relied onit instead of our ad-hoc peer state exchange mecha-nisms, much of the services' source code would havebeen eliminated.

In both of these services, work queues are notexplicitly exposed to service authors. These queuesare hidden inside the thread scheduler, since Nin-jaRMI spawns a thread per connected client. Thisdesign decision had repercussions on service struc-ture: each service had to be written to handle mul-tithreading, since service authors must handle con-sistency within a single service instance as well asacross instances throughout the cluster. Providinga mechanism to expose work queues to service au-thors may simplify the structure of some services,for example if services serialize requests to avoid is-sues associated with multithreading.

In the current MultiSpace, service instances mustexplicitly keep track of their counterparts on othernodes and spawn new services when load or avail-ability demands it. A useful primitive would beto allow authors to specify conditions that dictatewhen service instances are spawned or pushed toa speci�c node, and to allow the MultiSpace infras-tructure to handle the triggering of these conditions.

5.3 Java as an Internet-service con-struction environment

Java has proven to be an invaluable tool in thedevelopment of the Ninja infrastructure. The abil-ity to rapidly deploy cross-platform code compo-nents simply by assuming the existence of a JavaVirtual Machine made it easy to construct complexdistributed services without concerning oneself withthe heterogeneity of the systems involved. The useof RMI as a strongly-typed RPC, tied very closely tothe Java language semantics, makes distributed pro-gramming comparably simple to single node devel-opment. The protection, modularization, and safetyguarantees provided by the Java runtime environ-ment make dynamic dissemination of code compo-nents a natural activity. Similarly, the use of Javaclass re ection to generate new code wrappers forexisting components (as with Redirector Stubs) pro-vides automatic indirection at the object level.

Java has a number of drawbacks in its currentform, however. Performance is always an issue,and work on just-in-time [14, 24] and ahead-of-time[27] compilation is addressing many of these prob-lems. The widely-used JVM from Sun Microsys-tems exhibits a large memory footprint (we haveobserved 3-4 MB for \Hello World", and up to 30

Page 27: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

MB for a relatively simple application that performsmany of memory allocations and deallocations7),and crossing the boundary from Java to native coderemains an expensive operation. In addition, theJava threading model permits threads to be non-preemptive, which has serious implications for com-ponents which must run in a protected environment.Our approach has been to use only Java Virtual Ma-chines which employ preemptive threads.

We have started an e�ort to improve the perfor-mance of the Java runtime environment. Our ini-tial prototype, called Jaguar, permits direct Javaaccess to hardware resources through the use of amodi�ed just-in-time compiler. Rather than goingthrough the relatively expensive Java Native Inter-face for access to devices, Jaguar generates machinecode for direct hardware access which is inlined withcompiled Java bytecodes. We have implemented aJaguar interface to a VIA8 enabled fast system areanetwork, obtaining performance equivalent to VIAaccess from C (80 microseconds round-trip time forsmall messages and over 400 megabits/second peakbandwidth). We believe that this approach is a vi-able way to tailor the Java environment for high-performance use in a clustered environment.

6 Related Work

Directly related to the Base architecture is theTACC [11] platform, which provides a cluster-based environment for scalable Internet services. InTACC, service components (or \workers") can bewritten in a number of languages and are controlledby a front-end machine which dispatches incomingrequests to back-end cluster machines, incorporat-ing load-balancing and restart in the case of nodefailure. TACC workers may be chained across thecluster for composable tasks. TACC was designedto support Internet services which perform datatransformation and aggregation tasks. Base servicescan additionally implement long-lived and persis-tent services; the result is that the Ninja approachaddresses a wider set of potential applications andsystem-support issues. Furthermore, Base servicescan dynamically created and destroyed through theiSpace loader interface on each MultiSpace node|TACC did not have this functionality.

7There are no explicit deallocations in Java | by \deallo-cation", we mean discarding all references to an object, thusenabling it to be garbage collected.

8VIA is the Virtual Interface Architecture [26], whichspeci�es an industry-standard architecture for high-bandwidth, low-latency communication within clusters.

Sun's JINI [31] architecture is similar to the theBase architecture in that it proposes to developa Java-based lingua franca for binding users, de-vices, and services together in an intelligent, pro-grammable Internet-wide infrastructure. JINI's useof RMI as the basic communication substrate anduse of code mobility for distributing service anddevice interfaces has a great deal of rapport withour approach. However, we believe that we areaddressing problems which JINI does not directlysolve: providing a hardware and software environ-ment supporting scalable, fault-tolerant services isnot within the JINI problem domain, nor is theuse of dynamically-generated code components toact as interfaces into services. However, JINI hastouched on issues such as service discovery and nam-ing which have not yet been dealt with by the Basearchitecture; likewise, JINI's use of JavaSpaces and\leases" as a persistent storage model may interactwell with the Base service model.

ANTS [30, 29] is a system that enables the dy-namic deployment of mobile code which implementsnetwork protocols within Active Routers. Coded inJava, and utilizing techniques similar to those in theiSpace environment, ANTS has a similar set of goalsin mind as the Base architecture. However, ANTSuses mobile code for processing each packet passingthrough a router; Base service components are ex-ecuted on the granularity of an RMI call. LiquidSoftware [19] and Joust [15] are similar to ANTS inthat they propose an environment which uses mobilecode to customize nodes in the network for commu-nications oriented tasks. These systems focused onadapting systems at the level of network protocolcode, while the Base architecture uses code mobil-ity for distribution of service components both in-ternally to and externally from a Base.

SanFrancisco [7] is a platform for building dis-tributed, object-oriented business applications. Itsprimary goal is to simplify the development ofthese applications by providing developers a set ofindustrial-strength \Foundation" objects which im-plement common functionality. As such, SanFran-cisco is very similar to Sun's Enterprise Java Beansin that it provides a framework for constructingapplications using reusable components, with San-Francisco providing a number of generic componentsto start with. MultiSpace addresses a di�erent set ofgoals than Enterprise Java Beans and SanFranciscoin that it de�nes a exible runtime environment forservices, and MultiSpace intends to provide scalabil-ity and fault-tolerance by leveraging the exibilityof a component architecture. MultiSpace servicescould be built using the EJB or SanFrancisco model

Page 28: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

(extended to expose the MultiSpace functionality),but these issues appear to be orthogonal.

The Distributed Computing Environment (DCE)[20] is a software suite that provides a middlewareplatform that operates on many operating systemsand environments. DCE abstracts away many OSand network services (such as threads, security, a di-rectory service, and RPC) and therefore allows pro-grammers to implement DCE middleware indepen-dent of the vagaries of particular operating systems.DCE is rich, robust, but notoriously heavyweight,and its focus is on providing interoperable, wide-area middleware. MultiSpace is far less mature, butfocuses instead on providing a platform for rapidlyadaptable services that are housed within a singleadministrative domain (the Base).

7 Conclusions

In this paper, we presented an architecture fora Base, a clustered hardware and software platformfor building and executing exible and adaptable in-frastructure services. The Base architecture was de-signed to adhere to three organizing principles: (1)solve the challenging service availability and scala-bility problems in carefully controlled environments,allowing service authors to make many assumptionsthat would not otherwise be valid in an uncontrolledor wide area environment, (2) gain service exibil-ity by decomposing Bases into a number of receptiveexecution environments; and (3) introduce a level ofindirection between the clients and services throughthe use of dynamic code generation techniques.

We built a prototype implementation (Multi-Space) of the Base architecture using Java as a foun-dation. This implementation took advantage of thecode mobility and dynamic compilation techniquesto help in the structuring and deployment of ser-vices inside the cluster. The MultiSpace abstractsaway load balancing, failover, and service instancedetection and naming from service authors. Usingthe MultiSpace platform, we implemented two novelservices: the Ninja Jukebox, and Keiretsu. TheNinja Jukebox implementation demonstrated thatcode mobility is valuable inside a cluster environ-ment, as it permits rapid evolution of services, andrun-time binding of service components to availableresources in the cluster. The Keiretsu applicationdemonstrated that our MultiSpace layer successfullyreduced the complexity of building new services: thecore Keiretsu service functionality was implementedin less than a page of code, but the application wasdemonstrably fault-tolerant. We also demonstrated

that Keiretsu code limited the scalability of this ser-vice, rather than any inherent limitation in the Mul-tiSpace layer, although we hypothesized that our useof multicast beacons would ultimately limit the scal-ability of the current MultiSpace implementation.

8 Acknowledgements

The authors thank Ian Goldberg and David Wag-ner for their contributions to iSpace, NinjaRMI, andthe Jukebox application. We would also like tothank all of the graduate and undergraduate stu-dents in the Ninja project for agreeing to be guineapigs for our platform, and Brian Shiratsuki andKeith Sklower for their help in the system of theMultiSpace machines and networks. Finally, wewould like to thank our anonymous reviewers andour shepherd, Charles Antonelli, for their insightand suggestions for improvement.

References

[1] Elan Amir, Steven McCanne, and Randy Katz. AnActive Service Framework and its Application toReal-Time Multimedia Transcoding. In Proceedingsof ACM SIGCOMM '98, volume 28, pages 178{189,October 1998.

[2] B. Anderson and D. Shasha. Persistent Linda:Linda + Transactions + Query Processing. InSpringer-Verlag Lecture Notes in Computer Science574, Mont-Saint-Michel, France, June 1991.

[3] Thomas E. Anderson, David E. Culler, and DavidPatterson. A Case for NOW (Networks of Worksta-tions). IEEE Micro, 12(1):54{64, February 1995.

[4] Ken Birman, Andre Schiper, and Pat Stephen-son. Lightweight Causal and Atomic Group Mul-ticast. ACM Transactions on Computer Systems,9(3):272{314, 1991.

[5] Andrew D. Birrell and Bruce Jay Nelson. Imple-menting Remote Procedure Call. ACM Transac-tions on Computing Systems, 2(1):39{59, February1984.

[6] T. Brisco. RFC 1764: DNS Support for Load Bal-ancing, April 1995.

[7] IBM Corporation. IBM SanFrancisco producthomepage. http://www.software.ibm.com/ad/

sanfrancisco/.

[8] Inktomi Corporation. The Technology Behind Hot-Bot. http://www.inktomi.com/whitepap.html,May 1996.

[9] Steven Czerwinski, Ben Y. Zhao, Todd Hodes, An-thony Joseph, and Randy Katz. An Architecture

Page 29: USENIX | The Advanced Computing Systems …...t for building and executing highly a v ailable, scalable, but exible and adaptable infras-tructure services. Our arc hitecture has three

for a Secure Service Discovery Service. In Proceed-ings of MobiCom '99, Seattle, WA, August 1999.ACM.

[10] A.D. Birrell et al. Grapevine: An Exercise in Dis-tributed Computing. Communications of the Asso-ciation for Computing Machinery, 25(4):3{23, Feb1984.

[11] Armando Fox, Steven D. Gribble, Yatin Chawathe,Eric A. Brewer, and Paul Gauthier. Cluster-BasedScalable Network Services. In Proceedings of the16th ACM Symposium on Operating Systems Prin-ciples, St.-Malo, France, October 1997.

[12] Ian Goldberg, Steven D. Gribble, David Wagner,and Eric A. Brewer. The Ninja Jukebox. In Sub-mitted to the 2nd USENIX Symposium on Inter-net Technologies and Systems, Boulder, Colorado,USA, October 1999.

[13] Robert S. Gray. Agent Tcl: A Flexible and Se-cure Mobile-Agent System. In Proceedings of theFourth Annual Usenix Tcl/Tk Workshop. USENIXAssociation, 1996.

[14] The Open Group. The Fajita CompilerProject. http://www.gr.opengroup.org/java/

compiler/fajita/index-b.htm, 1998.

[15] J. Hartman, L. Peterson, A. Bavier, P. Bigot,P. Bridges, B. Montz, R. Piltz, T. Proebsting, andO. Spatscheck. Joust: A Platform for Liquid Soft-ware. In IEEE Network (Special Edition on Activeand Programmable Networks), July 1998.

[16] Chris Hawblitzel, Chi-Chao Chang, Grzegorz Cza-jkowski, Deyu Hu, and Thorsten von Eicken. Im-plementing Multiple Protection Domains in Java.In Proceedings of the 1998 Usenix Annual Techni-cal Conference, June 1998.

[17] Open Group Research Institute. Scalable HighlyAvailable Web Server Project (SHAWS). http://

www.osf.org/RI/PubProjPgs/SFTWWW.htm.

[18] Dag Johansen, Robbert van Renesse, , and Fred R.Schneider. Operating System Support for MobileAgents. In Proceedings of the 5th IEEE Workshopon Hot Topics in Operating Systems, 1995.

[19] John Hartman and Udi Manber and Larry Peter-son and Todd Proebsting. Liquid Software: A NewParadigm for Networked Systems. Technical re-port, Department of Computer Science, Universityof Arizona, June 1996.

[20] Brad Curtis Johnson. A Distributed Comput-ing Environment Framework: an OSF Perspec-tive. Technical Report DEV-DCE-TP6-1, the OpenGroup, June 1991.

[21] Eric Jul, Henry M. Levy, Norman C. Hutchinson,and Andrew P. Black. Fine-Grained Mobility in theEmerald System. ACM Transactions on ComputingSystems, 6(1):109{133, 1988.

[22] Ti Kan and Steve Scherf. CDDB Speci�ca-ton. http://www.cddb.com/ftp/cddb-docs/cddb_howto.gz.

[23] Sun Microsystems. Java Remote MethodInvocation|Distributed Computing for Java.http://java.sun.com/.

[24] Sun Microsystems. The Solaris JIT Compiler.http://www.sun.com/solaris/jit, 1998.

[25] The Object Management Group (OMG). The Com-mon Object Request Broker: Architecture andSpeci�cation, February 1998. http://www.omg.

org/library/c2indx.html.

[26] Virtual Interface Architecture Organization. Vir-tual Interface Architecture Speci�cation version1.0, December 1997. http://www.viarch.org.

[27] Todd A. Proebsting, Gregg Townsend, PatrickBridges, John H. Hartman, Tim Newsham, andScott A. Watterson. Toba: Java for Applications|A Way Ahead of Time (WAT) Compiler. InProceedings of the Third USENIX Conference onObject-Oriented Technologies (COOTS), Portland,Oregon, USA, June 1997.

[28] Joseph Tardo and Luis Valente. Mobile Agent Secu-rity and Telescript. In Proceedings of the 41st Inter-national Conference of the IEEE Computer Society(CompCon '96), February 1996.

[29] David L. Tennenhouse, Jonathan M. Smith,W. David Sincoskie, David J. Wetherall, andGary J. Minden. A Survey of Active Network Re-search. Active Networks home page (MIT Teleme-dia, Networks and Systems group), 1996.

[30] David L. Tennenhouse and David J. Wetherall. To-wards an Active Network Architecture. In ACMSIGCOMM '96 (Computer Communications Re-view). ACM, 1996.

[31] Jim Waldo. Jini Architecture Overview. Avail-able at http://java.sun.com/products/jini/

whitepapers.

[32] James E. White. Telescript Technology: The Foun-dation for the Electronic Marketplace, 1991. http://www.generalmagic.com.

[33] P. Wycko�, S. W. McLaughry, T. J. Lehman, andD. A. Ford. TSpaces. IBM Systems Journal, 37(3),April 1998.

[34] C. Yoshikawa, B. Chun, P. Eastham, A. Vahdat,T. Anderson, and D. Culler. Using Smart Clientsto Build Scalable Services. In Proceedings of theWinter 1997 USENIX Technical Conference, Jan-uary 1997.