dynagrid: an adaptive, scalable, and reliable resource ...csl.skku.edu/papers/jgrid09.pdf · for...

17
J Grid Computing (2009) 7:73–89 DOI 10.1007/s10723-008-9107-y DynaGrid: An Adaptive, Scalable, and Reliable Resource Provisioning Framework for WSRF-Compliant Applications Eun-Kyu Byun · Jin-Soo Kim Received: 8 October 2007 / Accepted: 3 July 2008 / Published online: 9 August 2008 © Springer Science + Business Media B.V. 2008 Abstract The Web Services Resource Framework (WSRF) is a set of specifications which define a generic and open framework for modeling and ac- cessing stateful resources using Web services. This paper proposes DynaGrid, a new framework for WSRF-compliant applications. Many new com- ponents, such as ServiceDoor, Dynamic Service Launcher, ClientProxy, and PartitionManager, have been introduced to offer adaptive, scalable, and reliable resource provisioning. All of these components are implemented as standard WSRF- compliant Web services, hence DynaGrid is complementary to the existing GT4. Our exper- imental evaluation shows that DynaGrid effec- tively utilizes Grid resources by allocating only the required number of resources adaptively accord- ing to the amount of incoming requests, provid- ing both the scalability and the reliability at the same time. E.-K. Byun Division of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, South Korea e-mail: [email protected] J.-S. Kim (B ) Sungkyunkwan University, 300 Cheoncheon-dong, Jangan-gu, Suwon, Gyeonggi-do 440-746, South Korea e-mail: [email protected] Keywords WSRF · SOA · GT4 · Adaptive resource provisioning · Distributed system 1 Introduction Grid computing is the technology for building Internet-wide computing environment integrating distributed and heterogeneous resources [1]. Grid computing has provided many scientific projects with large computational power and storage ca- pacity. In addition, Grid computing start to pro- vide computational infrastructure for distributed applications based on SOA(Service Oriented Ar- chitecture). The recently proposed OGSA (Open Grid Services Architecture) [2] and WSRF (Web Services Resource Framework) [3] specifies the base architecture and interfaces for Grid as the infrastructure of SOA. WSRF is a group of specifications which define a generic and open framework for modeling and accessing stateful resources using Web services. A stateful resource (or a ServiceResource 1 ) is a set of data values that persist across, and evolve as a 1 Note that the term resource used in WSRF should not be confused with the resource in Grid computing, a general term to denote a computational or storage resource. In the rest of this paper, we use the term “ServiceResource” to explicitly indicate the stateful resource defined in WSRF specifications.

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

J Grid Computing (2009) 7:73–89DOI 10.1007/s10723-008-9107-y

DynaGrid: An Adaptive, Scalable, and ReliableResource Provisioning Frameworkfor WSRF-Compliant Applications

Eun-Kyu Byun · Jin-Soo Kim

Received: 8 October 2007 / Accepted: 3 July 2008 / Published online: 9 August 2008© Springer Science + Business Media B.V. 2008

Abstract The Web Services Resource Framework(WSRF) is a set of specifications which define ageneric and open framework for modeling and ac-cessing stateful resources using Web services. Thispaper proposes DynaGrid, a new framework forWSRF-compliant applications. Many new com-ponents, such as ServiceDoor, Dynamic ServiceLauncher, ClientProxy, and PartitionManager,have been introduced to offer adaptive, scalable,and reliable resource provisioning. All of thesecomponents are implemented as standard WSRF-compliant Web services, hence DynaGrid iscomplementary to the existing GT4. Our exper-imental evaluation shows that DynaGrid effec-tively utilizes Grid resources by allocating only therequired number of resources adaptively accord-ing to the amount of incoming requests, provid-ing both the scalability and the reliability at thesame time.

E.-K. ByunDivision of Computer Science, Korea AdvancedInstitute of Science and Technology (KAIST),373-1 Guseong-dong, Yuseong-gu,Daejeon 305-701, South Koreae-mail: [email protected]

J.-S. Kim (B)Sungkyunkwan University,300 Cheoncheon-dong, Jangan-gu, Suwon,Gyeonggi-do 440-746, South Koreae-mail: [email protected]

Keywords WSRF · SOA · GT4 · Adaptiveresource provisioning · Distributed system

1 Introduction

Grid computing is the technology for buildingInternet-wide computing environment integratingdistributed and heterogeneous resources [1]. Gridcomputing has provided many scientific projectswith large computational power and storage ca-pacity. In addition, Grid computing start to pro-vide computational infrastructure for distributedapplications based on SOA(Service Oriented Ar-chitecture). The recently proposed OGSA (OpenGrid Services Architecture) [2] and WSRF (WebServices Resource Framework) [3] specifies thebase architecture and interfaces for Grid as theinfrastructure of SOA.

WSRF is a group of specifications which definea generic and open framework for modeling andaccessing stateful resources using Web services. Astateful resource (or a ServiceResource1) is a setof data values that persist across, and evolve as a

1Note that the term resource used in WSRF should not beconfused with the resource in Grid computing, a generalterm to denote a computational or storage resource. In therest of this paper, we use the term “ServiceResource” toexplicitly indicate the stateful resource defined in WSRFspecifications.

Page 2: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

74 E.-K. Byun, J.-S. Kim

result of, Web service interactions. The executionmodel of stateful Web services are explained inSection 3.1.

In this paper, we argue that any Grid systemshould meet the following requirements to pro-vide enough resources to Grid applications effec-tively and stably.

– AdaptabilityA Grid system is composed of a large numberof heterogeneous resources on which diverseapplications are executed. Due to the dynamicnature of the Grid, the amount of resourcesdemanded by each application may changeover time. Currently, businesses must acquiremore processing power and storage than theyneed simply to cover peak times. For example,Amazon.com and other web retailers mustensure they can handle peak demand in hol-iday season, while the IRS(Internal RevenueService) has a peak in demand in April. Forboth entities, the hardware is under-utilizedthe rest of the time [12].Adaptive resource provisioning is necessaryto handle such a situation elegantly. Newresources should be allocated adaptively tothe application that requires more computingpower or storage. If the service request ratedecreases, deployed services can be removedin order to save disk space or to reduce man-agement costs. Unfortunately, most of the cur-rent WSRF-based hosting environments donot support adaptive resource provisioningsince each service needs to be deployed in ad-vance on the statically-partitioned resources.

– ScalabilityOne of the most important mechanisms forrealizing adaptive resource provisioning is dy-namic service deployment with which a newservice can be deployed on any resource inthe Grid without restarting of the container onthe target resource. Dynamic service deploy-ment mechanisms would appear to necessitatea service-specific centralized manager whichmaintains the locations of currently deployedServiceResources and monitors the status ofGrid resources used by the service. Specialattention should be paid to design the Gridsystem since such a centralized manager easily

becomes a performance bottleneck and poten-tially harms the scalability of the system.

– ReliabilityIn the large-scale Grid system, the availabilityof resources also tends to vary dynamically aseach resource may leave the system or crashunpredictably at any time. In order to achievereliable services, the state of running services,i.e. ServiceResources, should be kept avail-able even when the current host leaves thesystem or experiences unstability, overload, orfailure.

Globus Toolkit version 4 (GT4) [4] is a rep-resentative Grid middleware supporting WSRFin order to execute or access WSRF-compliantWeb services in a standard way. On GT4 Webservices can exploit heterogeneous Grid environ-ments with platform-independent Web servicesprotocols and Java-based hosting environment.However, since GT4 can’t support discussed re-quirements, another complementary architectureis required.

Therefore, this paper presents DynaGrid, anew framework which offers adaptive, scalable,and reliable resource provisioning for WSRF-compliant applications. We develop a new dy-namic service deployment mechanism to supportadaptive resource provisioning. To enhance thescalability, our adaptive resource provisioningmechanism does not rely on any centralizedmanager. DynaGrid achieves this by providingClient Proxy on the client side, a modified clientstub which transparently redirects the client’s re-quest to the actual location of the correspondingServiceResource. DynaGrid also distributes thecost of management and recovery by partition-ing resources allocated for the service. For re-liability, every ServiceResource in DynaGrid isreplicated to another resource and all the execu-tion requests for the ServiceResource are loggedwithin the replica. In case the original resourcecrashes, DynaGrid recovers the ServiceResourceby replaying logged requests on the replica.

We implemented DynaGrid framework as aset of WSRF-compliant Web services so that theyare executed on GT4 container without any mod-ification of GT4 container core. The experimen-tal evaluations are performed on a real testbed

Page 3: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 75

with practical applications including the MapRe-duce application [11]. The results indicate thatDynaGrid effectively utilizes Grid resources withscalability and reliability.

The rest of the paper is organized as follows. InSection 2, we briefly overview the related work.Section 3 presents the execution model and theoverall architecture of DynaGrid. The proposedmechanisms for adaptive resource provisioningis explained in Section 4. DynaGrid’s architec-ture for scalability and reliability are described inSection 5 and Section 6, respectively. Section 7presents the evaluation results. Finally, we con-clude in Section 8.

2 Related Work

M. Welsh et al. proposed SEDA (staged event-driven architecture) in [13]. SEDA enables In-ternet services to be concurrently executed ondistributed resources and resource allocation andload balancing are automatically handled. Hid-ing complex resource management from client issimilar to DynaGrid. However, SEDA can sup-port only specific application model named stageswhich is a standard as Web service.

Djilali et al. investigated fault-tolerant envi-ronment for RPC programming in Internet con-nected Desktop Grid [14]. They exploit three-tierarchitecture (client, coordinator, server), messagelogging, hierarchical fault detection and passivereplication. These schemes were well known andDynaGrid also exploits such schemes. DynaGridespecially focuses on keeping ServiceResourcesalive in WSRF environment under Grid resourcefailure.

The latest version of GT4 provides remote anddynamic service deployment mechanism calledHAND [5]. HAND adds an internal functioninto the GT4 container which allows to reloadthe deployed service list dynamically. However,this approach is not generally applicable to otherhosting environments since this is only a GT4-specific solution and requires administrative priv-ilege to access the internal data structure of theGT4 container.

DynaGrid is the first attempt to build adap-tive resource provisioning framework for WSRF-

compliant applications with the emphasis onscalability and reliability issues. Our dynamicservice deployment mechanism differs fromHAND in that our approach requires no modifi-cation to the GT4 container. Moreover, Service-Resource replication and recovery are unique toDynaGrid and, to the best of our knowledge, anysimilar mechanism has not been investigated forWSRF in the previous work.

3 WSRF and DynaGrid Overview

3.1 Web Service Execution Model in WSRF

In this section, prior to explain the architectureof DynaGrid, we will describe how stateful Webservices are executed on WSRF environment, howWeb services are accessed by clients and eventu-ally how they construct distributed applications.Web services require hosting environment calledWeb service container, for example, Apache Tom-cat and IBM Websphere, to be executed on anetwork accessible host. GT4 container is oneof Java-based hosting environments for WSRF-compliant Web services. GT4 container is exe-cuted on each host and it allows Web services toutilize the underlying resources of the host suchas CPU cycle, memory, and storage. A typicalstep to run an application on a specific host is toimplement the application as a WSRF-compliantservice and deploy it into the container of thehost. Clients then access the container to executefunctions of Web service with SOAP and HTTP.

The most important feature of WSRF com-pared to traditional Web service is that WSRFdefines standard interface for managing the stateof resources through Web services. According tothe WSRF specification, each client can create itsown ServiceResource and store the result data ofconsequent Web service executions in its Service-Resource to keep the client’s context on the Webservice. All information of Web service executionof client is stored in ServiceResources and theyare completely independent from Web servicecode. A Web service is accessed by client with anendpoint reference containing a URI which pointsboth Web service code and the key representing aspecific ServiceResource.

Page 4: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

76 E.-K. Byun, J.-S. Kim

Such separation of Web service code andServiceResource make it possible for client’scontext of the Web service to be resumed onany hosting environment in which the corre-sponding Web service code is deployed andServiceResource is located. DynaGrid realizeServiceResources to be freely moved to any con-tainer for adaptive and reliable execution of Webservice.

Such Web services can construct various typesof distributed applications. Firstly, Web serviceitself can be a simple application which handlesrequests from several clients as in traditional Webservers. Secondly, Web service can be used ina parallel application which performs the sameevaluation on diverse input data at the same time.In this case, the core of evaluation is implementedas Web service code and a special managementagent creates the necessary number of Service-Resources and execute the Web service for dif-ferent inputs. The execution may be handled indistributed resources for the performance. Finally,Web services can be components of a multi-tierdistributed application. In this case, one Web ser-vice takes a role of the client of next tier’s Webservice. In each tier, plural ServiceResources maybe activated concurrently.

3.2 Overall Architecture of DynaGrid

In order to execute a Web service on GT4, Webservice have to be deploy on the GT4 container ofstatically assigned Grid server so that all requestsfrom clients are handled in the pre-assigned con-tainer. DynaGrid complements such limitation ofthe existing WSRF-compliant Grid environmentespecially GT4. DynaGrid enables Web servicesto exploit any amount of Grid resource wheneverthe amount of necessary computational resourcesvaries.

Figure 1 shows the overall architecture ofDynaGrid. DynaGrid is composed of four com-ponents: ServiceDoor, Dynamic Service Launcher(DSL), PartitionManager, and Client Proxy.In DynaGrid framework, the components ofDynaGrid are independent Web services exe-cuted on GT4 containers and DSL and Partition-Manager are deployed on every container inadvance.

In order to be executed in DynaGrid, everyWeb service has to deploy a specific Web servicenamed ServiceDoor on a trusted resource andpublish its address to clients. The ServiceDoor isthe entry point to access the Web service fromclients. Only through the ServiceDoor, clients cancreate new ServiceResources. Clients feel that theServideDoor is the Web service itself and interactas they access the standard Web service.

ServiceDoor customized for each Web servicesis automatically created by DoorCreator whichDynaGrid provides. DoorCreator interpretsWSDL (Web Services Description Language),WSDD (Web Services Deployment Descriptors),and the corresponding Web service code.

Beyond ServiceDoor, Web service is actuallyexecuted on the GT4 container of the dynam-ically allocated Grid resource. Dynamic ServiceLauncher (DSL) is a special Web service runningon every resource in DynaGrid. Through DSL,DynaGrid can dynamically deploy new Web ser-vices, create ServiceResources, and handle Webservice execution request on any GT4 container.DSL also performs ServiceResource replicationand request logging.

DynaGrid groups DSLs which are allocatedto a specific Web service into several Service-Partitions for the scalability reason. GroupingDSLs is equivalent to grouping GT4 containerssince there is only one DSL on a GT4 con-tainer. Each ServicePartition consists of at leasttwo DSLs and one PartitionManager. Partition-Manager performs creation, replication, loadbalancing, and recovery of ServiceResources.In addition, it monitors the status of DSLsand ServiceResources that belong to the sameServiceParition.

ClientProxy is a modified client stub. It trans-parently redirects the incoming service executionrequest to the actual location of ServiceResource.DynaGrid also provides a utility called Proxy-Creator to automatically generate Client Proxycode from WSDL of the target Web service.

In a service-oriented workflow system, a servicecan be a client of other service. In such a case,the client-side service should use Web serviceinterface to connect to the server-side service.Therefore in DynaGrid framework, communica-tion between component services of a workflow

Page 5: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 77

DSL

PartitionManager

Service A

ServiceDoor A

DSL Service A

DSL Service A

ServiceResource

Grid

Client Proxy

Client

DSL

ServicePartition

Replica: GT4 container

ServiceResource creation

Service invocation

ServiceResource replication

PartitionManager DSL DSL

ServicePartition

DSL

PartitionManager

Service A

ServiceDoor A

DSL Service A

DSL Service A

ServiceResource

Grid

Client Proxy

Client

Client Proxy

Client

DSL

ServicePartition

Replica: GT4 container

ServiceResource creation

Service invocation

ServiceResource replication

PartitionManager DSL DSL

ServicePartition

Fig. 1 The overall architecture of DynaGrid

system should pass the ServiceDoor of server-sideservice.

In the following sections, we describe detailsof these components and how they interact eachother to achieve adaptive, scalable, and reliableresource provisioning.

4 Adaptive Resource Provisioning

The most distinctive feature of DynaGrid is thatWeb services can be executed on any resourcewhenever they want. To realize this, we developeda new dynamic service deployment mechanism andimplemented it as the main function of DSL. Itsdetails will be discussed in Section 4.1.

DynaGrid dynamically allocates additional re-sources when the current resources can’t prop-erly support the increasing demand for Webservices and ServiceResources. Resource inade-quacies happen in two situations. The first caseis when there are too many requests for Service-Resource creation to accommodate them withinthe allocated resources. The second case is when aServicePartition has no available resource to repli-cate ServiceResources and replicas from failed re-sources (cf. Section 6). In both cases, ServiceDoordynamically deploys the service into the DSL of anew resource and inserts the DSL to the currentServicePartition.

Because of the additional management stepcaused by the existence of ServiceDoor, inter-nal steps of ServiceResource creation and Web

service execution are modified. Section 4.2 andSection 4.4 will explain the details.

4.1 Dynamic Service Deployment

Dynamic service deployment of DynaGrid is dis-tinct from that of GT4. Typical Web service con-tainers including the GT4 container launch theservice code directly when requests to a specificWeb service arrive. Each container maintains alist of currently deployed services. Deploying anew service requires adding a new entry into thecontainer’s deployed service list. Since the changein the list is usually reflected only after the con-tainer is restarted, the latest version of GT4 hasa special mechanism which accesses the internaldata structure of GT4 in order to reload the de-ployed service list dynamically [5]. This is why thedynamic service deployment mechanism in GT4depends on the implementation of the GT4 con-tainer. In DynaGrid, on the other hand, deployinga new service does not require any interactionwith the underlying Web service container. Everyservice deployment request is forwarded to an ap-propriate DSL first. DSL then changes the execu-tion environment to that of the target service andlaunches the service code. In this way, DynaGrid’sdynamic service deployment mechanism conformsto WSRF specifications and thus can be used inany WSRF-compliant hosting environments.

In order for DSL to manage dynamically de-ployed services, a new type of ServiceResource

Page 6: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

78 E.-K. Byun, J.-S. Kim

named Meta ServiceResource is introduced. Oncea new service is deployed, a new Meta Service-Resource is created. It stores the information onthe deployed service including the service ID,interface class, service options, and ClassLoader.With the endpoint reference (EPR) of the MetaServiceResource, other components in DynaGridcan decide the target Web service for Service-Resource creations and service executions.

Typical steps of dynamic service deploymentin DynaGrid are as follows. (1) ServiceDoorsearches for an available resource using the in-formation service of the Grid. (2) It transfersthe package file containing the service code andServiceResource class to the resource, and de-ploys the service through DSL on the new re-source. (3) DSL returns the EPR of the createdMeta ServiceResource. (4) Finally, the returnedEPR is inserted into a ServicePartition and thePartitionManager begins to monitor the status ofthe DSL.

4.2 ServiceResource Creation

The first step to invoke a Web service is to create aServiceResource and to obtain its EPR composedof the key and the URI of the service. Clients askServiceDoor to create a ServiceResource throughWeb service interfaces. Addresses of all the cre-ated ServiceResources are kept by ServiceDoorso that clients can inquire the actual locations ofServiceResources to ServiceDoor.

Figure 2 depicts typical steps to create a newServiceResource in DynaGrid. When Service-Door receives the creation request, it assigns a

global key and select in which ServicePartitionnew ServiceResource will be created. Service-Door periodically gather state information fromevery PartitionManagers to check how manyrooms each ServicePartition have to create newServiceResource. Based on that information,ServiceDoor relays the creation request to thePartitionManager of the most sufficient Service-Partition.

PartitionManager then selects in which con-tainer the new ServiceResource will be created.Since PartitionManager also keeps the load in-formation of each resources including CPU uti-lization, amount of free memory and the numberof concurrent processes, it can easily select mostunder-loaded resource. ServiceResource is cre-ated through DSL in the selected resource andthe local EPR which specifies the actual locationof the newly created ServiceResource is returnedto ServiceDoor. ServiceDoor then inserts a globalkey-to-local EPR mapping entry into the mappingtable. Note that the local EPR cannot be deliv-ered to clients directly because DynaGrid maydynamically relocate the ServiceResource if nec-essary. Instead, ServiceDoor returns a new EPRcomposed of the address of ServiceDoor and theglobal key as a creation result to the client.

4.3 Web Service Execution

In general, clients use client stubs to invoke Webservices. A client stub translates Web service in-vocation requests from the client to SOAP mes-sages and sends them to remote Web serviceengines (for example, the GT4 container). Since

ServiceDoor

DSL

Service AServiceResource

(Address of ServiceDoor, Global Key) DSL

Service AReplica

ClusterMaster

1. Select a cluster to create

2. Select DSLsto create 3. Create ServiceResource and replica

4. Return EPRLocal EPR(Address of DSL, key)

Client ProxyClient

ServiceDoor

DSL

Service AServiceResource

(Address of ServiceDoor, Global Key) DSL

Service AReplica

ClusterMaster

1. Select a cluster to create

2. Select DSLsto create 3. Create ServiceResource and replica

4. Return EPRLocal EPR(Address of DSL, key)

Client ProxyClient

Client ProxyClient

Fig. 2 Steps for ServiceResource creation

Page 7: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 79

Resource1

Meta ServiceResource

for Service A

Container

DSL

EPR= DSL, ServiceA_Resource1

Resource 1

DSL ResourceHome

Meta ServiceResource

for Service B

Context - the configuration

is replaced for Service A

Resource 2

Resource 3

Resource 4

ServiceA_Resource1 execute

Container

Service A

EPR= ServiceA, Resource1

Resource 1

ResourceHome

Context of Service A

Resource 2

execute

Service B

(a) Service execution in GT4 (b) Service execution in DynaGrid

Users

ResourceHome

Resource 1

Users

Resource1

Meta ServiceResource

for Service A

Container

DSL

EPR= DSL, ServiceA_Resource1

Resource 1

DSL ResourceHome

Meta ServiceResource

for Service B

Context - the configuration

is replaced for Service A

Resource 2

Resource 3

Resource 4

ServiceA_Resource1 execute

Container

Service A

EPR= ServiceA, Resource1

Resource 1

ResourceHome

Context of Service A

Resource 2

execute

Service B

(a) Service execution in GT4 (b) Service execution in DynaGrid

Users

ResourceHome

Resource 1

Users

Fig. 3 The comparison of service execution mechanisms in GT4 and DynaGrid

DynaGrid has an enhanced service deploymentmechanism, Web service execution procedure isslightly different compared to the traditional Webservice engines.

Figure 3a shows how a service is executed inthe GT4 container. When a container receives arequest from client, it parses the EPR composedof the URI of the service and the ServiceResourcekey. Then, it finds the corresponding service in thedeployed service list. ResourceHome of the ser-vice retrieves the corresponding ServiceResourceaccording to the key. Finally, a thread called con-text is started with the ServiceResource.

On the other hand, the service execution inDSL proceeds as shown in Fig. 3b. When a con-tainer receives a request, it starts a context toexecute with the key composed of both serviceID and a local ServiceResource key. DSLRe-sourceHome parses the key to retrieve the MetaServiceResource related to the service ID andthe ServiceResource corresponding to the localServiceResource key. DSL changes the context’sClassLoader and service options according to theinformation stored in the Meta ServiceResourcein order to configure the context compatible forexecuting the service. Finally, DSL executes theservice with the ServiceResource and returns theresults to the client.

4.4 Discussion About Independent DSL

DLS can be said a service container on a hostservice container and it seems inefficient. Frankly,DSL can be implemented as a independent hostcontainer such as GT4 container of ApatheTomcat instead of Web service. In such ap-proach DSL’s core functionalities such as dy-namic service deployment and ServiceResourcecan be implemented more efficiently without us-ing unnecessarily complex mechanisms such asMetaServiceResource and ClassLoader reloca-tion. The advantage of Web service form of DSLis that DynaGrid can be easily applicable to VOconstructed with GT4 containers because deploy-ing new Web service is much easier than deploy-ing new container to all hosts. The contributionof Web service implementation of DSL is thatit can complement the functionalities requiredfor DynaGrid without any modification of GT4container.

5 Scalability in DynaGrid

Although ServiceDoor and dynamic service de-ployment mechanism successfully enable adap-tive resource provisioning, the presence of the

Page 8: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

80 E.-K. Byun, J.-S. Kim

centralized ServiceDoor causes a long serviceexecution path and the possibility of being abottleneck. In this section, we describe twoclever designs to improve the scalability ofDynaGrid. First, distributed PartitionManagersdeal with most of complex management tasks suchas creation, replication, load balancing, and re-covery on behalf of ServiceDoor. Second, Client-Proxy allows to bypass ServiceDoor during Webservices invocations. The following subsectionselaborate upon the features provided by Partition-Manager and ClientProxy.

5.1 ServicePartition

In DynaGrid, there must be a central managerto handle ServiceResource creation and to con-tinuously monitor the status of resources foradaptive resource provisioning. If all these tasksare performed in ServiceDoor, ServiceDoor canbe easily overloaded and DynaGrid cannot bescalable as the demand for resources increases.To prevent ServiceDoor from being overbur-dened, DynaGrid delegates some managementtasks, such as ServiceResource creation, repli-cation, recovery, and load balancing, to eachServicePartition. ServicePartition is composed ofa PartitionManager and DSLs on which Webservices are deployed by the corresponding Ser-viceDoor. Each PartitionManager independentlymanages its own DSLs and ServiceResources.

When a PartitionManager receives the Service-Resource creation request from ServiceDoor, itfirst selects two most under-utilized DSLs in theServicePartition; one is for the original and theother is for its replica. After successful creation ofthe new ServiceResource and its replica on the se-lected DSLs, the PartitionManager starts to keep

track of the local EPR and the replica location ofthe new ServiceResource. The local EPR of thenewly created ServiceResource is returned to theServiceDoor.

PartitionManager periodically monitors DSLsto see whether the resource is overloaded orto collect the list of destroyed ServiceResources.Those data are used to choose the best DSL whenPartitionManager tries to create new Service-Resources or replicas. When it detects the failureof DSL, the recovery mechanism is invoked (cf.Section 6.2).

Along with the management of Service-Partition, PartitionManager also reports its statusto ServiceDoor periodically. In order to preventPartitionManager from being a performance bot-tleneck, the size of a ServicePartition is lim-ited. When a ServicePartition is too overloaded,ServiceDoor dynamically creates a new Service-Partition to handle the increased service requests.Through this hierarchical resource managementstructure, we can improve the scalability ofDynaGrid.

5.2 Direct Web Service Invocation

In DynaGrid, ClientProxy generated by Proxy-Creator replaces the client stub. Every Web ser-vice client should load Client Proxy instance fora specific ServiceResource. ClientProxy transpar-ently redirects the request to the DSL where theServiceResource actually exists. Without Client-Proxy, every service execution request should passthrough ServiceDoor, which will limit the scalabil-ity of DynaGrid.

Figure 4 depicts steps for Web service invoca-tion in DynaGrid. Every EPR known to clients iscomposed of the address of the ServiceDoor and a

Client Proxy

Client

DSL

Service A

ServiceResource

DSL

Service A

Replica

Local EPR

(Address of DSL,

key)

1-1. Translate ERP

to local EPR2-1. Execute2-2. Log request ClusterMaster

ServiceDoor

1-2. Query local EPR

If the cache is invalid

Local EPR

cache

Subscribe Local EPR change

Client Proxy

Client

DSL

Service A

ServiceResource

DSL

Service A

Replica

Local EPR

(Address of DSL,

key)

1-1. Translate ERP

to local EPR2-1. Execute2-2. Log request ClusterMaster

ServiceDoor

1-2. Query local EPR

If the cache is invalid

Local EPR

cache

Subscribe Local EPR change

Fig. 4 Steps for Web service invocation

Page 9: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 81

global key as described in Section 4.2. On the firstaccess, ClientProxy obtains the local EPR accord-ing to the global key from ServiceDoor and cachesthe returned EPR internally. Thereafter, succes-sive execution requests are delivered directly tothe ServiceResource bypassing ServiceDoor.

Since the location of the ServiceResourcemay change, ClientProxy also obtains the EPRof the PartitionManager from ServiceDoor andsubscribes to the PartitionManager to receivethe notification when the local EPR changes. Ifthe cached local EPR becomes invalid due to thefailure of the corresponding DSL, the up-to-dateEPR is immediately fetched from ServiceDoor.

6 Reliability in DynaGrid

The failure of a resource generally leads to totalloss of ServiceResources served on the resource. Ifwe know that a ServiceResource will be unacces-sible, we can migrate it to another stable resourcefor providing uninterrupted service. However, allof the existing WSRF-compliant systems are de-signed in such a way that each ServiceResourceis handled only on its birthplace since WSRFspecifications do not define any feature relatedto ServiceResource relocation. DynaGrid resolvesthis problem with the recovery mechanism cou-pled with ServiceResource replication and requestlogging.

We assume that ServiceDoor is executed on areliable host. Thus, DynaGrid provides no recov-ery mechanism for ServiceDoor at this moment.This assumption is acceptable since ServiceDooris designed not to be overloaded and it occupiesonly one resource even if there are huge demandsfor the service.

6.1 ServiceResource Replication

DynaGrid provides a mechanism for Service-Resource replication. Every ServiceResource hasits own replica on another DSL in the sameServicePartition. When a ServiceResource is cre-ated, PartitionManager makes its replica onanother DSL by storing the first snapshot of theServiceResource. All the following execution re-quests for the ServiceResource are logged within

the replica. By replaying logged requests on thesnapshot, ServiceResource can be recovered fromits replica. The new snapshot of a ServiceResourceis replicated whenever the specified number of re-quests are logged. Eventually, each replica storesthe most recent snapshot and logs of requests is-sued after the last snapshot. This policy can reducethe time for replaying logged requests and savethe storage for logs. Making snapshot is also per-formed if the execution request cause some side-effect so that reply may break the consistency. Weadd a field to the operation description in WSDLto show whether the operation cause side-effect ornot. Unfortunately application developers shouldcheck the field manually in the current version ofDynaGrid.

DSL provides a Web service method for trans-ferring serialized ServiceResource object andstoring it in the corresponding replica. Beforemaking a snapshot of a ServiceResource, DSLacquires the lock on the ServiceResource to makesure the ServiceResource is not being modified byany execution request. DynaGrid also provides agroup of APIs which help service developers toeasily implement their ServiceResources as Seri-alizable objects.

A request logging is processed in the back-ground during the request is executed. The resultis returned to the client only after both the loggingand the service execution finish successfully. Notethat a request logging may take more time thanthe request execution since it requires one moreWeb service invocation to the replica. Moreover,if the DSL of the replica crashes, the loggingwould be delayed further for the recovery of thereplica. In spite of such possibility of long delay,our synchronous logging scheme is essential forthe reliability. Without synchronized logging, therequest log may be permanently lost since theclient may not be aware of the loss of the replica,failing to re-issue the lost request. In this case,ServiceResources can not be properly recovered.In brief, all Web service requests from clients areexecuted atomically.

We selected the number of replicas as justone to minimize the overhead including networktraffic and storage. Section 7.4 verifies that the sin-gle replica policy properly handles the reliabilityof Web service. Nonetheless, DynaGrid can be

Page 10: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

82 E.-K. Byun, J.-S. Kim

extended to manage multiple replicas to handlethe Grid environment composed of very volatileresources and networks. In such cases, each Webservice execution may takes more time because itwaits until the log is stored in all replicas.

6.2 Recovery of DSL

In DynaGrid, the information on a Service-Resource is distributed among DSL, Partition-Manager, and ServiceDoor. DSL has all theinformation on the original ServiceResources aswell as the snapshots and request logs of the repli-cas. PartitionManager keeps a mapping table be-tween the ServiceResource key and the addressesof DSLs where the ServiceResources and theirreplicas reside. ServiceDoor maintains the localEPR of every ServiceResource.

The failure of a DSL can be detected in twoways. First, since every DSL periodically notifiesits status to PartitionManager, PartitionManagercan notice that a DSL has failed if the notifica-tion is not arrived for the configured time. In thecurrent DynaGrid, notifications are sent every 6 sand the timeout value is 20 s. Second, the failureof a DSL can be detected by ClientProxy whenit accesses the ServiceResource or by other DSLs

when they send request logs to the DSL. As soonas Client Proxies or other DSLs see errors, theyask PartitionManager to check the availability ofthe ServiceResource or the replica. If Partition-Manager observes the DSL failure, it starts to re-cover all the ServiceResources and replicas basedon its mapping table. Some errors can be causedby the incorrect location information in ClientProxies or DSLs, in which case PartitionManagerreturns the correct information and lets themretry.

If a DSL fails, PartitionManager recovers allthe ServiceResources and replicas simultaneously.Figure 5a illustrates the recovery steps for DSLfailure. To recover a replica, PartitionManagercreates a new replica on another DSL (step 2–2)and immediately replicates the current snapshotof the ServiceResource (step 3). PartitionManagerupdates its ServiceResource mapping table (step4) and informs the original DSL that the locationof the replica is changed. The recovery processfor a ServiceResource is similar, but Partition-Manager first needs to order the DSL of thereplica to replay logged requests to make the up-to-date ServiceResource (step 2–1). It then cre-ates a new replica (step 2–2) and replicates thesnapshot of the ServiceResource from the replica

Replica B

ServiceResource B

(Replica : DSL#1)

(Replica : DSL#4)

DSL #1

ServiceResource A,

(Replica : DSL#2)

Replica B

PartitionManager

Replica A

ServiceResource A

(Replica : DSL#4)

Replica A

DSL #4

1. Detect and

notify the failure

Crashed!

DSL #2

DSL #32-2. Start

new replica

2-1. Recover

ServiceResource

3. Replicate

ServiceResource

Key Original Replica

A DSL #1 DSL #3

B DSL #2 DSL #1

A DSL #3 DSL #4

B DSL #2 DSL #4

4. Update mapping

(a) DSL recovery steps

Replica B

DSL #1

ServiceResource A,

(Replica : DSL#2)

Replica A

DSL #2

ServiceResource B,

(Replica : DSL#1)

PartutionManager_old

PartitionManager_new

Key Original Replica

A DSL #1 DSL #2

B DSL #2 DSL #1

5. Reconstruct mapping

ServiceDoor

Crashed!

DSL List :

DSL #1, DSL#2

(b) PartitionManager recovery steps

1. Detect and

notify the failure

Cluster #1 :

PartitionManager_new,

DSL #1, DSL#2

2. Start new PartitionManager instance

3. Send DSL list in the failed PartitionManager

4. Collect

information of

ServiceResource

and replica

Replica B

ServiceResource B

(Replica : DSL#1)

(Replica : DSL#4)

DSL #1

ServiceResource A,

(Replica : DSL#2)

Replica B

PartitionManager

Replica A

ServiceResource A

(Replica : DSL#4)

Replica A

DSL #4

1. Detect and

notify the failure

Crashed!

DSL #2

DSL #32-2. Start

new replica

2-1. Recover

ServiceResource

3. Replicate

ServiceResource

Key ReplicaKey Replica

A DSL #1 DSL #3A DSL #1 DSL #3

B DSL #2 DSL #1B DSL #2 DSL #1

A DSL #3 DSL #4A DSL #3 DSL #4

B DSL #2 DSL #4B DSL #2 DSL #4

4. Update mapping

(a) DSL recovery steps

Replica B

DSL #1

ServiceResource A,

(Replica : DSL#2)

Replica A

DSL #2

ServiceResource B,

(Replica : DSL#1)

PartitionManager_old

PartitionManager_new

Key ReplicaKey Replica

A DSL #1 DSL #2A DSL #1 DSL #2

B DSL #2 DSL #1B DSL #2 DSL #1

5. Reconstruct mapping

ServiceDoor

Crashed!

DSL List :

DSL #1, DSL#2

(b) PartitionManager recovery steps

1. Detect and

notify the failure

Cluster #1 :

PartitionManager_new,

DSL #1, DSL#2

2. Start new PartitionManager instance

3. Send DSL list in the failed PartitionManager

4. Collect

information of

ServiceResource

and replica

Fig. 5 The recovery process in DynaGrid

Page 11: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 83

(step 3). Finally, the mapping table is updated(step 4).

PartitionManager can misjudge the failure ofDSL due to the network problem. Moreover, DSLlooks alive from client while it looks failed fromPartitionManager or ServiceDoor. In such cases,requests already issued to the misjudged DSL areproperly finished and next requests are executedon new DSL since PartitionManager notifies theaddress of new DSL to ClientProxy of the client.The misjudged DSL is removed from the allocatedDSL list of PartitionManager. After the networkproblem is resolved, the DSL may be allocated tothe same Web service.

Client are also able to crash. If some clients be-come unavailable, their ServiceResource remainsuncontrolled. These orphan ServiceResources areswept according to the WS-Lifetime, one ofthe WSRF standard specification. Every Service-Resource has its own lifetime. In DynaGrid, DSLsand PartitionManagers automatically remove oldServiceResources and its replica and all loggedrequests.

6.3 Recovery of PartitionManager

ServiceDoor can detect the failure of Partition-Manager when the notification from Partition-Manager times out. ServiceDoor can also sensethe failure if a creation request on the Partition-Manager throws an error.

PartitionManager has a list of Service-Resources in the ServicePartition and themapping table which maintains the locations ofServiceResources and their replicas based onthe ServiceResource keys. This information isduplicated in ServiceDoor and DSLs so thatDynaGrid can reconstruct them even though aPartitionManager fails. Recall that ServiceDoorknows the list of DSLs for all ServicePartitionsand each DSL keeps the locations of replicas ofall the ServiceResources it has.

The recovery steps for PartitionManager fail-ure are depicted in Fig. 5b. When ServiceDoorobserves the failure of PartitionManager (step 1),ServiceDoor spawns a new PartitionManager(step 2) and sends the list of DSLs which weremanaged by the failed PartitionManager (step 3).

The new PartitionManager contacts all the DSLsto gather the list of ServiceResources and repli-cas (step 4). Based on the collected information,the new PartitionManager rebuilds the mappingtable (step 5). After the mapping table is rebuilt,PartitionManager searches for ServiceResourcesand replicas that were failed during the recoveryof PartitionManager, and recovers them if any.Finally, it starts to manage the ServicePartition.

6.4 Load Balancing

Although PartitionManager always tries todispatch ServiceResource creation requests tolightly-loaded DSLs, the imbalance in resourceutilization between DSLs is unavoidable since theusage pattern of ServiceResources changes overtime. To cope with such a situation, DynaGridprovides a load balancing mechanism usingServiceResource replication.

All DSLs in each ServicePartition periodicallynotify their resource status to their Partition-Manager. If the status violates the configured loadbalancing threshold, the PartitionManager startsa load balancing mechanism. In the current im-plementation of DynaGrid, the service providercan set the various load balancing threshold basedon CPU utilization, the occupied memory size,and the number of concurrent ServiceResourcesserviced per DSL.

If the load balancing mechanism is activated,PartitionManager selects one third of Service-Resources in the busy DSL. The ratio ’one third’is selected to balance between the overhead ofServiceResource migration and the obtainableroom in the busy DSL. Of course, the optimalratio can vary according to the resource demandchange of the Web service. The current selec-tion criterion is the load of the DSL on whichthe replica is located. In other words, Service-Resources whose replicas reside in lightly-loadedDSLs are selected. Once a ServiceResource isselected for load balancing, it is moved from theoverloaded DSL to another DSL in the same wayas the ServiceResource is failed. If the Service-Resource cannot be accommodated in the exist-ing DSLs, PartitionManager asks ServiceDoor toallocate additional DSLs.

Page 12: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

84 E.-K. Byun, J.-S. Kim

7 Evaluation

7.1 Environment and Benchmarks

We construct a testbed with 22 servers connectedby 100Mbps Ethernet. Table 1 summarizes the listof servers used for our testbed. On each server,Linux 2.4.20 and JDK 1.4.2 are installed and thecontainer of Globus Toolkit version 4.0.1 is usedas the hosting environment.

For our evaluation, we implemented threepractical applications as WSRF-compliant Webservices. Ray tracing service is a Web serviceimplementation of RAJA [7], an open sourceray tracer. Its ServiceResource maintains a resultbuffer and a request queue on which scene dataand the required quality operations are placed.Streaming buffer service exploits the server’s phys-ical memory to buffer streaming data similar toRBNB (Ring Buffered Network Bus) [8]. RBNBhas been used in many Grid systems includ-ing NEESGrid [9] for data aggregation, stream-ing, and synchronization. The ServiceResourceof streaming buffer service includes the buffereddata received from a source channel. Finally,in order to evaluate the overall effectiveness ofDynaGrid, we developed a MapReduce [11] sys-tem called GridMR on DynaGrid and executedApache Nutch [10], an open source search enginewritten in Java, on GridMR.

As a microbenchmark, we also implementedAdd Web service, which simply performs add op-eration and keeps the result of the calculation asServiceResource. The Add Web service is used toanalyze the overheads of DynaGrid.

7.2 Adaptive Resource Provisioning

We carried out two experiments to show thatDynaGrid adaptively manages its resources evenwhen the request rate from clients sharply in-creases. In these experiments, DSLs are deployedonly on Pentium-4 servers in our testbed.

Table 1 List of servers in our testbed

Count CPU RAM SMP

8 Pentium-4 2.8 GHz 1 GB 2-Way6 Pentium-4 2.8 GHz 512 MB No8 Pentium-3 850 MHz 1.5 GB 2-Way

0

0.2

0.4

0.6

0.8

1

1.2

0 4 10 12The number of concurrent clients

Com

plet

ion

rate

(re

ques

ts/m

in)

Static (5 DSLs)

Adaptive (upto 11 DSLs)

2 6 8

Fig. 6 The comparison of the request completion rate inthe ray tracing service

First, we compare the request completion ratein the ray tracing service when the number ofclients increases from 1 to 12 as shown in Fig. 6.We use total 11 DSLs in this experiment. Forstatic provisioning, we allocate only five DSLsassuming the initial administrative decision sug-gests that five resources are enough. For adaptiveprovisioning, we allow DynaGrid to use all theDSLs and each DSL is limited to serve only onetracing request at a time so that the superfluousrequests can be directed to other DSLs. Eachclient repeats the same request which takes about10 min. After the number of clients reaches fivein Fig. 6, static provisioning does not appear re-silient enough to support the unexpected resourcedemand. However, adaptive provisioning makesuse of all the resources, effectively handling theincreased request rate.

Second, we measure the amount of occupiedmemory by the streaming buffer service. Figure 7depicts the change in the aggregated buffer size

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250

Elapsed time from start of buffering (sec)

Agg

rega

ted

buffe

r siz

e (1

00M

B) Static (4 DLSs)

Adaptive (upto 8 DSLs)

Fig. 7 The comparison of the aggregated buffer size in thestreaming buffer service

Page 13: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 85

of 24 streaming channels on which data are gen-erated at 200 KB/s from sources. In the case ofadaptive provisioning, the service is initially de-ployed on two DSLs and all channels are cre-ated on them. On each DSL, the total memorycapacity allocated to buffers is limited to 100 MB.When the total size of buffers reaches 100MBin each DSL, the load balancing mechanism ofDynaGrid is triggered to relocate one third ofbuffers. DynaGrid may allocate additional DSLs(up to eight DSLs) to the streaming buffer serviceto obtain more memory. In the static provisioningcase, the channels are evenly distributed over fourDSLs. Figure 7 presents that, with the adaptiveprovisioning scheme, the aggregated buffer sizeis constantly expanded until the streaming bufferservice fully occupies the configured memory ofevery DSL. Meanwhile, the total buffer size islimited to 400 MB in the static provisioning case.Adaptive provisioning shows the slightly slowergrowth rate because the streaming buffer serviceis temporarily blocked during the relocation ofServiceResources.

Although DynaGrid enables flexible resourcemanagement, it introduces extra overhead com-pared to normal Web services. Table 2 presentsthe overhead of dynamic service deployment mea-sured during the experiment shown in Fig. 6. Ini-tializing a ServicePartition takes 1,009 ms wherestarting PartitionManager instance and addingtwo DSLs into the ServicePartition require 530and 479 ms, respectively. We can observe thatmost of the time during dynamic service deploy-ment is spent to transfer the service code, whosesize is 1,890 KB for the ray tracing service. Evenif the overhead seems to be high, this overheadcan be amortized considering that dynamic ser-vice deployment does not happen frequently. Twoprevious experiments confirms this as the overall

Table 2 Overhead of dynamic service deployment in theray tracing service

Operation Time (ms)

Initializing a ServicePartition 1,009Starting PartitionManager instance 530Adding two DSLs into ServicePartition 479

Dynamic service deployment 2,050Transferring service code (1,890 KB) 1,699

throughput is not seriously affected by dynamicservice deployment.

7.3 Scalability

7.3.1 Overhead of Service Execution in DynaGrid

Table 3 summarizes the ServiceResource cre-ation and execution time compared to the nor-mal Web service deployed on the GT4 container.The ServiceResource creation time on DynaGridrequires 654 ms on average which is almost fivetimes longer than the direct creation. This over-head is mainly caused by the additional networksteps required for contacting ServiceDoor andPartitionManager, creating a replica, and transfer-ring the first snapshot. Again, this overhead canbe amortized since the ServiceResource creationoccurs only once.

A Web service invocation takes 174 ms includ-ing the request logging in the replica. Note that,without ServiceResource replication, DynaGridshows the same performance as direct executiondue to ClientProxy. If ClientProxy has an invalidlocal EPR, DynaGrid requires additional 104 msto query ServiceDoor. The logging overhead of70 ms can be hidden for time-consuming Webservices since the logging is a simple Web servicecall to the replica DSL and it is performed concur-rently with the actual Web service execution.

7.3.2 Improvement by ServicePartition

This experiment is to show the effectivenessof PartitionManager in the ServiceResource

Table 3 Overhead of ServiceResource creation and exe-cution on DynaGrid

Operation Time (ms) Ratio

Direct creation on GT4 130 1Creation in DynaGrid 654 5.03

ServiceDoor/P.Manager latency 193 1.48Replica creation 127 0.97Copying the first snapshot 204 1.57ServiceResource creation 130 1

Direct execution on GT4 105 1Execution in DynaGrid 174 1.65

Request logging 70 0.67Via ServiceDoor 275 2.62

Query ServiceDoor 104 0.99

Page 14: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

86 E.-K. Byun, J.-S. Kim

creation. ServiceDoor and PartitionManagersrun on Pentium-4 servers, and DSLs on eightPentium-III servers. We compare the ServiceRe-source creation throughput using one Partition-Manager and four PartitionManagers. Figure 8shows that four PartitionManagers achieve thebetter throughput than a single PartitionManager.The result is because, even though all Service-Resource creation requests should pass Service-Door inevitably, PartitionManager takes mostpart of the creation procedure and these aredistributed among four PartitionManagers. Thefigure also shows the maximum throughput ofServiceDoor for Resource creation. Assumingthere is no overhead spent in PartitionManagerand DSL during creation, in our testbed 15ServiceResources can be created in each second.

ServicePartition also makes DynaGrid scalableby sharing the monitoring overhead. Partition-Manager constantly collects monitoring messagesfrom DSLs in the ServicePartition to check thestatus and the availability of each DSL. Assum-ing that ServiceDoor directly monitors all theDSLs, ServiceDoor can be easily overloaded bya number of DSLs. In Fig. 9, we measure theaverage CPU load of PartitionManager causedby handling notification messages from DSLs. Asshown in Fig. 9, the monitoring overhead is notnegligible when there are hundreds of DSLs andbeyond. Using several ServicePartitions, however,DynaGrid avoids a situation that one Service-Partition becomes overloaded by too many DSLs.

0

3

6

9

12

15

100Creation request rate (requests/sec)

Thr

ough

put (

crea

tions

/sec

)

1 PartitionManager

4 PartitionManagers

No latency inPartitionManager

0

3

6

9

12

15

1 10Creation request rate (requests/sec)

Thr

ough

put (

crea

tions

/sec

)

1 PartitionManager

4 PartitionManagers

No latency in DSL& PartitionManager

Fig. 8 The comparison of the ServiceResource creationthroughput

0

10

20

30

40

50

0 100 200 300 400The number of DSLs monitored

Mo

nito

rin

g o

verh

ea

d(C

PU

loa

d, %

)

Fig. 9 The overhead of monitoring according to the num-ber of monitored DSLs (each DSL sends a notificationevery 5 s)

7.3.3 Improvement by ClientProxy

As described in Section 5.2, ClientProxy cachesthe local EPR of ServiceResource and sendsthe requests to the ServiceResource directly.In this subsection, we measure the scalabilityimprovement by the direct service invocationquantitatively.

Figure 10 illustrates the throughput achievedwith the simple Add service in various setof worker pool composed of Pentium-4 andPentium-3 servers. We use Pentium-4 servers forServiceDoor and PartitionManager. In the firstthree cases, every request from client passesthrough ServiceDoor. And the rest of the casesuse the direct Web service invocation usingClientProxy.

The result presents that, without direct Webservice invocation, DynaGrid can not use the

0

20

40

60

80

100

120

Th

rou

gh

pu

t (e

xe

cu

tio

ns/s

ec)

5 DSLs(P4) - via ServiceDoor11 DSLs(P4) - via ServiceDoor19 DSLs(P4x11, P3x8) - via ServiceDoor5 DSLs(P4SMP)6 DSLs(P4)8 DSLs(P3)11 DSLs(P4x6, P4 SMPx5)19 DSLs(P4x11, P3x8)

Fig. 10 The effect of ClientProxy for the scalabilitythrough the comparison of aggregated throughput

Page 15: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 87

whole resources effectively since ServiceDoor be-comes a bottleneck. Even when DynaGrid usesthe same number of DSLs, we can see muchhigher throughput using the direct Web service in-vocation technique. The result also shows that thethroughput with ClientProxy is scalable accordingto the number of DSLs hosting the service.

7.4 Reliability

Table 4 shows the recovery costs due to the fail-ure of PartitionManager or DSL under variousconditions. The time to recover a DSL which hastwo ServiceResources and two replicas is only 358ms. For a ServicePartition which hosts 36 Service-Resources and 36 replicas on 18 DSLs, Partition-Manager can be recovered from the failure in2,039 ms. From Table 4, we can see that mostrecovery can be completed in a couple of seconds.

DSLs may remain unrecovered until Partition-Manager detects the failure of DSL or untilServiceDoor detects the failure of Partition-Manager. In the current configuration, Service-Door and PartitionManager check the availabilityof PartitionManager and DSLs, respectively,every 20 s. In case both PartitionManager andDSLs in the same ServicePartition fail at the sametime, few more seconds are necessary for the re-covery of PartitionManager. Therefore, the time aServiceResource remains down can be more than20 s in the worst case.

Latencies experienced by client due to the re-covery of nodes keeping ServiceResource andreplica are 1085 ms and 631 ms respectively.These latencies contain the time for checking thefailure through PartitionManager and the delaycaused by the recovery process. Since Service-

Table 4 Summary of recovery costs (SRs: Service-Resources, reps: replicas)

Target Size Time (ms)

DSL 2SRs, 2reps 35816SRs, 16reps 1,88620SRs, 20reps 2,474

PartitionManager 4DSLs, 4SRs, 4reps 6574DSLs, 40SRs, 40reps 82518DSLs, 36SRs, 36reps 2,039

Resource recovery process contains checkpoint-ing and replication of ServiceResource snapshot,the recovery time mainly depends on the size ofServiceResource.

Generally the size of ServiceResource hardlyexceeds tens of megabytes and consequently re-covery time is smaller than 10 s. Moreover, thefailures do not happen frequently, the overheadis acceptably small compared to the overall execu-tion time.

The survival duration of resources participatingin the Grid can be inferred from the previousstudy on peer-to-peer systems. From the observa-tions on Kazaa and Gnutella, Saroiu et al. foundthat the average session duration in peer-to-peersystem was about 60 min [6]. Notice that, for theGrid system composed of under-utilized resourcesof universities or research institutions, the averagesurvival duration would be much longer. More-over, if a resource leaves the Grid system not bythe accident but by intention, the resource can askits PartitionManager to relocate ServiceResourcessafely. Therefore, we can say that 60 min for theaverage survival duration is a very cautious valuein the Grid environment.

In DynaGrid, a ServiceResource is lost whenthe DSL of its replica crashes during the recoveryof the original ServiceResource, and vice versa.Assume that the recovery of a ServiceResourcetakes at most 30 s and the failure event follows thepoisson distribution with the average survival timeof 60 min as discussed in the previous paragraph.Under these assumptions, the probability for lossof ServiceResource is estimated to only 0.8%.Remember that those parameter values repre-sent a very unstable environment. If the averagelifetime of resource is extended to 12 hours, theServiceResource loss ratio is lowered to 0.07%and DynaGrid can provide 99.93% of the Service-Resource availability.

7.5 MapReduce on DynaGrid

We select Apache Nutch [10] to evaluate the ef-fectiveness of DynaGrid in practical areas. Nutchgenerates indexes of web sites crawled from inputURLs. The pseudo-code for Nutch is as follows.

Page 16: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

88 E.-K. Byun, J.-S. Kim

Inject initial URLs into fetch list;loop for depths

Filter fetch list to remove recentlyfetched web pages;

Fetch web pages in fetch list;Update fetch list with outgoing links infetched web pages;

end loop;Generate indexes from fetched web pages;

The latest version of Nutch exploits MapRe-duce [11] proposed by Google Inc. MapReduceis a programming model for processing large datasets in distributed environment. In MapReduce, aset of key/value pairs are generated from anotherset of key/value pairs in two step data processing:Map and Reduce. Each Map and Reduce functioncan be executed on distributed servers in parallel.

We develop GridMR, a distributed applica-tion constructed with WSRF-based Web serviceswhich implements the Map and Reduce functionand a GridMR library which is a central managerimplemented as attachable Java-library. GridMRlibrary created the necessary number of Service-Resources named MRWorkers according to thesize of input data and start the execution. EachMRWorker represent intermediate data of a Mapor Reduce function. Eventually a GridMR libraryand several MRWorkers construct a distinctiveMapReduce application. Nutch then can inject theinput into or obtain the output from GridMRthrough GridMR library.

MRWorker for Map

Nutch

MRWorker for Map

MRWorker for Reduce

MRWorker for Reduce

GridMR library

DSL

DSL

DSL

DSL

1

2

3

4

5Input data Output data

Intermediate Data

MRWorker for Map

Nutch

MRWorker for Map

MRWorker for Reduce

MRWorker for Reduce

GridMR library

DSLDSL

DSLDSL

DSLDSL

DSLDSL

1

2

3

4

5Input data Output data

Intermediate Data

Fig. 11 The execution of GridMR on DynaGrid

Table 5 The execution time (seconds) of Nutch withGridMR on DynaGrid

Environment Input 1 Input 2 Input 3

Original (Single server) 591 1,435 2,862GridMR (5 DSLs) 630 1,386 2,497GridMR (11 DSLs) 383 997 1,832GridMR (11 DSLs,1 failure) 407 1,059 1,906

Figure 11 describes the detail execution stepof GridMR on DynaGrid. Receiving the inputdata from Nutch (step 1), GridMR library splitsthe input data and transfers them to a set ofMRWorkers in charge of Map function (step 2).They then execute Map function for the inputdata and transfer the result, called the intermediatedata [11], to MRWorkers responsible for Reducefunction (step 3). After all Reduce MRWorkersfinish data processing, GridMR library collects theoutput data from them (step 4) and returns toNutch (step 6).

In Table 5, we measured the execution timesof Nutch with GridMR for three different inputsand compared them to the performance of theoriginal Nutch on a single server. For GridMR,the number of MRWorkers for Map or Reducefunction is configured to be equal to the numberof DSLs. Table 5 reports that the performance ofNutch improves as the number of DSLs increases.However, the performance benefit of GridMR isnot significant compared to the original Nutch.This is because GridMR uses the time-consumingWeb services protocols and the large amount ofdata are transferred over the network. Notice thatGridMR successfully finishes the data processingonly with the negligible performance degradationeven though a failure occurs in one of DSLs.

8 Conclusions

In this paper, we propose DynaGrid, a new frame-work which offers adaptive, scalable, and reliableresource provisioning for WSRF-compliant ap-plications. Adaptive resource provisioning is re-alized by our new dynamic service deploymentmechanism and ServiceDoor. ClientProxy anddistributed PartitionManagers enhance the scal-ability of DynaGrid. ServiceResource replication

Page 17: DynaGrid: An Adaptive, Scalable, and Reliable Resource ...csl.skku.edu/papers/jgrid09.pdf · for WSRF-Compliant Applications Eun-Kyu Byun ... Grid computing is the technology for

DynaGrid: an adaptive, scalable, and reliable resource... 89

and recovery improve the reliability of DynaGrid.DynaGrid is complementary to the existing Gridmiddleware such as GT4, and can be usedwith any Java-based WSRF-compliant hostingenvironments.

Currently, DynaGrid doesn’t provide any ad-ditional security mechanism. We assumed thatall containers in the same VO are trusted eachother and that only certified users can access thehost container by other security mechanism inthe other layer. We only focused on adaptability,scalability and reliability of WSRF services andServiceResources. Security issues will be treatedin the future version of DynaGrid.

In the current version of DynaGrid, Service-Door is assumed to reside in a reliable node.However, practically centralized ServiceDoor canaffect the stability of Web service. As future work,we will enhance the DynaGrid to be resilientagainst the ServiceDoor failure. For example, du-plicating ServiceDoor in one or more back-upnodes can be a solution. We plan to release thesource code of DynaGrid into the public domainin the near future.

Acknowledgement This work was supported by theKorea Science and Engineering Foundation (KOSEF)grant funded by the Korea government (MEST) (No. R01-2007-000-11832-0).

References

1. Foster, I., Kesselman, C., Tuecke, S.: The anatomy ofthe Grid: enabling scalable virtual organizations. Lec-

ture Notes in Computer Science 2150. Springer-Verlag(2001)

2. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: ThePhysiology of the Grid: An Open Grid ServicesArchitecture for Distributed. Globus Project, www.globus.org/research/papers/ogsa.pdf (2002)

3. WSRF: OASIS, Web Services Resource Framework(WSRF) TC, http://www.oasis-open.org/committees/wsrf/

4. Foster, I.: Globus toolkit version 4: software forservice-oriented systems. Lecture Notes in ComputerScience, vol. 3779, pp. 2–13. Springer-Verlag (2005)

5. Qi, L., Jin, H., Foster, I., Gawor, J.: HAND:highly available dynamic deployment infrastructure forglobus toolkit 4. 15th EUROMICRO InternationalConference on Parallel, Distributed and Network-Based Processing (2007)

6. Saroiu, S., Gummadi, P., Gribble, S.: A measurementstudy of peer-to-peer file sharing systems. Proceedingsof Multimedia Computing and Networking (2002)

7. The Raja Project: http://raja.sourceforge.net (2007)8. Creare Inc.: Ring buffered network bus, http://rbnb.

creare.com/RBNB (2007)9. The NEESGrid Project: http://it.nees.org/ (2007)

10. Apache Nutch Project: http://nutch.apache.org/nutch(2007)

11. Dean, J., Ghemawat, S.: MapReduce: simplified dataprocessing on large clusters. In: Proceedings of the 6thSymposium on Operating Systems Design and Imple-mentation (OSDI), pp. 137–150 (2004)

12. Beckett, J.: Scaling IT for the Planet, http://www.hpl.hp.com/news/2001/oct-dec/planetary.html January(2002)

13. Welsh, M., Culler, D., Brewer, E.: SEDA: an architec-ture for well-conditioned, scalable internet services. In:Proceedings of the Eighteenth Symposium on Operat-ing Systems Principles (SOSP-18) (2001)

14. Djilali, S., Herault, T., Lodygensky, O., Morlier, T.,Fedak, G., Cappello, F.: RPC-V: toward fault-tolerantRPC for internet connected desktop Grids with volatilenodes. In: Proceedings of the 2004 ACM/IEEE confer-ence on Supercomputing (2004)