a reliable checkpoint storage strategy for grid

Computing (2013) 95:611–632DOI 10.1007/s00607-012-0250-8

A reliable checkpoint storage strategy for grid

Sana Malik · Babar Nazir · Kalim Qureshi ·Imran Ali Khan

Received: 20 July 2012 / Accepted: 28 November 2012 / Published online: 14 December 2012© Springer-Verlag Wien 2012

Abstract Computational grids are composed of heterogeneous autonomously man-aged resources. In such environment, any resource can join or leave the grid at anytime. It makes the grid infrastructure unreliable in nature resulting in delay and failureof executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reli-ability, availability and quality-of-service. The most common technique, for achievingfault tolerance, used in High Performance Computing is rollback recovery. It relies onthe availability of checkpoints and stability of storage media. Thus the checkpoints arereplicated on storage media. It increases the job execution time, if replication is notdone in proper manner. Furthermore, dedicating powerful resources solely as check-point storage results in loss of computation power of these resources. It may resultsin bottlenecks, when the load on the network is high. To address the problem, in thispaper checkpoint replication based fault tolerance strategy named as Reliable Check-point Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicatedon all checkpoint servers in the grid in distributed manner. It decreases the checkpointreplication time and in turn improves the overall job execution time. Additionally, ifa resource fails during execution of a job, the RCSS restarts the job from its last validcheckpoint taken from any checkpoint server in the grid. Furthermore to increase thegrid performance, CPU cycles of checkpoint servers are also utilized during high loadon network. To evaluate the performance of RCSS simulations are carried out usingGridSim. The simulation results show that RCSS outperforms in intra-cluster Check-point wave completion time by 12.5 % with varying number of checkpoint servers.RCSS also reduces checkpoint wave completion time by 50 % with varying numberof clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.

S. Malik· B. Nazir (B) · I. A. KhanDepartment of Computer Science, COMSATS Institute of IT, Abbottabad, Pakistane-mail: [email protected]

K. QureshiDepartment of Computer Science, Kuwait University, Kuwait City, Kuwait

123

612 S. Malik et al.

Keywords Checkpoint storage · Fault tolerance in grid · Checkpoint replication ·Grid computing

Mathematics Subject Classification 65C99

1 Introduction

The development of Wide Area Networks and the availability of powerful resourcesare changing the way servers interact. Technological advancements make it possible toutilize geographically distributed resources in multiple owner domains to solve large-scale problems in the fields of science, engineering and commerce [1]. Therefore,grid and cluster computing have gained popularity in High Performance Computing[2]. For example, the ImmunoGrid, the project of European Union, aims to simulateimmune system at various levels, i.e. molecular, cellular and organ level [3]. SASgrid computing facilitates businesses to develop a shared environment to process hugeamount of data and analytical programs efficiently [4]. Folding@home, consisting ofidle computing resources of thousands of volunteered PCs, aims to simulate proteinfolding, drug designing for disease research [5].

The technologies pose many challenges with respect to computational nodes, stor-age media and interconnection mechanism that affect overall system reliability. Theprobability of resource failure is much greater than conventional parallel computingand the job execution suffers fatally in case of resource crash [6,7]. Consequently, largeapplications (e.g. scientific and engineering), that require a computing power of 100 or1,000 of nodes create problems with respect to reliability [8,11]. Thus, fault toleranceis an indispensable characteristic of grid [9]. In traditional implementation a failurecauses the whole distributed application to shutdown and it to be restarted manually[10]. To avoid restarting the application from the beginning a technique called rollbackrecovery is used which is based on the checkpointing concept. In checkpoint-basedstrategies, the executed portion of the process is periodically saved to a stable storagemedium as a checkpoint that is not subject to failures [6]. In the case of a failure,further computation is started from one of the previously saved states. Checkpoint-ing can be classified into three categories uncoordinated checkpointing, coordinatedcheckpointing and communication induced checkpointing [12].

In uncoordinated checkpointingeach process independently saves its checkpoints.In case of failure, the processes search the set of saved checkpoints for a consistentstate from which computation can be resumed. The main advantage of this technique isthat each process can take checkpoint, whenever it is most convenient for it. However,uncoordinated checkpoiting suffers from rollback propagation called domino effect.Domino effect is the worst case scenario in achieving consistent state of the system.It may cause the system to rollback to initial state, losing all the executed portionof the jobs before failure [13]. Rollback propagation also requires each processorto store multiple checkpoints that leads to large storage overhead. Due to high costof rollback this paper considers uncoordinated checkpointing unsuitable for subjectenvironment. In coordinated checkpointing, the processes have to coordinate with

123

A reliable checkpoint storage strategy for grid 613

Process p1

Cp 1

Cp 2

Cp n

Process p2

Cp 1

Cp 2

Cp n

Process p n

Cp 1

Cp 2

Cp n

Processing Node 1

Cp: Checkpoint

Processing Node 2 Processing Node n

Stable Storage

Time Interval n

Time Interval 2

Time Interval 1

Fig. 1 Fault tolerance by checkpoint/restart

each other in order to form a consistent global state of checkpoints. Coordinatedcheckpointing can be blocking as in [21] or non-blocking as in [16]. It does notsuffer from rollback propagation so it streamlines recovery process; also it minimizesstorage overhead since just single checkpoint is needed. As a consistent checkpointneeds to be determined prior to writing checkpoint to stable storage, the techniquesuffers from large latency. However, many scientific applications like finite elementmethod simulation and molecular dynamics simulation are iterative in nature and allowcheckpoint to be taken between iterations. In communication-induced checkpointing,the processes initiate some of their checkpoints independently. To prevent dominoeffect, this strategy forces additional checkpoints.

The fault-tolerance mechanism must have the ability to save generated checkpointson a stable storage [14]. Usually this can be achieved by installing dedicated checkpointservers. Figure 1, illustrates the concept of checkpoint storage over a stable storagemedia (dedicated checkpoint server).

But these dedicated servers may become a bottleneck when the grid size increases.To overcome this problem, grid computing shared nodes can be used to store check-point data. One way to provide checkpoint storage reliability is replication. In repli-cation, multiple copies of a checkpoint are stored on different nodes. Thus, data canbe recovered even when part of the system is unavailable. Another approach used isto break data into fragments and add some redundancy. Thus, data can be recoveredfrom a subset of the fragments. The most common technique used to break data intoredundant fragments is the addition of parity information.

Parity calculation is less reliable as compared to replication,since failure of onlyone checkpoint can be tolerated, failure of two consecutive checkpoints results in lossof data.Replication can provide N − 1 failure tolerance, where N is the total number

123

614 S. Malik et al.

of nodes in the system. Although, replication provides greater fault tolerance but itcan have adverse effects on the system if replication is not done in proper manner. Forexample, it can increase the checkpoint replication time which in turn results in longerjob execution time. Moreover, dedicating powerful resources solely as checkpointstorage media, results in loss of computation power of these resources. It may becomebottlenecks, when the load on the network is high. Therefore an efficient and reliablecheckpoint storage strategy is needed, which provide fault tolerance and ensure thejob execution within acceptable time limit, and also use the powerful resources of thegrid in optimal manner.

Beside checkpoint storage, a fault tolerant strategy must also address the hetero-geneity and dynamic nature of resources which is unavoidable in a grid environment[15].

In this paper Reliable Checkpoint Storage Strategy (RCSS) is proposed. To deal withcheckpoint server or whole cluster failure itreplicates the checkpoints over checkpointservers. Replicating checkpoint over all checkpoint servers in the grid may take longtime that results in longer execution time. RCSS shortens replication time in two ways,firstly by replicating checkpoints in distributed manner and secondly by having thecheckpoint servers partially acknowledge the client after recording the chunk. Uponreceiving the partial acknowledgement the client continues with execution; meanwhilethe checkpoint server replicates the chunk over other servers as well as on otherclusters. RCSS also utilizes the CPU cycles of dedicated servers in the case of highnetwork load. Hence, it minimizes the wastage of processing power of these moststable nodes of clusters.

Rest of the paper is organized as follows: Sect. 2 discusses related work; Sect. 3presents proposed strategy; Sect. 4 discusses experimental results and Sect. 5 concludesthe paper.

2 Related work

In order to tolerate faults, the strategies based on checkpointing save the executedportion of processes on stable storage. Hence, the executed portion can be retrieved incase of failure and further computation can be carried out. Checkpoints can be initiatedin three ways which have been discussed in third paragraph of introduction section.Chandy and Lamport [16] were the first to introduce a coordinated checkpointingprotocol for distributed applications.

Forthcoming sub-sections present diskless and disk-based checkpointing strategiesproposed so far.

2.1 Disk-based checkpointing

Disk-based checkpointing is advantageous because it can tolerate the failure of N − 1nodes, where N is the total number of nodes in the system. Bouabache et al. [10]considers the scenario of a cluster of clusters. Replication of checkpoints is done overstable storage called checkpoint servers. The number of checkpoint servers is fixedin each cluster. Checkpoint servers are responsible for replicating the received chunk

123


Client

Primary CS

Intermediary CS

Intermediary CS

Pseudo-primary CS

Intermediary CS

Cluster 1

ACKg

ACKg

ACK3f

f

ACK2f

Rn

RpRq

ACK1

Rq

f

f

Cluster 2

Fig. 2 Checkpoint replication in GHRS

of the checkpoints on all other checkpoint servers. Two variants of replications arepresented in [10]: Simple Hierarchical Replication Strategy and Greedy hierarchicalReplication Strategy.

2.1.1 Simple hierarchical replication strategy (SHRS)

The checkpoint server receiving the chunk of a checkpoint from a client becomesprimary checkpoint server for that chunk. The primary checkpoint server becomesresponsible to replicate the checkpoint on all the checkpoint servers of its group aswell as on checkpoint servers of other groups. Each primary server replicates the chunkto s + i0mode[2m] nodes (where ‘s’ represents primary checkpoint server, ‘i’ is thebit identifier of checkpoint server, and ‘m’ is the total number of checkpoint serversin a cluster). But in this technique the intermediary servers have no role to play.

2.1.2 Greedy hierarchical replication strategy (GHRS)

In GHRS, the replication process is accelerated by the involvement of intermediaryservers in the replication. For this a set of checkpoint servers is defined for eachcheckpoint server using a formula s, s + 20, s + 21. . ., s + 2m−1 called childrenof checkpoint server ‘s’(where ‘m’ represents total number of checkpoint serversin cluster). However, this technique suffers from the overhead of time to store thecheckpoint on stable storage [6]

The Fig. 2 shows the transition steps involved in checkpointing replication. Wheref is the chunk to be replicated, CS is Checkpoint Server, Rq is request to inquirewhether the chunk has been received or not, Rp and Rn are the response of the requestRq . The client sends its checkpoint to checkpoint server. The receiving server becomesprimary for that chunk. The primary checkpoint server then replicates the chunk toits children according to the formula mentioned in previous paragraph and depictedin Fig. 2. Then each intermediary server sends request to its children to inquire aboutarrival of the chunk, if the children send negative reply, the checkpoint is forwardedto children.

Similar examples from data grid are OceanStore [22] and Farsite [23]. OceanStoreallows global access to persistent data through peer-to-peer network. These networks

123

616 S. Malik et al.

are collection of untrusted server, oceanstore protect data through redundancy andcryptographic techniques. Persistent object is the fundamental unit of OceanStore.Each object is assigned a globally unique identifier. To improve the performance,objects can be replicated anywhere any time. It uses Byzantine-fault tolerant com-mit protocol to ensure consistency among replicas. The objects are modified throughupdates and exist in two forms, active and archival form. The latest version of data witha handle for update is called active replica, and archival form is a permanent, read-onlyform of object. Archival forms of objects are encoded through erasure code and repli-cated over 1,000 of servers. Replica for an object is retrieved through a probabilisticalgorithm. If this algorithm fails, the retrieval is left to deterministic algorithm.

Farsite is a protected and scalable file system. It is physically distributed amonga group of untrusted desktop workstations, but logically provides functionality as acentralized file server. Machines in Farsite system can perform three roles, a client, amember of directory group and a file host. It provides reliability and availability mainlythrough replicating directory metadata and file data. File host uses raw replication anddirectory group uses byzantine-fault-tolerant replication. The system is made scalableby employing distributed hint mechanism and delegation certificates for pathnametranslations. It locally caches file data to improve performance.

2.2 Diskless checkpointing

To reduce the overhead of disk-based checkpointing, Plank et al. [17] introduced theconcept of diskless checkpointing. In checkpointing stable storage media is replacedby memory for checkpoint storage. Chen et al. [18] presented checksum-based check-pointing strategy that relies on diskless checkpointing. The technique is scalable asthe overhead to survive K failures does not increase with increasing number of appli-cation processes. The key idea in this technique is pipelining of data segments. All thecomputational processors and the checkpoint processor are organized in chain. Eachprocessor receives the data segment from its predecessor, calculates the checksum andsends it to the next processor in the chain. The process continues until the segmentreaches the checkpoint server which is at the end of the chain. The checkpoint serverwill receive a segment at the end of each step and the checkpoint will be completedas soon as the last segment is received at the checkpoint server.

In Checkpoint Mirroring (MIR) [17], the checkpoint mirroring technique is used inwhich each processor saves a copy of its checkpoint to another processor’s local disk.In the case of processor failure the copy of that checkpoint will be available for a spareprocessor to continue the execution of that process. The drawback of this technique isthe need for space to store m + 1checkpoint per processor (see Fig. 3).

N+1 Parity (PAR) [19] is motivated diskless checkpointing technique. It overcomesthe space overhead of checkpoint mirroring with parity calculation that is stored onthe central disk. The PAR checkpointing process is presented in Fig. 4.

Sobe [19] presented two variants of parity calculation based on a Redundant Arrayof Independent Disks (RAID)-like storage scheme: Parity Grouping over Local Check-points (PG-LCP) and Intra Checkpoint Distribution (ICPD). In PG-LCP, parity is cal-culated over local checkpoints and stored on an additional node. If the size of the local

123


Fig. 3 Checkpointing mirroring

Checkpoints tolocal disks

then copy the checkpoint to a neighbour

N Application Processors

Central DiskParity

Checkpoints tolocal disks

Central DiskParity

N application processors

Fig. 4 Parity chekpointing

checkpoint is different from the checkpoints received, to calculate parity, the size ofeach checkpoint has to be enlarged to the size of the biggest checkpoint. The unusedbits are assumed to be zero in calculation. To restart a single failed process, its lastsaved state has to be reconstructed by XOR-ing all other checkpoints and parity. Thisprocess requires the transfer of N − 1 checkpoints and parity. The unit of the parityscheme is the entire checkpoint ( see Fig. 5).

In ICPD, the checkpoint is divided into chunks at local nodes and the parity iscalculated over these chunks (as shown in Fig. 6). The distributed software systemtransfer and write the parity and chunks to disks. The unit of parity is a chunk. Inaddition to the parity information the length of the checkpoint is also stored at theparity node. The checkpoint can be recovered in the original length even if the chunkand information related to it is lost.

Diskless checkpointing incurs the overhead of high memory for storing checkpoints.In calculating the parity each computing node has to communicate with the parity nodethat may cause a communication bottleneck. To recover a failed computing node,

123

618 S. Malik et al.

Fig. 5 Parity grouping of local checkpoints

Fig. 6 Intra-checkpoint distribution

Table 1 Comparison of checkpointing techniques

Checkpointstoragetechnique

Type Degree of faulttolerance

Encodingrequired

Disk-based/diskless

Intra clientcommunication duringcheckpointing

Checkpointreplication

Task level Up to requiredlevel

No Disk-based No

Paritycalculation

Task level At maximumof 1 node orcheckpoint

Yes Diskless Yes

checkpoints from all other computing nodes and from the parity node are required.Retrieving checkpoints in this way is also an expensive task in terms of communication.

Table 1 presents summary of some features on which checkpoint replication ispreferred on parity calculation to provide fault tolerance.

It is evident from the above table that checkpoint replication (disk-based check-pointing) is preferred technique to provide fault tolerance.

123


In [6], we proposed a fault tolerant job scheduling strategy for economydriven grid. The strategy is based on adaptive task replication. It maintains faulthistory of grid resources called fault index. Fault index for each resource isupdated whenever a resource completes a job or it fails. Grid Resource Bro-ker replicates the job on multiple grid resources, based on susceptibility of gridresource towards faults recommended by resource fault index. Thus, if a faultoccurs at a grid resource, the result of replicated job on other grid resourcescan be used. Therefore user jobs can be completed within specified deadline andbudget.

We also evaluated different fault tolerant techniques (retrying, alternate resource,alternative task and checkpointing) in [9]. The performance of these techniques ismeasured by standard metrics (throughput, turnaround time, waiting time and trans-mission delay). Through experimentation, it is concluded that checkpointing givesgood results as compared to other fault tolerant techniques.

In our paper “A hybrid fault tolerance technique in grid computing system” [11],we proposed two hybrid fault tolerant techniques i.e. alternate task with check-pointing and alternate task with retry. The two techniques adopt good features ofworkflow level fault tolerant technique and task level fault tolerant technique andovercome their limitations. After experimentation we concluded that alternate taskwith checkpointing significantly improves grid’s reliability than alternate task withretry.

In [15] we proposed fault tolerant job scheduling strategy, which is based on adap-tive task checkpointing. This strategy also maintains fault index for grid resources.Whenever there is a job to submit, the grid resource broker uses fault index to applydifferent amount of task checkpointing i.e. inserting checkpoints in a task at differ-ent intervals. This strategy successfully schedules jobs and tolerate faults gracefully.Furthermore, it improves overall job execution time and reduces the job executioncost. However, we did not consider domino effect in this strategy, which may affectits performance adversely.

This paper focuses on disk-based checkpoint replication strategy along with optimalutilization of stable resources of computational grid. The total time to replicate thecheckpoints in the grid is analyzed. The RCSS replicates the checkpoints in the gridin distributed manner. Hence decreases the overall replication time notably. It alsoutilizes the stable storage devices of the grid for computational purposes in highnetwork load.

3 Reliable checkpoint storage strategy (RCSS)

The system model, assumptions and the mechanism of RCSS along with algorithmand architectural diagram is discussed below.

3.1 System model

As the Grid is the collection of powerful resources that are managed by differentadministrative domains, in this paper the Grid is considered as a cluster of clusters(Fig. 7).

123

620 S. Malik et al.

Network Network

Network

Internet

Checkpoint server

Dual role server

Clients

Fig. 7 System Architecture

The Grid architecture is defined as follows:

• There are ‘K ’ clusters in the grid.• Each cluster has ‘N ’ nodes.• Among these ‘N ’ nodes some are dedicated checkpoint servers, and some have

dual roles to play, i.e. in high load, these nodes will perform the processing andalso store chunks, and in low load, they will only perform the processing, and therest of the machines in the cluster will perform processing only.

• The clusters are connected through front-end machines.• It is quite possible for any component to fail at any time in a grid environment.

A coordinated checkpoint protocol will handle the client failures.

Failure can be any of the following types:

• A checkpoint server in a cluster may be disconnected due to link failure.• A cluster may be disconnected from rest of the grid due to failure of the front-end

machine of the cluster.• Simultaneous failure of all the components of the cluster could occur.

3.1.1 Assumptions

• A group failure will occur only if any connection to the checkpoint server is lost,for example, a front-end machine failure due to crash.

• The system will crash only if K − 1 clusters get disconnected.• In the case of a cluster disconnected or group failure, the processes that were being

executed in this cluster will be restarted in another one.• There will be no more than x −1 checkpoint-server failures in a cluster, where ‘x’

is total number of checkpoint servers in that cluster.

123


3.2 Proposed strategy

Our strategy is based on disk-based checkpointing. It works in two phases, in storagephase the checkpoints are stored to checkpoint servers. The recovery phase is executedwhen any of computing nodes fail and the last valid checkpoint images are downloaded.

3.2.1 The storage phase

In this phase, all the compute nodes take checkpoint of executed portion of the job.To accelerate the storage process, all the checkpoints are divided into chunks. Thenall compute nodes send their chunks to the checkpoint servers in a distributed manneri.e., in round robin fashion. The server receiving the chunk from the compute nodebecomes the primary server for that chunk. The primary servers store their respectivechunks and send partial acknowledgements back to compute nodes. Upon receivingthe partial acknowledgements compute nodes resume with processing. Meanwhile,the primary servers replicate their chunks on all the servers in the same cluster aswell as on other clusters in the Grid. Each checkpoint server receiving the imagefrom another checkpoint server sends back an ACK after properly storing that chunk.The intra-cluster replication is done in a hierarchical way according to the followingformula:

s, s + 20mod[m], s + 21mod[m], . . . , s + 2n−1mod[m] (1)

where s is the primary checkpoint server (The server receiving the chunk from thecompute node), m is the total number of checkpoint servers in a cluster, and n is thebit identifier of the checkpoint server (e.g. for seven checkpoint servers three bits areused as their identifiers).

Meanwhile, the checkpoint server also sends its chunk to checkpoint servers in otherclusters. The checkpoint server which receives a chunk from a checkpoint server ofanother cluster becomes primary for that chunk. This primary checkpoint server storesthat chunk, and sends a partial acknowledgement back to the checkpoint server fromwhich it received the chunk. Replication of the chunk on other checkpoint servers isthen continued. When all the chunks are replicated on all the checkpoint servers in thegrid, the checkpoint wave is said to be completed and announced globally. The intercluster replication is also done in a hierarchical manner according to the followingformula:

c, c + 20mod[m], c + 21mod[m], . . . , c + 2n−1mod[m] (2)

where c is the cluster number (here c is the number of the cluster from which thecheckpoint server is sending chunk to checkpoint servers in other clusters in the grid)m is the total number of clusters in the grid, and n is the bit identifier of cluster (e.g. forfour clusters two bits are used as their identifiers).The compute nodes in a cluster andall clusters in the grid are logically arranged in a tree shape according to equations (1)& (2). Arranging compute nodes and clusters in such a way make the overall processmuch efficient.

123

622 S. Malik et al.

3.2.2 The recovery phase

At the start of the recovery phase, different checkpoint servers conduct an agreementfor the last valid checkpoint wave. In this agreement process, all checkpoint serverssend their last valid checkpoint wave number. The greatest number agreed by themajority of the servers becomes the result and then processing starts from that pointonward. For this process, model B, also used by Bouabache in [10] is used. The modelhas the following properties

• After some time ‘x’ all correct processes will realize a process that was failed.• After some time ‘y’ most of the processes will realize a process that is correct.

The storage and recovery phases are described in RCSS algorithm.

Algorithm 1: RCSS Algorithm

Input: Jobs (gridlets) to be processed Output: Chunks replicated

Begin

1. Collect available resource information from scheduling advisor 2. Submit jobs to available resources 3. When timer expires, pause the jobs

4. for each job do 5. Take checkpoints 6. Store checkpoint to checkpoint server according to Checkpoint Storage Algorithm 7. end for

8. Send partial acknowledgement to executing resources 9. Resume the jobs on all resources 10. Replicate the jobs in and among clusters according to Checkpoint Replication Algorithm 11. Send full acknowledgement to executing resources.

12. if a job or a resource fails do 13. Submit job to another resource in the grid 14. Restart the jobs on that resources according to Checkpoint Recovery Algorithm 15. end if

End

3.3 RCSS architecture diagram

Figure 8 describes working of the proposed strategy.The components of architecture diagram are explained below

3.3.1 Grid user

Grid user submits jobs to grid resources by specifying characteristics of jobs (i.e. num-ber of gridlets and their lengths) and quality of service requirements (i.e. processingtime). The gridlet here is a package that holds all the information related to job (i.e.length expressed in Millions of Instructions Per Second (MIPS), input and output filesize and the initiator of job.

3.3.2 Grid resource

Grid resources register themselves with Grid Information Service (GIS). At the timeof registration the resources provide their capabilities like the number of processors,

123


4. Dispatch jobs

Grid User

1.Request forResource

information

Resource 1

Resource 2

Resource 3

Resource n

Job Dispatcher

Time IntervalMonitor

Scheduling Advisor

3. Submit Job

Checkpoint Manager

Checkpoint Server

7. Store Checkpoint

8. ResumeJobs on

Partial Ack

9. Replicate Checkpoint

10. Full Ack

6. Get Checkpoint

5. Pause Jobs

11. Resource/Job Failure

12. Failed Job

14. Request for Recent Checkpoint

15. Send Recent Checkpoint

16. Submit Completed Job Details

17. JobResult

2. Availableresource

information

13. Submit Failed Job

Fig. 8 Architecture diagram for RCSS

processing speed, processing cost, time zone etc. These basic parameters determinethe execution time of the job.

3.3.3 Scheduling advisor

Scheduling advisor here is responsible for collecting detail of available resources.When a user comes with a job to execute, it requests scheduling advisor about detailof available resources. The scheduling advisor prepares the list of available resourcesbased on the requirements of user and sends it to the grid user.

3.3.4 Job dispatcher

Job dispatcher dispatches the jobs from the queue one by one to the grid resources forwhich they are intended.

3.3.5 Checkpoint manager

Checkpoint manager keep track of time and initiate checkpoint when required. Itprompts all the grid resources; which are executing jobs; to stop processing. At thetime when jobs are paused in the grid, the checkpoint manager takes images of exe-cuted parts of the jobs. It divides the checkpoints into chunks and sends to checkpointservers for storage.

123

624 S. Malik et al.

Algorithm 2: Checkpoint Storage Algorithm

Input: CheckpointsOutput: Chunks stored

Begin

1.for each job in the grid do 2. Take checkpoint of the job 3. Divide checkpoint into chunks 4. Send chunks to checkpoint servers in round robin fashion 5.end for

End

3.3.6 Checkpoint server

Checkpoint server is a vital component of grid. It stores chunks and replicates thechunks on other checkpoint servers in the grid. After storing the chunk, it sends partialacknowledgement to the grid resource, so that it may continue with processing.

Algorithm 3: Checkpoint Replication Algorithm

Input: Stored chunks Output: Chunks replicated

Begin

1.for each chunk stored 2.Replicate the chunk to rest of the checkpoint servers in the cluster according to formula (1) 3.end for4.for each chunk stored 5.Replicate the chunk to other clusters according to formula (2) 6.end for

End

The second phase of RCSS is checkpoint recovery. If a job fails, it is assigned toanother computational node in the grid and processing is restarted from last checkpoint.Checkpoint recovery algorithm shows the steps involved in this process.

Algorithm 4: Checkpoint Recovery Algorithm

Input: job failed Output: job resumed on another node from last checkpoint

Begin

1.for each job failed

2.Send request to all checkpoint servers in the grid for chunks of failed job3.Receive chunks from checkpoint servers4.Reconstruct image of last valid checkpoint5.Restart the job from that checkpoint onwards 6.end for

End

3.3.7 Illustrative example

Let’s say

U1, U2. . . Un are grid usersT1, T2 . . . Tn are tasks to be submitted in the gridSa is the Scheduling AdvisorJb is the Job DispatcherCS1, CS2 . . . CSn are the checkpoint serversCm is the Checkpoint Manager, andR1, R2 . . . Rn are the Resources in the grid.

123


1. U1 requests Sa for available resource information2. The Sa sends the required information to U1.3. U1 then submits job to Jb.4. Jb dispatches jobs to R1, R2 . . . Rn.5. When the timer expires all the jobs in the grid are paused6. Executed portion of all processes is noted as checkpoint7. The checkpoints are stored on CS1, CS2 . . . CSn.8. The CS1, CS2 . . . CSn send partial acknowledgements to R1, R2 . . . Rn, and

computation is resumed.9. CS1, CS2 . . . CSn then replicate their checkpoints over each other.

10. When replication is done, CS1, CS2 . . .CSn send full acknowledgements to R1,R2 . . . Rn.

11. If a job fails, the Cm is intimated about it.12. Failed job is sent back to Jb.13. The Jb assigns the failed job to another Rn (here n is equal to 1, 2, 3 …) in the

grid.14. The Rn requests CS for last valid checkpoint of that job15. The requested CS sends back the requested data and Rn restarts computation.16. The result of completed job is sent to Sa.17. The Sa sends the result back to U1.

4 Simulation results and discussion

The following subsections present the simulation setup, experiments conducted, resultsand discussion.

4.1 Simulation setup

The experiments were conducted in a GridSim simulator [20].We used windows XPSP2 on an Intel PIV 2.4 GHz, with 512 MB of RAM and 40 GB of Hard Disk. For sim-ulation; in each cluster it was assumed that all the checkpoint servers are connected in acomplete graph. During experiments we used 1–30 clusters, 1–30 checkpoint servers,and 50–1,000 clients(jobs).

4.2 Results and discussion

This subsection presents simulation results for different performance evaluation para-meters and their discussion.

4.2.1 Replication time within a cluster

The replication time means that the time it takes all the chunks in the grid to bereplicated on all checkpoint servers in the grid. In Fig. 10 the numbers on x-axis

123

626 S. Malik et al.

Replication Time

20

40

60

80

100

No. of Checkpoint Servers

Tim

e in

Sec

on

ds

Reliable Checkpoint Storage Strategy

Greedy Replication Strategy

1 2 3 4 5 6 7 8 9

Fig. 9 Replication time with respect to different numbers of Checkpoint Servers

represent checkpoint server numbers and numbers on y-axis represent time in seconds.Initially, to compare the performance of two strategies i.e. Greedy Hierarchical

Replication and Reliable Checkpoint Storage Strategies, the clients’ number was setto 200, the cluster number was set to one, the checkpoint size per client to 1 MB andthe checkpoint server numbers are varied.

Figure 9 shows the time required for replication within a cluster with respect todifferent numbers of checkpoint servers. It can be seen from Fig. 9 that the RCSSreplicates checkpoints to all of the servers in the group much faster the than GHRS.The reason is that in Greedy Hierarchical Replication, the servers exchange request andreply messages to inquire about the arrival of chunk and then act accordingly. It resultsin longer checkpoint replication time. While RCSS do not exchange control messagesamong checkpoint server to inquire about chunk arrival rather each checkpoint serversends the chunk to its peers with sequence number. If the receiver has already receivedthe chunk with the same sequence number, it discards the received chunk, otherwisethe receiver saves it. It clearly decreases the checkpoint replication time which isevident from Fig. 9.

4.2.2 Checkpoint wave completion time in single cluster

Checkpoint wave completion time means the total time from pausing the job for takingcheckpoint till replicating the checkpoint on all the checkpoint servers in the grid.

To investigate the scalability of clients’ numbers and time required to replicatecheckpoints within a cluster, the cluster and checkpoint server number were fixedat one and six, respectively. It was observed during experiments that the checkpointwave completion time depends on the number of clients. More the number of clients,greater is the data to be stored. Figure 10 shows the experimental result. To identifythe step (client communication time and replication time) that affects the checkpointwave completion time the most, the two steps are isolated in Figs. 11 and 12.

123


Checkpoint Wave Completion Time

0

1

2

3

4

5

6

50 100 200 300 400 500 600 700 800 900 1000

No. of Clients

Tim

e (i

n S

eco

nd

s)



Fig. 10 Checkpoint wave completion time with varying numbers of clients

Client Communication Time

00.10.20.30.40.50.60.70.80.9

1

50 100 200 300 400 500 600 700 800 900 1000

No. of Clients

Tim

e (i

n S

eco

nd

s)

Reliable Checkpoint Storage StrategyGreedy Replication Strategy

Fig. 11 Checkpoint storage with respect to varying number of clients

a. Client communication time Client communication time is the time it takes theclients to send their chunks to checkpoint servers in a round robin fashion.

Figure 11 shows the effect of client communication time. When the number ofclients were greater than 400 dual role nodes were used to store their own checkpointslocally which reduced the clients’ communication time significantly. For number ofclients greater than 500, some stable nodes store their own checkpoints locally, andfor number of clients’ grater then 600 some checkpoint servers perform processingtoo.

123

628 S. Malik et al.

Replication Time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

50 100 200 300 400 500 600 700 800 900 1000

No. of Clients

Tim

e (i

n S

eco

nd

s)Reliable Checkpoint Storage Strategy


Fig. 12 Replication time with varying number of clients

b. Replication time In Fig. 12, the replication time means the total time required toreplicate the chunks on all checkpoint servers in the cluster with respect to increasingnumber of clients.

The graph clearly depicts that replication time influences the checkpoint wavecompletion time the most. As compared to GHRS, RCSS performed well because oftwo reasons:

• There is no use of control packets.• Dual role nodes are used for checkpoint storage, which decreases the overall check-

point wave execution time.

4.2.3 Checkpoint wave completion time with varying number of clusters

Figure 13 illustrates that by using tree topology RCSS outperforms the GHRS whichreplicates the chunks in all clusters in a flat order (primary checkpoint server in acluster sends its chunks to all the clusters in the grid). For this experiment the numberof clients ‘c’ was set to 100, number of checkpoint servers ‘cs’ was set to 30 andthe number of clusters ‘k’ was varied. There were c/k clients and cs/k checkpointservers per cluster. The initial increase in checkpoint wave completion time is dueto the increase in inter-cluster communication links. But after 20–22 clusters, RCSSgave stable results because of topology. To make the graph easy to understand, clientcommunication time and replication time are shown in two separate graphs as follows.

Figure 14 represents that as the number of cluster increases the number of clientsper cluster decreases; and hence, the client communication time decreases too. Inthis graph the communication time of both RCSS and GHRS is same because thenumber of clients per cluster is less and RCSS do not use the checkpoint servers forjob execution.

Figure 15 shows the impact of topology on the replication step with decreasingnumber of clients and checkpoint servers per cluster. It is evident from the figure that

123



0

1

2

3

4

5

6

1 2 3 5 7 10 20 30

No. of Clusters

Tim

e (i

n S

eco

nd

s)Reliable Checkpoint Storage StrategyGreedy Replication Strategy

Fig. 13 Checkpoint wave completion time with varying number of clusters

Checkpoint Storage

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

1 2 4 6 8 10 20 30

No. of Clusters

Tim

e (i

n S

eco

nd

s)



Fig. 14 Checkpoint storage time with varying number of clusters

RCSS performs better than GHRS. The reason is that checkpoint servers in RCSS donot exchange requests and reply messages for chunks. Furthermore, when the clusternumber increases, the topology used in RCSS overcomes the drawback of slowerinter-cluster communication links.

For this experiment clients number per cluster was set at 100 and checkpointservers’ number per cluster at 20. RCSS in Fig. 16 outperforms GHRS because ofits intra-cluster and inter-cluster replication strategy. The GHRS takes more time toreplicate chunks because of two reasons. Firstly, it exchanges request and reply mes-sages for replicating chunks and secondly, it resumes computation when the chunkis replicated on all checkpoint servers in the grid. In contrast, RCSS eliminate the

123

630 S. Malik et al.

Checkpoint Replication

0

1

2

3

4

5

6

1 2 4 6 8 10 20 30

No. of Clusters

Tim

e (in

Sec

onds

)Reliable Checkpoint Storage Strategy


Fig. 15 Checkpoint replication time with varying number of clusters


0

5

10

15

20

25

30

1 2 4 6 8

No. of Clusters

Tim

e (i

n S

eco

nd

s)



Fig. 16 Checkpoint wave completion time with fixed number of clients and checkpoint servers

need of request and reply messages and resumes computation at receiving partialacknowledgement. Hence, results in shorter checkpoint wave completion time.

Comparison of mechanism involved in RCSS and GHRS is given in Table 2.

5 Conclusion

Fault tolerance is an important characteristic of grid. The system is made fault tolerantby using roll back recovery mechanism. It relies on the availability of checkpoints.Most often special devices are dedicated to store checkpoints. It is assumed that thesedevices can never fail but the reality contradicts this assumption. In grid or clusterenvironment any device can fail at any time. Furthermore, dedication of powerful

123


Table 2 Mechanism involved

Checkpointstrategy

Controlmessageexchange

Computationat checkpointservers

Replicationamongclusters

Task resuming Checkpointwave comple-tion time

Reliablecheckpointstoragestrategy

No Yes Distributedmanner

After partial ack Becomes stableafter some points

Greedyhierarchicalreplicationstrategy

Yes No Flat replication At full ack Increases propor-tionally

Table 3 Time comparison of RCSS and GHRS

Performance evaluation parameters Reliable checkpointstorage strategy (s)

Greedy hierarchicalreplication strategy (s)

Average checkpoint replicationtime (1 cluster, 200 clients, 1–9checkpoint servers)

40 68.8

Average checkpoint wavecompletion time(1 cluster, 6checkpoint servers, 50–1,000clients)

0.013 0.028

Average client communicationtime(1 cluster, 6 checkpointservers, 50–1,000 clients)

0.004 0.005

Average checkpoint replicationtime (1 cluster, 6 checkpointservers, 50–1,000 clients)

0.009 0.023

Checkpoint storage time(30checkpoint servers, 100 clients,1–30 clusters)

0.004 0.004

Checkpoint replication time(30checkpoint servers, 100 clients,1–30 clusters)

0.522 1.137

Checkpoint wave completiontime(30 checkpoint servers,100 clients, 1–30 clusters)

0.526 1.141

Checkpoint wave completiontime(100 clients/cluster, 20checkpoint servers/cluster, 1–8clusters)

3.5 7.875

resources for checkpoint storage only results in wastage of these resources when theyare needed the most.

RCSS ensures the availability of checkpoints in the case of checkpoint server failureor even in the case of cluster failure. RCSS also utilizes the CPU cycles of someof dedicated checkpoint servers for computation when the load on network is high.Results show that RCSS reduced intra-cluster checkpoint wave completion time by12.5 % as compared to GHRS. The checkpoint wave completion time with varying

123

632 S. Malik et al.

number of cluster is reduced by 50 % in RCSS, and for intra-cluster replication theRCSS outperformed GHRS by 39.5 % reduction in time. Table 3 represents some keyfindings of the comparison of RCSS and GHRS.

References

1. Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organization.Int J Supercomput App 15:200–222

2. Nandagopal M, Uthariaraj VR (2010) Fault tolerant scheduling strategy for computational grid envi-ronment. Int J Eng Sci Technol 2(9):4361–4372

3. Halling-Brown MD, Moss DS, Shepherd AJ et al (2009) A computational grid framework for immuno-logical applications. Philos Trans A Math Phys Eng Sci 367(1898):2705–2716

4. http://www.sas.com/technologies/architecture/grid/index.html#section=15. Pande lab. http://folding.stanford.edu. Stanford University6. Nazir B, Qureshi K, Manuel P (2012) Replication based fault tolerant job scheduling strategy for

economy driven grid. J Supercomput 1–197. Yu J, Buya R (2005) A taxanomy of work flow management systems for grid computing. J Grid Comput

3:298. Latchoumy P, Khader PSA (2011) Survey on fault tolerance in grid computing. Int J Comput Sci Eng

Surv 2(4):97–1109. Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid

computing system. Comput Elect Eng 36:1110–112210. Bouabache F, Herault T, Fedak G (2008) Hierarchical replication techniques to ensure checkpoint

storage reliabilty in grid environment. In: 8th IEEE international symposium on cluster computing andthe grid, pp 475–483

11. Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computingsystem. J Supercomp 56(1):106–128

12. De Camarge RY, Kon F (2006) Strategies for checkpoint storage on opportunistic grids. In IEEEComput Soc 7:1

13. Gupta B, Rahimi S, Allam V, Jupally V (2008) Domino effect free crash recovery for concurrentfailures in cluster federation. Proceedings of the 3rd international conference on advances in grid andpervasive computing, pp 4–17

14. Cheng CW, Wu JJ, Liu P (2008) QoS-aware, access-efficient, and storage-efficient replica placementin grid environments. J Supercomput 49:1614–1627

15. Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economybased grid. J Supercomput 50(1):1–18

16. Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems.In ACM Trans Comput Syst 3:63–75

17. Plank JS (1996) Improving the performance of coordinated checkpointers on networks of workstationsusing RAID techniques. In IEEE Trans Parallel Distrib Syst, pp 76–85

18. Chen Z, Dongarra J (2008) A scalable checkpoint encoding algorithm for diskless checkpointing. In:11th IEEE high assurance systems engineering, symposium, pp 71–79

19. Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: parallel anddistributed processing symposium, p 8

20. Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resourcemanagement and scheduling for grid computing. Concur Comput Pract Exp 14(13–15):1175–1220

21. Tamir Y, Equin C (1984) Error recovery in multicomputersusing global checkpoints. In:13th interna-tional conferenceon parallel processing, pp 32–41

22. Kubiatowicz J et al (2000) OceanStore: an architecture for global-scale persistent storage. SIGPLANNot 35:11

23. Adya A et al (2002) Farsite: federated, available, and reliable storage for an incompletely trustedenvironment. SIGOPS Oper Syst Rev 36:299–314

123

http://www.sas.com/technologies/architecture/grid/index.html#section=1

http://folding.stanford.edu

a reliable checkpoint storage strategy for grid

Documents