[ieee 2007 international conference on information and communication technology - dhaka, bangladesh...

International Conference on Information and Communication TechnologyICICT 2007, 7-9 March 2007, Dhaka, Bangladesh

A BROADCAST FAULT-TOLERANT HIERARCHICAL TOKEN-BASED MUTUAL EXCLUSIONALGORITHM

Yasser Mansouri, Mohammad Moallemi, Amin Rasoulifard, and Hossain DeldariDepartment of Computer Science and Engineering, Faculty of Engineering

Ferdowsi University of Mashhad, Mashhad, IranE-mail: { ya_ma2O, mo_mol6, am_ra84}@stu-mail.um.ac.ir, and [email protected]

ABSTRACT node i to node j and vice versa. In a dynamicstructure, NEIGHBORi changes from time to time, at

Fault tolerance is a key feature for every grid based each node i. on the other hand, in a static logicalalgorithm. In this paper we have proposed a fault structure, NEIGHBORi is fixed at every node i in thetolerance technique for hierarchical mutual exclusion system.problem solver algorithm. This algorithm is based on Token-based algorithms are suitable for large scaleNaimi-Trehel's token-based mutual exclusion environments such as grid and peer-to-peer systems.algorithm. Our algorithm uses some intra cluster Another benefit of these algorithms is the use thebroadcasts to achieve this end and tolerate N-1 simple data structure. However, node failure in apermanent crashes of N nodes. We have also logical tree is important because the token requestproposed a sketch of the proof for our algorithm and path is destroyed.its integrity. We have adapted the fault tolerant algorithm of [11]

for cluster computing and grid environment of1. INTRODUCTION algorithm [12]. In the hierarchical mutual exclusion

Obtaining exclusive accesses to resource is a algorithm of [12] the latency between local andfundamental problem in distributed systems. Many remote node is considered and this algorithm tries toproblems in the operating systems, data replications, reduce the number of messages between clusters.distributed shared memories, distributed databases, We have added two broadcasts to the hierarchicaletc., require a resource allocated to at most a single algorithm (first when the token is transferred fromprocess at a time. For almost two decades, numerous one cluster to another, second when the first requestalgorithms have been proposed to implement mutual from a remote cluster goes to the token holderexclusion [1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]. These cluster), which are done in the token holder clusteralgorithms are divided into two categories: while there are no faults and the failure recovery ispermission based [1, 2, 3, 8] and token-based [4, 5, 6, also accomplished by broadcasts. By these9, 10]. In permission based algorithms the node modifications we are able to cover N-1 failures ofNrequesting to enter to critical section is granted if it nodes. By the way, if all the nodes in a cluster whichreceives the permission of all other nodes in the does not own the token fail, our algorithm is able tosystems. The drawback of these algorithms is high recover the failure and continue. Although, in thecommunication overhead. token holder cluster at least one node must be alive.In token-based Algorithms, a unique token is In our algorithm the fault recovery process ismaintained, so that only the process holding the token executed simultaneously in all clusters.may enter the critical section. If a process is already We consider a limited number of nodes which areholding the token at the time it is making a critical divided into clusters based on their distances. Eachsection request, then no communication is necessary node has a local memory and can exchange messageand there is no delay in entering the critical section, with the other nodes. Communication between nodesthus ensuring the safety property. Some token-based is assumed to be perfect. The failure of a node isalgorithms [6, 10] consider that nodes are organized because of a permanent crash.in a logical tree. There are two types of logical tree: The rest ofthis paper is organized as follows; Sectionstatic and dynamic. For node i, define NEIGHBORi 2 discusses the hierarchical extension of Naimi-to be the set of nodes j, so that there is an edge from Trehel's mutual exclusion algorithm. Section 3 prese-

174

ts our algorithm and mechanisms for fault recovery. is local, and the number of preemptions is below theSection 4 is the sketch of the proof of our algorithm threshold (nb_preemt), a local preemption of theand section 5 concludes the paper. token is performed by setting next to the requester2.HIERARCHICALMUTUAL EXCLUSI- and rememorizing the old next in the beginning of

ON ALGORITHM R_Queue. Each time a node becomes the newThis algorithm is based on Naimi-Trehel's algorithm local root, the R Queue is sent to it. The R_Queue isand considers the delay and distance between nodes, also included in the token message.to place them on different clusters. It reduces the 3. OUR ALGORITHMintra-cluster messages and gives a higher priority to The principle of our algorithm is based onlocal node in a cluster, for entering the critical section constructing the N Queue (the next variables chain[12]. queue) and protecting the R_Queue which isThis algorithm applies three extensions to Naimi- remained in that cluster after the failure. The goal ofTrehel's algorithm, based on idea of limiting the reconstruction is to maintain the order of tokenpropagation of requests between nodes of different requests and avoid the retransmitting of the requests.clusters. PreemptAggregation extension of this On the other hand, if reconstruction is impossible, aalgorithm is presented here. new N Queue based on last_tree is built. However,This algorithm improves Naimi-Trehel's algorithm the assumption is that in the cluster holding the token,based on aggregation of messages and preemption of at least one live node exists so the R_Queue willthe token by local nodes. This method uses the survive after failure.following variables besides the already named In our algorithm, we added to each node a token_posvariables of a node in Naimi-Trehel's algorithm: variable, which points to the cluster, in which the

* Each node has a local cluster variable which token exists. Each time the token is transferred fromdetermines,node ibelongstowhichcluone cluster to another the owner of the tokencdetermines, nocde i belongs to which cluster.

broadcasts a TOKEN POS CHANGE message to all* Each cluster Ci has a Proxy node, except the.- - .Eachclusterwhich has ahePr dnode,e t the nodes, and they will update their token_pos variablecluster which has the Elected-node and the and receive the R Queue. In addition, we make atoken.theren is R Queue variable at local rootofcopy of the R Queue in all nodes of the cluster in

l There is a R_Queue variable at local root of which the token exists.the cluster which has the token. This queue During the initialization process the value of the lastholds the requests of remote clusters, variable of each proxy node is set to null. Each node

* There is a nb-preempt variable in the system in a cluster which doesn't hold the token, sends itsconfiguration which specifies the number of token request to its proxy and sets a timer for a periodlocal requests that would be served first. of Twai,t (Twai,t depends on Tcmg which is the latency of

At first each node sets its last variable to the proxy of communication between two nodes of differentits cluster, and like Naimi-Trehel's, sends its request clusters), and waits for the COMMIT message, thento the proxy. Then the proxy sends the request to the the proxy refers to its last variable. If it is null, thenElected_node in the remote cluster and changes it's the proxy sets its last variable to the requester andlast to the requester. broadcasts the request in the cluster which holds theSuch as Naimi-Trehel's each request follows the last's token (represented the token_pos variable) and sets apath until it reaches its local_root (the node of the timer for a period of Twait (which depends on Tcmg)same cluster whose last variable is set to 0). and waits for COMMIT message. The proxy resendsWhen a node receives a request, if it is not a the request until it receives a reply.Local_root node (last + 0), it forwards the request In the token holder cluster, all the nodes add this nodeand updates its last (only if the request is issued from to their R_Queue, and the node which owns the token,the local cluster in order to avoid redirection to sends the COMMIT message to the proxy of theremote cluster). If the receiver is a local_root node remote cluster. The proxy then redirects this(last = 0) which waits for the token (requesting = COMMIT message to the requester. On the othertrue), there is two cases: (1) the received request is hand, if the last variable of the proxy refers to a nodethe first one since the node waits for the token (i.e. in the local cluster, then the proxy redirects thisnext =0). Then, the next is set to the requester request to that node and the request travels throughbecause after the node obtains and releases the critical N_Queue path and the local_root sends the COMMITsection, it will have to send the token to the requester. message to the requester. Finally, in both cases theThe last is also updated only if the request comes requester will be added to the end of the N_Queuefrom the local cluster. (2) The next is already set. and becomes the local_root of the cluster. TheSince the receiver is a local_root, the next inevitably procedure of token requesting in the token holder ispoints to a remote node. In this case, if the requester

175

the same as PreemptAggregation method, which was way, N Queue is ordered so that the lowest position ismentioned earlier, except that the next variable of the for the node which owns the token (if the token existslocal_root does not point to a remote node, and all the in this cluster), or the node in the cluster which first ofrequests from the remote clusters are maintained in all will receive the token (if the token does not existthe R_Queue. in this cluster). The COMMIT message contains theIn our algorithm the node which wants to send the following sets of information:token to a node in N Queue or R_Queue, must send * The closest predecessor of Si in its localan ARE_YOU_ALIVE message to that node and start cluster.a timer of 2*Tmsg or 2*Tcmg respectively, and wait for * Si's position in N Queue: which is Si'sthe I_AMIV_ALIVE message, then it can send the predecessor's position plus one.token (Tmsg is the latency of communication between After receiving COMMIT message, Si periodicallytwo nodes). checks its closest predecessor's aliveness.In Fig. 1(a), the node D asks for the token. It sends a Based on modifications that we applied inrequest to its proxy PI, which broadcasts it in CO and PreemptAggregation algorithm, we are able to coverchanges its last to D. The local_root of this cluster M faults in the clusters which do not own the token(B) sends a COMMIT message to P1 and D is and M-1 faults in the cluster which owns the token.inserted in R Queue (the lasts of A and B are notupdated since the requester is a remote node). After Generalli,we nucer N- fauls in .te wholesystem. (M iS the number of nodes in a cluster and Nthat, F belonging to cluster C2 sends a request to its isthe n,, 1S thenumber of all the nodes in all clusters).proxy P2, and then P2 broadcasts it in CO and finally

Every cluster has its own N Queue. Position zero ofF is inserted in the R Queue.In Fig 1b) ndeEo C aksfothtke.he every N Queue is assigned to the first requestingIn Fig. I1(b), node E of C1I asks for the token. The

proxy PI locally redirects the request to D (the node (by the local proxy).local root of C1) which updates its next and last to E. Every time node Si detects a failure, then the failureI PreemptAggregation algorithm, each node in the recovery will be performed by one of the followingIn PreemptAggregation algorithm, each node in the tremcaim.Mcaim IadM r..' ,. . three mechanisms. Mechanisms M1 and M2 areN Queue knows its successor in the N Queue (by its executed in all clusters simultaneously and

next) but does not know its predecessor. We added amechanism M3 is only applied at the token holderconfirmation mechanism for each request, to inform a cluster.

node about its predecessors.cutrnode about its predecessors.Mechanisml (Ml): node S. detects a failure in itsEvery time a node Si sends a token request, node S MMIclosest predecessor.which is the local_root or proxy in the local cluster, clsestpredecessor.- ' ~~~~~~Mechanism2 (M2): node Si has not received any

sends a COMMIT message for Si. In this way, Sj tells COMMIT messageSi its position in N Queue and its predecessors. This Mechanism3 (M3): the node which wants to send

C2 the token to the next node in N Queue or R_Queuel -~P~H---~ F has not received any I AM ALIVE message.

Each of these three mechanisms is explained inWt S kD| details as following:

Mt. If none of Si's predecessors replied toEi ? l | d6) ARE YOU_ALIVE message, Si tries to connect itself

EDi <v to N Queue by broadcasting the SEARCH_PREVa) D and F request for the token message in its local cluster. Si then starts a timer of

(2*Tmsg) and waits for reply. All the nodes in thatC2 cluster which have a position less than Si in the

| ). 1 N Queue will reply to it. At the end of this time, SiI , Iwill select the highest position from the received

CO C1 replies and will connect itself to N Queue by sendingXy L[4ED Fl a CO)NNECTION message to this nodle (this node

gl \tg ~~~~~willchange its next variable to Si). If Si does notED ~~~~~~~~~~~~receiveany reply at this time, it concludes that all its

predecessors in N Queue have failed. It refers to itsb) Brequestsfor the tokentokenpos variable. There are two cases to consider:]ast * token aI~ The token exists in Si's local cluster: so the latterneS______ 2 iutvod X concludes that the token is lost and regenerates the

Fig. 1xample f our agorithmtoken and changes its position to zero.

176

The token is held by another cluster: then Si reply. At this time, if it receives anotherconcludes that it is the first node in its cluster that SEARCH QUEUE message, it concludes thatwill receive the token from a remote cluster. Then it another node S, has also detected this failure in thechanges its position to zero and broadcasts a local cluster. Therefore, if Sj has less access to CSSUBSTITUT-E_R_QUEUE message in the cluster (this information is included in SEARCH QUEUE),which holds the token. All the nodes in that cluster or if the access to CS is equal, the node which haswill substitute Si with a node of the same cluster as the greater ID (say Sj) will win the voting. TheSi's, in their R Queues. If there did not exist a node looser (Si) sends its token request to Si, and Si willof the same cluster as Si's, they will add Si to the end find itself in M2.a and will pursue the mechanism.of R_Queue. To maintain the order of N Queue, it should beM2. We consider a situation in which a node has not reconstructed. This procedure is dynamic and doesreceived any COMMIT message, after it sends its not have any latency or message overhead, since allrequest. In fact it did not have a position in N Queue the required information have been transmitted viayet. Thus it is possible that more than one node SEARCH-QUEUE message. Suppose node Si hassimultaneously detect this failure. detected the failure and also has won the voting.M2.a. In the first situation we consider that only Therefore, the last_tree in local cluster isnode Si will detect the failure. To connect itself to reconstructed as follow:N Queue it diffuses a SEARCH QUEUE message I: all the nodes in this cluster which do not wait forin its local cluster and starts a timer of 2*Tmsg and the token change their last variables to Si. II: all thewaits for reply. Every node that has a position in the nodes in this cluster which have a position inN Queue sends an ACK_SEARCH-Q-UEUE N Queue change their last variables to Si. III: all themessage to Si. This message contains the position of nodes in this cluster which did not have a position inthe sender in N Queue and whether it has a next in N Queue but are waiting for the token (in fact theyN Queue or not. Among all the replies that arrived sent a request but did not receive a COMMIT), setin 2*Tmsg to Si, Si will select the Si that has the their last variables equal to their next variables.greatest position, there are three possibilities: In the Fig. 2(a), nodes G and F have a position in

(i). Si knows that it is the last node in N Queue in local N Queue, but nodes I, J and H do not. Node Kthat cluster: in this case Si sends its token request to sends a token request to one of the faulty nodes.S, and the latter updates its next and last variables to Node L sends a token request to K, then K updatesSi, thus the structure of last-tree will remain its next and last variables to L. Suppose node H andconsistent. K detect the failure simultaneously, therefore, both

(ii). S, knows that it has a next: Si concludes that of them broadcast a SEARCH QUEUE message inthere are some faulty nodes after Sj in the N Queue. cluster C2, and suppose H wins the voting. Fig. 2(b)Then Si sends CONNECTION message to Si to illustrates how nodes update their lasts. Node F andconnect itself to Si. G which have a position in N Queue update their

(iii). Si has not received any reply and concludes last to H (see II), and also node P2 which was notthat the N Queue is completely collapsed in that waiting for the token (see I). Node I, J and L setcluster: therefore Si refers to its tokenpos variable, their lasts equal to their nexts (see III). Node Hif the tokenpos variable refers to Si's cluster; Si which is the winner of the voting places itself inconcludes that the token is also lost. Therefore it M2.a and sends a CONNECTION message to G.regenerates the token, and considers the zero Node K which is the looser of the voting sends aposition for itself. Thus, since R Queue is repeated token request to H and the latter forwards it to thein all the nodes of this cluster, it will survive. If Si's local_root, all the nodes in this path update their lasttokenpos variable refers to another cluster, Si to K. Thus, as it is shown in Fig. 2(c), the N Queuebroadcasts a SUBSTITUTE_R-QUEUE message in is reconstructed.that cluster (as described in M1). M3. This mechanism happens in two cases:M2. b. Here we consider the situation in which more I: The first situation is when the local_root in thethan one node detects this failure simultaneously in a token holder cluster fails. Since this node can not becluster. In this case to avoid from providing more checked by its successor in the remote cluster andthan one token and inconsistency in N_Queue, we therefore can not be detected this way, and supposemust have a voting mechanism. When node Si M2 has not detected this failure yet.detects this failure and broadcasts II: When the owner of token wants to send the tokenSEARCH_QUEUE message and it is waiting for

177

£2~~~~~~~~~~~~cr ~~~~~~C2 CO- (\CJ 0 P2 F

c H KiG1c .4 =C a) C wants to send the token to D

1 Ai

B) GD ~ (

a) Initial state co C1C2 vO

\AtP Ewb) Finally C sends the token and R_Queue to GX I D aCFig. 3 example of mechanism 3

) Iiti s In Fig.3(a), node C after finishing CS wants to sendFD CO C1 the token to the first node of RAQueue (D), but the[ |t 11~F latter has failed. In 3(b) when C did not receive any

> 2 / ~~~~~~~~IAMALIVE message from D after 2*Tcmg time,I L X I ~~~~~~~~~~~~~~~~~~~itSends the token to the next node ofR Queue (G).

b) After failure detection, nodes update their last4.SEC OFP OSv a l We give the outline of the proof that oursolut3on

P2F0 Insolves the mutual exclusion problem and the sketch

CF-- C2 of the proof of the starvation-freeness of the- ~~~~~~~~~solution.

A laTheoremh : A node requesting entry to its critical1i \<e> A ~~~~~section will eventually succeed within a bounded

time.LW giveProof: Th s proof comprises two parts. Firstly, we

have to prove the starvation-freeness for a nodecoCOCswhich has a position, and then that a node

s weventuallyobtain sa position within a bounded

L ~~~~~time.

In the absence of failure, we can identify thefollowing four invariants, which are easily proven

c) Nodes update their next variables by induction.I1: in the token holder cluster, theFig.2 example of mechanism 2 node with the lowest position in N Queue owns the

token and in other clusters, the node with the lowestto a node in R Queue (in fact to a node of remote position is the first node which receives the token.cluster) but the N_Queue in that cluster is 12: the position ordering respects the order ofcompletely collapsed and has not been detected by N_Queue in each cluster.I3: after Si gets its position,M2 yet. no node in that cluster can get a new position whichIn both cases if the token holder which wants to is smaller than Si's. 14: two nodes in the same clusterdeliver the token has not received an I_AIVIALIVE cannot have the same position.message after 2*Tmsg or 2*Tcmg, respectively, It In the absence of failure, these invariants ensure thatsends the token to the next node ofRRQueue. a node Si holding a position will receive the token

within a finite time. Indeed, I1 and 12 ensure that the

178

5. CONCLUSIONtoken is held by one of Si's predecessors. 12 and 13 In this paper we proposed a fault-tolerantensure that no node can be inserted before Si. If a hierarchical token based mutual exclusion algorithm.failure occurs and the token is lost, the It is no Our algorithm uses the fault tolerance idea of [11]longer true. Therefore, we must prove that for Naimi-Trehel token based mutual exclusionmechanism MI is able to make the invariant It true algorithm. We have used the hierarchical Preempt-in a bounded time, since this failure is handled by Aggregation algorithm and made someMt. In the token holder cluster, the node with the modifications in this algorithm so that it supportslowest position eventually detects the loss of the fault tolerance capability. It has three mechanisms totoken and regenerates it in a bounded time, and in recover failures. We added two internal clusterthe other clusters also the node with the lowest broadcasts to this algorithm and one overallposition detects the failure of the first node of the broadcast for recovering a fault in mechanisms MIlocal N Queue in a bounded time. and M2. The proposed algorithm preserves the orderSecondly we have to prove that a requester node will of the token requests after the failure and can covereventually obtain its position in the N_Queue in a N-1 permanent faults of N nodes. Finally, a sketchbounded time. To prove that a requester node will of proofs was presented to highlight the starvation-obtain its position in a bounded time, we have to freeness and mutual exclusion solution aspects ofprove: our algorithm.A node that its request has been lost will send it in a 6. REFERENCEbounded time: since mechanism M2 will start after [1] D. Agrawal and A. El Abbadi. An efficient solutionmaximum of m*Tmsg or 2*Tcmg (m is the maximum to the distributed mutual exclusion problem. In Proc.number of node in a cluster). Aside from this, when 8th ACM Symposium on Principles of Distributedmechanism M2, rebuilds last_tree and N_Queue in a Computing, pp. 193-200, 1989.cluster, the order after reconstruction is ljke before. [2] L. Lamport. Time, clocks, and the ordering of eventsAwhen the localroot of the token holder cluster in a distributed system. Communications of theAlso when the local_root of the token holder cluster ACM, 21(7):558-565, 1978.

fails and mechanism M3 has not started yet, if a [3] M. Meakawa. A algorithm for mutual exclusion inremote request is sent, the proxy of the remote decentralized systems. ACM Transactions ofcluster will resend it until it gets a reply, since Computer Systems, 3(2):145-159, 1985.mechanism M3 eventually starts when the token is [4] A.J. Martin. Distributed mutual exclusion on a ringgoing to be handed to the local root. of processors. Science of Computer Programming,A request can? be lost afinite n?umber of times: This is 5:265-276, 1985.A r e a l a in r t[5] M. Naimi and M. Trehel. How to detect a failure andensured by our model, i.e. there can be at most N-I regenerate the token in the log(n) distributedpermanent crashes. However, if there is an infinite algorithm for mutual exclusion. Lecture Notes innumber of failures, this property keeps true if and Computer Science, 312:155-166, 1867.only if the system has periods of stability at least the [6] K. Raymond. A tree-based algorithm for distributedmaximum of m*Tmsg or 2*Tcmg. mutual exclusion. ACM Transactions on ComputerTheorem 2: There is always at most one token in Systems, 7(1):61-77, 1989.the system, which guarantees that at most one node [7] M. Raynal. Prime numbers as a tool to designcan execute theCS atanytime. distributed algorithm. Information ProcessingProof: (a node regenerates the token only when it is Lectures, 33:53-85, 1989.Poof:, (a knodewegneeth

e

tokenonienls whens dit i [8] M. Raynal. A simple taxonomy for distributedlost), we know that token uniqueness is ensured in mutual exclusion algorithms. Technical Report 560,the token holder cluster. In mechanism MI or M2 a INRIA, 78153 Le Chesnay Cedex, France 1990.node regenerates the token if and only if, it did not [9] I. Suzuki and T. Kasami. A distributed mutualreceive any reply after sending exclusion. ACM Transactions on Computer Systems,ACK_SEARCH_PREV or ACK SEARC- 3(4):344-349, 1989.-r-T-I-, -TT_T_ -T T_T_ . I .1 . I I I [10] M. Trehel and M. Naimi. A distributed algorithm forH QUU'epciey mle httetkni mutual exclusion based on data structures and faultlost. In any cluster, if a node reaches to mechanisms tolerance. In Proc. IEEE 6th International ConferenceM1 or M2 and wants to regenerate the token, firstly on Computers and Communications, pp, 35-39, 1987.checks its tokenpos variable and if the token is [11] J. Sopena, L. Arantes, M. Bertier, P. Sens. A fault-supposed to exist in its local cluster it will regenerate tolerant token-based mutual exclusion algorithmit, else it won't. Besides, it is easily proven by using a dynamic tree. EuroPar 2005, Lisboa,contradiction that Ml and M2 are not compatible Portugal, September 2005. LNCS.[12]M. Bertier, L. Arantes, and P. Sens. Hierarchicaland will not regenerate the token simultaneously by toe bae'uulecuinagrtm.I ttwo different nodes. IEEE/ACM CCGridO4, 10 April 2004.

179

[ieee 2007 international conference on information and communication technology - dhaka, bangladesh...

Documents