november 2005distributed systems: distributed algorithms 1 distributed systems: distributed...

November 2005 Distributed systems: distributed algorithms

1

Distributed Systems:

Distributed algorithms


2

Overview of chapters

• Introduction• Co-ordination models and languages• General services• Distributed algorithms

– Ch 10 Time and global states, 11.4-11.5– Ch 11 Coordination and agreement, 12.1-12.5

• Shared data

• Building distributed services


3

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems


4

Logical clocks• Problem: ordering of events

– requirement for many algorithms– physical clocks cannot be used

• use causality:– within a single process: observation– between different processes: sending of a

message happens before receiving the same message


5

Logical clocks (cont.)

• Formalization: happens before relation

• Rules: – if x happens before y in any process p

then x y – for any message m: send (m) receive (m)– if x y and y z

then x z

• Implementation: logical clocks

x y


6


• Logical clock– counter appropriately incremented– one counter per process

• Physical clock– counts oscillations occurring in a crystal at a

definitive frequency


7


• Rules for incrementing local logical clock1 for each event (including send) in process p:

Cp := Cp + 1

2 when a process sends a message m, it piggybacks on m the value of Cp

3 on receiving (m, t), a process q

• computes Cq := max (Cq, t)

• applies rule 1: Cq := Cq +1

Cq is logical time for event receive(m)


8

•1

a


• Logical timestamps: example

P1

P2

P3 •1

b

•2

c

•3

d•4

e

•5

f

•3

g

•0


9


• C(x) logical clock value for event x

• Correct usage: if x y then C(x) < C(y)

• Incorrect usage: if C(x) < C(y) then x y

• Solution: Logical vector clocks


10


• Vector clocks for N processes:– at process Pi: Vi[j] for j = 1, 2,…,N

– Properties:

if x y then V(x) < V(y)

if V(x) < V(y) then x y


11


• Rules for incrementing logical vector clock1 for each event (including send) in process Pi:

Vi[i] := Vi[i] + 1

2 when a process Pi sends a message m, it piggybacks on m the value of Vi

3 on receiving (m, t), a process Pi

• apply rule 1

• Vi[j] := max(Vi[j] , t[j]) for j = 1, 2,…, N


12


• Logical vector clocks : example

a b

c d

e f

m1

m2

(2,0,0)(1,0,0)

(2,1,0) (2,2,0)

(2,2,2)(0,0,1)

p1

p2

p3

Physical time


13



14

Global states• Detect global properties

p2p1

message

garbage object

objectreference

a. Garbage collection

p2p1 wait-for

wait-forb. Deadlock

p2p1

activatepassive passivec. Termination


15

Global states (cont.)

• Local states & events– Process Pi : ei

k events

sik state, before event k

– History of Pi :

hi = < ei0, ei

1, ei2,…>

– Finite prefix of history of Pi :

hik = < ei

0, ei1, ei

2,…, eik >


16


• Global states & events– Global history

H = h1 h2 h3 … hn

– Global state (when?)

S = ( s1p, s2

q, …, snu)

consistent?– Cut of the systems execution

C = h1c1 h1

c2 … h1cn


17


• Example of cuts:

m1 m2

p1

p2Physical

time

e10

Consistent cutInconsistent cut

e11

e12

e13

e20

e 21

e 22


18


• Finite prefix of history of Pi :

hik = < ei

0, ei1, ei

2,…, eik >

• Cut of the systems executionC = h1

c1 h1c2 … h1

cn

• Consistent cut C e C, f e f C

• Consistent global statecorresponds to consistent cut


19


• Model execution of a (distributed) system

S0 S1 S2 S3 …

– Series of transitions between consistent states

– Each transition corresponds to one single event• Internal event

• Sending message

• Receiving message

– Simultaneous events order events


20


• Definitions:– Run = ordering of all events (in a global history)

consistent with each local history’s ordering

– Linearization =consistent run +consistent with

– S’ reachable from S linearization: … S … S’ …


21


• Kinds of global state predicates:– Stable

– Safety

– Liveness

= true in SS’, S … S’ = true in S’

= undesirable propertyS0 = initial state of systemS, S0 … S = false in S

= desirable propertyS0 = initial state of systemS, S0 … S = true in S


22


• Snapshot algorithm of Chandy & Lamport– Record consistent global state– Assumptions:

• Neither channels nor processes fail• Channels are unidirectional and provide FIFO-

ordered message delivery• Graph of channels and processes is strongly

connected• Any process may initiate a global snapshot• Process may continue their execution during the

snapshot


23


• Snapshot algorithm of Chandy & Lamport– Elements of algorithm

• Players: processes Pi with– Incoming channels– Outgoing channels

• Marker messages• 2 rules

– Marker receiving rule– Marker sending rule

– Start of algorithm• A process acts as it received a marker message


24


Marker receiving rule for process pi

On pi’s receipt of a marker message over channel c:if (pi has not yet recorded its state) it

records its process state now;records the state of c as the empty set;turns on recording of messages arriving over other incoming channels;

else pi records the state of c as the set of messages it has received over c since it saved its state.

end if

Marker sending rule for process pi

After pi has recorded its state, for each outgoing channel c: pi sends one marker message over c (before it sends any other message over c).


25


• Example:

p1 p2c2

c1

account widgets

$1000 (none)

account widgets

$50 2000


26


p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

2. Global state S 1 p1 p2(Order 10, $100), M<$900, 0> <$50, 2000>

(empty)

c2

c1

3. Global state S 2 p1 p2(Order 10, $100), M<$900, 0> <$50, 1995>

(five widgets)

c2

c1

(M = marker message)

4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>


27


p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

(M = marker message)


(empty)

c2



M

c2



(empty)

c2



28


• Observed state– Corresponds to consistent cut– Reachable!

Sinit Sfinal

Ssnap

actual execution e 0,e1,...

recording recording begins ends

pre-snap: e '0,e'1,...e'R-1 post-snap: e'R,e'R+1,...


29



30

Failure detectors• Properties

– Unreliable failure detector: answers with

• Suspected• Unsuspected

– Reliable failure detector: answers with

• Failed• Unsuspected

• Implementation– Every T sec: multicast by P of “P is here”– Maximum on message transmission time:

• Asynchronous system: estimate E• Synchronous system: absolute bound A

No “P is here” within T + E sec

No “P is here” within T + A sec


31



32

Mutual exclusion• Problem: how to give a single process

temporarily a privilege?– Privilege = the right to access a (shared)

resource– resource = file, device, window,…

• Assumptions– clients execute the mutual exclusion algorithm– the resource itself might be managed by a

server– Reliable communication


33

Mutual exclusion (cont.)

• Basic requirements:– ME1: at most one process might execute

in the shared resource at any time

(Safety)– ME2: a process requesting access to the

shared resource is eventually granted it (Liveness)

– ME3: Access to the shared resource should be granted in happened-before order

(Ordering or fairness)


34


• Solutions:– central server algorithm

– distributed algorithm using logical clocks

– ring-based algorithm

– voting algorithm

• Evaluation– Bandwidth (= #messages to enter and exit)

– Client delay (incurred by a process at enter and exit)

– Synchronization delay (delay between exit and enter)


35


central server algorithm• Central server offering 2 operations:

– enter()• if resource free

then operation returns without delayelse request is queued and return from operation is delayed

– exit()• if request queue is empty

then resource is marked freeelse return for a selected request is executed


36


central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User

Enter()


37



Server

Queue:

P1

P2P3

P4

User3

Enter()


38



Server

Queue:

P1

P2P3

P4

User3


39



Server

Queue:

P1

P2P3

P4

User3

enter()


40



Server

Queue:

4

P1

P2P3

P4

User3

enter()


41



Server

Queue:

4

P1

P2P3

P4

User3

enter()

enter()


42



Server

Queue:

4, 2

P1

P2P3

P4

User3

enter()

enter()


43



Server

Queue:

4, 2

P1

P2P3

P4

User3

enter()

enter() exit()


44



Server

Queue:

4, 2

P1

P2P3

P4

User

enter()

enter()


45



Server

Queue:

2

P1

P2P3

P4

User4

enter()


46


central server algorithm• Evaluation:

– ME3 not satisfied!

– Performance:• single server is performance bottleneck

• Enter critical section: 2 messages

• Synchronization: 2 messages between exit of one process and enter of next

– Failure: • Central server is single point of failure

• what if a client, holding the resource, fails?

• Reliable communication required


47


ring-based algorithm • All processes arranged in a

– unidirectional

– logical

ring

• token passed in ring

• process with token has access to resource


48


ring-based algorithm

P1

P4

P3

P2

P5

P6


49



P1

P4

P3

P2

P5

P6

P2 can use resource


50



P1

P4

P3

P2

P5

P6

P2 stopped using resource and forwarded token


51



P1

P4

P3

P2

P5

P6

P3 doesn’t need resource and forwards token


52



P1

P4

P3

P2

P5

P6


53


ring-based algorithm • Evaluation:

– ME3 not satisfied

– efficiency

• high when high usage of resource

• high overhead when very low usage

– failure

• Process failure: loss of ring!



54


distributed algorithm using logical clocks• Distributed agreement algorithm

– multicast requests to all participating processes– use resource when all other participants agree

(= reply received)• Processes

– keep logical clock; included in all request messages

– behave as finite state machine:• released• wanted• held


55


distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on initialization:• state := released;

– to obtain resource:• state := wanted;

• T = logical clock value for next event;

• multicast request to other processes <T, Pj>;

• wait for n-1 replies;

• state := held;


56


distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on receipt of request <Ti, Pi> :

• if (state = held) or (state = wanted and (T,Pj) < (Ti,Pi) )

then queue request from Pi

else reply immediately to Pi

– to release resource:• state := released;

• reply to any queued requests;


57


distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

– 3 processes

– P1 and P2 will request it concurrently

– P3 not interested in using resource


58


distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

P1

releasedQueue:

P2

releasedQueue:

P3

releasedQueue:


59


distributed algorithm using logical clocks

P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>0


60



P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

0

0


61



P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

0

0


62



P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

0

0


63



P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1


64



P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1


65



P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1


66



P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1


67



P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

2


68



P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

1

2


69



P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

1

0


70


distributed algorithm using logical clocks• Evaluation

– Performance:

• expensive algorithm:

2 * ( n - 1 ) messages to get resource

• Client delay: round trip

• Synchronization delay: 1 message to pass section to another process

• does not solve the performance bottleneck

– Failure

• each process must know all other processes



71


voting algorithm• Approach of Maekawa:

– Communication with subset of partners should suffice– Candidate collects sufficient votes

• Voting set: Vi = voting set for pi

i, j: Vi Vj – | Vi | = K

– Pj contained in M voting sets– Optimal solution

• K ~ N• M = K


72


voting algorithmOn initialization

state := RELEASED;voted := FALSE;

For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;

On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then

queue request from pi without replying; else

send reply to pi;voted := TRUE;

end if

Maekawa’s algorithm


73


voting algorithmMaekawa’s algorithm (cont.)For pi to exit the critical section

state := RELEASED;Multicast release to all processes in Vi – {pi};

On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then

remove head of queue – from pk, say; send reply to pk;voted := TRUE;

else voted := FALSE;

end if


74


voting algorithm• Evaluation

– Properties• ME1: OK

• ME2: NOK, deadlock possiblesolution: process requests in order

• ME3: Ok– Performance

• Bandwidth: on enter 2 N messages + on exit N messages

• Client delay: round trip

• Synchronization delay: round trip

– Failure:• Crash of process on another voting set can be tolerated



75


• Discussion

– algorithms are expensive and not practical

– algorithms are extremely complex in the

presence of failures

– better solution in most cases:

• let the server, managing the resource, perform

concurrency control

• gives more transparency for the clients


76



77

Elections• Problem statement:

– select a process from a group of processes– several processes may start election

concurrently

• Main requirement:– unique choice– Select process with highest id

• Process id

• <1/load, process id>


78

Elections (cont.)

• Basic requirements:– E1: participant pi set electedi = or

electedi = P; P process with highest id (Safety)

– E2: all processes pi participate and setelectedi or crash (Liveness)

not yet defined


79

Elections (cont.)

• Solutions:– Bully election algorithm

– Ring based election algorithm

• Evaluation

– Bandwidth ( ~ total number of messages)

– Turnaround time (the number of serialized message

transmission times between initiation and termination

of a single run)


80

Elections (cont.)

Bully election• Assumptions:

– each process has identifier– processes can fail during an election– communication is reliable

• Goal: – surviving member with the largest identifier is

elected as coordinator


81

Elections (cont.)

Bully election

• Roles for processes:

– coordinator

• elected process

• has highest identifier, at the time of election

– initiator

• process starting the election for some reason


82

Elections (cont.)

Bully election• Three types of messages:

– election message• sent by an initiator of the election

to all other processes with a higher identifier

– answer message• a reply message sent by the receiver of an election

message

– coordinator message• sent by the process becoming the coordinator

to all other processes with lower identifiers


83

Elections (cont.)

Bully election• Algorithm:

– send election message:• process doing it is called initiator

• any process may do it at any time

• when a failed process is restarted, it starts an election, even though the current coordinator is functioning (bully)

– a process receiving an election message• replies with an answer message

• will start an election itself (why?)


84

Elections (cont.)

Bully election

• Algorithm:

– actions of an initiator

• when not receiving an answer message within a

certain time (2Ttrans +Tprocess) becomes coordinator

• when having received an answer message

( a process with a higher identifier is active)

and not receiving a coordinator message (after x

time units)

will restart elections


85

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P1 initiator

election


86

Elections (cont.)

Bully election


P1 P3

P2 P4

P2 and P3 reply and start election

election

answer


87

Elections (cont.)

Bully election


P1 P3

P2 P4

Election messages of P2 arrive

election

answer


88

Elections (cont.)

Bully election


P1 P3

P2 P4

P3 replies

election

answer


89

Elections (cont.)

Bully election


P1 P3

P2 P4

P3 fails

election

answer


90

Elections (cont.)

Bully election


P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer


91

Elections (cont.)

Bully election


P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer


92

Elections (cont.)

Bully election


P1 P3

P2 P4

P2 replies and starts election

election

answer


93

Elections (cont.)

Bully election


P1 P3

P2 P4

P2 receives no replies coordinator

election

answer


94

Elections (cont.)

Bully election• Evaluation

– Correctness: E1 & E2 ok, if• Reliable communication• No process replaces crashed process

– Correctness: no guarantee for E1, if• Crashed process is replaced by process with same id• Assumed timeout values are inaccurate (= unreliable failure

detector)

– Performance• Worst case: O(n2)• Optimal: bandwidth: n-2 messages

turnaround: 1 message


95

Elections (cont.) Ring based election

• Assumptions:

– processes arranged in a logical ring

– each process has an identifier: i for Pi

– processes remain functional and reachable during the

algorithm


96


• Messages:– forwarded over logical ring– 2 types:

• election: used during electioncontains identifier

• elected: used to announce new coordinator

• Process States:– participant– non-participant


97


• Algorithm

– process initiating an election

• becomes participant

• sends election message to its neighbour


98


P1

P21P8

P11

P14

P5

initiator

11


99


• Algorithm– upon receiving an election message, a process

compares identifiers:• Received: identifier in message

• own: identifier of process

– 3 cases:• Received > own

• Received < own

• Received = own


100


• Algorithm– receive election message

• Received > own– message forwarded

– process becomes participant


101


P1

P21P8

P11

P14

P5

initiator

11


102


P1

P21P8

P11

P14

P5

initiator

11


103


• Algorithm– receive election message

• Received > own– message forwarded


• Received < own and process is non-participant– substitutes own identifier in message

– message forwarded



104


P1

P21P8

P11

P14

P5

initiator

11


105


P1

P21P8

P11

P14

P5

initiator

21


106


P1

P21P8

P11

P14

P5

initiator

21


107


P1

P21P8

P11

P14

P5

initiator

21


108


• Algorithm: – receive election message

• Received > own

– ...

• Received < own and process is non-participant– ...

• Received = own– identifier must be greatest

– process becomes coordinator

– new state: non-participant

– sends elected message to neighbour


109


P1

P21P8

P11

P14

P5 coordinator

21


110


• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour


111


• Algorithm – receive elected message

• participant:– new state: non-participant

– forwards message

• coordinator:– election process completed


112


P1

P21P8

P11

P14

P5 coordinator

21


113


P1

P21P8

P11

P14

P5 coordinator

21


114


P1

P21P8

P11

P14

P5 coordinator

21


115


P1

P21P8

P11

P14

P5 coordinator

21


116


P1

P21P8

P11

P14

P5 coordinator 21


117


P1

P21P8

P11

P14

P5 coordinator

21


118


P1

P21P8

P11

P14

P5 coordinator


119


• Evaluation– Why is condition

Received < own and process is non-participantnecessary? (see next slide for full algorithm)

– Number of messages:• worst case: 3 * n - 1• best case: 2 * n

– concurrent elections: messages are extinguished • as soon as possible• before winning result is announced


120


• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour


121



122

Multicast communication• Essential property:

– 1 multicast operation <> multiple sendsHigher efficiencyStronger delivery guarantees

• Operations: g = group, m = message

– X-multicast(g, m)– X-deliver(m)

• <> receive(m)

– X additional property Basic, Reliable, FifO,….


123

Multicast communication(cont.) IP multicast

• Datagram operations – with multicast IP address

• Failure model cfr UDP

– Omission failures– No ordering or reliability guarantees


124

Multicast communication(cont.) Basic multicast

• = IP multicast + delivery guarantee if multicasting process does not crash

• Straightforward algorithm: (with reliable send)

• Ex. practical algorithm using IP-multicast

To B-multicast(g, m):

p g: send(p, m)

On receive(m) at p:

B-deliver(m)


125

Multicast communication(cont.) Reliable multicast

• Properties:– Integrity (safety)

• A correct process delivers a message at most once

– Validity (liveness)

• Correct process p multicasts m p delivers m

– Agreement (liveness)

correct process p delivering m all correct processes will deliver m

– Uniform agreement (liveness)

process p (correct or failing) delivering m all correct processes will deliver m


126


• 2 algorithms:

1. Using B-multicast

2. Using IP-multicast + piggy backed acks


127


Algorithm 1 with B-multicast


128


• Correct?

– Integrity

– Validity

– Agreement

• Efficient?

– NO: each message transmitted g times

Algorithm 1 with B-multicast


129


Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet

Algorithm 2 with IP-multicast


130



Data structures at process p:

Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0

For process p to R-multicast message m to group g

IP-multicast (g, <m, Sgp , <q, Rg

q > >)

Sgp ++

On IP-deliver (<m, S, <q, R>>) at q from p


131


Algorithm 2 with IP-multicastOn IP-deliver (<m, S, <q, R>>) at q from p

if S = Rgp + 1

then R-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue

request missing messages endif

endif

if R > Rg then request missing messages endif


132


• Correct?– Integrity: seq numbers + checksums

– Validity: if missing messages are detected

– Agreement: if copy of message remains available



133


• 3 processes in group: P, Q, R

• State of process:– S: Next_sequence_number

– Rq: Already_delivered from Q

– Stored messages

• Presentation:

Algorithm 2 with IP-multicast: example

P: 2Q: 3 R: 5< >


134


• Initial state:


P: 0Q: -1 R: -1< >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >


135


• First multicast by P:


P: 1Q: -1 R: -1< mp0 >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>


136


• Arrival multicast by P at Q:


P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>


137


• New state:


P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >


138


• Multicast by Q:


P: 1Q: -1 R: -1< mp0 >

Q: 1P: 0 R: -1< mp0 ,mq0 >

R: 0P: -1 Q: -1< >

Q: mq0, 0, <P:0, R:-1>


139


• Arrival of multicast by Q:


P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >

Q: mq0, 0, <P:0, R:-1>


140


• When to delete stored messages?


P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >


141

Multicast communication(cont.) Ordered multicast

• FIFO

• Causal

• Total

if a correct process P:multicast(g, m);multicast(g, m’);

then for all correct processes:deliver(m’) deliver(m) before deliver(m’)

if multicast(g, m) multicast(g, m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)


142


• Total

• FIFO-Total = FIFO + Total

• Causal-Total = Causal + Total

• Atomic = reliable + Total

if p: deliver(m) deliver( m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)


143


F3

F1

F2

T2

T1

P1 P2 P3

Time

C3

C1

C2

Notice the consistent ordering of totally ordered messages T1 and T2,

the FIFO-related messages F1 and F2 and the causally

related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages.


144

Multicast communication(cont.) FIFO multicast

• Alg. 1: R-multicast using IP-multicast

– Correct?• Sender assigns Sg

p

• Receivers deliver in this order

• Alg. 2 on top of any B-multicast


145


Algorithm 2 on top of any B-multicast


Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0; Rg

q = -1

For process p to FO-multicast message m to group g

B-multicast ( g, <m, Sgp >)

Sgp ++

On B-deliver (<m, S >) at q from p


146


Algorithm 2 on top of any B-multicast

On B-deliver (<m, S >) at q from p

if S = Rgp + 1

then FO-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue endif

endif


147

Multicast communication(cont.) TOTAL multicast

• Basic approach:– Sender: assign totally ordered identifiers iso process

ids– Receiver: deliver as for FIFO ordering

• Alg. 1: use a (single) sequencer process

• Alg. 2: participants collectively agree on the assignment of sequence numbers


148

Multicast communication(cont.) TOTAL multicast: sequencer process


149

Multicast communication(cont.) TOTAL multicast: sequencer process

• Correct?

• Problems?– A single sequencer process

• bottleneck• single point of failure


150

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

• Approach:– Sender:

• B-multicasts message

– Receivers:• Propose sequence numbers to sender

– Sender: • uses returned sequence numbers to

generate agreed sequence number


151


2

1

1

2

2

1 Message

2 Proposed Seq

P2

P3

P1

P4

3 Agreed Seq

3

3


152



Agp : largest agreed sequence number

Pgp : largest proposed sequence number by P

On initialization:

Pgp = 0

For process p to TO-multicast message m to group g

B-multicast ( g, <m, i >) /i = unique id for m

On B-deliver (<m, i> ) at q from p

Pgq = max (Ag

q, Pgq) + 1

send (p, < i, Pgq >)

store <m, i, Pgq > in hold-back queue


153


On receive( i, P) at p from q

wait for all replies; a is the largest reply

B-multicast(g, < “order”, i, a >)

On B-deliver (<“order”, i, a > ) at q from p

Agq = max (Ag

q, a)

attach a to message i in hold-back queue

reorder messages in hold-back queue (increasing sequence numbers)

while message m in front of hold-back queue has been assigned

an agreed sequence number

do remove m from hold-back queue

TO-deliver (m)


154


• Correct?– Processes will agree on sequence number for

a message– Sequence numbers are monotonically

increasing– No process can prematurely deliver a

message• Performance

– 3 serial messages!• Total ordering

– <> FIFO– <> causal


155

Multicast communication(cont.) Causal multicast

• Limitations:– Causal order only by multicast operations– Non-overlapping, closed groups

• Approach:– Use vector timestamps– Timestamp = count number of multicast

messages


156

Multicast communication(cont.) Causal multicast: vector timestamps

Meaning?


157

Multicast communication(cont.) Causal multicast: vector timestamps

• Correct?– Message timestamp

m V m’ V’

– Given multicast(g,m) multicast(g,m’)proof V < V’


158



159

Consensus & related problems• System model

– N processes pi

– Communication is reliable– Processes mail fail

• Crash• Byzantine

– No message signing• Message signing limits the harm a faulty process can do

• Problems– Consensus– Byzantine generals– Interactive consistency


160

Consensus

• Problem statement pi: undecided state & pi proposes vi

– Message exchanges

– Finally: each process pi sets decision variable di, enters decided state and may not change di

• Requirements:– Termination: eventually each correct process pi sets di

– Agreement: pi and pj correct & in decided state di = dj

– Integrity: correct processes all propose same value d any process in decided state has

chosen d


161

Consensus

• Simple algorithm (no process failures)

– Collect processes in group g– For each process pi:

• R-multicast(g, vi)• Collect N values• d = majority (v1,v2,...,vN)

• Problems with failures:– Crash: detected? not in asynchronous systems– Byzantine? Faulty process can send around

different values


162

Byzantine generals• Problem statement

– Informal: • agree to attack or to retreat• commander issues the order• lieutenants are to decide to attack or retreat• all can be ‘treacherous’

– Formal• One process proposes value• Others are to agree

• Requirements:– Termination: each process eventually decides– Agreement: all correct processes select the same value– Integrity: commander correct other correct processes

select value of commander


163

Interactive consistency

• Problem statement– Correct processes agree upon a vector of values (one

value for each process)

• Requirements:– Termination: each process eventually decides

– Agreement: all correct processes select the same value

– Integrity: if pi correct

then all correct processes decide on vi as

the i-the component of their vector


164

Related problems & solutions

• Basic solutions:– Consensus:

• Ci(v1, v2, ..., vN) = decision value of pi

– Byzantine generals• j is commander, proposing v

• BGi (j, v) = decision value of pi

– Interactive consensus• ICi(v1, v2, ..., vN)[j] = jth value in the decision vector of pi

with v1, v2, ..., vN values proposed by processes


165


• Derived solutions:– IC from BG:

• Run BG N times once with each process as commander

• ICi(v1, v2, ..., vN)[j] = BG(j, vj)

– C from IC:• Run IC and produce vector of values

• Derive single value with appropriate function

– BG from C:


166


• Derived solutions:– IC from BG:

– C from IC:

– BG from C:

• Commander pj sends value v to itself & other

processes

• Each process runs C with the values v1, v2, ...vN

they received

• BGi(j,v) = Ci(v1, v2, ...vN)


167

Consensus in a synchronous system

• Assumptions– Use B-multicast

– f of N processes may fail

• Approach– Proceed in f + 1 rounds

– In each round correct processes B-multicast values

• Variables

– Valuesir = set of proposed values known to process pi

at the beginning of round r


168



169


• Termination?– Synchronous system

• Correct?– Each process arrives at the same set of values at the end of the

final round

– Proof? • Assume sets are different ...

• Pi has value v & Pj doesn’t have value v

Pk : managed to send v to Pi and not to Pj

• Agreement & Integrity?– Processes apply the same function


170

Byzantine generals in a synchronous system

• Assumptions– Arbitrary failures

– f of N processes may be faulty

– Channels are reliable: no message injections

– Unsigned messages


171


• Impossibility with 3 processes

• Impossibility with N <= 3f processes

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p1 (Commander)

p2 p3

1:x1:w

2:1:w

3:1:x

Faulty processes are shown shaded


172


• Solution with one faulty process– N = 4, f = 1

– 2 rounds of messages:• Commander sends its value to each of the lieutenants

• Each of the lieutenants sends the value it received to its peers

– Lieutenant receives• Value of commander

• N-2 values of peers

Use majority function


173


• Solution with one faulty process

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

Faulty processes are shown shaded

p4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:u

2:1:u

3:1:w

p4

1:v

4:1:v

2:1:u 3:1:w

4:1:v


174

Consensus & related problems

• Impossibility in asynchronous systems proof: No algorithm exists to reach consensus

– No guaranteed solution to• Byzantine generals problem

• Interactive consistency

• Reliable & totally ordered multicast

• Approaches for work around:– Masking faults: restart crashed process and use persistent storage

– Use failure detectors: make failure fail-silent by discarding messages


175



176

Distributed Systems:

Distributed algorithms

november 2005distributed systems: distributed algorithms 1 distributed systems: distributed...

Documents

message distributed

eik distributed systems

exampledistributed systems

ndistributed systems

systems execution c

algorithmsphysical clocks

x y solution

mif x y