november 2005distributed systems: distributed algorithms 1 distributed systems: distributed...

176
November 2005 Distributed systems: dist ributed algorithms 1 Distributed Systems: Distributed algorithms

Upload: delilah-jones

Post on 04-Jan-2016

259 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

1

Distributed Systems:

Distributed algorithms

Page 2: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

2

Overview of chapters

• Introduction• Co-ordination models and languages• General services• Distributed algorithms

– Ch 10 Time and global states, 11.4-11.5– Ch 11 Coordination and agreement, 12.1-12.5

• Shared data

• Building distributed services

Page 3: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

3

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 4: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

4

Logical clocks• Problem: ordering of events

– requirement for many algorithms– physical clocks cannot be used

• use causality:– within a single process: observation– between different processes: sending of a

message happens before receiving the same message

Page 5: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

5

Logical clocks (cont.)

• Formalization: happens before relation

• Rules: – if x happens before y in any process p

then x y – for any message m: send (m) receive (m)– if x y and y z

then x z

• Implementation: logical clocks

x y

Page 6: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

6

Logical clocks (cont.)

• Logical clock– counter appropriately incremented– one counter per process

• Physical clock– counts oscillations occurring in a crystal at a

definitive frequency

Page 7: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

7

Logical clocks (cont.)

• Rules for incrementing local logical clock1 for each event (including send) in process p:

Cp := Cp + 1

2 when a process sends a message m, it piggybacks on m the value of Cp

3 on receiving (m, t), a process q

• computes Cq := max (Cq, t)

• applies rule 1: Cq := Cq +1

Cq is logical time for event receive(m)

Page 8: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

8

•1

a

Logical clocks (cont.)

• Logical timestamps: example

P1

P2

P3 •1

b

•2

c

•3

d•4

e

•5

f

•3

g

•0

Page 9: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

9

Logical clocks (cont.)

• C(x) logical clock value for event x

• Correct usage: if x y then C(x) < C(y)

• Incorrect usage: if C(x) < C(y) then x y

• Solution: Logical vector clocks

Page 10: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

10

Logical clocks (cont.)

• Vector clocks for N processes:– at process Pi: Vi[j] for j = 1, 2,…,N

– Properties:

if x y then V(x) < V(y)

if V(x) < V(y) then x y

Page 11: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

11

Logical clocks (cont.)

• Rules for incrementing logical vector clock1 for each event (including send) in process Pi:

Vi[i] := Vi[i] + 1

2 when a process Pi sends a message m, it piggybacks on m the value of Vi

3 on receiving (m, t), a process Pi

• apply rule 1

• Vi[j] := max(Vi[j] , t[j]) for j = 1, 2,…, N

Page 12: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

12

Logical clocks (cont.)

• Logical vector clocks : example

a b

c d

e f

m1

m2

(2,0,0)(1,0,0)

(2,1,0) (2,2,0)

(2,2,2)(0,0,1)

p1

p2

p3

Physical time

Page 13: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

13

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 14: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

14

Global states• Detect global properties

p2p1

message

garbage object

objectreference

a. Garbage collection

p2p1 wait-for

wait-forb. Deadlock

p2p1

activatepassive passivec. Termination

Page 15: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

15

Global states (cont.)

• Local states & events– Process Pi : ei

k events

sik state, before event k

– History of Pi :

hi = < ei0, ei

1, ei2,…>

– Finite prefix of history of Pi :

hik = < ei

0, ei1, ei

2,…, eik >

Page 16: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

16

Global states (cont.)

• Global states & events– Global history

H = h1 h2 h3 … hn

– Global state (when?)

S = ( s1p, s2

q, …, snu)

consistent?– Cut of the systems execution

C = h1c1 h1

c2 … h1cn

Page 17: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

17

Global states (cont.)

• Example of cuts:

m1 m2

p1

p2Physical

time

e10

Consistent cutInconsistent cut

e11

e12

e13

e20

e 21

e 22

Page 18: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

18

Global states (cont.)

• Finite prefix of history of Pi :

hik = < ei

0, ei1, ei

2,…, eik >

• Cut of the systems executionC = h1

c1 h1c2 … h1

cn

• Consistent cut C e C, f e f C

• Consistent global statecorresponds to consistent cut

Page 19: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

19

Global states (cont.)

• Model execution of a (distributed) system

S0 S1 S2 S3 …

– Series of transitions between consistent states

– Each transition corresponds to one single event• Internal event

• Sending message

• Receiving message

– Simultaneous events order events

Page 20: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

20

Global states (cont.)

• Definitions:– Run = ordering of all events (in a global history)

consistent with each local history’s ordering

– Linearization =consistent run +consistent with

– S’ reachable from S linearization: … S … S’ …

Page 21: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

21

Global states (cont.)

• Kinds of global state predicates:– Stable

– Safety

– Liveness

= true in SS’, S … S’ = true in S’

= undesirable propertyS0 = initial state of systemS, S0 … S = false in S

= desirable propertyS0 = initial state of systemS, S0 … S = true in S

Page 22: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

22

Global states (cont.)

• Snapshot algorithm of Chandy & Lamport– Record consistent global state– Assumptions:

• Neither channels nor processes fail• Channels are unidirectional and provide FIFO-

ordered message delivery• Graph of channels and processes is strongly

connected• Any process may initiate a global snapshot• Process may continue their execution during the

snapshot

Page 23: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

23

Global states (cont.)

• Snapshot algorithm of Chandy & Lamport– Elements of algorithm

• Players: processes Pi with– Incoming channels– Outgoing channels

• Marker messages• 2 rules

– Marker receiving rule– Marker sending rule

– Start of algorithm• A process acts as it received a marker message

Page 24: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

24

Global states (cont.)

Marker receiving rule for process pi

On pi’s receipt of a marker message over channel c:if (pi has not yet recorded its state) it

records its process state now;records the state of c as the empty set;turns on recording of messages arriving over other incoming channels;

else pi records the state of c as the set of messages it has received over c since it saved its state.

end if

Marker sending rule for process pi

After pi has recorded its state, for each outgoing channel c: pi sends one marker message over c (before it sends any other message over c).

Page 25: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

25

Global states (cont.)

• Example:

p1 p2c2

c1

account widgets

$1000 (none)

account widgets

$50 2000

Page 26: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

26

Global states (cont.)

p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

2. Global state S 1 p1 p2(Order 10, $100), M<$900, 0> <$50, 2000>

(empty)

c2

c1

3. Global state S 2 p1 p2(Order 10, $100), M<$900, 0> <$50, 1995>

(five widgets)

c2

c1

(M = marker message)

4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

Page 27: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

27

Global states (cont.)

p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

(M = marker message)

4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

5. Global state S 4 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

M

c2

c1 C2 = <>C1=<(five widgets)>

6. Global state S 5 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

Page 28: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

28

Global states (cont.)

• Observed state– Corresponds to consistent cut– Reachable!

Sinit Sfinal

Ssnap

actual execution e 0,e1,...

recording recording begins ends

pre-snap: e '0,e'1,...e'R-1 post-snap: e'R,e'R+1,...

Page 29: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

29

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 30: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

30

Failure detectors• Properties

– Unreliable failure detector: answers with

• Suspected• Unsuspected

– Reliable failure detector: answers with

• Failed• Unsuspected

• Implementation– Every T sec: multicast by P of “P is here”– Maximum on message transmission time:

• Asynchronous system: estimate E• Synchronous system: absolute bound A

No “P is here” within T + E sec

No “P is here” within T + A sec

Page 31: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

31

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 32: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

32

Mutual exclusion• Problem: how to give a single process

temporarily a privilege?– Privilege = the right to access a (shared)

resource– resource = file, device, window,…

• Assumptions– clients execute the mutual exclusion algorithm– the resource itself might be managed by a

server– Reliable communication

Page 33: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

33

Mutual exclusion (cont.)

• Basic requirements:– ME1: at most one process might execute

in the shared resource at any time

(Safety)– ME2: a process requesting access to the

shared resource is eventually granted it (Liveness)

– ME3: Access to the shared resource should be granted in happened-before order

(Ordering or fairness)

Page 34: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

34

Mutual exclusion (cont.)

• Solutions:– central server algorithm

– distributed algorithm using logical clocks

– ring-based algorithm

– voting algorithm

• Evaluation– Bandwidth (= #messages to enter and exit)

– Client delay (incurred by a process at enter and exit)

– Synchronization delay (delay between exit and enter)

Page 35: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

35

Mutual exclusion (cont.)

central server algorithm• Central server offering 2 operations:

– enter()• if resource free

then operation returns without delayelse request is queued and return from operation is delayed

– exit()• if request queue is empty

then resource is marked freeelse return for a selected request is executed

Page 36: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

36

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User

Enter()

Page 37: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

37

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

Enter()

Page 38: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

38

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

Page 39: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

39

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

enter()

Page 40: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

40

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

4

P1

P2P3

P4

User3

enter()

Page 41: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

41

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

4

P1

P2P3

P4

User3

enter()

enter()

Page 42: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

42

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

4, 2

P1

P2P3

P4

User3

enter()

enter()

Page 43: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

43

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

4, 2

P1

P2P3

P4

User3

enter()

enter() exit()

Page 44: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

44

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

4, 2

P1

P2P3

P4

User

enter()

enter()

Page 45: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

45

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

2

P1

P2P3

P4

User4

enter()

Page 46: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

46

Mutual exclusion (cont.)

central server algorithm• Evaluation:

– ME3 not satisfied!

– Performance:• single server is performance bottleneck

• Enter critical section: 2 messages

• Synchronization: 2 messages between exit of one process and enter of next

– Failure: • Central server is single point of failure

• what if a client, holding the resource, fails?

• Reliable communication required

Page 47: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

47

Mutual exclusion (cont.)

ring-based algorithm • All processes arranged in a

– unidirectional

– logical

ring

• token passed in ring

• process with token has access to resource

Page 48: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

48

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

Page 49: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

49

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P2 can use resource

Page 50: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

50

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P2 stopped using resource and forwarded token

Page 51: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

51

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P3 doesn’t need resource and forwards token

Page 52: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

52

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

Page 53: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

53

Mutual exclusion (cont.)

ring-based algorithm • Evaluation:

– ME3 not satisfied

– efficiency

• high when high usage of resource

• high overhead when very low usage

– failure

• Process failure: loss of ring!

• Reliable communication required

Page 54: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

54

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Distributed agreement algorithm

– multicast requests to all participating processes– use resource when all other participants agree

(= reply received)• Processes

– keep logical clock; included in all request messages

– behave as finite state machine:• released• wanted• held

Page 55: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

55

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on initialization:• state := released;

– to obtain resource:• state := wanted;

• T = logical clock value for next event;

• multicast request to other processes <T, Pj>;

• wait for n-1 replies;

• state := held;

Page 56: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

56

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on receipt of request <Ti, Pi> :

• if (state = held) or (state = wanted and (T,Pj) < (Ti,Pi) )

then queue request from Pi

else reply immediately to Pi

– to release resource:• state := released;

• reply to any queued requests;

Page 57: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

57

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

– 3 processes

– P1 and P2 will request it concurrently

– P3 not interested in using resource

Page 58: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

58

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

P1

releasedQueue:

P2

releasedQueue:

P3

releasedQueue:

Page 59: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

59

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>0

Page 60: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

60

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

0

0

Page 61: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

61

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

0

0

Page 62: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

62

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

0

0

Page 63: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

63

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 64: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

64

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 65: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

65

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 66: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

66

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 67: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

67

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

2

Page 68: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

68

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

1

2

Page 69: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

69

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

1

0

Page 70: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

70

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Evaluation

– Performance:

• expensive algorithm:

2 * ( n - 1 ) messages to get resource

• Client delay: round trip

• Synchronization delay: 1 message to pass section to another process

• does not solve the performance bottleneck

– Failure

• each process must know all other processes

• Reliable communication required

Page 71: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

71

Mutual exclusion (cont.)

voting algorithm• Approach of Maekawa:

– Communication with subset of partners should suffice– Candidate collects sufficient votes

• Voting set: Vi = voting set for pi

i, j: Vi Vj – | Vi | = K

– Pj contained in M voting sets– Optimal solution

• K ~ N• M = K

Page 72: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

72

Mutual exclusion (cont.)

voting algorithmOn initialization

state := RELEASED;voted := FALSE;

For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;

On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then

queue request from pi without replying; else

send reply to pi;voted := TRUE;

end if

Maekawa’s algorithm

Page 73: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

73

Mutual exclusion (cont.)

voting algorithmMaekawa’s algorithm (cont.)For pi to exit the critical section

state := RELEASED;Multicast release to all processes in Vi – {pi};

On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then

remove head of queue – from pk, say; send reply to pk;voted := TRUE;

else voted := FALSE;

end if

Page 74: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

74

Mutual exclusion (cont.)

voting algorithm• Evaluation

– Properties• ME1: OK

• ME2: NOK, deadlock possiblesolution: process requests in order

• ME3: Ok– Performance

• Bandwidth: on enter 2 N messages + on exit N messages

• Client delay: round trip

• Synchronization delay: round trip

– Failure:• Crash of process on another voting set can be tolerated

• Reliable communication required

Page 75: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

75

Mutual exclusion (cont.)

• Discussion

– algorithms are expensive and not practical

– algorithms are extremely complex in the

presence of failures

– better solution in most cases:

• let the server, managing the resource, perform

concurrency control

• gives more transparency for the clients

Page 76: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

76

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 77: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

77

Elections• Problem statement:

– select a process from a group of processes– several processes may start election

concurrently

• Main requirement:– unique choice– Select process with highest id

• Process id

• <1/load, process id>

Page 78: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

78

Elections (cont.)

• Basic requirements:– E1: participant pi set electedi = or

electedi = P; P process with highest id (Safety)

– E2: all processes pi participate and setelectedi or crash (Liveness)

not yet defined

Page 79: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

79

Elections (cont.)

• Solutions:– Bully election algorithm

– Ring based election algorithm

• Evaluation

– Bandwidth ( ~ total number of messages)

– Turnaround time (the number of serialized message

transmission times between initiation and termination

of a single run)

Page 80: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

80

Elections (cont.)

Bully election• Assumptions:

– each process has identifier– processes can fail during an election– communication is reliable

• Goal: – surviving member with the largest identifier is

elected as coordinator

Page 81: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

81

Elections (cont.)

Bully election

• Roles for processes:

– coordinator

• elected process

• has highest identifier, at the time of election

– initiator

• process starting the election for some reason

Page 82: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

82

Elections (cont.)

Bully election• Three types of messages:

– election message• sent by an initiator of the election

to all other processes with a higher identifier

– answer message• a reply message sent by the receiver of an election

message

– coordinator message• sent by the process becoming the coordinator

to all other processes with lower identifiers

Page 83: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

83

Elections (cont.)

Bully election• Algorithm:

– send election message:• process doing it is called initiator

• any process may do it at any time

• when a failed process is restarted, it starts an election, even though the current coordinator is functioning (bully)

– a process receiving an election message• replies with an answer message

• will start an election itself (why?)

Page 84: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

84

Elections (cont.)

Bully election

• Algorithm:

– actions of an initiator

• when not receiving an answer message within a

certain time (2Ttrans +Tprocess) becomes coordinator

• when having received an answer message

( a process with a higher identifier is active)

and not receiving a coordinator message (after x

time units)

will restart elections

Page 85: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

85

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P1 initiator

election

Page 86: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

86

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 and P3 reply and start election

election

answer

Page 87: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

87

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Election messages of P2 arrive

election

answer

Page 88: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

88

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P3 replies

election

answer

Page 89: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

89

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P3 fails

election

answer

Page 90: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

90

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer

Page 91: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

91

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer

Page 92: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

92

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 replies and starts election

election

answer

Page 93: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

93

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 receives no replies coordinator

election

answer

Page 94: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

94

Elections (cont.)

Bully election• Evaluation

– Correctness: E1 & E2 ok, if• Reliable communication• No process replaces crashed process

– Correctness: no guarantee for E1, if• Crashed process is replaced by process with same id• Assumed timeout values are inaccurate (= unreliable failure

detector)

– Performance• Worst case: O(n2)• Optimal: bandwidth: n-2 messages

turnaround: 1 message

Page 95: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

95

Elections (cont.) Ring based election

• Assumptions:

– processes arranged in a logical ring

– each process has an identifier: i for Pi

– processes remain functional and reachable during the

algorithm

Page 96: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

96

Elections (cont.) Ring based election

• Messages:– forwarded over logical ring– 2 types:

• election: used during electioncontains identifier

• elected: used to announce new coordinator

• Process States:– participant– non-participant

Page 97: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

97

Elections (cont.) Ring based election

• Algorithm

– process initiating an election

• becomes participant

• sends election message to its neighbour

Page 98: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

98

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 99: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

99

Elections (cont.) Ring based election

• Algorithm– upon receiving an election message, a process

compares identifiers:• Received: identifier in message

• own: identifier of process

– 3 cases:• Received > own

• Received < own

• Received = own

Page 100: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

100

Elections (cont.) Ring based election

• Algorithm– receive election message

• Received > own– message forwarded

– process becomes participant

Page 101: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

101

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 102: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

102

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 103: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

103

Elections (cont.) Ring based election

• Algorithm– receive election message

• Received > own– message forwarded

– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message

– message forwarded

– process becomes participant

Page 104: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

104

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 105: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

105

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 106: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

106

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 107: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

107

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 108: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

108

Elections (cont.) Ring based election

• Algorithm: – receive election message

• Received > own

– ...

• Received < own and process is non-participant– ...

• Received = own– identifier must be greatest

– process becomes coordinator

– new state: non-participant

– sends elected message to neighbour

Page 109: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

109

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 110: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

110

Elections (cont.) Ring based election

• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour

Page 111: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

111

Elections (cont.) Ring based election

• Algorithm – receive elected message

• participant:– new state: non-participant

– forwards message

• coordinator:– election process completed

Page 112: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

112

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 113: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

113

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 114: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

114

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 115: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

115

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 116: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

116

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator 21

Page 117: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

117

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 118: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

118

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

Page 119: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

119

Elections (cont.) Ring based election

• Evaluation– Why is condition

Received < own and process is non-participantnecessary? (see next slide for full algorithm)

– Number of messages:• worst case: 3 * n - 1• best case: 2 * n

– concurrent elections: messages are extinguished • as soon as possible• before winning result is announced

Page 120: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

120

Elections (cont.) Ring based election

• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour

Page 121: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

121

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 122: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

122

Multicast communication• Essential property:

– 1 multicast operation <> multiple sendsHigher efficiencyStronger delivery guarantees

• Operations: g = group, m = message

– X-multicast(g, m)– X-deliver(m)

• <> receive(m)

– X additional property Basic, Reliable, FifO,….

Page 123: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

123

Multicast communication(cont.) IP multicast

• Datagram operations – with multicast IP address

• Failure model cfr UDP

– Omission failures– No ordering or reliability guarantees

Page 124: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

124

Multicast communication(cont.) Basic multicast

• = IP multicast + delivery guarantee if multicasting process does not crash

• Straightforward algorithm: (with reliable send)

• Ex. practical algorithm using IP-multicast

To B-multicast(g, m):

p g: send(p, m)

On receive(m) at p:

B-deliver(m)

Page 125: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

125

Multicast communication(cont.) Reliable multicast

• Properties:– Integrity (safety)

• A correct process delivers a message at most once

– Validity (liveness)

• Correct process p multicasts m p delivers m

– Agreement (liveness)

correct process p delivering m all correct processes will deliver m

– Uniform agreement (liveness)

process p (correct or failing) delivering m all correct processes will deliver m

Page 126: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

126

Multicast communication(cont.) Reliable multicast

• 2 algorithms:

1. Using B-multicast

2. Using IP-multicast + piggy backed acks

Page 127: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

127

Multicast communication(cont.) Reliable multicast

Algorithm 1 with B-multicast

Page 128: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

128

Multicast communication(cont.) Reliable multicast

• Correct?

– Integrity

– Validity

– Agreement

• Efficient?

– NO: each message transmitted g times

Algorithm 1 with B-multicast

Page 129: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

129

Multicast communication(cont.) Reliable multicast

Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet

Algorithm 2 with IP-multicast

Page 130: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

130

Multicast communication(cont.) Reliable multicast

Algorithm 2 with IP-multicast

Data structures at process p:

Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0

For process p to R-multicast message m to group g

IP-multicast (g, <m, Sgp , <q, Rg

q > >)

Sgp ++

On IP-deliver (<m, S, <q, R>>) at q from p

Page 131: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

131

Multicast communication(cont.) Reliable multicast

Algorithm 2 with IP-multicastOn IP-deliver (<m, S, <q, R>>) at q from p

if S = Rgp + 1

then R-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue

request missing messages endif

endif

if R > Rg then request missing messages endif

Page 132: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

132

Multicast communication(cont.) Reliable multicast

• Correct?– Integrity: seq numbers + checksums

– Validity: if missing messages are detected

– Agreement: if copy of message remains available

Algorithm 2 with IP-multicast

Page 133: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

133

Multicast communication(cont.) Reliable multicast

• 3 processes in group: P, Q, R

• State of process:– S: Next_sequence_number

– Rq: Already_delivered from Q

– Stored messages

• Presentation:

Algorithm 2 with IP-multicast: example

P: 2Q: 3 R: 5< >

Page 134: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

134

Multicast communication(cont.) Reliable multicast

• Initial state:

Algorithm 2 with IP-multicast: example

P: 0Q: -1 R: -1< >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >

Page 135: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

135

Multicast communication(cont.) Reliable multicast

• First multicast by P:

Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>

Page 136: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

136

Multicast communication(cont.) Reliable multicast

• Arrival multicast by P at Q:

Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>

Page 137: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

137

Multicast communication(cont.) Reliable multicast

• New state:

Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >

Page 138: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

138

Multicast communication(cont.) Reliable multicast

• Multicast by Q:

Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 1P: 0 R: -1< mp0 ,mq0 >

R: 0P: -1 Q: -1< >

Q: mq0, 0, <P:0, R:-1>

Page 139: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

139

Multicast communication(cont.) Reliable multicast

• Arrival of multicast by Q:

Algorithm 2 with IP-multicast: example

P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >

Q: mq0, 0, <P:0, R:-1>

Page 140: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

140

Multicast communication(cont.) Reliable multicast

• When to delete stored messages?

Algorithm 2 with IP-multicast: example

P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >

Page 141: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

141

Multicast communication(cont.) Ordered multicast

• FIFO

• Causal

• Total

if a correct process P:multicast(g, m);multicast(g, m’);

then for all correct processes:deliver(m’) deliver(m) before deliver(m’)

if multicast(g, m) multicast(g, m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)

Page 142: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

142

Multicast communication(cont.) Ordered multicast

• Total

• FIFO-Total = FIFO + Total

• Causal-Total = Causal + Total

• Atomic = reliable + Total

if p: deliver(m) deliver( m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)

Page 143: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

143

Multicast communication(cont.) Ordered multicast

F3

F1

F2

T2

T1

P1 P2 P3

Time

C3

C1

C2

Notice the consistent ordering of totally ordered messages T1 and T2,

the FIFO-related messages F1 and F2 and the causally

related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages.

Page 144: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

144

Multicast communication(cont.) FIFO multicast

• Alg. 1: R-multicast using IP-multicast

– Correct?• Sender assigns Sg

p

• Receivers deliver in this order

• Alg. 2 on top of any B-multicast

Page 145: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

145

Multicast communication(cont.) FIFO multicast

Algorithm 2 on top of any B-multicast

Data structures at process p:

Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0; Rg

q = -1

For process p to FO-multicast message m to group g

B-multicast ( g, <m, Sgp >)

Sgp ++

On B-deliver (<m, S >) at q from p

Page 146: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

146

Multicast communication(cont.) FIFO multicast

Algorithm 2 on top of any B-multicast

On B-deliver (<m, S >) at q from p

if S = Rgp + 1

then FO-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue endif

endif

Page 147: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

147

Multicast communication(cont.) TOTAL multicast

• Basic approach:– Sender: assign totally ordered identifiers iso process

ids– Receiver: deliver as for FIFO ordering

• Alg. 1: use a (single) sequencer process

• Alg. 2: participants collectively agree on the assignment of sequence numbers

Page 148: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

148

Multicast communication(cont.) TOTAL multicast: sequencer process

Page 149: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

149

Multicast communication(cont.) TOTAL multicast: sequencer process

• Correct?

• Problems?– A single sequencer process

• bottleneck• single point of failure

Page 150: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

150

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

• Approach:– Sender:

• B-multicasts message

– Receivers:• Propose sequence numbers to sender

– Sender: • uses returned sequence numbers to

generate agreed sequence number

Page 151: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

151

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

2

1

1

2

2

1 Message

2 Proposed Seq

P2

P3

P1

P4

3 Agreed Seq

3

3

Page 152: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

152

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

Data structures at process p:

Agp : largest agreed sequence number

Pgp : largest proposed sequence number by P

On initialization:

Pgp = 0

For process p to TO-multicast message m to group g

B-multicast ( g, <m, i >) /i = unique id for m

On B-deliver (<m, i> ) at q from p

Pgq = max (Ag

q, Pgq) + 1

send (p, < i, Pgq >)

store <m, i, Pgq > in hold-back queue

Page 153: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

153

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

On receive( i, P) at p from q

wait for all replies; a is the largest reply

B-multicast(g, < “order”, i, a >)

On B-deliver (<“order”, i, a > ) at q from p

Agq = max (Ag

q, a)

attach a to message i in hold-back queue

reorder messages in hold-back queue (increasing sequence numbers)

while message m in front of hold-back queue has been assigned

an agreed sequence number

do remove m from hold-back queue

TO-deliver (m)

Page 154: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

154

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

• Correct?– Processes will agree on sequence number for

a message– Sequence numbers are monotonically

increasing– No process can prematurely deliver a

message• Performance

– 3 serial messages!• Total ordering

– <> FIFO– <> causal

Page 155: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

155

Multicast communication(cont.) Causal multicast

• Limitations:– Causal order only by multicast operations– Non-overlapping, closed groups

• Approach:– Use vector timestamps– Timestamp = count number of multicast

messages

Page 156: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

156

Multicast communication(cont.) Causal multicast: vector timestamps

Meaning?

Page 157: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

157

Multicast communication(cont.) Causal multicast: vector timestamps

• Correct?– Message timestamp

m V m’ V’

– Given multicast(g,m) multicast(g,m’)proof V < V’

Page 158: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

158

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 159: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

159

Consensus & related problems• System model

– N processes pi

– Communication is reliable– Processes mail fail

• Crash• Byzantine

– No message signing• Message signing limits the harm a faulty process can do

• Problems– Consensus– Byzantine generals– Interactive consistency

Page 160: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

160

Consensus

• Problem statement pi: undecided state & pi proposes vi

– Message exchanges

– Finally: each process pi sets decision variable di, enters decided state and may not change di

• Requirements:– Termination: eventually each correct process pi sets di

– Agreement: pi and pj correct & in decided state di = dj

– Integrity: correct processes all propose same value d any process in decided state has

chosen d

Page 161: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

161

Consensus

• Simple algorithm (no process failures)

– Collect processes in group g– For each process pi:

• R-multicast(g, vi)• Collect N values• d = majority (v1,v2,...,vN)

• Problems with failures:– Crash: detected? not in asynchronous systems– Byzantine? Faulty process can send around

different values

Page 162: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

162

Byzantine generals• Problem statement

– Informal: • agree to attack or to retreat• commander issues the order• lieutenants are to decide to attack or retreat• all can be ‘treacherous’

– Formal• One process proposes value• Others are to agree

• Requirements:– Termination: each process eventually decides– Agreement: all correct processes select the same value– Integrity: commander correct other correct processes

select value of commander

Page 163: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

163

Interactive consistency

• Problem statement– Correct processes agree upon a vector of values (one

value for each process)

• Requirements:– Termination: each process eventually decides

– Agreement: all correct processes select the same value

– Integrity: if pi correct

then all correct processes decide on vi as

the i-the component of their vector

Page 164: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

164

Related problems & solutions

• Basic solutions:– Consensus:

• Ci(v1, v2, ..., vN) = decision value of pi

– Byzantine generals• j is commander, proposing v

• BGi (j, v) = decision value of pi

– Interactive consensus• ICi(v1, v2, ..., vN)[j] = jth value in the decision vector of pi

with v1, v2, ..., vN values proposed by processes

Page 165: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

165

Related problems & solutions

• Derived solutions:– IC from BG:

• Run BG N times once with each process as commander

• ICi(v1, v2, ..., vN)[j] = BG(j, vj)

– C from IC:• Run IC and produce vector of values

• Derive single value with appropriate function

– BG from C:

Page 166: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

166

Related problems & solutions

• Derived solutions:– IC from BG:

– C from IC:

– BG from C:

• Commander pj sends value v to itself & other

processes

• Each process runs C with the values v1, v2, ...vN

they received

• BGi(j,v) = Ci(v1, v2, ...vN)

Page 167: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

167

Consensus in a synchronous system

• Assumptions– Use B-multicast

– f of N processes may fail

• Approach– Proceed in f + 1 rounds

– In each round correct processes B-multicast values

• Variables

– Valuesir = set of proposed values known to process pi

at the beginning of round r

Page 168: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

168

Consensus in a synchronous system

Page 169: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

169

Consensus in a synchronous system

• Termination?– Synchronous system

• Correct?– Each process arrives at the same set of values at the end of the

final round

– Proof? • Assume sets are different ...

• Pi has value v & Pj doesn’t have value v

Pk : managed to send v to Pi and not to Pj

• Agreement & Integrity?– Processes apply the same function

Page 170: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

170

Byzantine generals in a synchronous system

• Assumptions– Arbitrary failures

– f of N processes may be faulty

– Channels are reliable: no message injections

– Unsigned messages

Page 171: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

171

Byzantine generals in a synchronous system

• Impossibility with 3 processes

• Impossibility with N <= 3f processes

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p1 (Commander)

p2 p3

1:x1:w

2:1:w

3:1:x

Faulty processes are shown shaded

Page 172: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

172

Byzantine generals in a synchronous system

• Solution with one faulty process– N = 4, f = 1

– 2 rounds of messages:• Commander sends its value to each of the lieutenants

• Each of the lieutenants sends the value it received to its peers

– Lieutenant receives• Value of commander

• N-2 values of peers

Use majority function

Page 173: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

173

Byzantine generals in a synchronous system

• Solution with one faulty process

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

Faulty processes are shown shaded

p4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:u

2:1:u

3:1:w

p4

1:v

4:1:v

2:1:u 3:1:w

4:1:v

Page 174: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

174

Consensus & related problems

• Impossibility in asynchronous systems proof: No algorithm exists to reach consensus

– No guaranteed solution to• Byzantine generals problem

• Interactive consistency

• Reliable & totally ordered multicast

• Approaches for work around:– Masking faults: restart crashed process and use persistent storage

– Use failure detectors: make failure fail-silent by discarding messages

Page 175: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

175

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 176: November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005 Distributed systems: distributed algorithms

176

Distributed Systems:

Distributed algorithms