distributed systems: distributed algorithms

176
November 2005 Distributed systems: dist ributed algorithms 1 Distributed Systems: Distributed algorithms

Upload: dallon

Post on 08-Feb-2016

133 views

Category:

Documents


6 download

DESCRIPTION

Distributed Systems: Distributed algorithms. Overview of chapters. Introduction Co-ordination models and languages General services Distributed algorithms Ch 10 Time and global states, 11.4-11.5 Ch 11 Coordination and agreement, 12.1-12.5 Shared data Building distributed services. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

1

Distributed Systems:

Distributed algorithms

Page 2: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

2

Overview of chapters• Introduction• Co-ordination models and languages• General services• Distributed algorithms

– Ch 10 Time and global states, 11.4-11.5– Ch 11 Coordination and agreement, 12.1-12.5

• Shared data

• Building distributed services

Page 3: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

3

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 4: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

4

Logical clocks• Problem: ordering of events

– requirement for many algorithms– physical clocks cannot be used

• use causality:– within a single process: observation– between different processes: sending of a

message happens before receiving the same message

Page 5: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

5

Logical clocks (cont.)

• Formalization: happens before relation

• Rules: – if x happens before y in any process p

then x y – for any message m: send (m) receive (m)– if x y and y z

then x z • Implementation: logical clocks

x y

Page 6: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

6

Logical clocks (cont.)

• Logical clock– counter appropriately incremented– one counter per process

• Physical clock– counts oscillations occurring in a crystal at a

definitive frequency

Page 7: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

7

Logical clocks (cont.)

• Rules for incrementing local logical clock1 for each event (including send) in process p:

Cp := Cp + 12 when a process sends a message m, it

piggybacks on m the value of Cp

3 on receiving (m, t), a process q• computes Cq := max (Cq, t)

• applies rule 1: Cq := Cq +1

Cq is logical time for event receive(m)

Page 8: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

8

•1a

Logical clocks (cont.)

• Logical timestamps: example

P1

P2

P3 •1b

•2c

•3d

•4e

•5f

•3g

•0

Page 9: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

9

Logical clocks (cont.)

• C(x) logical clock value for event x

• Correct usage: if x y then C(x) < C(y)

• Incorrect usage: if C(x) < C(y) then x y

• Solution: Logical vector clocks

Page 10: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

10

Logical clocks (cont.)

• Vector clocks for N processes:– at process Pi: Vi[j] for j = 1, 2,…,N

– Properties:

if x y then V(x) < V(y)

if V(x) < V(y) then x y

Page 11: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

11

Logical clocks (cont.)

• Rules for incrementing logical vector clock1 for each event (including send) in process Pi:

Vi[i] := Vi[i] + 12 when a process Pi sends a message m, it

piggybacks on m the value of Vi

3 on receiving (m, t), a process Pi

• apply rule 1• Vi[j] := max(Vi[j] , t[j]) for j = 1, 2,…, N

Page 12: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

12

Logical clocks (cont.)

• Logical vector clocks : example

a b

c d

e f

m1

m2

(2,0,0)(1,0,0)

(2,1,0) (2,2,0)

(2,2,2)(0,0,1)

p1

p2

p3

Physical time

Page 13: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

13

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 14: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

14

Global states• Detect global properties

p2p1

messagegarbage object

objectreference

a. Garbage collection

p2p1 wait-for

wait-forb. Deadlock

p2p1

activatepassive passivec. Termination

Page 15: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

15

Global states (cont.)

• Local states & events– Process Pi : ei

k events si

k state, before event k

– History of Pi :

hi = < ei0, ei

1, ei2,…>

– Finite prefix of history of Pi :

hik = < ei

0, ei1, ei

2,…, eik >

Page 16: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

16

Global states (cont.)

• Global states & events– Global history

H = h1 h2 h3 … hn

– Global state (when?)

S = ( s1p, s2

q, …, snu)

consistent?– Cut of the systems execution

C = h1c1 h1

c2 … h1cn

Page 17: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

17

Global states (cont.)

• Example of cuts:

m1 m2

p1

p2Physical

time

e10

Consistent cutInconsistent cut

e11 e1

2 e13

e20 e 2

1 e 22

Page 18: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

18

Global states (cont.)

• Finite prefix of history of Pi : hi

k = < ei0, ei

1, ei2,…, ei

k >• Cut of the systems execution

C = h1c1 h1

c2 … h1cn

• Consistent cut C e C, f e f C

• Consistent global statecorresponds to consistent cut

Page 19: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

19

Global states (cont.)

• Model execution of a (distributed) system

S0 S1 S2 S3 …

– Series of transitions between consistent states– Each transition corresponds to one single event

• Internal event• Sending message• Receiving message

– Simultaneous events order events

Page 20: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

20

Global states (cont.)

• Definitions:– Run = ordering of all events (in a global history)

consistent with each local history’s ordering– Linearization =

consistent run +consistent with

– S’ reachable from S linearization: … S … S’ …

Page 21: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

21

Global states (cont.)

• Kinds of global state predicates:– Stable – Safety

– Liveness

= true in SS’, S … S’ = true in S’

= undesirable propertyS0 = initial state of systemS, S0 … S = false in S

= desirable propertyS0 = initial state of systemS, S0 … S = true in S

Page 22: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

22

Global states (cont.)

• Snapshot algorithm of Chandy & Lamport– Record consistent global state– Assumptions:

• Neither channels nor processes fail• Channels are unidirectional and provide FIFO-

ordered message delivery• Graph of channels and processes is strongly

connected• Any process may initiate a global snapshot• Process may continue their execution during the

snapshot

Page 23: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

23

Global states (cont.)

• Snapshot algorithm of Chandy & Lamport– Elements of algorithm

• Players: processes Pi with– Incoming channels– Outgoing channels

• Marker messages• 2 rules

– Marker receiving rule– Marker sending rule

– Start of algorithm• A process acts as it received a marker message

Page 24: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

24

Global states (cont.)

Marker receiving rule for process pi

On pi’s receipt of a marker message over channel c:if (pi has not yet recorded its state) it

records its process state now;records the state of c as the empty set;turns on recording of messages arriving over other incoming channels;

else pi records the state of c as the set of messages it has received over c since it saved its state.

end if

Marker sending rule for process pi

After pi has recorded its state, for each outgoing channel c: pi sends one marker message over c (before it sends any other message over c).

Page 25: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

25

Global states (cont.)

• Example:

p1 p2c2

c1

account widgets

$1000 (none)

account widgets

$50 2000

Page 26: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

26

Global states (cont.)

p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

2. Global state S 1 p1 p2(Order 10, $100), M<$900, 0> <$50, 2000>

(empty)

c2

c1

3. Global state S 2 p1 p2(Order 10, $100), M<$900, 0> <$50, 1995>

(five widgets)

c2

c1

(M = marker message)

4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

Page 27: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

27

Global states (cont.)

p1 p2(empty)<$1000, 0> <$50, 2000>

(empty)

c2

c1

1. Global state S 0

(M = marker message)

4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

5. Global state S 4 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

M

c2

c1 C2 = <>C1=<(five widgets)>

6. Global state S 5 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>

(empty)

c2

c1 C2 = <>C1=<(five widgets)>

Page 28: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

28

Global states (cont.)

• Observed state– Corresponds to consistent cut– Reachable!

Sinit Sfinal

Ssnap

actual execution e 0,e1,...

recording recording begins ends

pre-snap: e '0,e'1,...e'R-1 post-snap: e'R,e'R+1,...

Page 29: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

29

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 30: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

30

Failure detectors• Properties

– Unreliable failure detector: answers with• Suspected• Unsuspected

– Reliable failure detector: answers with• Failed• Unsuspected

• Implementation– Every T sec: multicast by P of “P is here”– Maximum on message transmission time:

• Asynchronous system: estimate E• Synchronous system: absolute bound A

No “P is here” within T + E sec

No “P is here” within T + A sec

Page 31: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

31

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 32: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

32

Mutual exclusion• Problem: how to give a single process

temporarily a privilege?– Privilege = the right to access a (shared)

resource– resource = file, device, window,…

• Assumptions– clients execute the mutual exclusion algorithm– the resource itself might be managed by a

server– Reliable communication

Page 33: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

33

Mutual exclusion (cont.)

• Basic requirements:– ME1: at most one process might execute

in the shared resource at any time

(Safety)– ME2: a process requesting access to the

shared resource is eventually granted it (Liveness)

– ME3: Access to the shared resource should be granted in happened-before order

(Ordering or fairness)

Page 34: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

34

Mutual exclusion (cont.)

• Solutions:– central server algorithm– distributed algorithm using logical clocks– ring-based algorithm– voting algorithm

• Evaluation– Bandwidth (= #messages to enter and exit)– Client delay (incurred by a process at enter and exit)

– Synchronization delay (delay between exit and enter)

Page 35: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

35

Mutual exclusion (cont.)

central server algorithm• Central server offering 2 operations:

– enter()• if resource free

then operation returns without delayelse request is queued and return from operation is delayed

– exit()• if request queue is empty

then resource is marked freeelse return for a selected request is executed

Page 36: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

36

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User

Enter()

Page 37: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

37

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

Enter()

Page 38: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

38

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

Page 39: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

39

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue:

P1

P2P3

P4

User3

enter()

Page 40: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

40

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 4

P1

P2P3

P4

User3

enter()

Page 41: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

41

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 4

P1

P2P3

P4

User3

enter()

enter()

Page 42: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

42

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 4, 2

P1

P2P3

P4

User3

enter()

enter()

Page 43: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

43

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 4, 2

P1

P2P3

P4

User3

enter()

enter() exit()

Page 44: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

44

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 4, 2

P1

P2P3

P4

User

enter()

enter()

Page 45: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

45

Mutual exclusion (cont.)

central server algorithm• Example:

Server

Queue: 2

P1

P2P3

P4

User4

enter()

Page 46: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

46

Mutual exclusion (cont.)

central server algorithm• Evaluation:

– ME3 not satisfied!– Performance:

• single server is performance bottleneck• Enter critical section: 2 messages• Synchronization: 2 messages between exit of one process and

enter of next

– Failure: • Central server is single point of failure• what if a client, holding the resource, fails? • Reliable communication required

Page 47: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

47

Mutual exclusion (cont.)

ring-based algorithm • All processes arranged in a

– unidirectional

– logical

ring

• token passed in ring

• process with token has access to resource

Page 48: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

48

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

Page 49: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

49

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P2 can use resource

Page 50: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

50

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P2 stopped using resource and forwarded token

Page 51: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

51

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

P3 doesn’t need resource and forwards token

Page 52: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

52

Mutual exclusion (cont.)

ring-based algorithm

P1

P4

P3

P2

P5

P6

Page 53: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

53

Mutual exclusion (cont.)

ring-based algorithm • Evaluation:

– ME3 not satisfied

– efficiency• high when high usage of resource

• high overhead when very low usage

– failure• Process failure: loss of ring!

• Reliable communication required

Page 54: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

54

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Distributed agreement algorithm

– multicast requests to all participating processes– use resource when all other participants agree

(= reply received)• Processes

– keep logical clock; included in all request messages

– behave as finite state machine:• released• wanted• held

Page 55: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

55

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on initialization:• state := released;

– to obtain resource:• state := wanted;• T = logical clock value for next event;

• multicast request to other processes <T, Pj>;

• wait for n-1 replies;• state := held;

Page 56: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

56

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj

– on receipt of request <Ti, Pi> :• if (state = held) or

(state = wanted and (T,Pj) < (Ti,Pi) )then queue request from Pi

else reply immediately to Pi

– to release resource:• state := released;• reply to any queued requests;

Page 57: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

57

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

– 3 processes– P1 and P2 will request it concurrently

– P3 not interested in using resource

Page 58: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

58

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example

P1

releasedQueue:

P2

releasedQueue:

P3

releasedQueue:

Page 59: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

59

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>0

Page 60: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

60

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

0

0

Page 61: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

61

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

0

0

Page 62: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

62

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

0

0

Page 63: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

63

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 64: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

64

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 65: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

65

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 66: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

66

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

wantedQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

1

Page 67: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

67

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

<41,P1>

<41,P1>

<34,P2>

<34,P2>

<43,P3>

<45,P3>

1

2

Page 68: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

68

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

heldQueue:P1

P3

releasedQueue:

1

2

Page 69: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

69

Mutual exclusion (cont.)

distributed algorithm using logical clocks

P1

wantedQueue:

P2

releasedQueue:

P3

releasedQueue:

1

0

Page 70: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

70

Mutual exclusion (cont.)

distributed algorithm using logical clocks• Evaluation

– Performance:• expensive algorithm:

2 * ( n - 1 ) messages to get resource

• Client delay: round trip

• Synchronization delay: 1 message to pass section to another process

• does not solve the performance bottleneck

– Failure• each process must know all other processes

• Reliable communication required

Page 71: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

71

Mutual exclusion (cont.)

voting algorithm• Approach of Maekawa:

– Communication with subset of partners should suffice– Candidate collects sufficient votes

• Voting set: Vi = voting set for pi

i, j: Vi Vj – | Vi | = K

– Pj contained in M voting sets– Optimal solution

• K ~ N• M = K

Page 72: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

72

Mutual exclusion (cont.)

voting algorithmOn initialization

state := RELEASED;voted := FALSE;

For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;

On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then

queue request from pi without replying; else

send reply to pi;voted := TRUE;

end if

Maekawa’s algorithm

Page 73: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

73

Mutual exclusion (cont.)

voting algorithmMaekawa’s algorithm (cont.)For pi to exit the critical section

state := RELEASED;Multicast release to all processes in Vi – {pi};

On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then

remove head of queue – from pk, say; send reply to pk;voted := TRUE;

else voted := FALSE;

end if

Page 74: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

74

Mutual exclusion (cont.)

voting algorithm• Evaluation

– Properties• ME1: OK• ME2: NOK, deadlock possible

solution: process requests in order• ME3: Ok

– Performance• Bandwidth: on enter 2 N messages + on exit N messages • Client delay: round trip • Synchronization delay: round trip

– Failure:• Crash of process on another voting set can be tolerated• Reliable communication required

Page 75: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

75

Mutual exclusion (cont.)

• Discussion– algorithms are expensive and not practical

– algorithms are extremely complex in the presence of failures

– better solution in most cases:• let the server, managing the resource, perform

concurrency control

• gives more transparency for the clients

Page 76: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

76

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 77: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

77

Elections• Problem statement:

– select a process from a group of processes– several processes may start election

concurrently• Main requirement:

– unique choice– Select process with highest id

• Process id• <1/load, process id>

Page 78: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

78

Elections (cont.)

• Basic requirements:– E1: participant pi set electedi = or

electedi = P; P process with highest id (Safety)

– E2: all processes pi participate and setelectedi or crash (Liveness)

not yet defined

Page 79: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

79

Elections (cont.)

• Solutions:– Bully election algorithm

– Ring based election algorithm• Evaluation

– Bandwidth ( ~ total number of messages)

– Turnaround time (the number of serialized message

transmission times between initiation and termination

of a single run)

Page 80: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

80

Elections (cont.)

Bully election• Assumptions:

– each process has identifier– processes can fail during an election– communication is reliable

• Goal: – surviving member with the largest identifier is

elected as coordinator

Page 81: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

81

Elections (cont.)

Bully election• Roles for processes:

– coordinator• elected process

• has highest identifier, at the time of election

– initiator• process starting the election for some reason

Page 82: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

82

Elections (cont.)

Bully election• Three types of messages:

– election message• sent by an initiator of the election

to all other processes with a higher identifier– answer message

• a reply message sent by the receiver of an election message

– coordinator message• sent by the process becoming the coordinator

to all other processes with lower identifiers

Page 83: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

83

Elections (cont.)

Bully election• Algorithm:

– send election message:• process doing it is called initiator• any process may do it at any time• when a failed process is restarted, it starts an

election, even though the current coordinator is functioning (bully)

– a process receiving an election message• replies with an answer message• will start an election itself (why?)

Page 84: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

84

Elections (cont.)

Bully election• Algorithm:

– actions of an initiator• when not receiving an answer message within a

certain time (2Ttrans +Tprocess) becomes coordinator

• when having received an answer message ( a process with a higher identifier is active) and not receiving a coordinator message (after x time units) will restart elections

Page 85: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

85

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P1 initiator

election

Page 86: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

86

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 and P3 reply and start election

election

answer

Page 87: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

87

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Election messages of P2 arrive

election

answer

Page 88: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

88

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P3 replies

election

answer

Page 89: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

89

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P3 fails

election

answer

Page 90: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

90

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer

Page 91: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

91

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

Timeout at P1 : election starts again

election

answer

Page 92: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

92

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 replies and starts election

election

answer

Page 93: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

93

Elections (cont.)

Bully election

• Example: election of P2 after failure of P3 and P4

P1 P3

P2 P4

P2 receives no replies coordinator

election

answer

Page 94: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

94

Elections (cont.)

Bully election• Evaluation

– Correctness: E1 & E2 ok, if• Reliable communication• No process replaces crashed process

– Correctness: no guarantee for E1, if• Crashed process is replaced by process with same id• Assumed timeout values are inaccurate (= unreliable failure

detector)– Performance

• Worst case: O(n2)• Optimal: bandwidth: n-2 messages

turnaround: 1 message

Page 95: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

95

Elections (cont.) Ring based election

• Assumptions:– processes arranged in a logical ring

– each process has an identifier: i for Pi

– processes remain functional and reachable during the algorithm

Page 96: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

96

Elections (cont.) Ring based election

• Messages:– forwarded over logical ring– 2 types:

• election: used during electioncontains identifier

• elected: used to announce new coordinator

• Process States:– participant– non-participant

Page 97: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

97

Elections (cont.) Ring based election

• Algorithm– process initiating an election

• becomes participant

• sends election message to its neighbour

Page 98: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

98

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 99: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

99

Elections (cont.) Ring based election

• Algorithm– upon receiving an election message, a process

compares identifiers:• Received: identifier in message• own: identifier of process

– 3 cases:• Received > own• Received < own• Received = own

Page 100: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

100

Elections (cont.) Ring based election

• Algorithm– receive election message

• Received > own– message forwarded– process becomes participant

Page 101: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

101

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 102: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

102

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 103: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

103

Elections (cont.) Ring based election

• Algorithm– receive election message

• Received > own– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

Page 104: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

104

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

11

Page 105: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

105

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 106: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

106

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 107: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

107

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5

initiator

21

Page 108: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

108

Elections (cont.) Ring based election

• Algorithm: – receive election message

• Received > own– ...

• Received < own and process is non-participant– ...

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour

Page 109: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

109

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 110: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

110

Elections (cont.) Ring based election

• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour

Page 111: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

111

Elections (cont.) Ring based election

• Algorithm – receive elected message

• participant:– new state: non-participant– forwards message

• coordinator:– election process completed

Page 112: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

112

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 113: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

113

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 114: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

114

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 115: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

115

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 116: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

116

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator 21

Page 117: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

117

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

21

Page 118: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

118

Elections (cont.) Ring based election

P1

P21P8

P11

P14

P5 coordinator

Page 119: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

119

Elections (cont.) Ring based election

• Evaluation– Why is condition

Received < own and process is non-participantnecessary? (see next slide for full algorithm)

– Number of messages:• worst case: 3 * n - 1• best case: 2 * n

– concurrent elections: messages are extinguished • as soon as possible• before winning result is announced

Page 120: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

120

Elections (cont.) Ring based election

• Algorithm receive election message• Received > own

– message forwarded– process becomes participant

• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant

• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour

Page 121: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

121

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 122: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

122

Multicast communication• Essential property:

– 1 multicast operation <> multiple sendsHigher efficiencyStronger delivery guarantees

• Operations: g = group, m = message– X-multicast(g, m)– X-deliver(m)

• <> receive(m)– X additional property

Basic, Reliable, FifO,….

Page 123: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

123

Multicast communication(cont.) IP multicast

• Datagram operations – with multicast IP address

• Failure model cfr UDP

– Omission failures– No ordering or reliability guarantees

Page 124: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

124

Multicast communication(cont.) Basic multicast

• = IP multicast + delivery guarantee if multicasting process does not crash

• Straightforward algorithm: (with reliable send)

• Ex. practical algorithm using IP-multicast

To B-multicast(g, m):

p g: send(p, m)

On receive(m) at p:

B-deliver(m)

Page 125: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

125

Multicast communication(cont.) Reliable multicast

• Properties:– Integrity (safety)

• A correct process delivers a message at most once– Validity (liveness)

• Correct process p multicasts m p delivers m– Agreement (liveness)

correct process p delivering m all correct processes will deliver m

– Uniform agreement (liveness) process p (correct or failing) delivering m

all correct processes will deliver m

Page 126: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

126

Multicast communication(cont.) Reliable multicast

• 2 algorithms:

1. Using B-multicast

2. Using IP-multicast + piggy backed acks

Page 127: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

127

Multicast communication(cont.) Reliable multicast

Algorithm 1 with B-multicast

Page 128: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

128

Multicast communication(cont.) Reliable multicast

• Correct?– Integrity

– Validity

– Agreement

• Efficient?– NO: each message transmitted g times

Algorithm 1 with B-multicast

Page 129: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

129

Multicast communication(cont.) Reliable multicast

Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet

Algorithm 2 with IP-multicast

Page 130: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

130

Multicast communication(cont.) Reliable multicast

Algorithm 2 with IP-multicast

Data structures at process p:

Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0

For process p to R-multicast message m to group g

IP-multicast (g, <m, Sgp , <q, Rg

q > >)

Sgp ++

On IP-deliver (<m, S, <q, R>>) at q from p

Page 131: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

131

Multicast communication(cont.) Reliable multicast

Algorithm 2 with IP-multicastOn IP-deliver (<m, S, <q, R>>) at q from p

if S = Rgp + 1

then R-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue

request missing messages endif

endif

if R > Rg then request missing messages endif

Page 132: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

132

Multicast communication(cont.) Reliable multicast

• Correct?– Integrity: seq numbers + checksums– Validity: if missing messages are detected– Agreement: if copy of message remains available

Algorithm 2 with IP-multicast

Page 133: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

133

Multicast communication(cont.) Reliable multicast

• 3 processes in group: P, Q, R• State of process:

– S: Next_sequence_number– Rq: Already_delivered from Q– Stored messages

• Presentation:

Algorithm 2 with IP-multicast: example

P: 2Q: 3 R: 5< >

Page 134: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

134

Multicast communication(cont.) Reliable multicast

• Initial state:

Algorithm 2 with IP-multicast: example

P: 0Q: -1 R: -1< >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >

Page 135: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

135

Multicast communication(cont.) Reliable multicast

• First multicast by P: Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: -1 R: -1< >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>

Page 136: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

136

Multicast communication(cont.) Reliable multicast

• Arrival multicast by P at Q: Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >

P: mp0, 0, <Q:-1, R:-1>

Page 137: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

137

Multicast communication(cont.) Reliable multicast

• New state: Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 0P: 0 R: -1< mp0 >

R: 0P: -1 Q: -1< >

Page 138: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

138

Multicast communication(cont.) Reliable multicast

• Multicast by Q: Algorithm 2 with IP-multicast: example

P: 1Q: -1 R: -1< mp0 >

Q: 1P: 0 R: -1< mp0 ,mq0 >

R: 0P: -1 Q: -1< >

Q: mq0, 0, <P:0, R:-1>

Page 139: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

139

Multicast communication(cont.) Reliable multicast

• Arrival of multicast by Q: Algorithm 2 with IP-multicast: example

P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >

Q: mq0, 0, <P:0, R:-1>

Page 140: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

140

Multicast communication(cont.) Reliable multicast

• When to delete stored messages? Algorithm 2 with IP-multicast: example

P: 1Q: 0 R: -1< mp0 ,mq0 >

Q: 1P: 0 R: -1< mp0 , ,mq0 >

R: 0P: -1 Q: 0< mq0 >

Page 141: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

141

Multicast communication(cont.) Ordered multicast

• FIFO

• Causal

• Total

if a correct process P:multicast(g, m);multicast(g, m’);

then for all correct processes:deliver(m’) deliver(m) before deliver(m’)

if multicast(g, m) multicast(g, m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)

Page 142: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

142

Multicast communication(cont.) Ordered multicast

• Total

• FIFO-Total = FIFO + Total• Causal-Total = Causal + Total• Atomic = reliable + Total

if p: deliver(m) deliver( m’)then for all correct processes:

deliver(m’) deliver(m) before deliver(m’)

Page 143: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

143

Multicast communication(cont.) Ordered multicast

F3

F1

F2

T2T1

P1 P2 P3

Time

C3

C1

C2

Notice the consistent ordering of totally ordered messages T1 and T2, the FIFO-related messages F1 and F2 and the causally related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages.

Page 144: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

144

Multicast communication(cont.) FIFO multicast

• Alg. 1: R-multicast using IP-multicast– Correct?

• Sender assigns Sgp

• Receivers deliver in this order

• Alg. 2 on top of any B-multicast

Page 145: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

145

Multicast communication(cont.) FIFO multicast

Algorithm 2 on top of any B-multicastData structures at process p:

Sgp : sequence number

Rgq : sequence number of the latest message it has delivered from q

On initialization:

Sgp = 0; Rg

q = -1

For process p to FO-multicast message m to group g

B-multicast ( g, <m, Sgp >)

Sgp ++

On B-deliver (<m, S >) at q from p

Page 146: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

146

Multicast communication(cont.) FIFO multicast

Algorithm 2 on top of any B-multicastOn B-deliver (<m, S >) at q from p

if S = Rgp + 1

then FO-deliver (m)

Rgp ++

check hold-back queue

else if S > Rgp + 1

then store m in hold-back queue endif

endif

Page 147: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

147

Multicast communication(cont.) TOTAL multicast

• Basic approach:– Sender: assign totally ordered identifiers iso process

ids– Receiver: deliver as for FIFO ordering

• Alg. 1: use a (single) sequencer process

• Alg. 2: participants collectively agree on the assignment of sequence numbers

Page 148: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

148

Multicast communication(cont.) TOTAL multicast: sequencer process

Page 149: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

149

Multicast communication(cont.) TOTAL multicast: sequencer process

• Correct?

• Problems?– A single sequencer process

• bottleneck• single point of failure

Page 150: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

150

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

• Approach:– Sender:

• B-multicasts message– Receivers:

• Propose sequence numbers to sender– Sender:

• uses returned sequence numbers to generate agreed sequence number

Page 151: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

151

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

21

1

2

2

1 Message

2 Proposed Seq

P2

P3

P1

P4

3 Agreed Seq

3

3

Page 152: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

152

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

Data structures at process p:

Agp : largest agreed sequence number

Pgp : largest proposed sequence number by P

On initialization:

Pgp = 0

For process p to TO-multicast message m to group g

B-multicast ( g, <m, i >) /i = unique id for m

On B-deliver (<m, i> ) at q from p

Pgq = max (Ag

q, Pgq) + 1

send (p, < i, Pgq >)

store <m, i, Pgq > in hold-back queue

Page 153: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

153

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

On receive( i, P) at p from q

wait for all replies; a is the largest reply

B-multicast(g, < “order”, i, a >)

On B-deliver (<“order”, i, a > ) at q from p

Agq = max (Ag

q, a)

attach a to message i in hold-back queue

reorder messages in hold-back queue (increasing sequence numbers)

while message m in front of hold-back queue has been assigned

an agreed sequence number

do remove m from hold-back queue

TO-deliver (m)

Page 154: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

154

Multicast communication(cont.) TOTAL multicast: ISIS algorithm

• Correct?– Processes will agree on sequence number for a

message– Sequence numbers are monotonically

increasing– No process can prematurely deliver a message

• Performance– 3 serial messages!

• Total ordering– <> FIFO– <> causal

Page 155: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

155

Multicast communication(cont.) Causal multicast

• Limitations:– Causal order only by multicast operations– Non-overlapping, closed groups

• Approach:– Use vector timestamps– Timestamp = count number of multicast

messages

Page 156: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

156

Multicast communication(cont.) Causal multicast: vector timestamps

Meaning?

Page 157: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

157

Multicast communication(cont.) Causal multicast: vector timestamps

• Correct?– Message timestamp

m V m’ V’

– Given multicast(g,m) multicast(g,m’)proof V < V’

Page 158: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

158

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 159: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

159

Consensus & related problems• System model

– N processes pi

– Communication is reliable– Processes mail fail

• Crash• Byzantine

– No message signing• Message signing limits the harm a faulty process can do

• Problems– Consensus– Byzantine generals– Interactive consistency

Page 160: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

160

Consensus• Problem statement

pi: undecided state & pi proposes vi

– Message exchanges– Finally: each process pi sets decision variable di, enters

decided state and may not change di

• Requirements:– Termination: eventually each correct process pi sets di

– Agreement: pi and pj correct & in decided state di = dj

– Integrity: correct processes all propose same value d any process in decided state has chosen d

Page 161: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

161

Consensus• Simple algorithm (no process failures)

– Collect processes in group g– For each process pi:

• R-multicast(g, vi)• Collect N values• d = majority (v1,v2,...,vN)

• Problems with failures:– Crash: detected? not in asynchronous systems– Byzantine? Faulty process can send around

different values

Page 162: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

162

Byzantine generals• Problem statement

– Informal: • agree to attack or to retreat• commander issues the order• lieutenants are to decide to attack or retreat• all can be ‘treacherous’

– Formal• One process proposes value• Others are to agree

• Requirements:– Termination: each process eventually decides– Agreement: all correct processes select the same value– Integrity: commander correct other correct processes

select value of commander

Page 163: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

163

Interactive consistency• Problem statement

– Correct processes agree upon a vector of values (one value for each process)

• Requirements:– Termination: each process eventually decides

– Agreement: all correct processes select the same value

– Integrity: if pi correct then all correct processes decide on vi as

the i-the component of their vector

Page 164: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

164

Related problems & solutions• Basic solutions:

– Consensus: • Ci(v1, v2, ..., vN) = decision value of pi

– Byzantine generals• j is commander, proposing v• BGi (j, v) = decision value of pi

– Interactive consensus• ICi(v1, v2, ..., vN)[j] = jth value in the decision vector of pi

with v1, v2, ..., vN values proposed by processes

Page 165: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

165

Related problems & solutions• Derived solutions:

– IC from BG:• Run BG N times once with each process as

commander• ICi(v1, v2, ..., vN)[j] = BG(j, vj)

– C from IC:• Run IC and produce vector of values• Derive single value with appropriate function

– BG from C:

Page 166: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

166

Related problems & solutions• Derived solutions:

– IC from BG:– C from IC:– BG from C:

• Commander pj sends value v to itself & other processes

• Each process runs C with the values v1, v2, ...vN they received

• BGi(j,v) = Ci(v1, v2, ...vN)

Page 167: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

167

Consensus in a synchronous system

• Assumptions– Use B-multicast– f of N processes may fail

• Approach– Proceed in f + 1 rounds – In each round correct processes B-multicast values

• Variables– Valuesi

r = set of proposed values known to process pi at the beginning of round r

Page 168: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

168

Consensus in a synchronous system

Page 169: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

169

Consensus in a synchronous system

• Termination?– Synchronous system

• Correct?– Each process arrives at the same set of values at the end of the

final round– Proof?

• Assume sets are different ...• Pi has value v & Pj doesn’t have value v

Pk : managed to send v to Pi and not to Pj

• Agreement & Integrity?– Processes apply the same function

Page 170: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

170

Byzantine generals in a synchronous system

• Assumptions– Arbitrary failures– f of N processes may be faulty– Channels are reliable: no message injections– Unsigned messages

Page 171: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

171

Byzantine generals in a synchronous system

• Impossibility with 3 processes

• Impossibility with N <= 3f processes

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p1 (Commander)

p2 p3

1:x1:w

2:1:w

3:1:x

Faulty processes are shown shaded

Page 172: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

172

Byzantine generals in a synchronous system

• Solution with one faulty process– N = 4, f = 1– 2 rounds of messages:

• Commander sends its value to each of the lieutenants• Each of the lieutenants sends the value it received to its peers

– Lieutenant receives• Value of commander• N-2 values of peers

Use majority function

Page 173: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

173

Byzantine generals in a synchronous system

• Solution with one faulty processp1 (Commander)

p2 p3

1:v1:v

2:1:v3:1:u

Faulty processes are shown shadedp4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:u

2:1:u3:1:w

p4

1:v

4:1:v

2:1:u 3:1:w

4:1:v

Page 174: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

174

Consensus & related problems• Impossibility in asynchronous systems

proof: No algorithm exists to reach consensus– No guaranteed solution to

• Byzantine generals problem• Interactive consistency• Reliable & totally ordered multicast

• Approaches for work around:– Masking faults: restart crashed process and use persistent storage– Use failure detectors: make failure fail-silent by discarding

messages

Page 175: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

175

This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems

Page 176: Distributed Systems:  Distributed algorithms

November 2005 Distributed systems: distributed algorithms

176

Distributed Systems:

Distributed algorithms