seminar 236825 open problems in distributed computing

44
SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825 Introduction 1

Upload: wynona

Post on 08-Feb-2016

43 views

Category:

Documents


1 download

DESCRIPTION

SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING. Winter 2013-14 Hagit Attiya & Faith Ellen. INTRODUCTION. Distributed Systems. Distributed systems are everywhere: share resources communicate increase performance (speed & fault tolerance) Characterized by - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 1

SEMINAR 236825 OPEN PROBLEMS IN

DISTRIBUTED COMPUTING

Winter 2013-14

Hagit Attiya & Faith Ellen

236825

Page 2: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 2

INTRODUCTION

236825

Page 3: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Distributed Systems• Distributed systems are

everywhere:– share resources– communicate– increase performance (speed &

fault tolerance)

• Characterized by– independent activities

(concurrency)– loosely coupled parallelism

(heterogeneity)– inherent uncertainty

E.g.• operating systems• (distributed) database

systems• software fault-tolerance• communication

networks• multiprocessor

architectures

236825 Introduction 3

Page 4: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 4

Main Admin Issues

• Goal: Read some interesting papers, related to some open problems in the area

• Mandatory (active) participation– 1 absence w/o explanation

• Tentative list of papers already published– First come first served

• Lectures in English

236825

Page 5: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Course Overview: Basic Models

236825 Introduction 5

messagepassing

synchronous

asynchronous

PRAM

sharedmemory

Page 6: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Message-Passing Model• processors p0, p1, …, pn-1 are nodes of the graph.

Each is a state machine with a local state.• bidirectional point-to-point channels are the undirected edges of the

graph. • Channel from pi to pj is modeled in two pieces:

– outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)

236825 Introduction 6

1 3

2

2

1 1

1

2

p3p2

p0

p1

Page 7: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Modeling Processors and Channels

236825 Introduction 7

inbuf[1]

p1's localvariables

outbuf[1] inbuf[2]

outbuf[2]

p2's localvariables

• processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state.

• bidirectional point-to-point channels are the undirected edges of the graph.

• Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)

Page 8: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Configuration

A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels.

Formally, a vector of processor states (including outbufs, i.e., channels), one per processor

236825 Introduction 8

Page 9: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Deliver Event

Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step

236825 Introduction 9

p1 p2m3 m2 m1

p1 p2m3 m2 m1

Page 10: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Computation EventOccurs at one processor• Start with old accessible state (local vars + incoming messages)• Apply processor's state machine transition function;

handle all incoming messages• End with new accessible state with empty inbufs

& new outgoing messages

236825 Introduction 10

c d eoldlocalstate

a

newlocalstate

b

Page 11: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Execution

• In the first configuration: each processor is in initial state and all inbufs are empty

• For each consecutive triple configuration, event, configuration

new configuration is same as old configuration except:– if delivery event: specified msg is transferred from sender's

outbuf to receiver's inbuf– if computation event: specified processor's state (including

outbufs) change according to transition function

236825 Introduction 11

configuration, event, configuration, event, configuration, …

Page 12: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Asynchronous Executions

• An execution is admissible in asynchronous model if– every message in an outbuf is eventually delivered– every processor takes an infinite number of steps

• No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out

• Models a reliable system (no message is lost and no processor stops working)

236825 Introduction 12

Page 13: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Example: Simple Flooding Algorithm

• Each processor's local state consists of variable color, either red or green

• Initially:– p0: color = green, all outbufs contain M– others: color = red, all outbufs empty

• Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs

236825 Introduction 13

Page 14: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Example: Flooding

236825 Introduction 14

p1

p0

p2

M M

p1

p0

p2

MM

deliver eventat p1 from p0

computationevent by p1

deliver eventat p2 from p1

p1

p0

p2

M

M

M Mp1

p0

p2

M

M

computationevent by p2

Page 15: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Example: Flooding (cont'd)

236825 Introduction 15

deliver eventat p1 from p2

computationevent by p1

deliver eventat p0 from p1

etc. to deliverrest ofmsgs

p1

p0

p2

M

M M

Mp1

p0

p2

M

M M

M

p1

p0

p2

M

M Mp1

p0

p2

M

M

M

Page 16: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

(Worst-Case) Complexity Measures• Message complexity: maximum number of

messages sent in any admissible execution• Time complexity: maximum "time" until all

processes terminate in any admissible execution.

• How to measure time in an asynchronous execution?– Produce a timed execution by assigning non-decreasing

real times to events so that time between sending and receiving any message is at most 1.

– Time complexity: maximum time until termination in any timed admissible execution.

236825 Introduction 16

Page 17: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Complexities of Flooding Algorithm

A state is terminated if color = green.• One message is sent over each edge in each

direction message complexity is 2m, where m = number of edges.

• A node turns green once a "chain" of messages reaches it from p0 time complexity is diameter + 1 time units.

236825 Introduction 17

Page 18: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 18

Synchronous Message Passing Systems

An execution is admissible for the synchronous model if it is an infinite sequence of rounds

– A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor.

Captures the lockstep behavior of the modelAlso implies

– every message sent is delivered– every processor takes an infinite number of steps.

Time is the number of rounds until termination

236825

Page 19: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Example: Flooding in the Synchronous Model

236825 Introduction 19

p1

p0

p2

M

MM

M

p1

p0

p2

M M

p1

p0

p2

round 1 events

round 2events

Time complexity is diameter + 1Message complexity is 2m

Page 20: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Broadcast Over a Rooted Spanning Tree

• Processors have information about a rooted spanning tree of the communication topology– parent and children local variables at each processor

• Complexities (synchronous and asynchronous model)– time is depth of the spanning tree, which is at most n - 1– number of messages is n - 1, since one message is sent over each

spanning tree edge

236825 Introduction 20

• root initially sends M to its children• when a processor receives M from its parent

– sends M to its children– terminates (sets a local Boolean to true)

Page 21: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Finding a Spanning Tree from a Root

236825 Introduction 24

• root sends M to all its neighbors• when non-root first gets M

– set the sender as its parent– send "parent" msg to sender– send M to all other neighbors (if no other neighbors, then

terminate)• when get M otherwise

– send "reject" msg to sender• use "parent" and "reject" msgs to set children

variables and terminate (after hearing from all neighbors)

Page 22: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Execution of Spanning Tree Algorithm

236825 Introduction 25

g h

a

b c

d e f

Synchronous: always givesbreadth-first search (BFS) tree

Both models:O(m) messagesO(diam) time

root

g h

a

b c

d e f

Asynchronous: not necessarily BFS tree

root

Page 23: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Execution of Spanning Tree Algorithm

236825 Introduction 26

g h

a

b c

d e f

An asynchronous execution gavea depth-first search (DFS) tree.Is DFS property guaranteed?

No!

Another asynchronousexecution results in this tree:neither BFS nor DFS

root

g h

a

b c

d e f

root

Page 24: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Shared Memory Model

Processors (also called processes) communicate via a set of shared variablesEach shared variable has a type, defining a set of primitive operations (performed atomically)

• read, write• compare&swap (CAS)• LL/SC, DCAS, kCAS, …• read-modify-write (RMW),

kRMW

236825 Introduction 29

p0

X

p1 p2

Y

read write writeRMW

Page 25: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Changes from the Message-Passing Model

236825 Introduction 30

• no inbuf and outbuf state components• configuration includes values for shared variables• one event type: a computation step by a process

– pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive

– shared variable's value in the new configuration changes according to the primitive's semantics

– pi 's state in the new configuration changes according to its old state and the result of the primitive

An execution is admissible if every processor takes an infinite number of steps

Page 26: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 31

Abstract Data Types

• Abstract representation of data & set of methods (operations) for accessing it

• Implement using primitives on base objects

• Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones

236825

data

Page 27: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 32

Executing Operations

236825

P1

invocation response

P2

P3

deq

enq(1)

enq(2)

1

ok

Page 28: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 33

Interleaving Operations, or Not

236825

deqenq(1) enq(2)1ok

Sequential behavior: invocations & responses alternate and match (on process & object)Sequential Specification: All legal sequential behaviors

Page 29: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 34

Correctness: Sequential consistency[Lamport, 1979]

• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the sequential specification)– Preserves the order of operations by the same

process

236825

Page 30: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 35

Example 1: Multi-Writer Registers

Add logical time to values

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

Read only own value

Read(X)read v,TSi return v

Once in a while read TS1,..., TSn

and write to TSi

236825

Using (multi-reader) single-writer registers

Need to ensure writes are eventually visible

Page 31: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 36

Timestamps1. The timestamps of two write operations by the same process

are ordered 2. If a write operation completes before another one starts, it has a

smaller timestamp

236825

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

Page 32: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 37

Multi-Writer Registers: Proof

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

Read(X)read v,TSi return v

Once in a while read TS1,..., TSn

and write to TSi

236825

Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write

Page 33: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

38236825 Introduction

Multi-Writer Registers: Proof

Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write

Legality is immediate Per-process order is preserved since a read returns a

value (with timestamp) larger than the preceding write by the same process

Page 34: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 39

Correctness: Linearizability[Herlihy & Wing, 1990]

• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the specification of the ADTs)– Preserves the real-time order of non-overlapping

operations• Each operation appears to takes effect

instantaneously at some point between its invocation and its response (atomicity)

236825

Page 35: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 40

Example 2: Linearizable Multi-Writer Registers

Add logical time to values

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

Read(X)read TS1,...,TSn

return value with max TS

236825

Using (multi-reader) single-writer registers[Vitanyi & Awerbuch, 1987]

Page 36: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 41

Multi-writer registers: Linearization order

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

236825

Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write

Read(X)read TS1,...,TSn

return value with max TS

Page 37: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 42

Multi-Writer Registers: Proof

236825

Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write

Legality is immediate Real-time order is preserved since a read returns a value

(with timestamp) larger than all preceding operations

Page 38: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 43

Example 3: Atomic Snapshot

• n components• Update a single component• Scan all the components

“at once” (atomically)

Provides an instantaneous view of the whole memory

236825

updateok

scanv1,…,vn

Page 39: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

44236825 Introduction

Atomic Snapshot Algorithm

Update(v,k)A[k] = v,seqi,i

Scan()repeat

read A[1],…,A[n]read A[1],…,A[n]if equal

return A[1,…,n]Linearize:

• Updates with their writes• Scans inside the double

collects

double collect

[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]

Page 40: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 45

Atomic Snapshot: Linearizability

Double collect (read a set of values twice)If equal, there is no write between the collects

– Assuming each write has a new value (seq#)

Creates a “safe zone”, where the scan can be linearized

236825

read A[1],…,A[n] read A[1],…,A[n]

write A[j]

Page 41: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 46

Liveness Conditions

• Wait-free: every operation completes within a finite number of (its own) steps no starvation for mutex

• Nonblocking: some operation completes within a finite number of (some other process) steps deadlock-freedom for mutex

• Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps– Also called solo termination

wait-free nonblocking obstruction-free

Bounded wait-free bounded nonblocking bounded obstruction-free

236825

Page 42: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 47

Wait-free Atomic Snapshot[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]

• Embed a scan within the Update.

236825

Update(v,k)V = scanA[k] = v,seqi,i,V

Scan()repeat

read A[1],…,A[n]read A[1],…,A[n]if equal

return A[1,…,n]

else record diffif twice pj return Vj

Linearize:• Updates with their writes• Direct scans as before• Borrowed scans in place

direct scan

borrowedscan

Page 43: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 48

Atomic Snapshot: Borrowed Scans

Interference by process pj

And another one… pj does a scan inbeteween

Linearizing with the borrowed scan is OK.236825

write A[j]

read A[j]… … read A[j]… …

embedded scan write A[j]

read A[j]… … read A[j]… …

Page 44: SEMINAR 236825  OPEN PROBLEMS IN  DISTRIBUTED COMPUTING

Introduction 49

List of Topics (Indicative)

• Atomic snapshots

• Space complexity of consensus

• Dynamic storage

• Vector agreement

• Renaming

• Maximal independent set

• Routing

and possibly others…

236825