seminar 236825 open problems in distributed computing

Introduction 1

SEMINAR 236825 OPEN PROBLEMS IN

DISTRIBUTED COMPUTING

Winter 2013-14

Hagit Attiya & Faith Ellen

236825

Introduction 2

INTRODUCTION

236825

Distributed Systems• Distributed systems are

everywhere:– share resources– communicate– increase performance (speed &

fault tolerance)

• Characterized by– independent activities

(concurrency)– loosely coupled parallelism

(heterogeneity)– inherent uncertainty

E.g.• operating systems• (distributed) database

systems• software fault-tolerance• communication

networks• multiprocessor

architectures

236825 Introduction 3

Introduction 4

Main Admin Issues

• Goal: Read some interesting papers, related to some open problems in the area

• Mandatory (active) participation– 1 absence w/o explanation

• Tentative list of papers already published– First come first served

• Lectures in English

236825

Course Overview: Basic Models


messagepassing

synchronous

asynchronous

PRAM

sharedmemory

Message-Passing Model• processors p0, p1, …, pn-1 are nodes of the graph.

Each is a state machine with a local state.• bidirectional point-to-point channels are the undirected edges of the

graph. • Channel from pi to pj is modeled in two pieces:

– outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)


1 3

2

2

1 1

1

2

p3p2

p0

p1

Modeling Processors and Channels


inbuf[1]

p1's localvariables

outbuf[1] inbuf[2]

outbuf[2]

p2's localvariables

• processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state.

• bidirectional point-to-point channels are the undirected edges of the graph.

• Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)

Configuration

A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels.

Formally, a vector of processor states (including outbufs, i.e., channels), one per processor


Deliver Event

Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step


p1 p2m3 m2 m1

p1 p2m3 m2 m1

Computation EventOccurs at one processor• Start with old accessible state (local vars + incoming messages)• Apply processor's state machine transition function;

handle all incoming messages• End with new accessible state with empty inbufs

& new outgoing messages


c d eoldlocalstate

a

newlocalstate

b

Execution

• In the first configuration: each processor is in initial state and all inbufs are empty

• For each consecutive triple configuration, event, configuration

new configuration is same as old configuration except:– if delivery event: specified msg is transferred from sender's

outbuf to receiver's inbuf– if computation event: specified processor's state (including

outbufs) change according to transition function


configuration, event, configuration, event, configuration, …

Asynchronous Executions

• An execution is admissible in asynchronous model if– every message in an outbuf is eventually delivered– every processor takes an infinite number of steps

• No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out

• Models a reliable system (no message is lost and no processor stops working)


Example: Simple Flooding Algorithm

• Each processor's local state consists of variable color, either red or green

• Initially:– p0: color = green, all outbufs contain M– others: color = red, all outbufs empty

• Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs


Example: Flooding


p1

p0

p2

M M

p1

p0

p2

MM

deliver eventat p1 from p0

computationevent by p1


p1

p0

p2

M

M

M Mp1

p0

p2

M

M


Example: Flooding (cont'd)





etc. to deliverrest ofmsgs

p1

p0

p2

M

M M

Mp1

p0

p2

M

M M

M

p1

p0

p2

M

M Mp1

p0

p2

M

M

M

(Worst-Case) Complexity Measures• Message complexity: maximum number of

messages sent in any admissible execution• Time complexity: maximum "time" until all

processes terminate in any admissible execution.

• How to measure time in an asynchronous execution?– Produce a timed execution by assigning non-decreasing

real times to events so that time between sending and receiving any message is at most 1.

– Time complexity: maximum time until termination in any timed admissible execution.


Complexities of Flooding Algorithm

A state is terminated if color = green.• One message is sent over each edge in each

direction message complexity is 2m, where m = number of edges.

• A node turns green once a "chain" of messages reaches it from p0 time complexity is diameter + 1 time units.


Introduction 18

Synchronous Message Passing Systems

An execution is admissible for the synchronous model if it is an infinite sequence of rounds

– A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor.

Captures the lockstep behavior of the modelAlso implies

– every message sent is delivered– every processor takes an infinite number of steps.

Time is the number of rounds until termination

236825

Example: Flooding in the Synchronous Model


p1

p0

p2

M

MM

M

p1

p0

p2

M M

p1

p0

p2

round 1 events

round 2events

Time complexity is diameter + 1Message complexity is 2m

Broadcast Over a Rooted Spanning Tree

• Processors have information about a rooted spanning tree of the communication topology– parent and children local variables at each processor

• Complexities (synchronous and asynchronous model)– time is depth of the spanning tree, which is at most n - 1– number of messages is n - 1, since one message is sent over each

spanning tree edge


• root initially sends M to its children• when a processor receives M from its parent

– sends M to its children– terminates (sets a local Boolean to true)

Finding a Spanning Tree from a Root


• root sends M to all its neighbors• when non-root first gets M

– set the sender as its parent– send "parent" msg to sender– send M to all other neighbors (if no other neighbors, then

terminate)• when get M otherwise

– send "reject" msg to sender• use "parent" and "reject" msgs to set children

variables and terminate (after hearing from all neighbors)

Execution of Spanning Tree Algorithm


g h

a

b c

d e f

Synchronous: always givesbreadth-first search (BFS) tree

Both models:O(m) messagesO(diam) time

root

g h

a

b c

d e f

Asynchronous: not necessarily BFS tree

root

Execution of Spanning Tree Algorithm


g h

a

b c

d e f

An asynchronous execution gavea depth-first search (DFS) tree.Is DFS property guaranteed?

No!

Another asynchronousexecution results in this tree:neither BFS nor DFS

root

g h

a

b c

d e f

root

Shared Memory Model

Processors (also called processes) communicate via a set of shared variablesEach shared variable has a type, defining a set of primitive operations (performed atomically)

• read, write• compare&swap (CAS)• LL/SC, DCAS, kCAS, …• read-modify-write (RMW),

kRMW


p0

X

p1 p2

Y

read write writeRMW

Changes from the Message-Passing Model


• no inbuf and outbuf state components• configuration includes values for shared variables• one event type: a computation step by a process

– pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive

– shared variable's value in the new configuration changes according to the primitive's semantics

– pi 's state in the new configuration changes according to its old state and the result of the primitive

An execution is admissible if every processor takes an infinite number of steps

Introduction 31

Abstract Data Types

• Abstract representation of data & set of methods (operations) for accessing it

• Implement using primitives on base objects

• Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones

236825

data

Introduction 32

Executing Operations

236825

P1

invocation response

P2

P3

deq

enq(1)

enq(2)

1

ok

Introduction 33

Interleaving Operations, or Not

236825

deqenq(1) enq(2)1ok

Sequential behavior: invocations & responses alternate and match (on process & object)Sequential Specification: All legal sequential behaviors

Introduction 34

Correctness: Sequential consistency[Lamport, 1979]

• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the sequential specification)– Preserves the order of operations by the same

process

236825

Introduction 35

Example 1: Multi-Writer Registers

Add logical time to values

Write(v,X)read TS1,..., TSn

TSi = max TSj +1write v,TSi

Read only own value

Read(X)read v,TSi return v

Once in a while read TS1,..., TSn

and write to TSi

236825

Using (multi-reader) single-writer registers

Need to ensure writes are eventually visible

Introduction 36

Timestamps1. The timestamps of two write operations by the same process

are ordered 2. If a write operation completes before another one starts, it has a

smaller timestamp

236825



Introduction 37

Multi-Writer Registers: Proof



Read(X)read v,TSi return v

Once in a while read TS1,..., TSn

and write to TSi

236825

Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write

38236825 Introduction


Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write

Legality is immediate Per-process order is preserved since a read returns a

value (with timestamp) larger than the preceding write by the same process

Introduction 39

Correctness: Linearizability[Herlihy & Wing, 1990]

• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the specification of the ADTs)– Preserves the real-time order of non-overlapping

operations• Each operation appears to takes effect

instantaneously at some point between its invocation and its response (atomicity)

236825

Introduction 40

Example 2: Linearizable Multi-Writer Registers

Add logical time to values



Read(X)read TS1,...,TSn

return value with max TS

236825

Using (multi-reader) single-writer registers[Vitanyi & Awerbuch, 1987]

Introduction 41

Multi-writer registers: Linearization order



236825

Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write

Read(X)read TS1,...,TSn

return value with max TS

Introduction 42


236825

Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write

Legality is immediate Real-time order is preserved since a read returns a value

(with timestamp) larger than all preceding operations

Introduction 43

Example 3: Atomic Snapshot

• n components• Update a single component• Scan all the components

“at once” (atomically)

Provides an instantaneous view of the whole memory

236825

updateok

scanv1,…,vn

44236825 Introduction

Atomic Snapshot Algorithm

Update(v,k)A[k] = v,seqi,i

Scan()repeat

read A[1],…,A[n]read A[1],…,A[n]if equal

return A[1,…,n]Linearize:

• Updates with their writes• Scans inside the double

collects

double collect

[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]

Introduction 45

Atomic Snapshot: Linearizability

Double collect (read a set of values twice)If equal, there is no write between the collects

– Assuming each write has a new value (seq#)

Creates a “safe zone”, where the scan can be linearized

236825

read A[1],…,A[n] read A[1],…,A[n]

write A[j]

Introduction 46

Liveness Conditions

• Wait-free: every operation completes within a finite number of (its own) steps no starvation for mutex

• Nonblocking: some operation completes within a finite number of (some other process) steps deadlock-freedom for mutex

• Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps– Also called solo termination

wait-free nonblocking obstruction-free

Bounded wait-free bounded nonblocking bounded obstruction-free

236825

Introduction 47

Wait-free Atomic Snapshot[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]

• Embed a scan within the Update.

236825

Update(v,k)V = scanA[k] = v,seqi,i,V

Scan()repeat

read A[1],…,A[n]read A[1],…,A[n]if equal

return A[1,…,n]

else record diffif twice pj return Vj

Linearize:• Updates with their writes• Direct scans as before• Borrowed scans in place

direct scan

borrowedscan

Introduction 48

Atomic Snapshot: Borrowed Scans

Interference by process pj

And another one… pj does a scan inbeteween

Linearizing with the borrowed scan is OK.236825

write A[j]

read A[j]… … read A[j]… …

embedded scan write A[j]

read A[j]… … read A[j]… …

Introduction 49

List of Topics (Indicative)

• Atomic snapshots

• Space complexity of consensus

• Dynamic storage

• Vector agreement

• Renaming

• Maximal independent set

• Routing

and possibly others…

236825

seminar 236825 open problems in distributed computing

Documents