seminar 236825 open problems in distributed computing
DESCRIPTION
SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING. Winter 2013-14 Hagit Attiya & Faith Ellen. INTRODUCTION. Distributed Systems. Distributed systems are everywhere: share resources communicate increase performance (speed & fault tolerance) Characterized by - PowerPoint PPT PresentationTRANSCRIPT
Introduction 1
SEMINAR 236825 OPEN PROBLEMS IN
DISTRIBUTED COMPUTING
Winter 2013-14
Hagit Attiya & Faith Ellen
236825
Introduction 2
INTRODUCTION
236825
Distributed Systems• Distributed systems are
everywhere:– share resources– communicate– increase performance (speed &
fault tolerance)
• Characterized by– independent activities
(concurrency)– loosely coupled parallelism
(heterogeneity)– inherent uncertainty
E.g.• operating systems• (distributed) database
systems• software fault-tolerance• communication
networks• multiprocessor
architectures
236825 Introduction 3
Introduction 4
Main Admin Issues
• Goal: Read some interesting papers, related to some open problems in the area
• Mandatory (active) participation– 1 absence w/o explanation
• Tentative list of papers already published– First come first served
• Lectures in English
236825
Course Overview: Basic Models
236825 Introduction 5
messagepassing
synchronous
asynchronous
PRAM
sharedmemory
Message-Passing Model• processors p0, p1, …, pn-1 are nodes of the graph.
Each is a state machine with a local state.• bidirectional point-to-point channels are the undirected edges of the
graph. • Channel from pi to pj is modeled in two pieces:
– outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)
236825 Introduction 6
1 3
2
2
1 1
1
2
p3p2
p0
p1
Modeling Processors and Channels
236825 Introduction 7
inbuf[1]
p1's localvariables
outbuf[1] inbuf[2]
outbuf[2]
p2's localvariables
• processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state.
• bidirectional point-to-point channels are the undirected edges of the graph.
• Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)
Configuration
A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels.
Formally, a vector of processor states (including outbufs, i.e., channels), one per processor
236825 Introduction 8
Deliver Event
Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step
236825 Introduction 9
p1 p2m3 m2 m1
p1 p2m3 m2 m1
Computation EventOccurs at one processor• Start with old accessible state (local vars + incoming messages)• Apply processor's state machine transition function;
handle all incoming messages• End with new accessible state with empty inbufs
& new outgoing messages
236825 Introduction 10
c d eoldlocalstate
a
newlocalstate
b
Execution
• In the first configuration: each processor is in initial state and all inbufs are empty
• For each consecutive triple configuration, event, configuration
new configuration is same as old configuration except:– if delivery event: specified msg is transferred from sender's
outbuf to receiver's inbuf– if computation event: specified processor's state (including
outbufs) change according to transition function
236825 Introduction 11
configuration, event, configuration, event, configuration, …
Asynchronous Executions
• An execution is admissible in asynchronous model if– every message in an outbuf is eventually delivered– every processor takes an infinite number of steps
• No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out
• Models a reliable system (no message is lost and no processor stops working)
236825 Introduction 12
Example: Simple Flooding Algorithm
• Each processor's local state consists of variable color, either red or green
• Initially:– p0: color = green, all outbufs contain M– others: color = red, all outbufs empty
• Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs
236825 Introduction 13
Example: Flooding
236825 Introduction 14
p1
p0
p2
M M
p1
p0
p2
MM
deliver eventat p1 from p0
computationevent by p1
deliver eventat p2 from p1
p1
p0
p2
M
M
M Mp1
p0
p2
M
M
computationevent by p2
Example: Flooding (cont'd)
236825 Introduction 15
deliver eventat p1 from p2
computationevent by p1
deliver eventat p0 from p1
etc. to deliverrest ofmsgs
p1
p0
p2
M
M M
Mp1
p0
p2
M
M M
M
p1
p0
p2
M
M Mp1
p0
p2
M
M
M
(Worst-Case) Complexity Measures• Message complexity: maximum number of
messages sent in any admissible execution• Time complexity: maximum "time" until all
processes terminate in any admissible execution.
• How to measure time in an asynchronous execution?– Produce a timed execution by assigning non-decreasing
real times to events so that time between sending and receiving any message is at most 1.
– Time complexity: maximum time until termination in any timed admissible execution.
236825 Introduction 16
Complexities of Flooding Algorithm
A state is terminated if color = green.• One message is sent over each edge in each
direction message complexity is 2m, where m = number of edges.
• A node turns green once a "chain" of messages reaches it from p0 time complexity is diameter + 1 time units.
236825 Introduction 17
Introduction 18
Synchronous Message Passing Systems
An execution is admissible for the synchronous model if it is an infinite sequence of rounds
– A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor.
Captures the lockstep behavior of the modelAlso implies
– every message sent is delivered– every processor takes an infinite number of steps.
Time is the number of rounds until termination
236825
Example: Flooding in the Synchronous Model
236825 Introduction 19
p1
p0
p2
M
MM
M
p1
p0
p2
M M
p1
p0
p2
round 1 events
round 2events
Time complexity is diameter + 1Message complexity is 2m
Broadcast Over a Rooted Spanning Tree
• Processors have information about a rooted spanning tree of the communication topology– parent and children local variables at each processor
• Complexities (synchronous and asynchronous model)– time is depth of the spanning tree, which is at most n - 1– number of messages is n - 1, since one message is sent over each
spanning tree edge
236825 Introduction 20
• root initially sends M to its children• when a processor receives M from its parent
– sends M to its children– terminates (sets a local Boolean to true)
Finding a Spanning Tree from a Root
236825 Introduction 24
• root sends M to all its neighbors• when non-root first gets M
– set the sender as its parent– send "parent" msg to sender– send M to all other neighbors (if no other neighbors, then
terminate)• when get M otherwise
– send "reject" msg to sender• use "parent" and "reject" msgs to set children
variables and terminate (after hearing from all neighbors)
Execution of Spanning Tree Algorithm
236825 Introduction 25
g h
a
b c
d e f
Synchronous: always givesbreadth-first search (BFS) tree
Both models:O(m) messagesO(diam) time
root
g h
a
b c
d e f
Asynchronous: not necessarily BFS tree
root
Execution of Spanning Tree Algorithm
236825 Introduction 26
g h
a
b c
d e f
An asynchronous execution gavea depth-first search (DFS) tree.Is DFS property guaranteed?
No!
Another asynchronousexecution results in this tree:neither BFS nor DFS
root
g h
a
b c
d e f
root
Shared Memory Model
Processors (also called processes) communicate via a set of shared variablesEach shared variable has a type, defining a set of primitive operations (performed atomically)
• read, write• compare&swap (CAS)• LL/SC, DCAS, kCAS, …• read-modify-write (RMW),
kRMW
236825 Introduction 29
p0
X
p1 p2
Y
read write writeRMW
Changes from the Message-Passing Model
236825 Introduction 30
• no inbuf and outbuf state components• configuration includes values for shared variables• one event type: a computation step by a process
– pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive
– shared variable's value in the new configuration changes according to the primitive's semantics
– pi 's state in the new configuration changes according to its old state and the result of the primitive
An execution is admissible if every processor takes an infinite number of steps
Introduction 31
Abstract Data Types
• Abstract representation of data & set of methods (operations) for accessing it
• Implement using primitives on base objects
• Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones
236825
data
Introduction 32
Executing Operations
236825
P1
invocation response
P2
P3
deq
enq(1)
enq(2)
1
ok
Introduction 33
Interleaving Operations, or Not
236825
deqenq(1) enq(2)1ok
Sequential behavior: invocations & responses alternate and match (on process & object)Sequential Specification: All legal sequential behaviors
Introduction 34
Correctness: Sequential consistency[Lamport, 1979]
• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the sequential specification)– Preserves the order of operations by the same
process
236825
Introduction 35
Example 1: Multi-Writer Registers
Add logical time to values
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read only own value
Read(X)read v,TSi return v
Once in a while read TS1,..., TSn
and write to TSi
236825
Using (multi-reader) single-writer registers
Need to ensure writes are eventually visible
Introduction 36
Timestamps1. The timestamps of two write operations by the same process
are ordered 2. If a write operation completes before another one starts, it has a
smaller timestamp
236825
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Introduction 37
Multi-Writer Registers: Proof
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read(X)read v,TSi return v
Once in a while read TS1,..., TSn
and write to TSi
236825
Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write
38236825 Introduction
Multi-Writer Registers: Proof
Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write
Legality is immediate Per-process order is preserved since a read returns a
value (with timestamp) larger than the preceding write by the same process
Introduction 39
Correctness: Linearizability[Herlihy & Wing, 1990]
• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the specification of the ADTs)– Preserves the real-time order of non-overlapping
operations• Each operation appears to takes effect
instantaneously at some point between its invocation and its response (atomicity)
236825
Introduction 40
Example 2: Linearizable Multi-Writer Registers
Add logical time to values
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read(X)read TS1,...,TSn
return value with max TS
236825
Using (multi-reader) single-writer registers[Vitanyi & Awerbuch, 1987]
Introduction 41
Multi-writer registers: Linearization order
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
236825
Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write
Read(X)read TS1,...,TSn
return value with max TS
Introduction 42
Multi-Writer Registers: Proof
236825
Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write
Legality is immediate Real-time order is preserved since a read returns a value
(with timestamp) larger than all preceding operations
Introduction 43
Example 3: Atomic Snapshot
• n components• Update a single component• Scan all the components
“at once” (atomically)
Provides an instantaneous view of the whole memory
236825
updateok
scanv1,…,vn
44236825 Introduction
Atomic Snapshot Algorithm
Update(v,k)A[k] = v,seqi,i
Scan()repeat
read A[1],…,A[n]read A[1],…,A[n]if equal
return A[1,…,n]Linearize:
• Updates with their writes• Scans inside the double
collects
double collect
[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
Introduction 45
Atomic Snapshot: Linearizability
Double collect (read a set of values twice)If equal, there is no write between the collects
– Assuming each write has a new value (seq#)
Creates a “safe zone”, where the scan can be linearized
236825
read A[1],…,A[n] read A[1],…,A[n]
write A[j]
Introduction 46
Liveness Conditions
• Wait-free: every operation completes within a finite number of (its own) steps no starvation for mutex
• Nonblocking: some operation completes within a finite number of (some other process) steps deadlock-freedom for mutex
• Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps– Also called solo termination
wait-free nonblocking obstruction-free
Bounded wait-free bounded nonblocking bounded obstruction-free
236825
Introduction 47
Wait-free Atomic Snapshot[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
• Embed a scan within the Update.
236825
Update(v,k)V = scanA[k] = v,seqi,i,V
Scan()repeat
read A[1],…,A[n]read A[1],…,A[n]if equal
return A[1,…,n]
else record diffif twice pj return Vj
Linearize:• Updates with their writes• Direct scans as before• Borrowed scans in place
direct scan
borrowedscan
Introduction 48
Atomic Snapshot: Borrowed Scans
Interference by process pj
And another one… pj does a scan inbeteween
Linearizing with the borrowed scan is OK.236825
write A[j]
read A[j]… … read A[j]… …
embedded scan write A[j]
read A[j]… … read A[j]… …
Introduction 49
List of Topics (Indicative)
• Atomic snapshots
• Space complexity of consensus
• Dynamic storage
• Vector agreement
• Renaming
• Maximal independent set
• Routing
and possibly others…
236825