Outwards from the middle of the maze
Peter Alvaro UC Berkeley
Outline
1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
The transaction concept
DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
The transaction concept
DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
The transaction concept
DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
The transaction concept
DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
Transactions: a holistic contract
Write Read
Application
Opaque store
Transactions
Transactions: a holistic contract
Write Read
Application
Opaque store
Transactions
Assert: balance > 0
Assert: balance > 0
Transactions: a holistic contract
Write Read
Application
Opaque store
Transactions
Transactions: a holistic contract
Write Read
Application
Opaque store
Transactions
Assert: balance > 0
Transactions: a holistic contract
Write Read
Application
Opaque store
Transactions
Assert: balance > 0
Incidental complexities
• The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”
Fundamental complexity
“[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.”
Jim Waldo et al., A Note on Distributed Computing (1994)
A holistic contract …stretched to the limit
Write Read
Application
Opaque store
Transactions
A holistic contract …stretched to the limit
Write Read
Application
Opaque store
Transactions
Are you blithely asserting that transactions aren’t webscale?
Some people just want to see the world burn. Those same people want to see the world use inconsistent databases.
- Emin Gun Sirer
Alternative to top-down design?
The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.
Alternative: the “bottom-up,” systems ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
“‘Tis a fine barn, but sure ‘tis no castle, English”
The “bottom-up” ethos
Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?
Low-level contracts
Write Read
Application
Distributed store KVS
Low-level contracts
Write Read
Application
Distributed store KVS
Low-level contracts
Write Read
Application
Distributed store KVS
R1(X=1) R2(X=1) W1(X=2) W2(X=0)
W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
Low-level contracts
Write Read
Application
Distributed store KVS
Assert: balance > 0
R1(X=1) R2(X=1) W1(X=2) W2(X=0)
W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
Low-level contracts
Write Read
Application
Distributed store KVS
Assert: balance > 0
causal? PRAM? delta? fork/join? red/blue? Release?
R1(X=1) R2(X=1) W1(X=2) W2(X=0)
W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
When do contracts compose?
Application
Distributed service
Assert: balance > 0
iw, did I get mongo in my riak? Assert: balance > 0
Composition is the last hard problem
Composing modules is hard enough We must learn how to compose guarantees
Outline
1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Why distributed systems are hard2
Asynchrony Partial Failure
Fundamental Uncertainty
Asynchrony isn’t that hard
Logical timestamps Deterministic interleaving
Ameloriation:
Partial failure isn’t that hard
Replication Replay
Ameloriation:
(asynchrony * partial failure) = hard2
Logical timestamps Deterministic interleaving
Replication Replay
(asynchrony * partial failure) = hard2
Logical timestamps Deterministic interleaving
Replication Replay
(asynchrony * partial failure) = hard2
Tackling one clown at a time
Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs
Outline
1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Distributed consistency
Today: A quick summary of some great work.
Consider a (distributed) graph
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Partitioned, for scalability
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Replicated, for availability
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Deadlock detection
Task: Identify strongly-connected components
Waits-for graph
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Garbage collection
Task: Identify nodes not reachable from Root.
Root
Refers-to graph
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Correctness
Deadlock detection • Safety: No false positives
• Liveness: Identify all deadlocks
Garbage collection • Safety: Never GC live memory!
• Liveness: GC all orphaned memory
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Correctness
Deadlock detection • Safety: No false positives-
• Liveness: Identify all deadlocks
Garbage collection • Safety: Never GC live memory!
• Liveness: GC all orphaned memory
Correctness
Deadlock detection • Safety: No false positives
• Liveness: Identify all deadlocks
Garbage collection • Safety: Never GC live memory!
• Liveness: GC all orphaned memory
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Root
Consistency at the extremes
StorageObjectFlow
LanguageApplication
Linearizable key-value store?
Custom solutions?
Consistency at the extremes
StorageObjectFlow
LanguageApplication
Linearizable key-value store?
Custom solutions?
Consistency at the extremes
StorageObjectFlow
LanguageApplication
Linearizable key-value store?
Custom solutions?
Efficient Correct
Object-level consistency
Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence)
StorageObjectFlow
LanguageApplication
Object-level consistency
Insert Read
Convergent data structure (e.g., Set CRDT)
Object-level consistency
Insert Read
Commutativity Associativity Idempotence
Insert Read
Convergent data structure (e.g., Set CRDT)
Object-level consistency
Insert Read
Commutativity Associativity Idempotence
Insert Read
Convergent data structure (e.g., Set CRDT)
Object-level consistency
Insert Read
Commutativity Associativity Idempotence
Insert Read
Convergent data structure (e.g., Set CRDT)
Object-level consistency
Insert Read
Commutativity Associativity Idempotence
Reordering Batching Retry/duplication
Tolerant to
Application
Convergent data structures
Object-level composition?
Assert: Graph replicas converge
Application
Convergent data structures
Object-level composition?
GC Assert: No live nodes are reclaimed
Assert: Graph replicas converge
Application
Convergent data structures
Object-level composition?
? ?
GC Assert: No live nodes are reclaimed
Assert: Graph replicas converge
Flow-level consistency
StorageObjectFlow
LanguageApplication
Flow-level consistency
Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees
Graphstore
Transactionmanager
Transitiveclosure
Deadlockdetector
Flow-level consistency
Order-insensitivity (confluence)
output set = f(input set)
Flow-level consistency
Order-insensitivity (confluence)
output set = f(input set)
Flow-level consistency
Order-insensitivity (confluence)
output set = f(input set)
Flow-level consistency
Order-insensitivity (confluence)
output set = f(input set)
Flow-level consistency
=
Order-insensitivity (confluence)
output set = f(input set)
Flow-level consistency
{ }
{ } =
Order-insensitivity (confluence)
output set = f(input set)
Confluence is compositional
output set = f � g(input set)
Confluence is compositional
output set = f � g(input set)
Confluence is compositional
output set = f � g(input set)
Graphstore
Memoryallocator
Transitiveclosure
Garbagecollector
Confluent Not
Confluent
Confluent
Graphstore
Transactionmanager
Transitiveclosure
Deadlockdetector
Confluent ConfluentConfluent
Graph queries as dataflow
Graphstore
Memoryallocator
Transitiveclosure
Garbagecollector
Confluent Not
Confluent
Confluent
Graphstore
Transactionmanager
Transitiveclosure
Deadlockdetector
Confluent ConfluentConfluent
Graph queries as dataflow Confluent
Coordinate here
Graphstore
Memoryallocator
Transitiveclosure
Garbagecollector
Confluent Not
Confluent
Confluent
Coordination: what is that?
Coordinate here
Strategy 1: Establish a total order
Graphstore
Memoryallocator
Transitiveclosure
Garbagecollector
Confluent Not
Confluent
Confluent
Coordination: what is that?
Coordinate here
Strategy 2: Establish a producer- consumer barrier
Fundamental costs: FT via replication
Graphstore
Transactionmanager
Transitiveclosure
Deadlockdetector
Confluent ConfluentConfluent
Graphstore
Transitiveclosure
Deadlockdetector
Confluent ConfluentConfluent
(mostly) free!
global synchronization!
Graphstore
Transactionmanager
Transitiveclosure
GarbageCollector
Confluent Confluent
Graphstore
Transitiveclosure
GarbageCollector
Confluent Not
Confluent
Confluent
Paxos
Not
Confluent
Fundamental costs: FT via replication
Fundamental costs: FT via replication
GarbageCollector
Graphstore
Transactionmanager
Transitiveclosure
GarbageCollector
Confluent Confluent
Graphstore
Transitiveclosure
Confluent Not
Confluent
Confluent
BarrierNot
Confluent
Barrier
The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton
Language-level consistency
DSLs for distributed programming? • Capture consistency concerns in the
type system
StorageObjectFlow
LanguageApplication
Language-level consistency
CALM Theorem:
Monotonic à confluent
Conservative, syntactic test for confluence
Language-level consistency
Deadlock detector
Garbage collector
Language-level consistency
Deadlock detector
Garbage collector
nonmonotonic
Let’s review
• Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise
(Tricks are great, but tools are better)
Outline
1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Grand challenge: composition
Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?
Example: Atomic multi-partition update
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Two-phase commit
Example: replication
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
T1T2
T4
T10
T3
T6
T5
T9
T7
T8
T11
T12
T13
T14
Reliable broadcast
Popular wisdom: don’t reinvent
Example: Kafka replication bug
Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper
One nasty bug: Acknowledged writes are lost
A guarantee would be nice
Bottom up approach: • use formal methods to verify individual
components (e.g. protocols) • Build systems from verified components
Shortcomings: • Hard to use • Hard to compose
Investment
Returns
Bottom-up assurances
Formal verifica[on
Program Environment Correctness Spec
Composing bottom-up assurances
Composing bottom-up assurances
Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property)
If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson
Composing bottom-up assurances
Composing bottom-up assurances
Composing bottom-up assurances
Top-down “assurances”
Top-down “assurances”
Testing
Top-down “assurances”
Fault injection Testing
Top-down “assurances”
Fault injection
Testing
End-to-end testing would be nice
Top-down approach: • Build a large-scale system • Test the system under faults
Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete
Investment
Returns
Lineage-driven fault injection
Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist
Lineage-driven fault injection
Correctness Specification
Malevolent sentience
Molly
Lineage-driven fault injection
Molly
Correctness Specification
Malevolent sentience
Lineage-driven fault injection (LDFI)
Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?
Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.
The game
• Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and
chooses failures for the next execution.
Dedalus: it’s about data
log(B, “data”)@5
What
Where
When
Some data
Dedalus: it’s like Datalog
consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!
Dedalus: it’s like Datalog
consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!
(Which is like SQL)
create view log as select Node, Pload from bcast;!
Dedalus: it’s about time
consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
Dedalus: it’s about time
consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
Natural join (bcast.Node1 == node.Node1)
State change
Communication
The match
Protocol: Reliable broadcast
Specification:
Pre: A correct process delivers a message m Post: All correct process delivers m
Failure Model:
(Permanent) crash failures Message loss / partitions
Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);!log(Node, Pload)@next ! :- log(Node, Pload);!!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
“An effort” delivery protocol
Round 1 in space / time
Process b Process a Process c
2
1
2
log log
Round 1: Lineage
log(B, data)@5
Round 1: Lineage
log(B, data)@5
log(B, data)@4
log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!
Round 1: Lineage
log(B, data)@5
log(B, data)@4
log(B, data)@3
Round 1: Lineage
log(B, data)@5
log(B, data)@4
log(B, data)@3
log(B,data)@2
Round 1: Lineage
log(B, data)@5
log(B, data)@4
log(B, data)@3
log(B,data)@2
log(A, data)@1
log(Node2, Pload)@async :- bcast(Node1, Pload), !! ! ! ! ! ! node(Node1, Node2);!
!!!!log(B, data)@2 :- bcast(A, data)@1, !
! ! ! ! ! ! node(A, B)@1;!
An execution is a (fragile) “proof” of an outcome
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^AB2 ^AB3 ^AB4
1
(which required a message from A to B at time 1)
Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”
Round 1: counterexample
The adversary wins!
Process b Process a Process c
1
2
log (LOST) log
Round 2
Same as Round 1, but A retries.
bcast(N, P)@next ! ! ! :- bcast(N, P);!
Round 2 in spacetime Process b Process a Process c
2
3
4
5
1
2
3
4
2
3
4
5
log log
log log
log log
log log
Round 2
log(B, data)@5
Round 2
log(B, data)@5
log(B, data)@4
log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!
Round 2
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!!!!!log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
Round 2
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(B, data)@3
log(A, data)@3
Round 2
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(B, data)@3
log(A, data)@3
log(B,data)@2
log(A, data)@2
Round 2
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(B, data)@3
log(A, data)@3
log(B,data)@2
log(A, data)@2
log(A, data)@1
Round 2
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(B, data)@3
log(A, data)@3
log(B,data)@2
log(A, data)@2
log(A, data)@1
Retry provides redundancy in time
Traces are forests of proof trees log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^AB2 ^AB3 ^AB4
1
Traces are forests of proof trees log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^AB2 ^AB3 ^AB4
1
Round 2: counterexample
Process b Process a Process c
1
CRASHED 2
log (LOST) log
The adversary wins!
Round 3
Same as in Round 2, but symmetrical.
bcast(N, P)@next ! ! ! :- log(N, P);!
Round 3 in space / time Process b Process a Process c
2
3
4
5
1
2
3
4
5
2
3
4
5
log log
log log
log log
log log
log log
log log
log log
log log
log log
log log
Redundancy in space and time
Round 3 -- lineage
log(B, data)@5
Round 3 -- lineage
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(C, data)@4
Round 3 -- lineage
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(C, data)@4
Log(B, data)@3
log(A, data)@3
log(C, data)@3
Round 3 -- lineage
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(C, data)@4
Log(B, data)@3
log(A, data)@3
log(C, data)@3
log(B,data)@2
log(A, data)@2
log(C, data)@2
log(A, data)@1
Round 3 -- lineage
log(B, data)@5
log(B, data)@4
log(A, data)@4
log(C, data)@4
Log(B, data)@3
log(A, data)@3
log(C, data)@3
log(B,data)@2
log(A, data)@2
log(C, data)@2
log(A, data)@1
Round 3
The programmer wins!
Let’s reflect
Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations
The role of the adversary can be automated
1. Break a proof by dropping any contributing message.
Disjunction
(AB1 ∨ BC2)
The role of the adversary can be automated
1. Break a proof by dropping any contributing message.
2. Find a set of failures that breaks all proofs of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2)
∧ (AC1) ∧ (AC2)
The role of the adversary can be automated
1. Break a proof by dropping any contributing message.
2. Find a set of failures that breaks all proofs of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2)
∧ (AC1) ∧ (AC2)
Molly, the LDFI prototype
Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast
Commit protocols
Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)
Two-phase commit Agent a Agent b Coordinator Agent d
2
5
2
5
1
3
4
2
5
vote vote
prepare prepare prepare
commit commit commit
vote
Two-phase commit Agent a Agent b Coordinator Agent d
2
5
2
5
1
3
4
2
5
vote vote
prepare prepare prepare
commit commit commit
vote
Can I kick it?
Two-phase commit Agent a Agent b Coordinator Agent d
2
5
2
5
1
3
4
2
5
vote vote
prepare prepare prepare
commit commit commit
vote
Can I kick it?
YES YOU CAN
Two-phase commit Agent a Agent b Coordinator Agent d
2
5
2
5
1
3
4
2
5
vote vote
prepare prepare prepare
commit commit commit
vote
Can I kick it?
YES YOU CAN
Well I’m gone
Two-phase commit
Agent a Agent a Coordinator Agent d
2 2
1
3
CRASHED
2
v v
p p p
v
Violation: Termination
The collabora[ve termina[on protocol
Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.
2PC - CTP Agent a Agent b Coordinator Agent d
2
3
4
5
6
7
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
decision_req decision_req
vote
decision_req decision_req
prepare prepare prepare
vote
decision_req decision_req
2PC - CTP Agent a Agent b Coordinator Agent d
2
3
4
5
6
7
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
decision_req decision_req
vote
decision_req decision_req
prepare prepare prepare
vote
decision_req decision_req
Can I kick it?
YES YOU CAN
……?
3PC
Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort
2. Phase 2: send canCommit, collect acks – Agent timeout à commit
3. Phase 3: Just like phase 2 of 2PC
3PC Process a Process b Process C Process d
2
4
7
2
4
7
1
3
5
6
2
4
7
vote_msg
ack
vote_msg
ack
cancommit cancommit cancommit
precommit precommit precommit
commit commit commit
vote_msg
ack
3PC Process a Process b Process C Process d
2
4
7
2
4
7
1
3
5
6
2
4
7
vote_msg
ack
vote_msg
ack
cancommit cancommit cancommit
precommit precommit precommit
commit commit commit
vote_msg
ack
Timeout à Abort
Timeout à Commit
Network partitions make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Network partitions make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Agent crash Agents learn commit decision
Network partitions make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Agent crash Agents learn commit decision
d is dead; coordinator decides to abort
Network partitions make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network partition
Agent crash Agents learn commit decision
d is dead; coordinator decides to abort
Network partitions make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network partition
Agent crash Agents learn commit decision
d is dead; coordinator decides to abort
Agents A & B decide to commit
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network partition
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network partition
a becomes leader and sole replica
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network partition
a becomes leader and sole replica
a ACKs client write
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network partition
a becomes leader and sole replica
a ACKs client write
Data loss
Molly summary
Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods
Where we’ve been; where we’re headed
1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed
1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed
1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed
1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to
hide! We need tools to manage it.
3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed
1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed
1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
Outline
1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
Outline
1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion 4. Backwards from outcomes
Remember
1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We
need tools to manage it.
3. Focus on flow: data in motion 4. Backwards from outcomes
Composition is the hardest problem
A happy crisis
Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”