Download - RICON keynote: outwards from the middle of the maze

Outwards from the middle of the maze

Peter Alvaro UC Berkeley

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

The transaction concept

DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;

The “top-down” ethos

Transactions: a holistic contract

Write Read

Application

Opaque store

Transactions


Write Read

Application

Opaque store

Transactions

Assert: balance > 0

Assert: balance > 0


Write Read

Application

Opaque store

Transactions


Write Read

Application

Opaque store

Transactions

Assert: balance > 0

Incidental complexities

•  The “Internet.” Searching it. •  Cross-datacenter replication schemes •  CAP Theorem •  Dynamo & MapReduce •  “Cloud”

Fundamental complexity

“[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.”

Jim Waldo et al., A Note on Distributed Computing (1994)

A holistic contract …stretched to the limit

Write Read

Application

Opaque store

Transactions

Are you blithely asserting that transactions aren’t webscale?

Some people just want to see the world burn. Those same people want to see the world use inconsistent databases.

- Emin Gun Sirer

Alternative to top-down design?

The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.

Alternative: the “bottom-up,” systems ethos

The “bottom-up” ethos


“‘Tis a fine barn, but sure ‘tis no castle, English”


Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?

Low-level contracts

Write Read

Application

Distributed store KVS

Low-level contracts

Write Read

Application


R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

Low-level contracts

Write Read

Application


Assert: balance > 0

R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

Low-level contracts

Write Read

Application


Assert: balance > 0

causal? PRAM? delta? fork/join? red/blue? Release?

R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

When do contracts compose?

Application

Distributed service

Assert: balance > 0

iw, did I get mongo in my riak? Assert: balance > 0

Composition is the last hard problem

Composing modules is hard enough We must learn how to compose guarantees

Outline


Why distributed systems are hard2

Asynchrony Partial Failure

Fundamental Uncertainty

Asynchrony isn’t that hard

Logical timestamps Deterministic interleaving

Ameloriation:

Partial failure isn’t that hard

Replication Replay

Ameloriation:

(asynchrony * partial failure) = hard2

Logical timestamps Deterministic interleaving

Replication Replay

(asynchrony * partial failure) = hard2

Tackling one clown at a time

Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs

Outline


Distributed consistency

Today: A quick summary of some great work.

Consider a (distributed) graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Partitioned, for scalability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Replicated, for availability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Deadlock detection

Task: Identify strongly-connected components

Waits-for graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Garbage collection

Task: Identify nodes not reachable from Root.

Root

Refers-to graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives-




Correctness

Deadlock detection •  Safety: No false positives




T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Root

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

Custom solutions?

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

Custom solutions?

Efficient Correct

Object-level consistency

Capture semantics of data structures that •  allow greater concurrency •  maintain guarantees (e.g. convergence)

StorageObjectFlow

LanguageApplication

Insert Read

Convergent data structure (e.g., Set CRDT)


Insert Read

Commutativity Associativity Idempotence

Insert Read

Convergent data structure (e.g., Set CRDT)


Insert Read

Commutativity Associativity Idempotence

Reordering Batching Retry/duplication

Tolerant to

Application

Convergent data structures

Object-level composition?

Assert: Graph replicas converge

Application



GC Assert: No live nodes are reclaimed


Application



? ?

GC Assert: No live nodes are reclaimed


Flow-level consistency

StorageObjectFlow

LanguageApplication


Capture semantics of data in motion •  Asynchronous dataflow model •  component properties à system-wide guarantees

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector


Order-insensitivity (confluence)

output set = f(input set)


=




{ }

{ } =



Confluence is compositional

output set = f � g(input set)

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector


Graph queries as dataflow Confluent

Coordinate here

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate here

Strategy 1: Establish a total order

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate here

Strategy 2: Establish a producer- consumer barrier

Fundamental costs: FT via replication

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector


Graphstore

Transitiveclosure

Deadlockdetector


(mostly) free!

global synchronization!

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

GarbageCollector

Confluent Not

Confluent

Confluent

Paxos

Not

Confluent



GarbageCollector

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

Confluent Not

Confluent

Confluent

BarrierNot

Confluent

Barrier

The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton

Language-level consistency

DSLs for distributed programming? •  Capture consistency concerns in the

type system

StorageObjectFlow

LanguageApplication


CALM Theorem:

Monotonic à confluent

Conservative, syntactic test for confluence


Deadlock detector

Garbage collector


Deadlock detector

Garbage collector

nonmonotonic

Let’s review

•  Consistency is tolerance to asynchrony •  Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise

(Tricks are great, but tools are better)

Outline


Grand challenge: composition

Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?

Example: Atomic multi-partition update

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Two-phase commit

Example: replication

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Reliable broadcast

Popular wisdom: don’t reinvent

Example: Kafka replication bug

Three “correct” components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper

One nasty bug: Acknowledged writes are lost

A guarantee would be nice

Bottom up approach: •  use formal methods to verify individual

components (e.g. protocols) •  Build systems from verified components

Shortcomings: •  Hard to use •  Hard to compose

Investment

Returns

Bottom-up assurances

Formal verifica[on

Program Environment Correctness Spec

Composing bottom-up assurances


Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property)

If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson

Top-down “assurances”


Testing


Fault injection Testing


Fault injection

Testing

End-to-end testing would be nice

Top-down approach: •  Build a large-scale system •  Test the system under faults

Shortcomings: •  Hard to identify complex bugs •  Fundamentally incomplete

Investment

Returns

Lineage-driven fault injection

Goal: top-down testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist


Correctness Specification

Malevolent sentience

Molly


Molly

Correctness Specification

Malevolent sentience

Lineage-driven fault injection (LDFI)

Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: •  Why did a good thing happen? •  What could have gone wrong along the way?

Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.

The game

•  Both players agree on a failure model •  The programmer provides a protocol •  The adversary observes executions and

chooses failures for the next execution.

Dedalus: it’s about data

log(B, “data”)@5

What

Where

When

Some data

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

(Which is like SQL)

create view log as select Node, Pload from bcast;!

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

Natural join (bcast.Node1 == node.Node1)

State change

Communication

The match

Protocol: Reliable broadcast

Specification:

Pre: A correct process delivers a message m Post: All correct process delivers m

Failure Model:

(Permanent) crash failures Message loss / partitions

Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);!log(Node, Pload)@next ! :- log(Node, Pload);!!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

“An effort” delivery protocol

Round 1 in space / time

Process b Process a Process c

2

1

2

log log

Round 1: Lineage

log(B, data)@5

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

log(B,data)@2

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

log(B,data)@2

log(A, data)@1

log(Node2, Pload)@async :- bcast(Node1, Pload), !! ! ! ! ! ! node(Node1, Node2);!

!!!!log(B, data)@2 :- bcast(A, data)@1, !

! ! ! ! ! ! node(A, B)@1;!

An execution is a (fragile) “proof” of an outcome

log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ÂB2 ÂB3 ÂB4

1

(which required a message from A to B at time 1)

Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”

Round 1: counterexample

The adversary wins!


1

2

log (LOST) log

Round 2

Same as Round 1, but A retries.

bcast(N, P)@next ! ! ! :- bcast(N, P);!

Round 2 in spacetime Process b Process a Process c

2

3

4

5

1

2

3

4

2

3

4

5

log log

log log

log log

log log

Round 2

log(B, data)@5

Round 2

log(B, data)@5

log(B, data)@4

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!!!!!log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

log(A, data)@1

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

log(A, data)@1

Retry provides redundancy in time

Traces are forests of proof trees log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ÂB2 ÂB3 ÂB4

1

Round 2: counterexample


1

CRASHED 2

log (LOST) log

The adversary wins!

Round 3

Same as in Round 2, but symmetrical.

bcast(N, P)@next ! ! ! :- log(N, P);!

Round 3 in space / time Process b Process a Process c

2

3

4

5

1

2

3

4

5

2

3

4

5

log log

log log

log log

log log

log log

log log

log log

log log

log log

log log

Redundancy in space and time

Round 3 -- lineage

log(B, data)@5

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Log(B, data)@3

log(A, data)@3

log(C, data)@3

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Log(B, data)@3

log(A, data)@3

log(C, data)@3

log(B,data)@2

log(A, data)@2

log(C, data)@2

log(A, data)@1

Round 3

The programmer wins!

Let’s reflect

Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

Disjunction

(AB1 ∨ BC2)

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

Molly, the LDFI prototype

Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast

Commit protocols

Problem: Atomically change things Correctness properties: 1.  Agreement (All or nothing) 2.  Termination (Something)

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote


2

5

2

5

1

3

4

2

5

vote vote



vote

Can I kick it?


2

5

2

5

1

3

4

2

5

vote vote



vote

Can I kick it?

YES YOU CAN


2

5

2

5

1

3

4

2

5

vote vote



vote

Can I kick it?

YES YOU CAN

Well I’m gone

Two-phase commit

Agent a Agent a Coordinator Agent d

2 2

1

3

CRASHED

2

v v

p p p

v

Violation: Termination

The collabora[ve termina[on protocol

Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.

2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote

decision_req decision_req

vote



vote


2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote


vote



vote


Can I kick it?

YES YOU CAN

……?

3PC

Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1.  Phase 1: Just like in 2PC –  Agent timeout à abort

2.  Phase 2: send canCommit, collect acks –  Agent timeout à commit

3.  Phase 3: Just like phase 2 of 2PC

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack

cancommit cancommit cancommit

precommit precommit precommit


vote_msg

ack

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack




vote_msg

ack

Timeout à Abort

Timeout à Commit

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit



abort (LOST) abort (LOST)

abort abort

vote_msg



2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit




abort abort

vote_msg

Agent crash Agents learn commit decision



2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit




abort abort

vote_msg


d is dead; coordinator decides to abort



2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit




abort abort

vote_msg

Brief network partition





2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit




abort abort

vote_msg




Agents A & B decide to commit

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w



1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w




1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w


a becomes leader and sole replica



1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w



a ACKs client write



1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w



a ACKs client write

Data loss

Molly summary

Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods

Where we’ve been; where we’re headed



1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures


1.  We need application-level guarantees 2.  (asynchrony X partial failure) = too hard to

hide! We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures


1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures




3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline



3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline



3.  Focus on flow: data in motion 4.  Backwards from outcomes

Remember

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide! We

need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Composition is the hardest problem

A happy crisis

Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

Download - RICON keynote: outwards from the middle of the maze

Top Related