consensus in distributed computing
TRANSCRIPT
CONSENSUS IN DISTRIBUTED COMPUTING
LET’S TALK ABOUT…
CONSENSUS IN DISTRIBUTED COMPUTING
RUBEN TAN LONG ZHENG
▸ CTO of Neuroware, Inc
▸ We Do Blockchain Stuff™
▸ Co-founder of Javascript Developers Malaysia
▸ Proud owner of 2 useless cats
▸ @roguejs
CONSENSUS IN DISTRIBUTED COMPUTING
SUPER HIGH-LEVEL OVERVIEW
▸ Consensus in Distributed Computing
▸ Consensus
▸ Agreeing that something is the truth
▸ Distributed Computing
▸ A network of nodes operating together
CONSENSUS IN DISTRIBUTED COMPUTING
FAILURE MODES
▸ Fail-stop = a node dies
▸ Fail-recover = a node dies and comes back later (Jesus/Zombie)
▸ Byzantine = a node misbehaves
CONSENSUS IN DISTRIBUTED COMPUTING
BYZANTINE GENERAL’S PROBLEM
▸ One of the first impossibility proof in computer communications
▸ Impossible to solve in a perfect manner
▸ Originated from the Two General’s Problem (1975)
▸ Explored in detail in Leslie Lamport, Robert Shostak, Marshall Pease paper: The Byzantine General Problem (1982)
ENEMY
A
B
C
D
E
F
TRAITOR
ATTACK!
ATTACK!
ATTACK!
RETREAT!
RETREAT!
RETREAT!
ATTACK! RETREAT!
ENEMY
A
B
C
D
E
F
TRAITOR
MUAHAHA, NO CONSENSUS!
ROUTS THE FLEEING ARMY
ATTACKERS HAVE
INSUFFICIENT FORCE
AND ARE DESTROYED
CONSENSUS IN DISTRIBUTED COMPUTING
BYZANTINE FAULT TOLERANCE
▸ Byzantine Fault
▸ Any fault that presents different symptoms to different observers (some general attack, some general retreat)
▸ Byzantine Failure
▸ The loss of a system service reliant on consensus due to Byzantine Fault
▸ Byzantine Fault Tolerance
▸ A system that is resilient/tolerant of a Byzantine Fault
CONSENSUS IN DISTRIBUTED COMPUTING
ON A SIDENOTE…
▸ Distributed computing is inherently unreliable
▸ Peter Deutsch, Bill Joy, Tom Lyon and James Gosling
▸ The Eight Fallacies of Distributed Computing (1994-1997)
▸ Today, we still have engineers who believe in some, if not all of the fallacies
CONSENSUS IN DISTRIBUTED COMPUTING
EIGHT FALLACIES OF DISTRIBUTED COMPUTING
▸ The network is reliable
▸ Latency is zero
▸ Bandwidth is infinite
▸ The network is secure
▸ Topology does not change
▸ There is only one administrator
▸ Transport cost is zero
▸ The network is homogeneous (same platform)
When you believe in any of the eight fallacies…
CONSENSUS
The Real Talk Begins™
CONSENSUS IN DISTRIBUTED COMPUTING
CONSENSUS OVERVIEW
▸ Achieving Consensus = distributed system acting as one entity
▸ Consensus Problem = getting nodes in a distributed system to agree on something (value, operation, etc)
▸ Basically… consensus = THE HIVE MIND
▸ Common Examples
▸ Commit transactions to a database
▸ Synchronising clocks
CONSENSUS IN DISTRIBUTED COMPUTING
FLP IMPOSSIBILITY PROOF
▸ Michael J. Fisher, Nancy A. Lynch, and Michael S. Patterson
▸ Impossibility of Distributed Consensus with One Faulty Process (1985) - Dijkstra (dike-stra) Award (2001)
▸ In synchronous settings, it is possible to reach consensus at the cost of time
▸ Consensus is impossible in an asynchronous setting even when only 1 node will crash
CONSENSUS IN DISTRIBUTED COMPUTING
SOLVING THE CONSENSUS PROBLEM
▸ Strong consensus follows these properties:
▸ Termination - all nodes eventually decide on a value
▸ Agreement - all nodes decide on a value
▸ Validity - the value decided must be proposed by a node (AKA no default value to fall back on)
▸ Termination + Agreement + Validity = Consensus
CONSENSUS IN DISTRIBUTED COMPUTING
CONSENSUS PROTOCOLS
▸ 2 Phase Commit
▸ 3 Phase Commit
▸ Basic Paxos
▸ The Future…
CONSENSUS IN DISTRIBUTED COMPUTING
2 PHASE COMMIT
▸ Simplest consensus protocol
▸ Phase 1 - Proposal
▸ A node (called coordinator) proposes a value to all other nodes, then gathers votes
▸ Phase 2 - Commit-or-abort
▸ The coordinator sends:
▸ Commit if all nodes voted yes. All nodes commit the new value
▸ Abort if 1 or more nodes voted no. All nodes abort the value
COOR.
NODE
NODE
NODE
NODE
Coordinator proposes a value
COOR.
NODE
NODE
NODE
NODE
All nodes vote yes or no
COOR.
NODE
NODE
NODE
NODE
Coordinator sends commit if
all nodes voted yes; sends
abort otherwise All nodes now
update themselves
to contain the
proposed value, or
all nodes abort
CONSENSUS IN DISTRIBUTED COMPUTING
2 PHASE COMMIT
▸ Agreement - every node accepts the value from the coordinator at phase 2 = YES
▸ Validity - commit/abort originated from the coordinator = YES
▸ Termination = no loops in the steps, doesn’t run forever = YES
▸ Therefore, 2 phase commit fulfils the requirements of a consensus protocol
CONSENSUS IN DISTRIBUTED COMPUTING
2 PHASE COMMIT
▸ Blocking failure when coordinator fails before sending proposal to all nodes
COOR.
NODE
NODE
NODE
Coordinator proposes a value
▸ Blocking failure when coordinator fails before sending proposal to all nodes
2 PHASE COMMIT
CONSENSUS IN DISTRIBUTED COMPUTING
COOR.
NODE
NODE
NODE
Receives proposed
value, votes yes, now
waiting for commit
▸ Blocking failure when coordinator fails before sending proposal to all nodes
2 PHASE COMMIT
CONSENSUS IN DISTRIBUTED COMPUTING
COOR.
NODE
NODE
NODE
Coordinator crashes… and a different
coordinator comes in to propose a
different value
NEW COOR.
▸ Blocking failure when coordinator fails before sending proposal to all nodes
2 PHASE COMMIT
CONSENSUS IN DISTRIBUTED COMPUTING
COOR.
NODE
NODE
NODENEW COOR.
Node cannot accept new proposal
because waiting on commit. Cannot
abort because first Coordinator might
recover.
CONSENSUS IN DISTRIBUTED COMPUTING
2 PHASE COMMIT
▸ Guarantees safety, but not liveness
▸ Safety = all nodes agree on a value proposed by a node
▸ Liveness = should still be able to function when some nodes crash
CONSENSUS IN DISTRIBUTED COMPUTING
3 PHASE COMMIT
▸ Similar to 2 Phase Commit, with an extra phase (duh)
▸ Phase 1 - Proposal - same as 2PC
▸ Phase 2 - Pre-approve - similar to 2PC commit-or-abort, but nodes reply with ACK instead
▸ Phase 3 - Do Commit - now the nodes commit
▸ Tolerant of node crashes, but not network partitions
▸ Won’t cover in detail
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Presented by Leslie Lamport in The Part-Time Parliament (1988)
▸ Named after the Paxos civilisation’s legislation
▸ Remains as:
▸ The hardest to understand in theory
▸ The hardest to implement
▸ The closest we get to reaching ideal consensus
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Used in:
▸ Apache Zookeeper
▸ Google Chubby (BigTable)
▸ Google Spannar
▸ Apache Mesos
▸ Apache Cassandra
▸ etc
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Components:
▸ Proposers
▸ Proposes values to other nodes
▸ Acceptors
▸ Respond to proposers with votes
▸ Commits chosen value & decision state
▸ Server can have both 1 Proposer & 1 Acceptor
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Uses a two-base approach:
▸ Broadcast Prepare
▸ Find out if there’s already a chosen value
▸ Block older proposals that have yet to be completed
▸ Broadcast Accept
▸ Ask acceptors to accept a value
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Prepare(n)
▸ n = proposal number [max++]~[server id]
▸ Return(p, v)
▸ p = proposal number
▸ v = current accepted value (if any)
▸ Accept(p, v)
▸ p = proposal number
▸ v = value to be accepted
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Proposal Phase
▸ Proposer generates a proposal number p
▸ Proposer broadcasts p and a value v
▸ Acceptor checks p if higher than its min-p, updates if so
▸ Acceptor replies any acc-p and acc-v
▸ Proposer waits for majority
▸ Checks if any return acc-p is highest, and replace v with acc-v
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS
▸ Accept Phase
▸ Proposer sends p and v to all acceptors
▸ Acceptors check if p is lower than min-p, and ignores if so. Otherwise, acc-p = min-p = p and acc-v = v
▸ Acceptor reply accepted or rejected
▸ If majority accepted, terminate with v. Otherwise, restart Propose Phase with new p
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 0 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
P
7
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
P
7
P1 7
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
P
7
P1 7
P1 7
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
ACC-P -
ACC-V -
A1
A2
A3
7
v7 is proposed with p1
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
ACC-P -
ACC-V -
ACC-P -
ACC-V -
A1
A2
A3
7
Has majority! Since acc-p and acc-v are both null, we know that we are the only proposers in the network so far
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
ACC-P -
ACC-V -
ACC-P -
ACC-V -
A1
A2
A3
Now, we send out p and v in the accept phase
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
A1
A2
A3
Acceptors update acc-p and acc-v
P1MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P
7
P1 7
P1 7
A1
A2
A3
Accept!
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P
A1
A2
A3
Accept!
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P
Accept!
A1
A2
A3
Accept!
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P
Accept!
Oh look, we have majority! v7 is the terminated value then!
A1
A2
A3
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
PShuddup, nobody loves you
Accept? :(
CONSENSUS IN DISTRIBUTED COMPUTING
PAXOS - MULTI PROPOSERS
▸ What if there were multiple proposers?
▸ Brace yourself, It’s Complicated™ (not really)
A1
A2
A3
7
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P1
7
P1 7
P2
P1 7
A1
A2
A3
7
P1MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
MIN-P 1 ACC-P - ACC-V -
P
7
P1 7
P1 7
ACC-P -
ACC-V -
ACC-P -
ACC-V -
P2
A1
A2
A3
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P1
P2
P1 7
P1 7
P1 7 P2 5
5
v5 is proposed with p2
A1
A2
A3
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P1
P2
P1 7
P1 7
P1 7 P2 5
ACC-P 1
ACC-V 7
5
A1
A2
A3
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P1
P2
P1 7
P1 7
P1 7 P2 7
7
value of p2 is changed to 7
A1
A2
A3
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
MIN-P 1 ACC-P 1 ACC-V 7
P1
P2
P1 7
P1 7
P1 7 P2 7
Broadcast accept phase with p2 and v7
A1
A2
A3
MIN-P 2 ACC-P 1 ACC-V 7
MIN-P 2 ACC-P 1 ACC-V 7
MIN-P 2 ACC-P 1 ACC-V 7
P1
P2
P1 7
P1 7
P1 7 P2 7
P2 7
P2 7
Both proposer succeed! No blocking here.
CONSENSUS IN DISTRIBUTED COMPUTING
BASIC PAXOS
▸ This is BASIC Paxos: 2PC with a twist (Quorum)
▸ It has vulnerabilities!
▸ Best of 2PC (safety), with strong liveness
▸ Most Consensus Algorithm are a variant of Paxos
▸ Forms the basis of Distributed Computing research
CONSENSUS IN DISTRIBUTED COMPUTING
CLOSING…
▸ Basic Paxos is not Byzantine Fault Tolerant
▸ It is a challenge to create a consensus protocol (termination, agreement, validity) that is Byzantine Fault Tolerant
▸ Nakamoto Consensus (aka bitcoin consensus) skirts around Byzantine problems by imposing proof-of-work
▸ Raft is an implementation of Paxos, used in etcd and consul
PAXOS - BEST GEEKY PICKUP LINE NEVER
Ruben Tan
CONSENSUS IN DISTRIBUTED COMPUTING