stanford university april 2015 - raft · 2020-04-29 · stanford university april 2015...
TRANSCRIPT
THE RAFT CONSENSUS ALGORITHMDIEGO ONGARO AND JOHN OUSTERHOUT
STANFORD UNIVERSITY
APRIL 2015
raftconsensus.github.io
MOTIVATIONGoal: shared hash table (state machine)Put it on a single machine attached to network
Pros: easy, consistentCons: prone to failure
With Raft, keep consistency yet deal with failures
WHAT IS CONSENSUSAgreement on shared state (single system image)Recovers from server failures autonomously
Minority of servers fail: no problemMajority fail: lose availability, retain consistency
Servers
Key to building consistent storage systems
REPLICATED STATE MACHINES
x←3 y←2 x←1 z←6Log
ConsensusModule
StateMachine
Log
ConsensusModule
StateMachine
Log
ConsensusModule
StateMachine
Servers
Clients
x 1
y 2
z 6
x←3 y←2 x←1 z←6
x 1
y 2
z 6
x←3 y←2 x←1 z←6
x 1
y 2
z 6
z←6
Replicated log ⇒ replicated state machineAll servers execute same commands in same order
Consensus module ensures proper log replicationSystem makes progress as long as any majority of servers upFailure model: fail-stop (not Byzantine), delayed/lost msgs
HOW IS CONSENSUS USED?Top-level system configuration
repl. state machine
S1 S2
S3
N N N N...
repl. state machine
leader standby standby
S1 S2
S3
N N N N...
Replicate entire database state
PAXOS PROTOCOLLeslie Lamport, 1989Nearly synonymous with consensus
“The dirty little secret of the NSDI community isthat at most five people really, truly understand
every part of Paxos ;-).” —NSDI reviewer
“There are significant gaps between thedescription of the Paxos algorithm and the
needs of a real-world system...the final systemwill be based on an unproven protocol.”
—Chubby authors
RAFT'S DESIGN FOR UNDERSTANDABILITYWe wanted an algorithm optimized for building real systems
Must be correct, complete, and perform wellMust also be understandable
“What would be easier to understand or explain?”
Fundamentally different decomposition than PaxosLess complexity in state spaceLess mechanism
RAFT USER STUDY
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Raft g
rade
Paxos grade
Raft then PaxosPaxos then Raft
0
5
10
15
20
implement explain
number o
f participants
Paxos much easierPaxos somewhat easierRoughly equalRaft somewhat easierRaft much easier
RAFT OVERVIEW1. Leader election
Select one of the servers to act as cluster leaderDetect crashes, choose new leader
2. Log replication (normal operation)Leader takes commands from clients, appends to its logLeader replicates its log to other servers (overwritinginconsistencies)
3. SafetyOnly a server with an up-to-date log can become leader
RAFTSCOPE VISUALIZATION
CORE RAFT REVIEW1. Leader election
Heartbeats and timeouts to detect crashesRandomized timeouts to avoid split votesMajority voting to guarantee at most one leader per term
2. Log replication (normal operation)Leader takes commands from clients, appends to its logLeader replicates its log to other servers (overwritinginconsistencies)Built-in consistency check simplifies how logs may differ
3. SafetyOnly elect leaders with all committed entries in their logsNew leader defers committing entries from prior terms
RANDOMIZED TIMEOUTSHow much randomization is needed to avoid split votes?
0%
20%
40%
60%
80%
100%
100 1000 10000 100000
cumulative percent
time without leader (ms)
150150ms150151ms150155ms150175ms150200ms150300ms
Conservatively, use random range ~10x network latency
RAFT IMPLEMENTATIONSName Primary Authors Language License
Blake Mizerany, Xiang Li and Yicheng Qin (CoreOS) Go Apache2.0
(Sky) and (CMU, CoreOS) Go MIT
(hashicorp) Go MPL-2.0
Java Apache2
(Stanford, Scale Computing) C++ ISC
Scala Apache2
Javascript MPL-2.0
(Basho) Erlang Apache2
Moiz Raja, Kamal Rameshan, Robert Varga (Cisco), Tom Pantelis(Brocade)
Java Eclipse
Javascript MIT
Javascript ISC
Scala Apache2
C BSD
Copied from Raft website, probably stale.
etcd/raft
go-raft Ben Johnson Xiang Li
hashicorp/raft Armon Dadgar
copycat Jordan Halterman
LogCabin Diego Ongaro
akka-raft Konrad Malawski
kanaka/raft.js Joel Martin
rafter Andrew Stone
OpenDaylight
liferaft Arnout Kazemier
skiff Pedro Teixeira
ckite Pablo Medina
willemt/raft Willem-Hendrik Thiart
LOGCABINC++, permissive licenseStarted as research platform for Raft at StanfordWorking with Scale Computing to make production-ready1.0 release later this month
Rolling upgrades for 1.0Looking for more users!
CCabinLLog
github.com/logcabin
CONCLUSIONConsensus widely regarded as difficultRaft designed for understandability
Easier to teach in classroomsBetter foundation for building practical systems
Pieces needed for a practical system:Cluster membership changes (simpler in dissertation)
Log compaction (expanded tech report/dissertation)
Client interaction (expanded tech report/dissertation)
Evaluation (dissertation: understandability, correctness, leader election & replication performance)
QUESTIONSraftconsensus.github.io
mailing listraft-dev