InfluxDB and the Raft consensus protocolPhilip O'Toole, Director of Engineering (SF)InfluxDB San Francisco Meetup, December 2015
Agenda
● What is InfluxDB and why is it distributed?
● What is the problem with distribution?
● How can it be solved?
● InfluxDB and distributed consensus
What is InfluxDB?
● Open-source time-series database
● Distributed by design
● Written in Go
Why distribute InfluxDB?
● A distributed database provides reliability– Your data is located in multiple places
● A distributed database offers scalability– For both write and query load
What is the problem?
It's easy to set the value of a single node reliably
X=5X=5ClientClientACK
What if the value should be replicated?
X=5X=5ClientClient X=5X=5
Multi-node system - receiving node replicates data
Node A Node B
ACK
X=5X=5 ??
Node A Node B
X
What if nodes can't communicate?
This failure is known as a partition
ClientClientACK
X=5X=5
X=5X=5
ClientClient
Client replication: multi-node system - each node must support changing state
Node A
Node B
What if the value should be replicated?
What if nodes can't communicate?
X=6X=6
X=5X=5
Client 1Client 1
Client 2Client 2
Multi-node system - each node receives mutually-exclusive operation
Node A
Node B
X
ACK
ACK
Communication restored
X=6X=6
X=5X=5
ClientClient
Node A
Node B
What value should be read for the node?
Real world scenario
Client 2Client 2
Client 1Client 1
Modify database
Backup anddelete database
Node A
Node B
ACK
X
ACK
This problem is known as Distributed Consensus
What is the solution?
● Paxos– Famously difficult-to-understand
● ZAB (ZooKeeper Atomic Broadcast)– Created by Yahoo! Research
● Raft– Diego Ongaro and John Ousterhout at Stanford
Secret Lives of DataBy Ben Johnson @benbjohnson
What is the cost?
● Latency in decision making
● Low throughput
● System rigidity
● Consensus under load is tricky
InfluxDB and RaftWhat follows is subject to change
Accurate as of 0.9.6
What is stored via consensus?
● Cluster membership● Databases● Retention Policies
– Enforcement run by leader
● Users● Continuous Queries
– Queries initiated by leader
● Shard Metadata– Locations on cluster
● Assignment performed by leader
– Start and end times
Leader
Follower
Follower
What is not stored via consensus?
● Time-series data– And time-series indexes
● Time-series data schema
System Implications
Data Ingestion
● Write traffic can stall if a shard does not exist– Because the cluster must reach consensus
– This delay is usually very short
– Shard pre-creation mitigates this issue
Schema varies by shard and time
Shard A Shard Bcpu,host=a load=100 cpu,host=b load=99.4
floatinteger
What is the type of the field load on the measurement cpu? It depends! Each shard is its own local authority
No global schema exists
> CREATE DATABASE foo
> DROP DATABASE fooer
ERR: database not found: fooer
> USE foo
Using database foo
> INSERT cpu value=100
> SELECT value FROM cpu
name: cpu
time value
20151212T05:39:11.51832836Z 100
> SELECT valuexxx FROM cpu
> SELECT value FROM cpuxxx
>
Unexpected error-free cases
Consensus regarding existingdatabases
No consensus i.e. central authority for schema
Fault Tolerance
Fault Tolerance
● 3-node Raft clusters can tolerate failure of 1 node
● 5-node Raft clusters can tolerate failure of 2 nodes– 0.9.6 clusters restricted to 3 Raft nodes
● Write and query traffic may tolerate further failure– Until the data held in consensus must be changed
Previous Approachesv0.9-rc20
Previous Approachesv0.9-rc20
● All data, including write load, went through Raft.– A streaming model
● AppendEntries decomposed into two calls
– Potentially high performance
– Global schema which meant improved user experience
– Easier to reason about when all data in consensus
● In practice was inefficient and complex to implement correctly.– Approach was abandoned by 0.9.0
Future Designs
0.9.6 Design
LeaderFollower
Follower
Data Data
Data
Data
Data
Data Data
Data
● Each node may be data Raft, or both
0.10.0 Design
Leader
Follower
Follower
Data Data
Data
Data
Data
Data Data
Data
● Distinct Raft sub-system● Data-only nodes
A lightly-loaded consensus system may result in morerobust clusters.
References
● https://groups.google.com/forum/#!msg/raft-dev/6GK22lL_Y1c/vSR3_XrRVpMJ
● http://thesecretlivesofdata.com/raft/
● https://raft.github.io/
● https://github.com/hashicorp/raft
● https://github.com/otoolep/hraftd
● https://github.com/influxdb/influxdb/tree/0.9.6/meta