influxdb and the raft consensus protocol - amazon s3 · previous approaches v0.9-rc20 all data,...

31
InfluxDB and the Raft consensus protocol Philip O'Toole, Director of Engineering (SF) InfluxDB San Francisco Meetup, December 2015

Upload: others

Post on 03-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

InfluxDB and the Raft consensus protocolPhilip O'Toole, Director of Engineering (SF)InfluxDB San Francisco Meetup, December 2015

Page 2: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Agenda

● What is InfluxDB and why is it distributed?

● What is the problem with distribution?

● How can it be solved?

● InfluxDB and distributed consensus

Page 3: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is InfluxDB?

● Open-source time-series database

● Distributed by design

● Written in Go

Page 4: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Why distribute InfluxDB?

● A distributed database provides reliability– Your data is located in multiple places

● A distributed database offers scalability– For both write and query load

Page 5: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is the problem?

Page 6: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

It's easy to set the value of a single node reliably

X=5X=5ClientClientACK

Page 7: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What if the value should be replicated?

X=5X=5ClientClient X=5X=5

Multi-node system - receiving node replicates data

Node A Node B

ACK

Page 8: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

X=5X=5 ??

Node A Node B

X

What if nodes can't communicate?

This failure is known as a partition

ClientClientACK

Page 9: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

X=5X=5

X=5X=5

ClientClient

Client replication: multi-node system - each node must support changing state

Node A

Node B

What if the value should be replicated?

Page 10: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What if nodes can't communicate?

X=6X=6

X=5X=5

Client 1Client 1

Client 2Client 2

Multi-node system - each node receives mutually-exclusive operation

Node A

Node B

X

ACK

ACK

Page 11: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Communication restored

X=6X=6

X=5X=5

ClientClient

Node A

Node B

What value should be read for the node?

Page 12: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Real world scenario

Client 2Client 2

Client 1Client 1

Modify database

Backup anddelete database

Node A

Node B

ACK

X

ACK

Page 13: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

This problem is known as Distributed Consensus

Page 14: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is the solution?

● Paxos– Famously difficult-to-understand

● ZAB (ZooKeeper Atomic Broadcast)– Created by Yahoo! Research

● Raft– Diego Ongaro and John Ousterhout at Stanford

Page 15: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Secret Lives of DataBy Ben Johnson @benbjohnson

Page 16: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is the cost?

● Latency in decision making

● Low throughput

● System rigidity

● Consensus under load is tricky

Page 17: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

InfluxDB and RaftWhat follows is subject to change

Accurate as of 0.9.6

Page 18: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is stored via consensus?

● Cluster membership● Databases● Retention Policies

– Enforcement run by leader

● Users● Continuous Queries

– Queries initiated by leader

● Shard Metadata– Locations on cluster

● Assignment performed by leader

– Start and end times

Leader

Follower

Follower

Page 19: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

What is not stored via consensus?

● Time-series data– And time-series indexes

● Time-series data schema

Page 20: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

System Implications

Page 21: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Data Ingestion

● Write traffic can stall if a shard does not exist– Because the cluster must reach consensus

– This delay is usually very short

– Shard pre-creation mitigates this issue

Page 22: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Schema varies by shard and time

Shard A Shard Bcpu,host=a load=100 cpu,host=b load=99.4

floatinteger

What is the type of the field load on the measurement cpu? It depends! Each shard is its own local authority

Page 23: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

No global schema exists

           > CREATE DATABASE foo

           > DROP DATABASE fooer

           ERR: database not found: fooer

           > USE foo

           Using database foo

           > INSERT cpu value=100

           > SELECT value FROM cpu

           name: cpu

           ­­­­­­­­­

           time                            value

           2015­12­12T05:39:11.51832836Z   100

           > SELECT valuexxx FROM cpu

           > SELECT value FROM cpuxxx

           >

Unexpected error-free cases

Consensus regarding existingdatabases

No consensus i.e. central authority for schema

Page 24: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Fault Tolerance

Page 25: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Fault Tolerance

● 3-node Raft clusters can tolerate failure of 1 node

● 5-node Raft clusters can tolerate failure of 2 nodes– 0.9.6 clusters restricted to 3 Raft nodes

● Write and query traffic may tolerate further failure– Until the data held in consensus must be changed

Page 26: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Previous Approachesv0.9-rc20

Page 27: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Previous Approachesv0.9-rc20

● All data, including write load, went through Raft.– A streaming model

● AppendEntries decomposed into two calls

– Potentially high performance

– Global schema which meant improved user experience

– Easier to reason about when all data in consensus

● In practice was inefficient and complex to implement correctly.– Approach was abandoned by 0.9.0

Page 28: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

Future Designs

Page 29: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

0.9.6 Design

LeaderFollower

Follower

Data Data

Data

Data

Data

Data Data

Data

● Each node may be data Raft, or both

Page 30: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

0.10.0 Design

Leader

Follower

Follower

Data Data

Data

Data

Data

Data Data

Data

● Distinct Raft sub-system● Data-only nodes

A lightly-loaded consensus system may result in morerobust clusters.

Page 31: InfluxDB and the Raft consensus protocol - Amazon S3 · Previous Approaches v0.9-rc20 All data, including write load, went through Raft. – A streaming model AppendEntries decomposed

References

● https://groups.google.com/forum/#!msg/raft-dev/6GK22lL_Y1c/vSR3_XrRVpMJ

● http://thesecretlivesofdata.com/raft/

● https://raft.github.io/

● https://github.com/hashicorp/raft

● https://github.com/otoolep/hraftd

● https://github.com/influxdb/influxdb/tree/0.9.6/meta