osdc 2014: fabrizio manfredi - data replication
DESCRIPTION
Data replication is a crucial component for distributed services deployed in a multi-Data Center environment. The replication schema needs to be carefully evaluated before its implementation, wrong design or the misuse in most of the case end with a big service outages. To understand the replication it is needed to understand the algorithms behind it, for this reason the session will start to explaining the most used algorithms to solve the CAP theorem (Consistency , Availability and Partitioning Tolerance) like Consistent Hash, Vector clock, Gossip protocol, Paxos and Raft. The second part of the talk will be focused to analyze how the products on the market do the replication (replication in action) with advantages and disadvantages, the talk will cover the distributed filesystem (cephs, tahoe, extreemfs..), distributed databases (db replication primitieves and external tool like Tungsten), Nosql (riak, cassandra, mongodb, couchdb) and Frameworks for in house solution (beardb, open replication,..). The talk will also show the evaluation methods and testing process for identify the best solution for your environment.TRANSCRIPT
Beolink.org!
Data replication Fabrizio Manfredi Furuholmen"
Beolink.org!
FOSDEM 2014"2"
Agenda
! Introduction ! overview ! Theorem ! Common Pattern
! Implementation ! Filesystem ! RDBMS ! Nosql ! Framework
! Example
Beolink.org!
3"
Data Replication
http://blog.open-e.com/in-a-nutshell-data-replication-snapshots-and-backup/"
Beolink.org!
4"
Data Replication
http://www.dreamstime.com/stock-images-cloud-computing-scalability-reliability-background-concept-word-image34898574"
Beolink.org!
5"
Introduction
Beolink.org!
6"
World Connection
Beolink.org!
7"
Main Problem
VS!
Beolink.org!
8"
Main Problem
Beolink.org!
9"
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.""
You "can’t have the three at the
same time !and get an acceptable latency."
Beolink.org!
10"
CAP
ACID!!Atomic: Everything in a transaction succeeds or the entire transaction is rolled back."Consistent: A transaction cannot leave the database in an inconsistent state."Isolated: Transactions cannot interfere with each other."Durable: Completed transactions persist, even when servers restart etc.""- Strong consistency for transaction highest priority"- Pessimistic"- Complex mechanisms"
"- Availability and scaling highest priorities"- Weak consistency"- Optimistic"- Best Effort"- Simple and FAST "
Basic Availability"Soft-state"Eventual consistency""
BASE""
RDBMS!
NoSQL!
Beolink.org!
11"
Data Distribution
Business Decision!
Beolink.org!
12"
Start with some Algorithms
Beolink.org!
13"
Data Distribution
Replication!
Data Placement"
Data Consistency"
System Coordination"
Data Transmission"
Beolink.org!
14"
Data Placement
Better Distribution = partitioning !Parallel operation = parallel stream/multi core!
!
Beolink.org!
15"
Data Placement
Beolink.org!
16"
Data placement by HASH
It isn’t rocket science !!
Beolink.org!
17"
Data Distribution
http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html"
Consistent HASH!
Chord"
Space base/multi dimension"
Beolink.org!
18"
Data placement
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"
Vnode base" Proximity base"
Replication"
Beolink.org!
19"
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"
To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""- Read and Write quorum!- Write quorum Read all!
Beolink.org!
20"
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"
To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""- Read and Write quorum!- Write quorum Read all!
Beolink.org!
21"
Coordination Protocol
Consensus protocol!"Paxos , Raft, ect""Based on the state machine approach (The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. )"""""
Epidemic (Gossip)!"epidemic: anybody can infect anyone "else with equal probability"""""""
Anti-entropy protocols assume that synchronization is performed by a fixed schedule – every node regularly chooses another node at random or by some rule and exchanges database contents, resolving differences. "
O(log n)"http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf"
Beolink.org!
22"
Transmission Protocol
Optimization!- Re order"- Deduplication""
!Transmission"- By difference (Merkel tree) "- Callback "- Compression"- Auto correction"
Locking!- Distributed locking"- Multiversioning"- …"
!"
mito
sis!
Beolink.org!
23"
Implementation
Beolink.org!
24"
Answer …no Answer
Block replication, file
Information
Document , blog, session
Content with a TTL over a 1m
Distributed file system
RDMBS
NoSQL
Caching system
Beolink.org!
25"
Distributed Filesystem
DFS is a service that provides a single point of reference and a logical tree structure for file system resources that may be physically located anywhere on the network."""
One significant responsibility of a file system is to ensure that, regardless of the actions by programs accessing the data, the structure remains consistent…"
Beolink.org!
26"
Filesystem
""
Properties of DFS!"• Simple from application point of view"• Data consistency""
Base on the solution!"• Partitioning Tolerance "• Scalability"• High Avaibility """"
Beolink.org!
27"
Filesystem DRDB
DRDB!!Replication mode: Asynchronous, Memory synchronous , Synchronous "Transfer optimization: DRProxy """
Main Goals!!Disk replication, single service availability""Disaster Recovery"""
Beolink.org!
28"
Filesystem CEPH
""
Ceph!Data distribution: Hash base"Consensus protocol: Raft for consensus"Write mode: Write one, read one, client is notified when all replicas have been written"Weak consistency with cache pool"""
Openstack Backednd at Cern""1128 OSDs"3PB"XXX vms""http://www.slideshare.net/"Inktank_Ceph/scaling-ceph-at-cern "
Main Goals!!- Blockdevice/base for other filesystem"- Cloud support, image storage and vm storage"""
Beolink.org!
29"
CEPH
""
Users: > 5000"VMs > 7000"> 250k VMs spawned"
http://www.synnefo.org/resources.html"
Beolink.org!
30"
RDBMS
""
Property of RDBMS!"• Quite Simple from application point of view"• Data consistency""
Base on the solution!"• Low Partitioning Tolerance "• Low Scalability"• Low High Availability """"
Beolink.org!
31"
RDBMS
!Asynchronous Replication"Semi synchronous""
Postgres"Synchronous"Asynchronous"
Beolink.org!
32"
NoSQL
Properties of DFS!"• Fast"""
Base on the solution!"• Partitioning Tolerance "• Scalability"• High Availability"• Simple """"
Beolink.org!
33"
NoSQL Performance
http://planetcassandra.org/nosql-performance-benchmarks/"
Beolink.org!
34"
Riak
Geo Replication!
Tunable trade-offs for distribution and replication (N, R, W) "
Distributed Hash Table"
Beolink.org!
35"
Filesystem over NoSQL
FUSE!In most of the case non stable"!S3 Interface!Internet standard de facto"
Beolink.org!
36"
Filesystem over NoSQL
Wooga"
http://www.slideshare.net/wooga/riak-at-woogariak-meetup-sept-2013?qid=4809eca2-8378-4e70-8e75-0db29b635fa5&v=qf1&b=&from_search=3"
https://fosdem.org/2014/schedule/event/nyt_cassandra/"
Beolink.org!
37"
Combine different solution
37"
Edge node (Varnish)!
Nosql!
Loc
al !
cach
e!C
entr
aliz
e! c
ache!
Info!
Sto
rage!
DFS!
Origin (Distribute cache)!
Loca
l !
DB! Nosql!Dec
reas
e th
e nu
mbe
r of t
he re
ques
ts!
Increase of the age of the data!
Beolink.org!
38"
Framework
Build your system if you need … " ""….do you really need"
CERN"
CERN"
Beolink.org!
39"
Framework
Don’t forget Rsync !!
Beolink.org!
40"
Framework
Replication or Caching ?!
Beolink.org!
41"
Build a solution
• Split in pieces"
• Track version "
• Transfer when needed"
• Transfer the difference"
• Use Notification when is possible"
• Move data close to computation"
• Move master close to write operation"
• Split counter to avoid dead lock"
• In HTTP don’t forget the Etag and lastmodify"" ""
openkad!
open-chord!
openReplica!
Raft!
Beolink.org!
42"
Build a solution
Beolink.org! " Five pylons
43"
Obj
ects"
• Separation btw data and metadata"
• Each element is marked with a revision"
• Each element is marked with an hash."
Cac
he"
• Client side"
• Callback/Notify"
• Persistent!
Tran
smis
sion"
• Parallel operation"
• Http like protocol"
• Compression"
• Transfer by difference"
Dis
trib
utio
n" • Resource discovery by DNS"
• Data spread on multi node cluster"
• Decentralize!
• Independents cluster!
• Data Replication!
Secu
rity" • Secure
connection"
• Encryption client side,"
• Extend ACL"
• Delegation/Federation!
• Admin Delegation!
Beolink.org!
44"
Build a solution
- Consistent HASH"
- Zmq transport protocol"
- Gossip protocol for failure detection"
- Tunable trade-offs ""
Pisa is a simple block data replication !on a wide range of node!
Beolink.org! " And …
45"
“There is always a failure waiting around the corner”"
*Werner Vogel! "
Beolink.org! !
Thank you http://[email protected]"