clusteron: building highly configurable and … · chuang, w.-c., sang, b., yoo, s., gu, ......
TRANSCRIPT
ClusterOn: Building highly configurable and reusable clustered
data services using simple data nodes Ali Anwar★ Yue Cheng★ Hai Huang† Ali R. Butt★
★ †
Growing data needs à new storage applications
2
BigTable Redis Memcached
DynamoDB
K-V store
Cassandra MongoDB HyperTable
HyperDex
NoSQL DB
Swift Amazon S3
Object store
HDFS LustreFS
DFS
A new storage system is born daily*
5 6 57
912
10
16 17 16
0
5
10
15
20
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
#storagesystem
s
NumberofstoragesystemspapersinSOSP,OSDI,ATC,andEuroSysconferencesinthelastdecade
3 *Weexaggeratehere,butnewstorageapplicaOonsaredevelopedfairlyregularly!
Distributed storage systems are notoriously hard to implement!
5
Fault-tolerant algorithms are notoriously hard to express correctly, even as pseudo-code. This problem is worse when the code for such an algorithm is intermingled with all the other code that goes into building a complete system… Tushar Chandra, Robert Griesemer, and Joshua Redstone, Paxos Made Live, PODC’07
Case study: Redis 3.0.1
replication.c – replicate data from master to slave, or slave to slave sentinel.c – monitoring nodes, handle failover from master to slave cluster.c – support cluster mode with multiple masters/shards These files are 20% of the code base (450K/2100K in size) 6
Quantifying LoC
0%20%40%60%80%100%
Redis HDFS
%LoC
CoreIO Management• Core IO • Distributed management
8237 56208Linesofcode: 12955 31117
8
Quantifying LoC • Core IO • Distributed management • Etc: Config/auth/stats/
compatibility ...
0%20%40%60%80%100%
Redis HDFS
%LoC
CoreIO Management Etc
8237 5620820202 25185
Linesofcode: 12955 31117
9
Quantifying LoC • Core IO • Distributed management • Etc: Config/auth/stats/
compatibility ...
0%20%40%60%80%100%
Redis HDFS
%LoC
CoreIO Management Etc
Linesofcode:8237 5620820202 25185
12955 31117
10
12
Abstracting and generalizing common features/functionality can
simplify development of new distributed storage applications
1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}
(a) Vanilla
15
1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}
1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}
(b) Zookeeper based
16
(a) Vanilla
1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}
1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}
1#include<vsynclib>23voidPut(Strkey,Objval){4if(this.master):5zk.Lock(key)//zookeeper6HashTbl.insert(key,val)7zk.Unlock(key)//zookeeper8Vsync.Sync(master.slaves)9}1011ObjGet(Strkey){12if(this.master):13Objval=Vsync.Quorum(key)14Vsync.Sync(master.slaves)15returnval16}
(c) Vsync
17
(a) Vanilla (b) Zookeeper based
1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}
1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}
1#include<vsynclib>23voidPut(Strkey,Objval){4if(this.master):5zk.Lock(key)//zookeeper6HashTbl.insert(key,val)7zk.Unlock(key)//zookeeper8Vsync.Sync(master.slaves)9}1011ObjGet(Strkey){12if(this.master):13Objval=Vsync.Quorum(key)14Vsync.Sync(master.slaves)15returnval16}
1voidPut(Strkey,Objval){2HashTbl.insert(key,val)3}45ObjGet(Strkey){6returnHashTbl(key)7}
(d) ClusterOn
+ClusterOn
=DistributedapplicaOon
18
(a) Vanilla (b) Zookeeper based (c) Vsync
Design goals • Minimize framework overhead
• Enable effective service differentiation
• Realize reusable distributed storage platform components
19
Design challenges Diversity of applications
- Replication policies - Sharding policies - Membership
management - Failover recovery - Client connector
CAP tradeoffs
- Latency - Throughput - Availability
Consistency
- None - Strong - Eventual
20
Using ClusterOn
21
1voidPut(Strkey,Objval){2HashTbl.insert(key,val)3}45ObjGet(Strkey){6returnHashTbl(key)7}
Non-dist.kvstore App.developer
{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}
Bootstrapinfo
kvs
ClusterOn
ReplicaOon Consistency Topology
Loadbalancing Autoscaling Failover
Metadataserver
Datalet
kvs kvs kvs kvs kvs kvs kvs
ObjectstoreFilesystemDatabaseetc...
ClusterOn architecture
22
{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}
Bootstrapinfokvs
ClusterOn
ReplicaOon Consistency Topology
Loadbalancing Autoscaling Failover
Metadataserver
Middlew
are
ApplicaO
on
kvs kvs kvs kvs kvs kvs kvs
M S S
Userrequest Replytouser
Master/slave;Strongconsistency;Writerequest
Master/slave;Strongconsistency;Failover
ClusterOn architecture
23
{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}
Bootstrapinfokvs
ClusterOn
ReplicaOon Consistency Topology
Loadbalancing Autoscaling Failover
Metadataserver
Middlew
are
kvs kvs kvs kvs kvs kvs kvs
M S S
kvs
S
ApplicaO
on
Failover
Preliminary evaluation • Proof-of-concept prototype implementation
– ClusterOn framework (so far: 5000+ lines of C++; not including header files)
• Event handling, Protobuf • Strong consistency with Zookeeper
– Storage apps • Redis: in-memory KV cache/store • LevelDB: persistent KV store
• Measure overhead & scalability 24
Data forwarding overhead: Redis
0
100
200
300
400
500
0 20 40 60 80 100 120 140Throughp
ut(1
03RPS)
Batchsize
Twemproxy+Redis
1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 25
Data forwarding overhead: Redis
0
100
200
300
400
500
0 20 40 60 80 100 120 140Throughp
ut(1
03RPS)
Batchsize
DirectRedis
1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 26
Twemproxy+Redis
Data forwarding overhead: Redis
0
100
200
300
400
500
0 20 40 60 80 100 120 140Throughp
ut(1
03RPS)
Batchsize
ClusterOn+Redis
Twemproxy+Redis
DirectRedis
1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 27
Scaling up: LevelDB
24.1
3.2
33.2
23.422.6
2.9
29.022.1
0
10
20
30
40
1KB 10KB 1KB 10KB
SET GETThroughp
ut(1
03QPS)
DirectLDB ClusterOn+1LDB
1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 28
Scaling up: LevelDB
24.1
3.2
33.223.422.6
2.9
29.0 22.140.0
5.5
44.1 42.3
72.9
11.0
81.1
57.1
020406080
100
1KB 10KB 1KB 10KB
SET GETThroughp
ut(1
03QPS)
DirectLDB ClusterOn+1LDB ClusterOn+2LDB ClusterOn+4LDB
1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 29
Scaling up: LevelDB
24.1
3.2
33.223.422.6
2.9
29.0 22.140.0
5.5
44.1 42.3
72.9
11.0
81.1
57.1
020406080
100
1KB 10KB 1KB 10KB
SET GETThroughp
ut(1
03QPS)
DirectLDB ClusterOn+1LDB ClusterOn+2LDB ClusterOn+4LDB
1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 30
Related work
CHUANG, W.-C., SANG, B., YOO, S., GU, R., KULKARNI, M., AND KILLIAN, C. Eventwave: Programming model and runtime support for tightly-coupled elastic cloud applications. In ACM SOCC’13.
HUNT, G. C., SCOTT, M. L., ET AL. The coign automatic distributed partitioning system. In USENIX OSDI’99.
Vsync: Consistent Data Replication for Cloud Computing https://vsync.codeplex.com/
31
Summary • Modern distributed storage applications share a large
portion of common functionalities which can be generalized and abstracted
• ClusterOn can automatically provide core functionalities, such as service distribution, to reduce development effort – Faster realization of new storage applications – Easy code maintenance – Flexible and extensible service differentiation
• Questions & contact: Ali Anwar, [email protected] http://research.cs.vt.edu/dssl/
32