mainak ghosh, wenting wang, gopalakrishna holla, indranil gupta
TRANSCRIPT
Morphus: Supporting Online Reconfigurations
in Sharded NoSQL Systems
Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta
2
NoSQL Databases
Predicted to become a $3.4B industry by 2018
3
Database Reconfiguration
Problem: Changing database or table-level configuration parameterso Primary/Shard Key – MongoDB, Cassandrao Ring size - Cassandra
Challenge: Affects a lot of data at once
Motivation:o Initially DB configured based on guesseso Sys admin needs to play around with different parameters
• Seamlessly, and efficientlyo Later, as workload changes and business/use case evolves:
• Change parameters, but do it in a live database
4
Today's Prevalent Solution
Existing Solution: Create a new schema, export and re-import data
Why Bad?o Massive unavailability of data: every second of outage
costs $1.1K at Amazon and $1.5K at Googleo Reconfiguration change caused outage at Foursquareo Manual change of primary key at Google: took 2 years
and involved 2 dozen teams
Need a solution that is automated, seamless and efficient
5
Goals
Fast completion time Minimize the amount of data transfer required
Highly available CRUD operations should be answered as much as
possible with little impact on latency
Network Aware Reconfiguration should adapt to underlying network
latencies
6
Assumptions
Master Slave Replication
Range-based Partitioning
Flexibility in data assignment
MongoDB, RethinkDB, CouchDB, etc.
Optimal Algorithms
8
Notations
S1
S2
S3
𝑪𝟏
𝑪𝟐
𝑪𝟑
Ko 1 2 3
Kn 2 4 8
Ko 4 5 6
Kn 1 6 3
Ko 7 8 9
Kn 9 5 7
Old Arrangement
𝑪 ′𝟏
𝑪 ′𝟐
𝑪 ′𝟑
Ko 4 1 6
Kn 1 2 3
Ko 2 8 5
Kn 4 5 6
Ko 9 3 7
Kn 7 8 9
New Arrangement
9
Greedy Approach
S1
S2
S3
𝑪 ′𝟏
𝑪 ′𝟐
𝑪 ′𝟑
11
1
21
0
01
2
Lemma 1: The greedy algorithm is optimal in total network
transfer volume
10
Unbalanced Greedy
S1
S2
S3
𝑪𝟏
𝑪𝟐
𝑪𝟑
Ko 1 2 3
Kn 2 4 8
Ko 4 5 6
Kn 1 6 3
Ko 7 8 9
Kn 9 5 7
Old Arrangement
𝑪 ′𝟏
𝑪 ′𝟐
𝑪 ′𝟑
Ko 4 1 6
Kn 1 2 3
Ko 2 8 5
Kn 4 5 6
Ko 9 3 7
Kn 7 8 9
New Arrangement
11
Bipartite Matching Approach
S1
S2
S3
𝑪 ′𝟏
𝑪 ′𝟐
𝑪 ′𝟑
12
3
21
0
00
0
Lemma 2: Among all load-balanced strategies that assignat most V = new chunks to any server, the Hungarian algorithm is optimal in total network transfer volume
GreedyHungarian Assignment
System Design
13
Typical Sharded Deployment
RS0 Primary
RS1 Primary
RS2 Primary
Config
Front End
Front End
select * from table where user_id = 20
user_id = 20
RS1
select * from table where product_id = 20
14
Isolation Phase
RS 0 Secondar
yRS0
Primary
RS 1 Secondar
yRS1
Primary
RS 2 Secondar
yRS2
Primary
Config
Front End
Front End
db.collection.changeShardKey(product_id:1)
15
Execution Phase
RS 0 Secondar
y
RS0 Primary
RS 1 Secondar
y
RS1 Primary
RS 2 Secondar
y
RS2 Primary
Config
Front End
Front End
db.collection.changeShardKey(product_id:1)
1. Runs Algorithm2. Generates placement plan update table set price = 20
where user_id = 20
16
Recovery Phase
RS 0 Secondar
y
RS0 Primary
RS 1 Secondar
y
RS1 Primary
RS 2 Secondar
y
RS2 Primary
Config
Front End
Front End
db.collection.changeShardKey(product_id:1)
Iteration 1Iteration 2
17
Commit Phase
RS0 Secondar
y
RS 0 Secondar
y
RS0 Primary
RS1 Secondar
y
RS 1 Secondar
y
RS1 Primary
RS2 Secondar
y
RS 2 Secondar
y
RS2 Primary
Config
Front End
Front End
db.collection.changeShardKey(product_id:1)
update table set price = 20 where product_id = 20
RS0 Primary
RS1 Primary
RS2 Primary
RS 0 Secondar
y
RS 1 Secondar
y
RS 2 Secondar
y
Network Awareness
19
Assigns a socket per chunk per source-destination pair of communicating servers (Chunk-Based)
Hierarchical Topology Experiment
Inter Rack Latency
Unequal Data Size
20
Weighted Fair Sharing
Between a pair of source server i and destination server j, number of sockets assigned,
: Total amount of data that i needs to transfer to j : Observed round trip latency between i and j
This scheme is in contrast to Orchestra[Chowdhury et al] which only fair shares based on data size
21
Improvement
WFS strategy 30% better than naïve chunk-based scheme and 9% better
than Orchestra [Chowdhury et al]
22
Geo-Distributed Optimization
Morphus chooses slaves for reconfiguration during first Isolation phase
In a geo-distributed setting, naïve choice can lead to bulk transfers over wide area network
Solution: Localize bulk transfer by choosing replicas in the same datacentero Morphus extracts the datacenter information from
replica’s metadata.
2x-3x improvement observed in experiments
Evaluation
24
Setup
Dataset: Amazon Reviews [Snap].
Cluster: Emulab d710 nodes, 100 Mbps LAN switch and Google Cloud (n1-standard-4 VMs), 1 Gbps network
Workload: Custom generator similar to YCSB. Implements Uniform, Zipfian and Latest distribution for key access
Morphus: Implemented on top of MongoDB
25
Data Availability
Access Distribution
Read Success Rate
Write Success Rate
Read Only 99.9 -
Uniform 99.9 98.5
Latest 97.2 96.8
Zipf 99.9 98.3
26
Impact on Read Latency
27
Data Availability - Summary
Access Distribution
Read Success Rate
Write Success Rate
Read Only 99.9 -
Uniform 99.9 98.5
Latest 97.2 96.8
Zipf 99.9 98.3Morphus has a small impact on data
availability
28
Algorithm Comparison
Hungarian performs well in both scenarios and should be preferred over Greedy and Random schemes
29
Scalability
Sub-linear increase in reconfiguration timeas data and cluster size increases
30
Related Work
Online schema change [Rae et al.]: Resultant availabilities smaller.
Live data migration: Attempted in databases [Albatross, ShuttleDB] and VMs [Bradford et al.]. Similar approach as ours. o Albatross and ShuttleDB ship whole state to a set of
empty servers.
Reactive reconfiguration proposed by Squall [Elmore et al]: Fully available but takes longer.
31
Takeaways
Morphus is a system which allows live reconfiguration of a sharded NoSQL database.
Morphus is implemented on top of MongoDB Morphus uses network efficient algorithms for
placing chunks while changing shard key Morphus minimally affects data availability during
reconfiguration Morphus mitigates stragglers during data
migration by optimizing for the underlying network latencies.