mainak ghosh, wenting wang, gopalakrishna holla, indranil gupta

31
Morphus: Supporting Online Reconfigurations in Sharded NoSQL Systems Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Upload: randall-fleming

Post on 31-Dec-2015

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Morphus: Supporting Online Reconfigurations

in Sharded NoSQL Systems

Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Page 2: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

2

NoSQL Databases

Predicted to become a $3.4B industry by 2018

Page 3: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

3

Database Reconfiguration

Problem: Changing database or table-level configuration parameterso Primary/Shard Key – MongoDB, Cassandrao Ring size - Cassandra

Challenge: Affects a lot of data at once

Motivation:o Initially DB configured based on guesseso Sys admin needs to play around with different parameters

• Seamlessly, and efficientlyo Later, as workload changes and business/use case evolves:

• Change parameters, but do it in a live database

Page 4: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

4

Today's Prevalent Solution

Existing Solution: Create a new schema, export and re-import data

Why Bad?o Massive unavailability of data: every second of outage

costs $1.1K at Amazon and $1.5K at Googleo Reconfiguration change caused outage at Foursquareo Manual change of primary key at Google: took 2 years

and involved 2 dozen teams

Need a solution that is automated, seamless and efficient

Page 5: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

5

Goals

Fast completion time Minimize the amount of data transfer required

Highly available CRUD operations should be answered as much as

possible with little impact on latency

Network Aware Reconfiguration should adapt to underlying network

latencies

Page 6: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

6

Assumptions

Master Slave Replication

Range-based Partitioning

Flexibility in data assignment

MongoDB, RethinkDB, CouchDB, etc.

Page 7: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Optimal Algorithms

Page 8: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

8

Notations

S1

S2

S3

𝑪𝟏

𝑪𝟐

𝑪𝟑

Ko 1 2 3

Kn 2 4 8

Ko 4 5 6

Kn 1 6 3

Ko 7 8 9

Kn 9 5 7

Old Arrangement

𝑪 ′𝟏

𝑪 ′𝟐

𝑪 ′𝟑

Ko 4 1 6

Kn 1 2 3

Ko 2 8 5

Kn 4 5 6

Ko 9 3 7

Kn 7 8 9

New Arrangement

Page 9: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

9

Greedy Approach

S1

S2

S3

𝑪 ′𝟏

𝑪 ′𝟐

𝑪 ′𝟑

11

1

21

0

01

2

Lemma 1: The greedy algorithm is optimal in total network

transfer volume

Page 10: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

10

Unbalanced Greedy

S1

S2

S3

𝑪𝟏

𝑪𝟐

𝑪𝟑

Ko 1 2 3

Kn 2 4 8

Ko 4 5 6

Kn 1 6 3

Ko 7 8 9

Kn 9 5 7

Old Arrangement

𝑪 ′𝟏

𝑪 ′𝟐

𝑪 ′𝟑

Ko 4 1 6

Kn 1 2 3

Ko 2 8 5

Kn 4 5 6

Ko 9 3 7

Kn 7 8 9

New Arrangement

Page 11: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

11

Bipartite Matching Approach

S1

S2

S3

𝑪 ′𝟏

𝑪 ′𝟐

𝑪 ′𝟑

12

3

21

0

00

0

Lemma 2: Among all load-balanced strategies that assignat most V = new chunks to any server, the Hungarian algorithm is optimal in total network transfer volume

GreedyHungarian Assignment

Page 12: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

System Design

Page 13: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

13

Typical Sharded Deployment

RS0 Primary

RS1 Primary

RS2 Primary

Config

Front End

Front End

select * from table where user_id = 20

user_id = 20

RS1

select * from table where product_id = 20

Page 14: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

14

Isolation Phase

RS 0 Secondar

yRS0

Primary

RS 1 Secondar

yRS1

Primary

RS 2 Secondar

yRS2

Primary

Config

Front End

Front End

db.collection.changeShardKey(product_id:1)

Page 15: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

15

Execution Phase

RS 0 Secondar

y

RS0 Primary

RS 1 Secondar

y

RS1 Primary

RS 2 Secondar

y

RS2 Primary

Config

Front End

Front End

db.collection.changeShardKey(product_id:1)

1. Runs Algorithm2. Generates placement plan update table set price = 20

where user_id = 20

Page 16: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

16

Recovery Phase

RS 0 Secondar

y

RS0 Primary

RS 1 Secondar

y

RS1 Primary

RS 2 Secondar

y

RS2 Primary

Config

Front End

Front End

db.collection.changeShardKey(product_id:1)

Iteration 1Iteration 2

Page 17: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

17

Commit Phase

RS0 Secondar

y

RS 0 Secondar

y

RS0 Primary

RS1 Secondar

y

RS 1 Secondar

y

RS1 Primary

RS2 Secondar

y

RS 2 Secondar

y

RS2 Primary

Config

Front End

Front End

db.collection.changeShardKey(product_id:1)

update table set price = 20 where product_id = 20

RS0 Primary

RS1 Primary

RS2 Primary

RS 0 Secondar

y

RS 1 Secondar

y

RS 2 Secondar

y

Page 18: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Network Awareness

Page 19: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

19

Assigns a socket per chunk per source-destination pair of communicating servers (Chunk-Based)

Hierarchical Topology Experiment

Inter Rack Latency

Unequal Data Size

Page 20: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

20

Weighted Fair Sharing

Between a pair of source server i and destination server j, number of sockets assigned,

: Total amount of data that i needs to transfer to j : Observed round trip latency between i and j

This scheme is in contrast to Orchestra[Chowdhury et al] which only fair shares based on data size

Page 21: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

21

Improvement

WFS strategy 30% better than naïve chunk-based scheme and 9% better

than Orchestra [Chowdhury et al]

Page 22: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

22

Geo-Distributed Optimization

Morphus chooses slaves for reconfiguration during first Isolation phase

In a geo-distributed setting, naïve choice can lead to bulk transfers over wide area network

Solution: Localize bulk transfer by choosing replicas in the same datacentero Morphus extracts the datacenter information from

replica’s metadata.

2x-3x improvement observed in experiments

Page 23: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

Evaluation

Page 24: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

24

Setup

Dataset: Amazon Reviews [Snap].

Cluster: Emulab d710 nodes, 100 Mbps LAN switch and Google Cloud (n1-standard-4 VMs), 1 Gbps network

Workload: Custom generator similar to YCSB. Implements Uniform, Zipfian and Latest distribution for key access

Morphus: Implemented on top of MongoDB

Page 25: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

25

Data Availability

Access Distribution

Read Success Rate

Write Success Rate

Read Only 99.9 -

Uniform 99.9 98.5

Latest 97.2 96.8

Zipf 99.9 98.3

Page 26: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

26

Impact on Read Latency

Page 27: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

27

Data Availability - Summary

Access Distribution

Read Success Rate

Write Success Rate

Read Only 99.9 -

Uniform 99.9 98.5

Latest 97.2 96.8

Zipf 99.9 98.3Morphus has a small impact on data

availability

Page 28: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

28

Algorithm Comparison

Hungarian performs well in both scenarios and should be preferred over Greedy and Random schemes

Page 29: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

29

Scalability

Sub-linear increase in reconfiguration timeas data and cluster size increases

Page 30: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

30

Related Work

Online schema change [Rae et al.]: Resultant availabilities smaller.

Live data migration: Attempted in databases [Albatross, ShuttleDB] and VMs [Bradford et al.]. Similar approach as ours. o Albatross and ShuttleDB ship whole state to a set of

empty servers.

Reactive reconfiguration proposed by Squall [Elmore et al]: Fully available but takes longer.

Page 31: Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta

31

Takeaways

Morphus is a system which allows live reconfiguration of a sharded NoSQL database.

Morphus is implemented on top of MongoDB Morphus uses network efficient algorithms for

placing chunks while changing shard key Morphus minimally affects data availability during

reconfiguration Morphus mitigates stragglers during data

migration by optimizing for the underlying network latencies.