providing big data applications with fault-tolerant data migration
Post on 14-Feb-2017
219 Views
Preview:
TRANSCRIPT
Providing Big Data Applications with Fault-Tolerant Data Migration Across
Heterogeneous NoSQL databasesMarco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto
Politecnico di Milano, Italy
BIGDSE ‘16 – May 16th 2016, Austin
NoSQLs and Big Data applications
� Highly-available, big data applications need specific storage technologies:� Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.)� NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.)
�NoSQLs preferred to DFSs for:� Efficient data access (for reads and/or writes)� Concurrent data access� Adjustable data consistency and integrity policies� Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.).
2
NoSQLs heterogeneity
� Lack of standard data access interfaces and languages
� Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.)
� Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.)
3
Vendor lock-in“The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.”
C. Mohan 4
Research objective
Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications
5
Hegira4Cloud
Hegira4Cloud requirements
1. Big Data migration across any NoSQL database and Database as a Service (DaaS)
2. High performant data migration
3. Fault-tolerant data migration
6
Migration System Core
Hegira4Cloud approach
Source DB
Target DB
SRC TWCMIGRATION QUEUE
7
Conversion to the Metamodel Format
Conversion from the Metamodel Format
Monolithic architecture data migration GAE Datastore -> Azure Tables
dataset #1 dataset #2 dataset #3
Source size (MB) 16 64 512
# of Entities 36940 147758 1182062
Migration time (sec) 1098 4270 34111
Entities throughput (ent/s) 33.643 34.604 34.653
Avg. %CPU usage 4.749 3.947 4.111
Hegira4Cloud V1
~18m ~71m ~568m
8
Migration System Core
Source DB
Target DB
SRC TWCMIGRATION QUEUE
Improving performance: components decoupling
Components decoupling helps in:� distributing the computation (conversion to/from the intermediate meta
model);
� isolating possible bottlenecks;
� finding (and solving) errors.
Source DB
Target DB
SRC TWCMIGRATION QUEUE
9
TWCTWCSRCSRC
Improving performance: parallelization
Operations to be executed can be parallelized:� data extraction (from the source database)� data should be partitionable
� data load (to the target database)
Source DB
Target DB
SRC TWCMIGRATION QUEUE
10
Improving performance: TWC parallelization
Challenges:� avoid to duplicate data (i.e., process disjunct data only once)� avoid threads starvation� in case of fault, already extracted data shouldn’t be lost
Solution: RabbitMQ�messages distributed (disjointly) in round-robin fashion�messages correctly processed are acknowledged and removed�messages are persisted on disk
11
TWCTWCSRCSRCSource DB
Target DB
SRC TWCMIGRATION QUEUE
Improving performance: SRC parallelization
Challenges:� complete knowledge of stored data is needed to partition data
� partitions should be processed at most once (to avoid duplications)
12
TWCTWCSRCSRCSource DB
Target DB
SRC TWCMETAMODEL QUEUE
Improving performance: SRC parallelization
13
TWCTWCSRCSRCSource DB
Target DB
SRC TWCMIGRATION QUEUE
Source DB
1
…
10
11
…
20
21
…
30
VDP1
VDP2
VDP3
Lets assume that data are associated with an unique, incremental primary key (or an indexed property)
References to the VDPs are stored inside a persistent storage
Addressing faults
Types of (non-trivial) faults:� Database faults� Components faults� Network faults
On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore)
Source DB
Target DB
SRC TWCMIGRATION QUEUE
Connection loss
14
Virtual data partitioning
Source DB
Key Values
1
2
…
10
11
…
20
21
…
30
VDP1
VDP2
VDP3
Status Log
VDPid Status
1 migrated
2 under_mig
3 not_mig
not mig. under mig. migrated
migrate finish_mig
15
ZooKeeper
PARTITION STATUS
Hegira4Cloud V2
Source DB
Target DB
SRC TWCMIGRATION QUEUE
STATUS LOG
16
Hegira4Cloud V2: Evaluation
� 1 Source Reading Thread� 40 Target Writing Threads
Monolithic arcitecture Parallel distributedarchitecture
dataset #1 dataset #2 dataset #3 dataset #1Source size (MB) 16 64 512 318464 (311GB)# of Entities 36940 147758 1182062 ~107MMigration time (sec) 1098 4270 34111 124867 (34½h)Entities throughput(ent/s) 33.643 34.604 34.653 856.41Avg. %CPU usage 4.749 3.947 4.111 49.87
17
Conclusions
�Efficient, fault-tolerant method for data migration
�Architecture supporting data migration across NoSQL databases� Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase)� Evaluated on industrial case study
Future work
� Support online data migrations
� Rigorous tests for assessing data completeness and correctness
19
Marco ScavuzzoPhD student @ Politecnico di MilanoYou can find me at: marco.scavuzzo@polimi.it
CreditsPresentation template by SlidesCarnival
top related