providing big data applications with fault-tolerant data migration

Providing Big Data Applications with Fault-Tolerant Data Migration Across

Heterogeneous NoSQL databasesMarco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto

Politecnico di Milano, Italy

BIGDSE ‘16 – May 16th 2016, Austin

NoSQLs and Big Data applications

� Highly-available, big data applications need specific storage technologies:� Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.)� NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.)

�NoSQLs preferred to DFSs for:� Efficient data access (for reads and/or writes)� Concurrent data access� Adjustable data consistency and integrity policies� Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.).

NoSQLs heterogeneity

� Lack of standard data access interfaces and languages

� Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.)

� Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.)

Vendor lock-in“The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.”

C. Mohan 4

Research objective

Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications

Hegira4Cloud

Hegira4Cloud requirements

1. Big Data migration across any NoSQL database and Database as a Service (DaaS)

2. High performant data migration

3. Fault-tolerant data migration

Migration System Core

Hegira4Cloud approach

Source DB

Target DB

SRC TWCMIGRATION QUEUE

Conversion to the Metamodel Format

Conversion from the Metamodel Format

Monolithic architecture data migration GAE Datastore -> Azure Tables

dataset #1 dataset #2 dataset #3

Source size (MB) 16 64 512

# of Entities 36940 147758 1182062

Migration time (sec) 1098 4270 34111

Entities throughput (ent/s) 33.643 34.604 34.653

Avg. %CPU usage 4.749 3.947 4.111

Hegira4Cloud V1

~18m ~71m ~568m

Migration System Core

Source DB

Target DB

Improving performance: components decoupling

Components decoupling helps in:� distributing the computation (conversion to/from the intermediate meta

model);

� isolating possible bottlenecks;

� finding (and solving) errors.

Source DB

Target DB

TWCTWCSRCSRC

Improving performance: parallelization

Operations to be executed can be parallelized:� data extraction (from the source database)� data should be partitionable

� data load (to the target database)

Source DB

Target DB

Improving performance: TWC parallelization

Challenges:� avoid to duplicate data (i.e., process disjunct data only once)� avoid threads starvation� in case of fault, already extracted data shouldn’t be lost

Solution: RabbitMQ�messages distributed (disjointly) in round-robin fashion�messages correctly processed are acknowledged and removed�messages are persisted on disk

TWCTWCSRCSRCSource DB

Target DB

Improving performance: SRC parallelization

Challenges:� complete knowledge of stored data is needed to partition data

� partitions should be processed at most once (to avoid duplications)

Target DB

SRC TWCMETAMODEL QUEUE

Improving performance: SRC parallelization

Target DB

Source DB

Lets assume that data are associated with an unique, incremental primary key (or an indexed property)

References to the VDPs are stored inside a persistent storage

Addressing faults

Types of (non-trivial) faults:� Database faults� Components faults� Network faults

On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore)

Source DB

Target DB

Connection loss

Virtual data partitioning

Source DB

Key Values

Status Log

VDPid Status

1 migrated

2 under_mig

3 not_mig

not mig. under mig. migrated

migrate finish_mig

ZooKeeper

PARTITION STATUS

Hegira4Cloud V2

Source DB

Target DB

STATUS LOG

Hegira4Cloud V2: Evaluation

� 1 Source Reading Thread� 40 Target Writing Threads

Monolithic arcitecture Parallel distributedarchitecture

dataset #1 dataset #2 dataset #3 dataset #1Source size (MB) 16 64 512 318464 (311GB)# of Entities 36940 147758 1182062 ~107MMigration time (sec) 1098 4270 34111 124867 (34½h)Entities throughput(ent/s) 33.643 34.604 34.653 856.41Avg. %CPU usage 4.749 3.947 4.111 49.87

Conclusions

�Efficient, fault-tolerant method for data migration

�Architecture supporting data migration across NoSQL databases� Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase)� Evaluated on industrial case study

Future work

� Support online data migrations

� Rigorous tests for assessing data completeness and correctness

Marco ScavuzzoPhD student @ Politecnico di MilanoYou can find me at: marco.scavuzzo@polimi.it

CreditsPresentation template by SlidesCarnival

providing big data applications with fault-tolerant data migration

Documents

fault tolerant 4_5

pangolin: a fault-tolerant persistent memory...

vidana: a fault tolerant approach for a distributed data ......

a review on energy-efficient fault-tolerant data … review...

fault detection and fault tolerant approaches … · fault...

building fault-tolerant applications on...

a fault-tolerant method for hla typing with pacbio data

fault tolerant

fault-tolerant avionics

fault-tolerant data collection in heterogeneous ... ·...

fault-tolerant communication runtime support for...

fault tolerant configuration

a distributed fault/intrusion-tolerant sensor data storage...

data-reconciliation based fault-tolerant model predictive...

fault tolerant sytems

fault tolerant systems

techniques for fault-tolerant quantum error...

making cloud intermediate data fault-tolerant

fault tolerant 소개

synchrony and time in fault-tolerant distributed...