hug france feb 2016 - migration de données structurées entre hadoop et rdbms par louis rabiet...

February 16th 2016 [email protected]

Migrating structured data between Hadoop and RDBMS

Who am I?

• Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on

the top of the highest mountains of the world

What I do ?

• Develop of an analytics toolbox.• No setup. No SQL. No compromise.•Generate SQL with a REST API.

It is open source! https://github.com/openbouquet

Topic of today

• You need Scalability?• You need a machine learning toolbox? Hadoop is the solution.

•But you still need structured data?Our tool provide a solution.

=> We need both!

What does that mean?

•Creation of dataset in Bouquet• Send the dataset to Spark• Enrich inside Spark•Re-injection in original database

How we do it?

User input

Relational DB

SparkBouquet

Create and Send

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

User select the data. Bouquet generate the corresponding SQL Code

Kafka

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

Data is read from the SQL database

Kafka

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

The BI tool creates an avro schema and send the data to Kafka

Kafka

How does it work?


SparkKafka

HDFS/ Tachyon

Hive Metastore

Kafka Broker(s) receive the data

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

Kafka

The hive metastore is updated and the hdfs connectors writes into hdfs

How to keep the data structured?

Use a schema registry (Avro in Kafka).each schema has a corresponding kafka topic and a distinct hive table.

{ "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]}]}

Challenges

- Auto creation of topics/table in Hive for each datasets from Bouquet. - JDBC reads are too slow for something like Kafka.- Issue with types conversion: null is not supported for all cases for example (issue

272 on schema-registry).- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec

2015)- Hive: Setting the warehouse directory.- In tachyon: Setting up hostname.

Tachyon?

•Use it as in memory filesystem to replace HDFS.• Interact with Spark using the hdfs plugin. • Transparent from user point of view

Status

Injection SQL -> Spark: OKSpark usage: OKRe-injection: In alpha stage.

Re-injection

Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function)• Bouquet pulls the data from spark

We use it for real!

Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.

In the future

•Notebook integration•We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)

QUESTIONS OPENBOUQUET.IO

DB HD

Bouquet Architecture

Bouquet Server

SQL DATA

JDBC

Dynamic Caching & Indexing

REST APIBusiness Modeling OAuth2

Generic Apps

Multi-Tenant

REDIS Elastic MongoDB

JS/SDK Custom Apps

hug france feb 2016 - migration de données structurées entre hadoop et rdbms par louis rabiet...

Internet