hug france feb 2016 - migration de données structurées entre hadoop et rdbms par louis rabiet...
TRANSCRIPT
February 16th 2016 [email protected]
Migrating structured data between Hadoop and RDBMS
Who am I?
• Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on
the top of the highest mountains of the world
What I do ?
• Develop of an analytics toolbox.• No setup. No SQL. No compromise.•Generate SQL with a REST API.
It is open source! https://github.com/openbouquet
Topic of today
• You need Scalability?• You need a machine learning toolbox? Hadoop is the solution.
•But you still need structured data?Our tool provide a solution.
=> We need both!
What does that mean?
•Creation of dataset in Bouquet• Send the dataset to Spark• Enrich inside Spark•Re-injection in original database
How we do it?
User input
Relational DB
SparkBouquet
Create and Send
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
Data is read from the SQL database
Kafka
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
The BI tool creates an avro schema and send the data to Kafka
Kafka
How does it work?
BouquetRelational DB
SparkKafka
HDFS/ Tachyon
Hive Metastore
Kafka Broker(s) receive the data
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs
How to keep the data structured?
Use a schema registry (Avro in Kafka).each schema has a corresponding kafka topic and a distinct hive table.
{ "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]}]}
Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet. - JDBC reads are too slow for something like Kafka.- Issue with types conversion: null is not supported for all cases for example (issue
272 on schema-registry).- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec
2015)- Hive: Setting the warehouse directory.- In tachyon: Setting up hostname.
Tachyon?
•Use it as in memory filesystem to replace HDFS.• Interact with Spark using the hdfs plugin. • Transparent from user point of view
Status
Injection SQL -> Spark: OKSpark usage: OKRe-injection: In alpha stage.
Re-injection
Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function)• Bouquet pulls the data from spark
We use it for real!
Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.
In the future
•Notebook integration•We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)
QUESTIONS OPENBOUQUET.IO
DB HD
Bouquet Architecture
Bouquet Server
SQL DATA
JDBC
Dynamic Caching & Indexing
REST APIBusiness Modeling OAuth2
Generic Apps
Multi-Tenant
REDIS Elastic MongoDB
JS/SDK Custom Apps