mysql applier for apache hadoop: real-time event streaming to hdfs

26
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Insert Picture Here MySQL Applier for Apache Hadoop Real-Time Event Streaming to HDFS Mats Kindahl Neha Kumari Shubhangi Garg 2013-09-21

Upload: mats-kindahl

Post on 27-Jan-2015

124 views

Category:

Technology


6 download

DESCRIPTION

This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.

TRANSCRIPT

Page 1: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1

Insert Picture Here

MySQL Applier for Apache HadoopReal-Time Event Streaming to HDFSMats KindahlNeha KumariShubhangi Garg

2013-09-21

Page 2: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.2

The following is intended to outline our general product direction. It is intended

for information purposes only, and may not be incorporated into any contract.

It is not a commitment to deliver any material, code, or functionality, and

should not be relied upon in making purchasing decision. The development,

release, and timing of any features or functionality described for Oracle’s

products remains at the sole discretion of Oracle.

Safe Harbor Statement

Page 3: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.3

Presentation Outline

● Why Big Data?

● Working with Big Data

● MySQL Applier for Hadoop

● Road map

Page 4: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.4

Why Big Data?

● Reporting● Predefined data

● Viewing history● Past occurrences

● Using Sales Data● Typically in database

● Analytics● Data mining

● Predicting future● Trends

● Using all available data● Sales● Click stream● Likes/Tweets

Traditional Approach Big Data

Page 5: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.5

Why Big Data?

● Web Recommendations

● Sentiment Analysis

● Marketing Campaign Analysis

● Customer Churn Modeling

● Fraud Detection

● Research and Development

● Risk Modeling

● Machine Learning

90% with Pilot Projects at end of 2012

Poor Data Costs 35% in Annual

Revenues

10% Improvement in Data Usability Drives $2bn in

RevenueSource: http://wikibon.org/blog/big-data-statistics/

Page 6: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.6

Why Hadoop?

● Scales to thousands of nodes● Combines data from multiple

sources● Handles unstructured data● Run queries against all of the

data

● Runs on commodity servers● Easy to set up● Affordable

● Fault-tolerant● File block replication● Self-healing

● Map/Reduce● Distributed processing model● Good for large data sets

Page 7: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.7

Example Use-Case: On-Line Retail

Browsing

Recommendations Recommendations

UpdatesPreferences

Brands “Liked”

Web LogsPage ViewsComments

CustomersP

urchaseH

istory

Purchases

Page 8: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.8

Big Data Lifecycle

Decide

Organize

Acquire

Applier

Analyze

Page 9: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.9

Hadoop Tools: In the Lifecycle

Apache SqoopMySQL Applier for Hadoop

Apache Flume

Apache DrillApache HiveApache Pig

Page 10: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.10

Hadoop Tools: Apache Sqoop

● Apache top-level project● Part of Hadoop project● Developed by Cloudera

● Bulk data import and export● Between Hadoop HDFS and external data stores

● Support JDBC connector architecture● Supports plug-ins for specific functionality● “Fast-path” connector for MySQL

Page 11: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.11

Hadoop Tools: Apache Sqoop

SqoopJob

SqoopJob

SqoopJob

SqoopJob

SqoopJob

Hadoop Cluster

Page 12: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.12

Hadoop Tools: Apache Flume

● Apache top-level project● Part of Hadoop project

● Collecting log data● Various sources: Avro, Thrift, Syslog, Netcat● Can aggregate and consolidate data

● Data typically sent to HDFS● Can store data in other “sinks” as well

Page 13: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.13

Hadoop Tools: Apache Flume

Source Sink

HDFSChannel

Page 14: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.14

New Tool: MySQL Applier for Hadoop

● Using Binlog API● Proof of concept

● Replication from MySQL to HDFS● Exploit replication protocol● Read server binary log

● Fetches changes from MySQL● Using Binary Log API● Row-based replication● Caveat: DDL not handled

● Stores changes into HDFS● Consumable by other tools● Caveat: only row inserts● Considering update/delete

Page 15: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.15

New Tool: MySQL Applier for Hadoop

HDFS

BinlogAPI libhdfs

Binary LogEvents

MySQL Applier for Hadoop

TimestampPrimary Key

Data

DecodeRow

Page 16: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.16

MySQL Applier for Hadoop:Requirements

● MySQL 5.6 or later● Available at http://dev.mysql.com/downloads/mysql

● MySQL Applier for Hadoop● Available at http://labs.mysql.com

● Apache Hadoop 1.0.4 or later● Available at http://hadoop.apache.org/releases.html

● Apache Hive or other Hadoop Tool for analysis● Available at http://hive.apache.org/releases.html

Page 17: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.17

Hadoop Applier for Hadoop:Mapping Rows

● Timestamp column is added first in table

● Timestamp from binary log

INSERT INTO test.tbl VALUES   (23456,'Sanjai','Feldhoffer'),  (23457,'Manohar','Kakkar'),  (23458,'Christ','Kalefeld'),  (23459,'Gretta','Varker'),  (23460,'Masato','Steinauer'),  (23461,'Baruch','Uchoa');

1379361681,23456,Sanjai,Feldhoffer1379361685,23457,Manohar,Kakkar1379361692,23458,Christ,Kalefeld1379361693,23459,Gretta,Varker1379361699,23460,Masato,Steinauer1379361703,23461,Baruch,Uchoa

MySQL HDFS

Page 18: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.18

Hadoop Applier for Hadoop:Using Hive

● Does not handle DDL● Create table manually as above

● MySQL Applier field and row delimiter can be controlled­­field­delimiter­­row­delimiter

CREATE TABLE tbl (

  user_id INT PRIMARY KEY,  first CHAR(60), last CHAR(60))

CREATE TABLE tbl (  ts INT,  user_id INT,  first STRING, last STRING) ROW FORMAT DELIMITED  FIELDS TERMINATED BY ','  STORED AS TEXTFILE 

SQL HDFS

Page 19: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.19

Hadoop Applier for Hadoop

● Start MySQL Applier for Hadoop

happlier ­­field­delimiter=, \  mysql://[email protected] hdfs://example.com:9000

● Inserts written to files in warehouse directory

● Default: /user/hive/warehouse

● MySQL Table: test.tblHDFS: /user/hive/warehouse/test.db/tbl/datafile1.txt

Page 20: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.20

Hadoop Applier for Hadoop:Update and Delete?

● Batch import using Sqoop● Transfer all data each time● If changes are small, bandwidth is

wasted

Sqoop

Hadoop Rack

Page 21: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.21

Hadoop Applier for Hadoop:Update and Delete?

● Batch import using Sqoop● Transfer all data each time● If changes are small, bandwidth is

wasted

● Incremental import using Applier● Only changes imported● Bandwidth is used efficiently● … but what about updates and

deletes?Applier

Hadoop Rack

Page 22: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.22

Hadoop Applier for Hadoop:Update and Delete?

● Problem:● HDFS is append-only● Rows inserted are appended to file● How can rows be updated or deleted?

● Idea:● Rows updated are appended to file● Rows have primary key● Row contain after-image and timestamp of update● For each primary key, pick row with latest timestamp

Page 23: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.23

Hadoop Applier for Hadoop:Update and Delete?

Applier

Hadoop Rack

● Timestamped rows to HDFS● After image for updates● Flag deletes

● Customized HiveQL queries

SELECT … FROM tblWHERE ts = MAX(ts)GROUP BY key

Page 24: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.24

Hadoop Applier for Hadoop:Update and Delete?

Clean

DirtyApplier

CleaningJob

Hadoop Rack

● Timestamped rows to HDFS● After image for updates● Flag deletes

● Special “cleaning“ job● Read dirty files● Write clean files● Moving data inside rack use

bandwidth efficiently

Page 25: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.25

MySQL and Hadoop:Resources and Information

● MySQL and Hadoop: Guide to Big Data Integration

http://www.mysql.com/why-mysql/white-papers/mysql-and-hadoop-guide-to-big-data-integration

● MySQL Applier for Hadoop

http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html

● Developer Blogs● Mats Kindahl: http://mysqlmusings.blogspot.com● Shubhangi Garg: http://innovating-technology.blogspot.in● Neha Kumari: http://nehakumari19.blogspot.in

Page 26: MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.26

Thank you!