from oracle to hadoop: unlocking hadoop for your rdbms with apache sqoop and other tools guy...
TRANSCRIPT
![Page 1: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/1.jpg)
From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with
Apache Sqoop and Other Tools
Guy Harrison, David Robson, Kate Ting
{guy.harrison, david.robson}@software.dell.com, [email protected]
October 16, 2014
![Page 2: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/2.jpg)
About Guy, David, & Kate
Guy Harrison @guyharrison- Executive Director of R&D @ Dell- Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming
David Robson @DavidR021- Principal Technologist @ Dell- Sqoop Committer, Lead on Toad for Hadoop & OraOop
Kate Ting @kate_ting- Technical Account Mgr @ Cloudera- Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
![Page 3: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/3.jpg)
![Page 4: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/4.jpg)
![Page 5: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/5.jpg)
![Page 6: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/6.jpg)
RDBMS and Hadoop The relational database reigned
supreme for more than two decades Hadoop and other non-relational
tools have overthrown that hegemony
We are unlikely to return to a “one size fits all” model based on Hadoop
- Though some will try For the foreseeable future, enterprise
information architectures will include relational and non-relational stores
![Page 7: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/7.jpg)
Scenarios1. We need to access RDBMS
to make sense of Hadoop data
HDFS
Analytic output
Weblogs
RDBMS
ProductsFlume SQOOP
YARN/MR1
![Page 8: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/8.jpg)
Scenarios1. Reference data is in the
RDBMS
2. We want to run analysis outside of the RDBMS
HDFS
Analytic output
RDBMS
ProductsSQOOP
YARN/MR1
SalesSQOOP
![Page 9: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/9.jpg)
Scenarios1. Reference data is in the
RDBMS
2. We want to run analysis outside of the RDBMS
3. Feeding YARN/MR output into RDBMS
HDFS
Analytic output
Weblogs
RDBMS
Weblog Summary
Flume
SQOOP
YARN/MR1
![Page 10: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/10.jpg)
Scenarios1. We need to access RDBMS
to make sense of Hadoop data
2. We want to use Hadoop to analyse RDBMS data
3. Hadoop output belongs in RDBMS Data warehouse
4. We archive old RDBMS data to Hadoop
HDFS
BI platform
RDBMS
SalesSQOOP
HQL
Old Sales
SQL
![Page 11: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/11.jpg)
SQOOP SQOOP was created in 2009
by Aaron Kimball as a means of moving data between SQL databases and Hadoop
It provided a generic implementation for moving data
It also provided a framework for implementing database specific optimized connectors
![Page 12: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/12.jpg)
How SQOOP works (import)
HDFS RDBMS
Table Metadata
Table Data
SQOOPTable.java
Map Task
FileOutputFormat
DataDrivenDBInputFormat
Map TaskDataDrivenDBInputForma
t
FileOutputFormat
Hive DDL
HDFS files
Hive Table
![Page 13: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/13.jpg)
SQOOP & Oracle
![Page 14: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/14.jpg)
SQOOP issues with Oracle SQOOP uses primary key
ranges to divide up data between mappers
However, the deletes hit older key values harder, making key ranges unbalanced.
Data is almost never arranged on disk in key order so index scans collide on disk
Load is unbalanced, and IO block requests >> blocks in the table.
ORACLE TABLE on DISK
Index block Index block
RANGE SCAN
MAPPER
ORACLE SESSION
ID > 0 and ID < MAX/2
MAPPER
ORACLE SESSION
ID > MAX/2
Index block Index block
RANGE SCAN
Index block Index block
![Page 15: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/15.jpg)
Other problems Oracle might run each mapper using a
full scan – clobbering the database Oracle might run each mapper in
parallel – clobbering the database Sqoop may clobber the database
cache
0 2 4 6 8 10 12 14 16 180
200
400
600
800
1000
1200
1400
1600
1800
Number of mappers
Elas
ped
time
(s)
0 4 8 12 16 20 240
1000
2000
3000
4000
5000
6000
7000
Database load
Number of mappers
Dat
abas
e Ti
me
(s)
![Page 16: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/16.jpg)
High speed connector design Partition data based on physical
storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more
than once Support Oracle datatypes
ORACLE TABLE
HDFS
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
![Page 17: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/17.jpg)
Imports (Oracle->Hadoop) Uses Oracle block/extent map to
equally divide IO
Uses Oracle direct path (non-buffered) IO for all reads
Round-robin, sequential or random allocation
All mappers get an equal number of blocks & no block is read twice
If table is partitioned, each mapper can work on a separate partition – results in partitioned output
ORACLE TABLE
HDFS
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
![Page 18: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/18.jpg)
Exports (Hadoop-> Oracle)
Optionally leverages Oracle partitions and temporary tables for parallel writes
Performs MERGE into Oracle table (Updates existing rows, inserts new rows)
Optionally use oracle NOLOGGING (faster but unrecoverable)
ORACLE TABLE
HDFS
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
HADOOP MAPPER
ORACLE SESSION
![Page 19: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/19.jpg)
Import – Oracle to Hadoop When data is unclustered
(randomly distributed by PK), old SQOOP scales poorly
Clustered data shows better scalability but is still much slower than the direct approach.
New SQOOP outperforms 5-20 times typically
We’ve seen limiting factor as:- Data IO bandwidth, or- Network out of DB, or- Hadoop CPU
0 5 10 15 20 25 30 350
200
400
600
800
1000
1200
1400
1600
direct=false - unclustered Data direct=false clustered datadirect=true
Number of mappers
Elap
sed
time
(s)
![Page 20: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/20.jpg)
Import - Database overhead As you increase mappers in old sqoop,
database load increases rapidly
- (sometimes non-linear) In new Sqoop, queuing occurs only after
IO bandwidth is exceeded
0 4 8 12 16 20 240
500
1000
1500
2000
2500
3000
SqoopDirect
Number of mappersD
B tim
e (m
inut
es)
![Page 21: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/21.jpg)
Export – Oracle to Hadoop On Export, old SQOOP would hit
database writer bottleneck early on and fail to parallelize.
New SQOOP uses partitioning and direct path inserts.
Typically bottlenecks on write IO on Oracle side
0 4 8 12 16 20 240
20
40
60
80
100
120
SqoopDirect
Number of mappers
Elap
sed
time
(min
utes
)
![Page 22: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/22.jpg)
Reduction in database load 45% reduction in DB CPU 83% reduction in elapsed time 90% reduction in total database
time 99.9% reduction in database IO
CPU time
Elapsed time
DB time
IO requests
IO time
0 20 40 60 80 100
55.31
83.45
90.59
99.28
99.98
8 node Hadoop cluster, 1B rows, 310GB
% reduction
![Page 23: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/23.jpg)
Replication No matter how fast we make SQOOP,
it’s a drag to have to run a SQOOP job before every Hadoop job.
Replicating data into Hadoop cuts down on SQOOP overhead on both sides and avoids stale data.
Shareplex® for Oracle and Hadoop
![Page 24: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/24.jpg)
Sqoop 1.4.5 Summary
Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct
Minimal privileges required Access to DBA views requiredWorks on most object types: e.g. IOT 5x-20x faster performance on tables
Favors Sqoop terminology Favors Oracle terminology
Database load increases non-linearly Up to 99% reduction in database IO
![Page 25: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/25.jpg)
Future of SQOOP
![Page 26: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/26.jpg)
Sqoop 1 Import Architecture
sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop --password sqoop \
--table cities
![Page 27: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/27.jpg)
Sqoop 1 Export Architecture
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop --password sqoop \
--table cities \
--export-dir /temp/cities
![Page 28: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/28.jpg)
Sqoop 1 Challenges Concerns with usability
- Cryptic, contextual command line arguments
Concerns with security
- Client access to Hadoop bin/config, DB
Concerns with extensibility
- Connectors tightly coupled with data format
![Page 29: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/29.jpg)
Sqoop 2 Design Goals Ease of use
- REST API and Java API
Ease of security
- Separation of responsibilities
Ease of extensibility
- Connector SDK, focus on pluggability
![Page 30: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/30.jpg)
Ease of Use
sqoop import \
-Dmapred.child.java.opts="Djava.security.egd=file:///dev/urandom“\
-Ddfs.replication=1 \
-Dmapred.map.tasks.speculative.execution=false \
--num-mappers 4 \
--hive-import --hive-table CUSTOMERS --create-hive-table \
--connect jdbc:oracle:thin:@//localhost:1521/g12c \
--username OPSG --password opsg --table OPSG.CUSTOMERS \
--target-dir CUSTOMERS.CUSTOMERS
Sqoop 1 Sqoop 2
![Page 31: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/31.jpg)
Ease of Security
sqoop import \
-Dmapred.child.java.opts="Djava.security.egd=file:///dev/urandom“\
-Ddfs.replication=1 \
-Dmapred.map.tasks.speculative.execution=false \
--num-mappers 4 \
--hive-import --hive-table CUSTOMERS --create-hive-table \
--connect jdbc:oracle:thin:@//localhost:1521/g12c \
--username OPSG --password opsg --table OPSG.CUSTOMERS \
--target-dir CUSTOMERS.CUSTOMERS
Sqoop 1 Sqoop 2
• Role-based access to connection objects• Prevents misuse and abuse• Administrators create, edit, delete• Operators use
![Page 32: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/32.jpg)
Ease of ExtensibilitySqoop 1 Sqoop 2
Tight Coupling
• Connectors fetch and store data from db
• Framework handles serialization, format conversion, integration
![Page 33: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/33.jpg)
Takeaway Apache Sqoop
- Bulk data transfer tool between external structured datastores and Hadoop
Sqoop 1.4.5 now with a --direct parameter option for Oracle
- 5x-20x performance improvement on Oracle table imports
Sqoop 2
- Ease of use, security, extensibility
![Page 34: From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com,](https://reader038.vdocuments.site/reader038/viewer/2022103112/551c58c4550346a66a8b4fd2/html5/thumbnails/34.jpg)
Questions?Guy Harrison @guyharrison
David Robson @DavidR021
Kate Ting @kate_ting
Visit Dell at Booth #102
Visit Cloudera at Booth #305
Book Signing: Today @ 3:15pm
Office Hours: Tomorrow @ 11am