apache sqoop: unlocking hadoop for your relational database
DESCRIPTION
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.TRANSCRIPT
![Page 1: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/1.jpg)
Unlocking Hadoop for Your Rela4onal DB
Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014
![Page 2: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/2.jpg)
Who Am I?
• Started 3 yr ago as 1st Cloudera Support Eng • Now manages Cloudera’s 2 largest customers
• Sqoop CommiJer, PMC Member • Co-‐Author of the Apache Sqoop Cookbook
![Page 3: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/3.jpg)
What is Sqoop?
• Apache Top-‐Level Project • SQl to hadOOP • Tool to transfer data from rela4onal databases
• Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To/From Hadoop ecosystem • HDFS (text, sequence file), Hive, HBase, Avro
3
![Page 4: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/4.jpg)
Why Sqoop?
• Efficient/Controlled resource u4liza4on • Concurrent connec4ons, Time of opera4on
• Datatype mapping and conversion • Automa4c, and User override
• Metadata propaga4on • Sqoop Record • Hive Metastore • Avro
![Page 5: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/5.jpg)
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
![Page 6: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/6.jpg)
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
![Page 7: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/7.jpg)
Sqoop 1 Architecture
7
![Page 8: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/8.jpg)
Sqoop 1 Command Line
sqoop TOOL PROPS ARG [-- EXTRA] • TOOL: import, export • PROPS
• Hadoop (java) proper4es • -Dwhatever.whenever=yes
• ARG • Generic SQOOP arguments • --table, --connect, ...
• EXTRA • connector specific • --schema (PostgreSQL and Microsoa SQL Server)
![Page 9: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/9.jpg)
Sqoop 1 Example
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities
sqoop export \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities \ --export-dir /temp/cities
![Page 10: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/10.jpg)
Sqoop 1 Challenges
• Cryp4c, contextual command line arguments • Security concerns • Type mapping is not clearly defined • Client needs access to Hadoop binaries/configura4on and database
• JDBC model is enforced
10
![Page 11: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/11.jpg)
Troubleshoo4ng Sqoop 1
• Versions: Sqoop, Hadoop, OS, JDBC • Console log aaer running with the --verbose flag
• Capture the en4re output via sqoop import … &> sqoop.log • En4re Sqoop command including the op4ons-‐file if applicable • Expected output and actual output • Table defini4on • Small input data set that triggers the problem
• Especially with export, malformed data is oaen the culprit • Hadoop task logs
• Oaen the task logs contain further informa4on describing the problem • Permissions on input files
![Page 12: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/12.jpg)
Troubleshoo4ng Sqoop 1
Imported table has more rows than source table? • Data contains char used as Hive’s delimiters
• Clean up data • --hive-drop-import-delims
• Removes \n, \t, and \01 char
• --hive-delims-replacement “SPECIAL” • Replaces \n, \t, and \01 char with string SPECIAL
• Not restricted to Hive -‐ any import job using text files • Ensure output files have one line per imported row
![Page 13: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/13.jpg)
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
![Page 14: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/14.jpg)
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
![Page 15: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/15.jpg)
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
![Page 16: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/16.jpg)
Protec4ng Your Password
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ -P
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --password-file my-sqoop-password
![Page 17: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/17.jpg)
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
![Page 18: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/18.jpg)
Sqoop Works on CLI Not in Oozie
Character parameter '|' has multiple characters; only the first will be used.
Got error creating database manager: java.io.IOException:
No manager for connect string: "jdbc:teradata...”
![Page 19: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/19.jpg)
Sqoop Works on CLI Not in Oozie
sqoop import --password "spEci@l\$" \ –connect 'jdbc:x:/yyy;db=sqoop’
• Remove all escaping that you’ve added for the shell • Use <arg> vs <command> tags as content is considered to be one parameter
• Put all -‐D parameters into configura4on sec4on • Install driver into workflow’s lib/ directory or shared ac4on library /user/oozie/share/lib/sqoop/
![Page 20: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/20.jpg)
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
![Page 21: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/21.jpg)
Choosing Proper Connector
• JDBC driver is dependency for all three connectors
• Sqoop automa4cally chooses most op4mal connector (OraOoop, built-‐in,
Generic JDBC Connector) • Or explicitly chose: --connection-manager com.quest.oraoop.OraOopConnManager
![Page 22: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/22.jpg)
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
![Page 23: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/23.jpg)
Overriding Type Mapping
-‐-‐map-‐column-‐java parameter • comma separated list of key-‐value pairs
• key = exact column name • value = target Java type
sqoop import \
--map-column-java \
c1=Float,c2=String,c3=String ...
![Page 24: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/24.jpg)
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
![Page 25: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/25.jpg)
Sqoop 2 Architecture
25
![Page 26: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/26.jpg)
Sqoop 2 Design Goals
• Security and Separa4on of Concerns • Role based access and use
• Ease of extension • No low-‐level Hadoop knowledge needed • No func4onal overlap between Connectors
• Ease of Use • Uniform func4onality • Domain specific interac4ons
![Page 27: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/27.jpg)
Sqoop 2 UI in Hue
• Troubleshoo4ng • sqoop.log file is located in @LOGDIR@ and the rest should be in server/logs/*
• Look for catalina.out, catalina.log, localhost-‐*.log
![Page 28: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/28.jpg)
28
![Page 29: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/29.jpg)
29
![Page 30: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/30.jpg)
30
![Page 31: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/31.jpg)
31
![Page 32: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/32.jpg)
32
![Page 33: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/33.jpg)
33
![Page 34: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/34.jpg)
34
![Page 35: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/35.jpg)
35
![Page 36: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/36.jpg)
36
![Page 37: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/37.jpg)
37
![Page 38: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/38.jpg)
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
![Page 39: Apache Sqoop: Unlocking Hadoop for Your Relational Database](https://reader034.vdocuments.site/reader034/viewer/2022051312/546f10a5b4af9f130c8b45a9/html5/thumbnails/39.jpg)
Resources
39
Sqoop 2 http://archive-primary.cloudera.com/cdh5/cdh/5/sqoop2/
Sqoop 1