bridging the gap of relational to hadoop using sqoop @ expedia
TRANSCRIPT
![Page 1: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/1.jpg)
Bridging the gap of Relational to Hadoop using Sqoop@Expedia(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia
![Page 2: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/2.jpg)
| Expedia Inc. Proprietary & Confidential2
Agenda• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
![Page 3: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/3.jpg)
| Expedia Inc. Proprietary & Confidential3
Data Synchronization
![Page 4: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/4.jpg)
| Expedia Inc. Proprietary & Confidential4
Problem Statement• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.–Merge the new/updated data to hive tables.–Merge the data in parallel.
![Page 5: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/5.jpg)
| Expedia Inc. Proprietary & Confidential5
Community Solution - Sqoop• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data stores such as relational databases.
• Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc.
![Page 6: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/6.jpg)
| Expedia Inc. Proprietary & Confidential6
Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs.
- Hive Merge - Merges the incremental data migrated to hdfs into your
existing hive tables.- Supports merge based on composite keys- Merges older partitions as well as add new partitions.
![Page 7: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/7.jpg)
| Expedia Inc. Proprietary & Confidential7
Enhanced Sqoop Features - Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy
![Page 8: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/8.jpg)
| Expedia Inc. Proprietary & Confidential8
Hcatalog for Hive- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.
![Page 9: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/9.jpg)
| Expedia Inc. Proprietary & Confidential9
External tables with HCatalog
![Page 10: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/10.jpg)
| Expedia Inc. Proprietary & Confidential10
Sqoop Import to Hive Managed Table
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse
![Page 11: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/11.jpg)
| Expedia Inc. Proprietary & Confidential11
![Page 12: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/12.jpg)
| Expedia Inc. Proprietary & Confidential12
New Enhancement :Import to Hive External Table
• The above command creates a hive table in the user managedDirectory /user/root/test_part2
![Page 13: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/13.jpg)
| Expedia Inc. Proprietary & Confidential13
![Page 14: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/14.jpg)
| Expedia Inc. Proprietary & Confidential14
Dynamic Partitioning with HCatalog
![Page 15: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/15.jpg)
| Expedia Inc. Proprietary & Confidential15
Sqoop Import to Hive Static Partition• Can pass only 1 static partition as sqoop argument
![Page 16: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/16.jpg)
| Expedia Inc. Proprietary & Confidential16
Sqoop Import to Hive Static Partition• Check Hive Partition
![Page 17: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/17.jpg)
| Expedia Inc. Proprietary & Confidential17
Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with date value specified manually.
![Page 18: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/18.jpg)
| Expedia Inc. Proprietary & Confidential18
Questions
How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?Is there any way that I can pass the format ?
![Page 19: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/19.jpg)
| Expedia Inc. Proprietary & Confidential19
New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static partition key.
![Page 20: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/20.jpg)
| Expedia Inc. Proprietary & Confidential20
![Page 21: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/21.jpg)
| Expedia Inc. Proprietary & Confidential21
New Enhancement : Import to Hive Dynamic Partition with Date Format
• A new argument is passed –hcatalog-dynamic-partition-key-format with argument –hcatalog-dynamic-partition-keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.
![Page 22: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/22.jpg)
| Expedia Inc. Proprietary & Confidential22
![Page 23: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/23.jpg)
| Expedia Inc. Proprietary & Confidential23
Password encrypted in Sqoop Metastore• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic passkey and algorithm is passed in command line.
![Page 24: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/24.jpg)
| Expedia Inc. Proprietary & Confidential24
Issues with Sqoop Merge Tool• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate sequential Sqoop jobs.
![Page 25: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/25.jpg)
| Expedia Inc. Proprietary & Confidential25
Merge Incremental data using Sqoop and Hive External Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables.
![Page 26: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/26.jpg)
| Expedia Inc. Proprietary & Confidential26
Merge Incremental data using Sqoop and Hive External Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above generated table.
![Page 27: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/27.jpg)
| Expedia Inc. Proprietary & Confidential27
New Tool: Hive Merge• Import original base table into Hive
![Page 28: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/28.jpg)
| Expedia Inc. Proprietary & Confidential28
New Tool : Hive merge • Import incremental data into Hive
![Page 29: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/29.jpg)
| Expedia Inc. Proprietary & Confidential29
• Finally merge data using tool hive-merge.
New Tool : Hive merge
![Page 30: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/30.jpg)
| Expedia Inc. Proprietary & Confidential30
Acquiring locks during Hive Merge• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it finishes.
![Page 31: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/31.jpg)
| Expedia Inc. Proprietary & Confidential31
Performance metrics : Hive Merge tool
![Page 32: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/32.jpg)
| Expedia Inc. Proprietary & Confidential32
Other Key Enhancements• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog
![Page 33: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/33.jpg)
| Expedia Inc. Proprietary & Confidential33
Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format
• SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog
![Page 34: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/34.jpg)
| Expedia Inc. Proprietary & Confidential34
• SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-Hcatalog
![Page 35: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/35.jpg)
35
Demo
![Page 36: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/36.jpg)
| Expedia Inc. Proprietary & Confidential36
Questions
![Page 37: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/37.jpg)
| Expedia Inc. Proprietary & Confidential37
Hive Merge Internal ArchitectureStep 1: Identify partitions to update. Skip this step for non-partitioned tables.
![Page 38: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/38.jpg)
| Expedia Inc. Proprietary & Confidential38
Hive Merge Internal ArchitectureStep 2: Merge the new partitions with the old partitions(only for partitioned tables).
![Page 39: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia](https://reader035.vdocuments.site/reader035/viewer/2022062412/58a4828f1a28ab58738b4f33/html5/thumbnails/39.jpg)
| Expedia Inc. Proprietary & Confidential39
Hive Merge Internal ArchitectureStep 3: Delete older versions.