apache big data europe 2016 power pig with spark...apache pig procedural scripting language pig...
TRANSCRIPT
![Page 2: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/2.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 3: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/3.jpg)
Background
![Page 4: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/4.jpg)
Apache Pig
● Procedural scripting language
● Pig Latin: similar to sql
● Heavily used for ETL
● Schema / No schema data, Pig eats everything
![Page 5: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/5.jpg)
Spark
● Faster
● Generality
● Easy of use
![Page 6: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/6.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 7: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/7.jpg)
Why Pig on Spark
● Better Performance
○ No intermediate data between stages
○ In-memory caching abstraction
○ Executor JVM Reuse
● Support Pig users to experience Spark conveniently
![Page 8: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/8.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 9: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/9.jpg)
Design Architecture
![Page 10: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/10.jpg)
Design Architecture
![Page 11: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/11.jpg)
Design Architecture
![Page 12: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/12.jpg)
Pig Latin to RDD<Tuple> transformations
![Page 13: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/13.jpg)
Pig Latin to RDD<Tuple> transformations
![Page 14: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/14.jpg)
Pig Latin to RDD<Tuple> transformations
![Page 15: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/15.jpg)
Operator Mapping
Pig Operator Spark Operator
Load newAPIHadoopFile
Store saveAsNewAPIHadoopFile
Filter filter
GroupBy groupby/reduceBy
Join CoGroupRDD
ForEach mapPartitions
Sort sortByKey
![Page 16: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/16.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 17: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/17.jpg)
Benchmark Overview
Component Version
Pig Spark branch
Hadoop 2.6.0
Spark 1.6.2
PigMix Trunk
![Page 18: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/18.jpg)
Basic Configuration
spark.master=yarn-client
spark.executor.memory=6553m
spark.yarn.executor.memoryOverhead=1638
spark.executor.cores=8
spark.dynamicAllocation.enabled=true
spark.network.timeout=1200000
![Page 19: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/19.jpg)
Benchmark Overview (cont’d)
![Page 20: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/20.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 21: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/21.jpg)
Optimize GroupBy/Join
![Page 22: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/22.jpg)
Optimize GroupBy/Join
![Page 23: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/23.jpg)
Optimize GroupBy/Join
![Page 24: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/24.jpg)
Optimize GroupBy/Join
![Page 25: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/25.jpg)
Skewed Key Sort
![Page 26: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/26.jpg)
Skewed Key Sort
![Page 27: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/27.jpg)
Skewed Key Sort
![Page 28: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/28.jpg)
Salted Key Solution
![Page 29: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/29.jpg)
Skewed Key Sort Performance
There are significant performance Improvement in sort case(L10) and skewed key sort case(L9)
![Page 30: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/30.jpg)
Agenda
● Background
● Why Pig on Spark ?
● Design Architecture
● Benchmark
● Optimization
● Current Status & Future Work
● Q&A
![Page 31: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/31.jpg)
Current Status: Nearing end of Milestone 1
● Functional completeness: DONE
● All Unit Tests Pass: DONE
● Merge Spark Branch to Master: In Code Review
![Page 32: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/32.jpg)
Ongoing Work towards Milestone 2● Implement Optimizations
○ Optimize Group by/Join - PIG-4797: DONE
○ FR Join - PIG-4771: DONE
○ Merge Join - PIG-4810: DONE
○ Skewed Join: UNDER REVIEW
● Enhance Test Infrastructure
○ Use “local-cluster” mode to run unit tests
● Spark Integration
○ Improved error, progress, stats reporting
○ YARN Cluster Mode
![Page 33: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/33.jpg)
Future work: Milestone 3
● Implement More Optimizations
○ Split / MultiQuery using RDD.cache()
○ Compute optimal Shuffle Parallelism
○ Optimize/Redesign Spark Plan
● Code Stablization, Bug Fixes
![Page 34: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/34.jpg)
Contribution welcomed● Git:
○ https://github.com/apache/pig/tree/spark
● Wiki :
○ https://cwiki.apache.org/confluence/display/PIG/Pig
+on+Spark
● Umbrella jira:
○ PIG-4059
![Page 35: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything](https://reader034.vdocuments.site/reader034/viewer/2022042404/5f1b681d52b2a40f6157baad/html5/thumbnails/35.jpg)
Q&A