chicago spark meetup-april2017-public
TRANSCRIPT
![Page 1: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/1.jpg)
1© Cloudera, Inc. All rights reserved.
Building Efficient Pipelines in Apache Spark Guru Medasani
![Page 2: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/2.jpg)
2© Cloudera, Inc. All rights reserved.
Agenda
• Introduction• Myself• Cloudera
• Spark Pipeline Essentials• Using Spark UI• Resource Allocation• Tuning• Data Formats• Streaming
• Questions
![Page 3: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/3.jpg)
3© Cloudera, Inc. All rights reserved.
Introduction: Myself
• Current: Senior Solutions Architect at Cloudera (Chicago, IL)• Past: BigData Engineer at Monsanto Research & Development (St. Louis, MO)
![Page 4: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/4.jpg)
4© Cloudera, Inc. All rights reserved.
Introduction: ClouderaThe modern platform for data management, machine learning and advanced analytics
Founded 2008, by former employees ofProduct First commercial distribution of Hadoop CDH – Shipped 2009 World Class Support 24x7 Global Staff & Operations in 27 Countries
Proactive & Predictive Support Programs using our EDHMission Critical Production deployments in run-the-business applications worldwide –
Financial Services, Retail, Telecom, Media, Health Care, Energy, Government
The Largest Ecosystem 2,500+ PartnersCloudera University Over 45,000 TrainedOpen Source Leaders Cloudera employees are leading developers & contributors to the
complete Apache Hadoop ecosystem of projects
![Page 5: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/5.jpg)
5© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Using Spark UI
![Page 6: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/6.jpg)
6© Cloudera, Inc. All rights reserved.
UI: Event Timeline
![Page 7: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/7.jpg)
7© Cloudera, Inc. All rights reserved.
UI: Job Details - DAG
![Page 8: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/8.jpg)
8© Cloudera, Inc. All rights reserved.
UI: Stage Details
![Page 9: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/9.jpg)
9© Cloudera, Inc. All rights reserved.
UI: Stage Metrics
![Page 10: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/10.jpg)
10© Cloudera, Inc. All rights reserved.
UI: Skewed Data Metrics - Example
![Page 11: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/11.jpg)
11© Cloudera, Inc. All rights reserved.
UI: Job Labels and Storage
![Page 12: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/12.jpg)
12© Cloudera, Inc. All rights reserved.
UI: Job Labels and RDD Names
![Page 13: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/13.jpg)
13© Cloudera, Inc. All rights reserved.
UI: DataFrame and Dataset Names
https://issues.apache.org/jira/browse/SPARK-8480
![Page 14: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/14.jpg)
14© Cloudera, Inc. All rights reserved.
UI: Skipped Stages
http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui
![Page 15: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/15.jpg)
15© Cloudera, Inc. All rights reserved.
UI: Using Shuffle Metrics
![Page 16: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/16.jpg)
16© Cloudera, Inc. All rights reserved.
Lot’s more in the UI
• SQL Queries• Environment Variables• Executor Aggregates
![Page 17: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/17.jpg)
17© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Resource Allocation
![Page 18: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/18.jpg)
18© Cloudera, Inc. All rights reserved.
Resources: Basics
• If running Spark on YARN• First Step: Setup proper YARN resource queues and dynamic resource pools
![Page 19: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/19.jpg)
19© Cloudera, Inc. All rights reserved.
Resources: Dynamic Allocation
• Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload.• Originally just Spark-On-Yarn, now all cluster managers
![Page 20: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/20.jpg)
20© Cloudera, Inc. All rights reserved.
Static Allocation vs Dynamic Allocation
• Static Allocation• --num-executors NUM
• Dynamic Allocation • Enabled by default in CDH• Good starting point• Not the final solution
![Page 21: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/21.jpg)
21© Cloudera, Inc. All rights reserved.
Dynamic Allocation in Spark Streaming
• Enabled by default in CDH• Cloudera recommends to disable dynamic allocation for Spark Streaming
• Why?• Dynamic Allocation behavior - executors are removed when idle.• Data comes in every batch, and executors run whenever data is available.• Executor idle timeout is less than the batch duration, executors are constantly
being added and removed• If the executor idle timeout is greater than the batch duration, executors are
never removed
![Page 22: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/22.jpg)
22© Cloudera, Inc. All rights reserved.
Resources: # Executors, cores, memory !?!
• 6 Nodes• 16 cores each• 64 GB of RAM each
![Page 23: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/23.jpg)
23© Cloudera, Inc. All rights reserved.
Decisions, decisions, decisions
• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)
• 6 nodes• 16 cores each• 64 GB of RAM
![Page 24: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/24.jpg)
24© Cloudera, Inc. All rights reserved.
Spark Architecture recap
![Page 25: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/25.jpg)
25© Cloudera, Inc. All rights reserved.
Answer #1 – Most granular
• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
![Page 26: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/26.jpg)
26© Cloudera, Inc. All rights reserved.
Answer #1 – Most granular
• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
![Page 27: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/27.jpg)
27© Cloudera, Inc. All rights reserved.
Why?
• Not using benefits of running multiple tasks in same executor.• Missing benefits of shared broadcast variables. Need more copies of the data
![Page 28: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/28.jpg)
28© Cloudera, Inc. All rights reserved.
Answer #2 – Least granular
• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each
Worker node
Executor 1
![Page 29: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/29.jpg)
29© Cloudera, Inc. All rights reserved.
Answer #2 – Least granular
• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each
Worker node
Executor 1
![Page 30: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/30.jpg)
30© Cloudera, Inc. All rights reserved.
Why?
• Need to leave some memory overhead for OS/Hadoop daemons
![Page 31: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/31.jpg)
31© Cloudera, Inc. All rights reserved.
Answer #3 – with overhead
• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
![Page 32: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/32.jpg)
32© Cloudera, Inc. All rights reserved.
Answer #3 – with overhead
• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
![Page 33: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/33.jpg)
33© Cloudera, Inc. All rights reserved.
Let’s assume…
• You are running Spark on YARN, from here on…• 4 other things to keep in mind
![Page 34: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/34.jpg)
34© Cloudera, Inc. All rights reserved.
#1 – Memory overhead
• --executor-memory controls the heap size• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory• Default is max(384MB, . 0.10 * spark.executor.memory)
![Page 35: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/35.jpg)
35© Cloudera, Inc. All rights reserved.
#2 - YARN AM needs a core: Client mode
![Page 36: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/36.jpg)
36© Cloudera, Inc. All rights reserved.
#2 YARN AM needs a core: Cluster mode
![Page 37: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/37.jpg)
37© Cloudera, Inc. All rights reserved.
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.• Best is to keep under 5 cores per executor
![Page 38: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/38.jpg)
38© Cloudera, Inc. All rights reserved.
#4 Garbage Collection
• Too much executor memory could cause excessive garbage collection delays.• 64GB is a rough guess as a good upper limit for a single executor.• When you reach this level, you should start looking at GC tuning
![Page 39: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/39.jpg)
39© Cloudera, Inc. All rights reserved.
Calculations
• 5 cores per executor• For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in totalafter taking out Hadoop/Yarn daemon cores)• 90 cores / 5 cores/executor= 18 executors• Each node has 3 executors• 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB• 1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1
![Page 40: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/40.jpg)
40© Cloudera, Inc. All rights reserved.
Correct answer
• 17 executors in total• 19 GB memory/executor• 5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1
![Page 41: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/41.jpg)
41© Cloudera, Inc. All rights reserved.
Dynamic allocation helps with this though, right?
• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)
• 6 nodes• 16 cores each• 64 GB of RAM
![Page 42: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/42.jpg)
42© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Tuning
![Page 43: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/43.jpg)
43© Cloudera, Inc. All rights reserved.
Memory: Unified Memory Management
https://issues.apache.org/jira/browse/SPARK-10000
![Page 44: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/44.jpg)
44© Cloudera, Inc. All rights reserved.
Memory: Example
• Let’s say you have 64GB Executor. • Default spark.memory.fraction : 0.6 = 0.6 * 64 = 38.4 GB• Default spark.memory.storage.fraction: 0.5 = 0.5 * 38.4 = 19.2 GB
• So based on how much data is being spilled, GC pauses and OOME, you can take following actions1. Increase number of executors (increasing parallelism)2. Tweak the spark.yarn.executor.memory.overhead (avoid OOME)3. Tweak Spark.memory.fraction (reduces memory pressure and spilling)4. Tweak Spark.memory.storage.fraction (what you think is right, not excessive)
![Page 45: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/45.jpg)
45© Cloudera, Inc. All rights reserved.
Memory: Hidden Caches(GraphX)
org.apache.spark.graphx.lib.PageRank
![Page 46: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/46.jpg)
46© Cloudera, Inc. All rights reserved.
Memory: Hidden Caches(MLlib)
![Page 47: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/47.jpg)
47© Cloudera, Inc. All rights reserved.
Parallelism
•Number of tasks depends on number of partitions• Too many partitions is usually better than too few partitions• Very important parameter in determining performance•Datasets read from HDFS rely on number of HDFS blocks• Typically each HDFS block becomes a partition in RDD
•User can specify the number of partitions during input or transformations
•What should the X be?• The most straightforward answer is experimentation• Look at the number of partitions in the parent RDD and then keep multiplying
that by 1.5 until performance stops improving
val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)
![Page 48: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/48.jpg)
48© Cloudera, Inc. All rights reserved.
How about the cluster?
• The two main resources that Spark (and YARN) think about are CPU and memory• Disk and network I/O, of course, play a part in Spark performance as well • But neither Spark nor YARN currently do anything to actively manage them
![Page 49: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/49.jpg)
49© Cloudera, Inc. All rights reserved.
Further Tuning
• Slimming down your data structures• In-memory footprint of your data structures impacts performance greatly• Kryo Serialization preferred over default serialization for custom objects• Cache the data in memory to figure out the dataset size and can make
estimates on record sizes• Example: (total cached rdd size)/(number of records in rdd)• Gives rough estimate on how much memory your records are occupying• After several transformations, you created some custom object, this is the
easiest way to get the size• Also can use SizeEstimator’s estimate method to find object’s size
![Page 50: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/50.jpg)
50© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Data Formats
![Page 51: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/51.jpg)
51© Cloudera, Inc. All rights reserved.
Data Formats
• Parquet• Avro• JSON• Avoid if you can• Needless CPU cycles spent parsing large text files again and again
![Page 52: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/52.jpg)
52© Cloudera, Inc. All rights reserved.
Storage: Parquet
• Popular columnar format for analytical workloads• Great performance• Efficient compression• Partition Discovery & Schema Merging• Writes files into HDFS• Small files problem, needs monitoring, manage compactions• Makes the ETL pipeline complex when handling updates
![Page 53: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/53.jpg)
53© Cloudera, Inc. All rights reserved.
Storage: Kudu
• Open source distributed columnar data store • Runs on native Linux filesystem• Currently GA and ships with CDH• Similar performance to Parquet• Handles updates• No need to worry about files anymore• Scales well• Spark using KuduContext
https://www.cloudera.com/products/open-source/apache-hadoop/apache-kudu.html
![Page 54: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/54.jpg)
54© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Streaming
![Page 55: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/55.jpg)
55© Cloudera, Inc. All rights reserved.
Streaming: Spark & Kafka Integration
• Use Direct Approach• Simplified Parallelism• Efficient and More reliable• Exactly-once Semantics• Requires Offset Management
![Page 56: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/56.jpg)
56© Cloudera, Inc. All rights reserved.
Streaming: Kafka Offset Management
• Set Kafka Parameter ‘auto.offset.reset’• Spark Streaming Checkpoints• Storing Offsets in HBase• Storing Offsets in Zookeeper• Kafka Itself
![Page 57: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/57.jpg)
57© Cloudera, Inc. All rights reserved.
More Resources
• Top 5 Spark Mistakes• https://spark-summit.org/2016/events/top-5-mistakes-when-writing-spark-applicatio
ns/
• Self-paced spark workshop• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs• http://
www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
• Tuning & Debugging Spark (with another explanation of internals)• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spar
![Page 58: Chicago spark meetup-april2017-public](https://reader036.vdocuments.site/reader036/viewer/2022081520/58f9a8e5760da3da068b68e0/html5/thumbnails/58.jpg)
58© Cloudera, Inc. All rights reserved.
Questions?