dtcc '14 spark runtime internals
DESCRIPTION
An introduction to Apache Spark runtime internals.TRANSCRIPT
![Page 2: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/2.jpg)
What is
• A fast and general engine for large-scale data
processing
• An open source implementation of Resilient
Distributed Datasets (RDD)
• Has an advanced DAG execution engine that
supports cyclic data flow and in-memory
computing
![Page 3: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/3.jpg)
Why
• Fast
– Run machine learning like iterative programs up to
100x faster than Hadoop MapReduce in memory,
or 10x faster on disk
– Run HiveQL compatible queries 100x faster than
Hive (with Shark/Spark SQL)
![Page 4: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/4.jpg)
Why
• Easy to use
– Fluent Scala/Java/Python API
– Interactive shell
– 2-5x less code (than Hadoop MapReduce)
![Page 5: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/5.jpg)
Why
• Easy to use
– Fluent Scala/Java/Python API
– Interactive shell
– 2-5x less code (than Hadoop MapReduce)
sc.textFile("hdfs://...").flatMap(_.split(" ")).map(_ -> 1).reduceByKey(_ + _).collectAsMap()
![Page 6: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/6.jpg)
Why
• Easy to use
– Fluent Scala/Java/Python API
– Interactive shell
– 2-5x less code (than Hadoop MapReduce)
sc.textFile("hdfs://...").flatMap(_.split(" ")).map(_ -> 1).reduceByKey(_ + _).collectAsMap()
Can you write down
WordCount
in 30 seconds with
Hadoop MapReduce?
![Page 7: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/7.jpg)
Why
• Unified big data pipeline for:
– Batch/Interactive (Spark Core vs MR/Tez)
– SQL (Shark/Spark SQL vs Hive)
– Streaming (Spark Streaming vs Storm)
– Machine learning (MLlib vs Mahout)
– Graph (GraphX vs Giraph)
![Page 8: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/8.jpg)
A Bigger Picture...
![Page 9: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/9.jpg)
An Even Bigger Picture...
• Components of the Spark stack focus on big data analysis and are compatible with existing Hadoop storage systems
• Users don’t need to suffer expensive ETL cost to use the Spark stack
![Page 10: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/10.jpg)
“One Stack To Rule Them All”
• Well, mostly :-)
• And don't forget Shark/Spark SQL vs Hive
![Page 11: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/11.jpg)
Resilient Distributed Datasets
![Page 12: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/12.jpg)
Resilient Distributed Datasets
• Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors
• An RDD…
– either points to a direct data source
– or applies some transformation to its parent RDD(s) to generate new data elements
– Computation can be represented by lazy evaluatedlineage DAGs composed by connected RDDs
![Page 13: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/13.jpg)
Resilient Distributed Datasets
• Frequently used RDDs can be materialized and
cached in-memory to accelerate computation
• Spark scheduler takes data locality into
account
![Page 14: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/14.jpg)
The In-Memory Magic
• “In fact, one study [1] analyzed the access
patterns in the Hive warehouses at Facebook
and discovered that for the vast majority (96%)
of jobs, the entire inputs could fit into a
fraction of the cluster’s total memory.”
[1] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in
datacenter computing considered irrelevant. In HotOS ’11, 2011.
![Page 15: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/15.jpg)
The In-Memory Magic
• Without cache
– Elements are accessed in an iterator-based
streaming style
– One element a time, no bulk copy
– Space complexity is almost O(1) when there’s only
narrow dependencies
![Page 16: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/16.jpg)
The In-Memory Magic
• With cache
– One block per RDD partition
– LRU cache eviction
– Locality aware
– Evicted blocks can be recomputed in parallel with
the help of RDD lineage DAG
![Page 17: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/17.jpg)
Play with Error Logs
sc.textFile("hdfs://<input>").filter(_.startsWith("ERROR")).map(_.split(" ")(1)).saveAsTextFile("hdfs://<output>")
![Page 18: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/18.jpg)
Step by Step
The RDD.saveAsTextFile() action triggers a job. Tasks are started on
scheduled executors.
![Page 19: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/19.jpg)
The task pulls data lazily from the final RDD (the MappedRDD here).
Step by Step
![Page 20: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/20.jpg)
A record (text line) is pulled out from HDFS...
Step by Step
![Page 21: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/21.jpg)
... into a HadoopRDD
Step by Step
![Page 22: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/22.jpg)
Then filtered with the FilteredRDD
Step by Step
![Page 23: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/23.jpg)
Oops, not a match
Step by Step
![Page 24: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/24.jpg)
Another record is pulled out...
Step by Step
![Page 25: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/25.jpg)
Filtered again...
Step by Step
![Page 26: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/26.jpg)
Passed to the MappedRDD for next transformation
Step by Step
![Page 27: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/27.jpg)
Transformed by the MappedRDD (the error message is extracted)
Step by Step
![Page 28: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/28.jpg)
Saves the message to HDFS
Step by Step
![Page 29: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/29.jpg)
A Synthesized Iterator
new Iterator[String] {private var head: String = _private var headDefined: Boolean = false
def hasNext: Boolean = headDefined || {do {
try head = readOneLineFromHDFS(...) // (1) read from HDFScatch {
case _: EOFException => return false}
} while (!head.startsWith("ERROR")) // (2) filter closuretrue
}
def next: String = if (hasNext) {headDefined = falsehead.split(" ")(1) // (3) map closure
} else {throw new NoSuchElementException("...")
}}
Constant Space
Complexity!
![Page 30: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/30.jpg)
If Cached...
val cached = sc.textFile("hdfs://<input>").filter(_.startsWith("ERROR")).cache()
cached.map(_.split(" ")(1)).saveAsTextFile("hdfs://<output>")
![Page 31: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/31.jpg)
If Cached...
All filtered error messages are cached in memory before being passed to the
next RDD. LRU cache eviction is applied when memory is insufficient
![Page 32: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/32.jpg)
But… I Don’t Have Enough
Memory To Cache All Data
![Page 33: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/33.jpg)
Don’t Worry
• In most cases, you only need to cache a fraction of
hot data extracted/transformed from the whole input
dataset
• Spark performance downgrades linearly & gracefully
when memory decreases
![Page 34: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/34.jpg)
How Does The Job Get
Distributed?
![Page 35: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/35.jpg)
Spark Cluster Overview
![Page 36: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/36.jpg)
Driver, Cluster Manager,
Worker, Executor, ...
So... Where THE HELL
Does My Code Run?
![Page 37: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/37.jpg)
Well... It Depends
![Page 38: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/38.jpg)
Driver and SparkContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(...)
• A SparkContext initializes the application driver,
the latter then registers the application to the cluster
manager, and gets a list of executors
• Since then, the driver takes full responsibilities
![Page 39: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/39.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• RDD lineage DAG is built on driver side with:
– Data source RDD(s)
– Transformation RDD(s), which are created by transformations
![Page 40: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/40.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• Once an action is triggered on driver side, a
job is submitted to the DAG scheduler of the
driver
![Page 41: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/41.jpg)
WordCount Revisited
• DAG scheduler cuts the DAG into stages and
turns each partition of a stage into a single task.
• DAG scheduler decides what to run
![Page 42: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/42.jpg)
WordCount Revisited
• Tasks are then scheduled to executors by
driver side task scheduler according to
resource and locality constraints
• Task scheduler decides where to run
![Page 43: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/43.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• Within a task, the lineage DAG of
corresponding stage is serialized together with
closures of transformations, then sent to and
executed on scheduled executors
![Page 44: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/44.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• The reduceByKey transformation introduces in a shuffle
• Shuffle outputs are written to local FS on the mapper side, then downloaded by reducers
![Page 45: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/45.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• ReduceByKey automatically combines values
within a single partition locally on the mapper
side and then reduce them globally on the
reducer side.
![Page 46: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/46.jpg)
WordCount Revisited
val lines = sc.textFile("input")
val words = lines.flatMap(_.split(" "))
val ones = words.map(_ -> 1)
val counts = ones.reduceByKey(_ + _)
val result = counts.collectAsMap()
• At last, results of the action are sent back to
the driver, then the job finishes.
![Page 47: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/47.jpg)
What About Parallelism?
![Page 48: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/48.jpg)
How To Control Parallelism?
• Can be specified in a number of ways
– RDD partition number• sc.textFile("input", minSplits = 10)
• sc.parallelize(1 to 10000, numSlices = 10)
– Mapper side parallelism
• Usually inherited from parent RDD(s)
– Reducer side parallelism• rdd.reduceByKey(_ + _, numPartitions = 10)
• rdd.reduceByKey(partitioner = p, _ + _)
• …
![Page 49: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/49.jpg)
How To Control Parallelism?
• “Zoom in/out”– RDD.repartition(numPartitions: Int)
– RDD.coalesce(numPartitions: Int,shuffle: Boolean)
![Page 50: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/50.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• Run in local mode
• 60+GB input file
![Page 51: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/51.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output") Hmm...
Pretty cute?
![Page 52: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/52.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• Actually, this may
generate 4,000,000
temporary files...
• Why 4M? (Hint: 2K2)
!!!
![Page 53: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/53.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• 1 partition per HDFS split
• Similar to mapper task number in Hadoop
MapReduce, the minSplits parameter is
only a hint for split number
![Page 54: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/54.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• Actual split number is also controlled by some
other properties like minSplitSize and default
FS block size
![Page 55: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/55.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• In this case, the final split size equals to local
FS block size, which is 32MB by default, and
60GB / 32MB ≈ 2K
• ReduceByKey generates 2K2 shuffle outputs
![Page 56: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/56.jpg)
A Trap of Partitions
sc.textFile("input", minSplits = 2).coalesce(2)
.map { line =>
val Array(key, value) = line.split(",")
key.toInt -> value.toInt
}
.reduceByKey(_ + _)
.saveAsText("output")
• Use RDD.coalesce() to control partition
number precisely.
![Page 57: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/57.jpg)
Summary
• In-memory magic
– Iterator-based streaming style data access
– Usually you don’t need to cache all dataset
– When memory is not enough, Spark performance
downgrades linearly & gracefully
![Page 58: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/58.jpg)
Summary
• Distributed job
– Locality aware
– Resource aware
• Parallelism
– Reasonable default behavior
– Can be finely tuned
![Page 59: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/59.jpg)
References
• Resilient distributed datasets: a fault-tolerant
abstraction for in-memory cluster computing
• An Architecture for Fast and General Data
Processing on Large Clusters
• Spark Internals (video, slides)
• Shark: SQL and Rich Analytics at Scale
• Spark Summit 2013
![Page 60: DTCC '14 Spark Runtime Internals](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c66e9d4a795913618b45f0/html5/thumbnails/60.jpg)
Q&A