think like spark

Think Like Spark Some Spark Concept & A Use Case

Who am I?

• Software engineer, data scientist, and Spark enthusiast at Alpine Data (SF Based Analytics Company)

• Co – Author High Performance Spark http://shop.oreilly.com/product/0636920046967.do

Linked in: https://www.linkedin.com/in/rachelbwarren• Slide share: http://www.slideshare.net/RachelWarren4• Github : rachelwarren. Code for this talk https

://github.com/high-performance-spark/high-performance-spark-examples

• Twitter: @warre_n_peace

http://shop.oreilly.com/product/0636920046967.do


http://www.slideshare.net/RachelWarren4

http://www.slideshare.net/RachelWarren4

https://github.com/high-performance-spark/high-performance-spark-examples



Overview

• A little Spark Architecture: How are Spark Jobs Evaluated? Why does that matter for performance?• Execution context: driver, executors, partitions, cores • Spark Application hierarchy: jobs/stages/tasks• Actions vs. Transformations (lazy evaluation) • Wide vs. Narrow Transformations (shuffles & data locality)

• Apply what we have learned with four versions of the same algorithm to find rank statistics

What is Spark?

Distributed computing framework. Must run in tandem with a data storage system - Standalone (For Local Testing) - Cloud (S3, EC2) - Distributed storage, with cluster manager,

- (Hadoop Yarn, Apache Messos)

Built around and abstraction called RDDs “Resilient, Distributed, Datasets” - Lazily evaluated, immutable, distributed collection of

partition objects

What happens when I launch a Spark Application?

Spark Driver

ExecutorExecutor Executor Executor

Stable storage e.g. HDFS

One Spark Executor

• One JVM for in memory computations

• Partitions care computed on executors

• Tasks correspond to partitions• dynamically allocated slots for

running tasks (executor cores x executors) • Caching takes up space on

executors

Spark

Dag

Partitions / Tasks

Implications

Two Most common cases of failures 1. Failure during shuffle stage • Moving data between Partitions requires communication with

the driver Failures often occur in the shuffle stage

2. Out of memory errors on executors and driver The driver and each executor have a static amount of memory*

It is easy to run out of memory on the executors or on the driver

*dynamic allocation allows changing the number of executors

How are Jobs Evaluated?

API Call Execution Element

Computation to evaluation one partition (combine narrow transforms)

Wide transformations (sort, groupByKey)

Actions (e.g. collect, saveAsTextFile)

Spark Context Object Spark Application

Job

Stage

Task Task

StageExecuted in Sequence

Executed in Parallel

Types Of Spark Operations

Actions• RDD Not RDD• Force execution: Each job

ends in exactly one action • Three Kinds• Move data to driver: collect,

take, count• Move data to external system

Write / Save • Force evaluation: foreach

Transformations• RDD RDD• Lazily evaluated • Can be combined and

executed in one pass of the data

• Computed on Spark executors

Implications of Lazy Evaluation

Frustrating: • Debugging = • Lineage graph is built backwards from action to reading in data or

persist/ cache/ checkpoint if you aren’t careful you will repeat computations *

* some times we get help from shuffle files

Awesome:• Spark can combine some types of transformations and execute them

in a single task • We only compute partitions that we need

Types of Transformations

Narrow• Never require a shuffle • map, mapPartitions, filter• coalesce* • Input partitions >= output

partitions• & output partitions known at

design time • A sequence of narrow

transformations are combined and executed in one stage as several tasks

Wide• May require a shuffle • sort, groupByKey,

reduceByKey, join • Requires data movement • Partitioning depends on data

it self (not known at design time)

• Cause stage boundary: Stage 2 cannot be complete until all the partitions in Stage 1 are computed.

Partition Dependencies for input and output partitions

Narrow Wide

Implications of Shuffles

• Narrow transformations are faster/ more parallelizable• Narrow transformation must be written so that they can

be computed on any subset of records • Narrow transformations can rely on some partitioning

information, (partition remains constant in each stage)*• Wide transformations may distribute data unevenly

across machines (for example according to the hash value of the key)

*we can loose partitioning information with map or mapPartitions(preservesPartitioner = false)

The “Goldilocks Use Case”

Rank Statistics on Wide Data

Design an application that would takes an arbitrary list of longs `n1`...`nk` and return the `nth` best element in each column of a DataFrame of doubles.

For example, if the input list is ( 8, 1000, and 20 million), our function would need to return the 8th, 1000th and 20 millionth largest element in each column.

Input Data

If we were looking for 2 and 4th elements, result would be:

V0: Iterative solution

Loop through each column:• map to value in the one column • Sort the column • Zip with index and filter for the correct rank statistic (i.e.

nth element) • Add the result for each column to a map

def findRankStatistics( dataFrame: DataFrame, ranks: List[Long]): Map[Int, Iterable[Double]] = { val numberOfColumns = dataFrame.schema.length var i = 0 var result = Map[Int, Iterable[Double]]() dataFrame.persist() while(i < numberOfColumns){ val col = dataFrame.rdd.map(row => row.getDouble(i)) val sortedCol : RDD[(Double, Long)] = col.sortBy(v => v).zipWithIndex() val ranksOnly = sortedCol.filter{ //rank statistics are indexed from one

case (colValue, index) => ranks.contains(index + 1) }.keys val list = ranksOnly.collect() result += (i -> list) i+=1 } result}

Persist prevents multiple data reads

SortBy is Spark’s sort

V0 = Too Many Sorts

• Turtle Picture

• One distributed sort per column

(800 cols = 800 sorts)• Each of these sorts

is executed in sequence

• Cannot save partitioning data between sorts

300 Million rows takes days!

V1: Parallelize by Column

• The work to sort each column can be done without information about the other columns

• Can map the data to (column index, value) pairs • GroupByKey on column index • Sort each group • Filter for desired rank statistics

Get Col Index, Value Pairs

private def getValueColumnPairs(dataFrame : DataFrame): RDD[(Double, Int)] = { dataFrame.rdd.flatMap{

row: Row => row.toSeq.zipWithIndex.map{ case (v, index) =>

(v.toString.toDouble, index)}}

}

Flatmap is a narrow transformation

Column Index Value

1 15.0

1 2.0

.. …

Group By Key Solution

• def findRankStatistics( dataFrame: DataFrame , ranks: List[Long]): Map[Int, Iterable[Double]] = { require(ranks.forall(_ > 0)) //Map to column index, value pairs val pairRDD: RDD[(Int, Double)] = mapToKeyValuePairs(dataFrame)

val groupColumns: RDD[(Int, Iterable[Double])] = pairRDD.groupByKey() groupColumns.mapValues( iter => { //convert to an array and sort val sortedIter = iter.toArray.sorted

sortedIter.toIterable.zipWithIndex.flatMap({ case (colValue, index) => if (ranks.contains(index + 1)) Iterator(colValue) else Iterator.empty }) }).collectAsMap()}

V1. Faster on Small Data fails on Big Data

300 K rows = quick300 M rows = fails

Problems with V1

• GroupByKey puts records from all the columns with the same hash value on the same partition THEN loads them into memory

• All columns with the same hash value have to fit in memory on each executor

• Can’t start the sorting until after the group by key phase has finished

V2 : Secondary Sort Style

1. ‘partitionAndSortWithinPartitions’: use the same hash partitioner as GroupByKey

- Partition by key and sort all records on each partition - Pushes some of the sorting work done on each partition

into the shuffle stage 2. Use mapPartitions to filter for the correct rank statistics- Doesn’t force each column to be stored as an in memory

data structure (each partition stays as one iterator)

Still fails on 300 M rows

Iterator-Iterator-Transformation With Map Partitions

• Iterators are not collections. They are a routine for accessing each element

• Allows Spark to selectively spill to disk • Don’t need to put all elements into memoryIn our case: Prevents loading each column into memory after the sorting stage

def findRankStatisticsV2(pairRDD: RDD[(Int, Double)], targetRanks: List[Long], partitions : Int ) = { val partitioner = new HashPartitioner(partitions) val sorted = pairRDD .repartitionAndSortWithinPartitions(partitioner)}

V2: Secondary Sort

Repartition + sort using Hash Partitioner

def findRankStatisticsV2(pairRDD: RDD[(Int, Double)], sorted.mapPartitions(iter => { var currentIndex = -1 var elementsPerIndex = 0 val filtered = iter.filter { case (colIndex, value) => if (colIndex != currentIndex) { currentIndex = 1 elementsPerIndex = 1 } else { elementsPerIndex += 1 } targetRanks.contains(elementsPerIndex) } groupSorted(filtered) //groups together ranks in same column }, preservesPartitioning = true)

filterForTargetIndex.collectAsMap()}

V2: Secondary Sort

Iterator-to-iterator transformation

V2: Still Fails

We don’t have put each column into memory but columns with the same hash value still have to be able to fit on one partition

Back to the drawing board

• Narrow transformations are quick and easy to parallelize • Partition locality can be retained across narrow transformations• Wide transformations are best with many unique keys. • Using iterator-to-iterator transforms in mapPartitions prevents

whole partitions from being loaded into memory • We can rely on shuffle files to prevent re-computation of a

wide transformations be several subsequent actions

We can solve the problem with one sortByKey and three map partitions

V3: Mo Parallel, Mo Better

1. Map to (cell value, column index) pairs 2. Do one very large sortByKey3. Use mapPartitions to count the values per column on each

partition 4. (Locally) using the results of 3 compute location of each rank

statistic on each partition5. Revisit each partition and find the correct rank statistics using

the information from step 4.

e.g. If the first partition has 10 elements from one column . The13th element will be the third element on the second partition in that column.

def findRankStatistics(dataFrame: DataFrame, targetRanks: List[Long]): Map[Int, Iterable[Double]] = {

val valueColumnPairs: RDD[(Double, Int)] = getValueColumnPairs(dataFrame) val sortedValueColumnPairs = valueColumnPairs.sortByKey()

sortedValueColumnPairs.persist(StorageLevel.MEMORY_AND_DISK)

val numOfColumns = dataFrame.schema.length val partitionColumnsFreq = getColumnsFreqPerPartition(sortedValueColumnPairs, numOfColumns) val ranksLocations = getRanksLocationsWithinEachPart(targetRanks, partitionColumnsFreq, numOfColumns)

val targetRanksValues = findTargetRanksIteratively(sortedValueColumnPairs, ranksLocations) targetRanksValues.groupByKey().collectAsMap()}

Complete code here: https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/GoldiLocks/GoldiLocksFirstTry.scala

1. Map to (val, col) pairs

2. Sort

3. Count per partition

4.

5. Filter for element computed in 4

Complete code here:

https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/GoldiLocks/GoldiLocksFirstTry.scala

V3: Still Blows up!

• First partitions show lots of failures and straggler tasks• Jobs lags in the sort stage and fails in final mapPartitions

stage

More digging reveled data was not evenly distributed

Data skew¼ of columns are zero

V4: Distinct values per Partition

• Instead of mapping from (value, column index pairs), map to ((value, column index), count) pairs on each partition

e. g. if on a given partition, there are ten rows with 0.0 in the 2nd column, we could save just one tuple: (0.0, 2), 10) • Use same sort and mapPartitions routines, but adjusted

for counts.

Different Key

column0 column2

2.0 3.0

0.0 3.0

0.0 1.0

0.0 0.0

(value, column Index), count)

((2.0, 0), 1)

(2.0,0), 3)

(3.0, 1), 2) ….

V4: Get (value, o

• Code for V4

def getAggregatedValueColumnPairs(dataFrame : DataFrame) : RDD[((Double, Int), Long)] = {

val aggregatedValueColumnRDD = dataFrame.rdd.mapPartitions(rows => { val valueColumnMap = new mutable.HashMap[(Double, Int), Long]() rows.foreach(row => { row.toSeq.zipWithIndex.foreach{ case (value, columnIndex) => val key = (value.toString.toDouble, columnIndex) val count = valueColumnMap.getOrElseUpdate(key, 0) valueColumnMap.update(key, count + 1) } }) valueColumnMap.toIterator }) aggregatedValueColumnRDD}

Map to ((value, column Index) ,count)

Using a hashmap to keep track of uniques

Code for V4

• Lots more code to complete the whole algorithm https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/GoldiLocks/GoldiLocksWithHashMap.scala

V4: Success!

• 4 times faster than previous solution on small data

• More robust, more parallelizable! Scaling to billions of rows!

Happy Goldilocks!

Why is V4: Better

Advantages• Sorting 75% of original records • Most keys are distinct • No stragglers, easy to parallelize • Can parallelize in many different ways

Lessons

• Sometimes performance looks ugly • Best unit of parallelization? Not always the most intuitive• Shuffle Less • Push work into narrow transformations• leverage data locality to prevent shuffles

• Shuffle Better • Shuffle fewer records• Use narrow transformations to filter or reduce when possible • Shuffle across keys that are well distributed • Best if records associated with one key fit in memory

• Be aware of data skew, know you data

Before We Part …

• Alpine Data is hiring! Data scientists, engineers (ruby, java, as well as Hadoop/ Scala) , support, technical sales“I continue to be amazed, Alpine has the nicest people ever” – Former alpine engineerhttp://alpinedata.com/careers/

• Buy my book! http://shop.oreilly.com/product/0636920046967.doAlso contact me if you are interested in being a reviewer

http://alpinedata.com/careers/

http://alpinedata.com/careers/




think like spark

Data & Analytics