5 apache spark tips in 5 minutes

1© Cloudera, Inc. All rights reserved.

5 Spark tips in 5 MinutesImran Rashid| Cloudera Engineer, Apache Spark PMC


rdd.cache()rdd.setName(…)

BAD:Sc.accumulator(0L)

GOOD:Sc.accumultor(0L, “my counter”)

#1: Name Cached RDDs and Accumulators


#1b: MEMORY_AND_DISK

• BAD: rdd.cache()• If partition is dropped, computed from scratch

• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)

Huge Raw Data

FilterFlatMap

…cache


• DAG Visualization• Key Metrics

• Data Read / Written• Shuffle Read / Write• Stragglers / Outliers

• Cache Utilization

#2: Use Spark’s UI


• Use Sample Code• Count Errors• Sample Errors

• SparkListener to output updates• https://gist.github.com/squito/2f7cc0

2c313e4c9e7df4

#3: Debug Counters

val parseErrors = ErrorTracker(“parsing errors", sc)

val allParsed: RDD[T] = sc.textFile(inputFile).flatMap { line => try { val r = Some(parser(line)) parseErrors.localValue.ok() r } catch { case NonFatal(ex) => parseErrors.localValue.error(line) None }}

https://gist.github.com/squito/2f7cc02c313e4c9e7df4

https://gist.github.com/squito/2f7cc02c313e4c9e7df4


#4: Avoid Driver Bottlenecks

GOOD BAD

rdd.collect() Exploratory data analysis; merging a small set of results.

Sequentially scan entire data set on driver. No parallelism, OOM on driver. (rdd.toLocaltIterator is better, still not good)

rdd.reduce() Summarize the results from a small dataset.

Big Data Structures, from lots of partitions.

sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.


• Try Scala!• Much simpler code• KISS• Sbt: ~compile, ~test-quick• Template project with giter8

• Use Spark Testing Base• Talk Wednesday by Holden K

• Run Spark Locally• But try at scale periodically (you may hit

bottlenecks)

#5: Dev Environment

https://github.com/squito/cdh-spark.g8

https://github.com/holdenk/spark-testing-base

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42993


• I write bugs• You write bugs• Spark has bugs

• Long Pipelines should be restartable• Bad: Bug in Stage 18 after 5 hours

rerun from scratch?• Good: Write to stable storage (eg.,

hdfs) periodically, restart from stage 17

• DiskCachedRDD

#6:Code for Fast Iterations


#7: Narrow Joins & HDFS

• Narrow Joins• Much cheaper• Anytime rdds share Partitioner

• What about when reading from hdfs?• SPARK-1061• Read from hdfs• “Remember” data was written

with a partitioner

Wide Join Narrow Join

https://issues.apache.org/jira/browse/SPARK-1061

https://issues.apache.org/jira/browse/SPARK-1061


Thank you

5 apache spark tips in 5 minutes

Software