5 apache spark tips in 5 minutes

10
1 © Cloudera, Inc. All rights reserved. 5 Spark tips in 5 Minutes Imran Rashid| Cloudera Engineer, Apache Spark PMC

Upload: cloudera-inc

Post on 16-Apr-2017

1.274 views

Category:

Software


0 download

TRANSCRIPT

Page 1: 5 Apache Spark Tips in 5 Minutes

1© Cloudera, Inc. All rights reserved.

5 Spark tips in 5 MinutesImran Rashid| Cloudera Engineer, Apache Spark PMC

Page 2: 5 Apache Spark Tips in 5 Minutes

2© Cloudera, Inc. All rights reserved.

rdd.cache()rdd.setName(…)

BAD:Sc.accumulator(0L)

GOOD:Sc.accumultor(0L, “my counter”)

#1: Name Cached RDDs and Accumulators

Page 3: 5 Apache Spark Tips in 5 Minutes

3© Cloudera, Inc. All rights reserved.

#1b: MEMORY_AND_DISK

• BAD: rdd.cache()• If partition is dropped, computed from scratch

• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)

Huge Raw Data

FilterFlatMap

…cache

Page 4: 5 Apache Spark Tips in 5 Minutes

4© Cloudera, Inc. All rights reserved.

• DAG Visualization• Key Metrics

• Data Read / Written• Shuffle Read / Write• Stragglers / Outliers

• Cache Utilization

#2: Use Spark’s UI

Page 5: 5 Apache Spark Tips in 5 Minutes

5© Cloudera, Inc. All rights reserved.

• Use Sample Code• Count Errors• Sample Errors

• SparkListener to output updates• https://gist.github.com/squito/2f7cc0

2c313e4c9e7df4

#3: Debug Counters

val parseErrors = ErrorTracker(“parsing errors", sc)

val allParsed: RDD[T] = sc.textFile(inputFile).flatMap { line => try { val r = Some(parser(line)) parseErrors.localValue.ok() r } catch { case NonFatal(ex) => parseErrors.localValue.error(line) None }}

Page 6: 5 Apache Spark Tips in 5 Minutes

6© Cloudera, Inc. All rights reserved.

#4: Avoid Driver Bottlenecks

GOOD BAD

rdd.collect() Exploratory data analysis; merging a small set of results.

Sequentially scan entire data set on driver. No parallelism, OOM on driver. (rdd.toLocaltIterator is better, still not good)

rdd.reduce() Summarize the results from a small dataset.

Big Data Structures, from lots of partitions.

sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.

Page 7: 5 Apache Spark Tips in 5 Minutes

7© Cloudera, Inc. All rights reserved.

• Try Scala!• Much simpler code• KISS• Sbt: ~compile, ~test-quick• Template project with giter8

• Use Spark Testing Base• Talk Wednesday by Holden K

• Run Spark Locally• But try at scale periodically (you may hit

bottlenecks)

#5: Dev Environment

Page 8: 5 Apache Spark Tips in 5 Minutes

8© Cloudera, Inc. All rights reserved.

• I write bugs• You write bugs• Spark has bugs

• Long Pipelines should be restartable• Bad: Bug in Stage 18 after 5 hours

rerun from scratch?• Good: Write to stable storage (eg.,

hdfs) periodically, restart from stage 17

• DiskCachedRDD

#6:Code for Fast Iterations

Page 9: 5 Apache Spark Tips in 5 Minutes

9© Cloudera, Inc. All rights reserved.

#7: Narrow Joins & HDFS

• Narrow Joins• Much cheaper• Anytime rdds share Partitioner

• What about when reading from hdfs?• SPARK-1061• Read from hdfs• “Remember” data was written

with a partitioner

Wide Join Narrow Join

Page 10: 5 Apache Spark Tips in 5 Minutes

10© Cloudera, Inc. All rights reserved.

Thank you