![Page 1: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/1.jpg)
Real-Life Apache Spark: Tips and Tricks from the Trenches Noah Bieler
Wealthport AG
Zurich Spark Meetup, March 2016
#ZurichSparkUsers
![Page 2: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/2.jpg)
Overview
2
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 3: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/3.jpg)
Overview
3
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 4: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/4.jpg)
Spark and the MapReduce Model
4
Map Reduce
Express your computations in terms of map (embarrassingly parallel) and reduce operations.
BreadSandwich
Tomato
Cheese
![Page 5: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/5.jpg)
Spark RDD
5
• RDD (Resilient Distributed Dataset) are the abstraction Spark uses to model parallelism.
• Uses the MapReduce model (map, reduce, immutability)
• RDDs in the code are only instructions to compute something. What Spark actually does is not obvious (optimisations, predicate pushdown etc.)
• Since the actual computation is “delayed” you cannot use RDDs within RDDs.
rdd1 rdd2 rdd3 rdd4map f countmap hmap g
c: Long
rdd1 rdd2map f . g . h count
In the code:
In the VM: c: Long
![Page 6: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/6.jpg)
RDD: Mental Model
6
rdd1 rdd2map f
On the Driver
action
On the Nodes: rdd1 rdd2map f
Partition 1:
Partition N:
…
object Main { def test = { val rdd: RDD[Int] = ...
val a = 1 + 2 + 3 // happens on the driver
rdd.map { i => i + a // happens on the nodes } } }
parallelize
![Page 7: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/7.jpg)
RDDs vs. DataFrames (since 1.3) vs. DataSets (since 1.6)
7
RDDs are the most basic building block on Spark. Limited API but full control and type safety.
DataFrames are RDDs of Rows (= Seq[Any], no type safety!) with a schema; basically like a table. More methods but less control. For example, one cannot control the partitioning. Possibility to use SQL statements.
New Datasets (since Spark 1.6) are like RDDs (type safety) but with (optimised) methods known from DataFrames (count).
![Page 8: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/8.jpg)
Overview
8
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 9: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/9.jpg)
Spark Pitfall: Join
9
joinRDD[(K,V)] RDD[(K,W)]
rdd1
1 -> “abc” 2 -> “dfg” …
3 -> “hij” 4 -> “xzy” …
rdd1
rdd2 1 -> 3.142 2 -> 2.718 …
3 -> 1.618 4 -> 8.314 …
rdd2
join
join
Partition 1:
Partition 2:
result 1 -> (“abc”, 3.142) 2 -> (“dfg”, 2.718) …
3 -> (“hij”, 1.618) 4 -> (“xzy”, 8.314) …
result
x No network traffic
Before you join, make sure that the two data frames are properly partitioned.
rdd.partitionBy(new HashPartitioner(4 * nodeCount))
![Page 10: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/10.jpg)
Spark Pitfalls: Join
10
Don’t use map on a partitioned PairRDD but mapValues if possible. Otherwise the partitioning is destroyed.
rdd1 1 -> “abc” 2 -> “dfgh” …
rdd2
1 -> 3 2 -> 4 …
map { case (k, v) => (k, v.size) }
rdd1
1 -> “abc” 2 -> “dfgh” …
1 -> 3 2 -> 4 …
mapValues(_.size)
Spark cannot know if key was changed. → Partitioning is erased.
Spark knows that key was not changed. → Partitioning is kept.
rdd2
![Page 11: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/11.jpg)
Spark Pitfalls: Joining a large and a small RDD
11
rdd1 sc.broadcast(rdd2.collect())
1 -> 3.142 ...1 -> “a”
1 -> “b”
2 -> “c”
1 -> “c”
3 -> “c”
not partitioned
When joining a large with a small RDD, it might be better to broadcast the small one. Especially, if otherwise the RDDs must be partitioned.
![Page 12: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/12.jpg)
Spark Pitfalls: Persistence
12
Persist DataFrames/RDDs which result in more than one branch of transformations.
rdd1
rdd2
rdd3
rdd4
map f
map hmap g
.persist()
……
object RDDs {
/** Automatically persist and unpersist an RDD * before and after the calculation. */ def withPersistedRDD[A, B]( rdd: RDD[A], storageLevel: StorageLevel )(f: RDD[A] => B): B = { val result = Try(f(rdd.persist(storageLevel))) rdd.unpersist() result.get }
withPersistedRDD(rdd1.map(f)) { rdd2 => val rdd3 = rdd2.map(g) val rdd4 = rdd2.map(h) /* ... */ result } }
![Page 13: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/13.jpg)
Spark Pitfalls: Serialisation
13
class Algorithm1 (val primeNumber: Int) extends Serializable {
def run(rdd: RDD[String]): RDD[Int] = {
rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }
val veryLargeTabe = Seq(/* ... */) }
class Algorithm2 (val primeNumber: Int) {
def run(rdd: RDD[String]): RDD[Int] = {
val _primeNUmber = primeNumber
rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + _primeNumber * c.toInt } }
val veryLargeTabe = Seq(/* ... */) }
You actually use this.primeNumber and therefore serialise the whole instance(including veryLargeTable).
A local copy of this.primeNumber avoids serialising the whole instance.
![Page 14: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/14.jpg)
Spark Pitfalls: Serialisation
14
You actually use this.hash and therefore serialise the whole instance(including veryLargeTable).
A function factory for hash avoids serialising the whole instance.
class Algorithm (val primeNumber: Int) extends Serializable {
def hash(s: String) = s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt }
def run(rdd: RDD[String]): RDD[Int] = { rdd.map { s => hash(s) } }
val veryLargeTabe = Seq(/* ... */) }
class Algorithm (val primeNumber: Int) {
def hashFunction() = { val _primeNUmber = primeNumber (s: String) => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }
def run(rdd: RDD[String]): RDD[Int] = { val hash = hashFunction() rdd.map { s => hash(s) } }
val veryLargeTabe = Seq(/* ... */) }
![Page 15: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/15.jpg)
Spark Pitfalls: MapLike not Serializable
15
object Main { val myMap = Map(1 -> "a", 2 -> "bc", 3 -> "def") .mapValues(_.size) // Produces MapLike, not Serializable .map(identity) // Produces Map again
val myOtherMap = /* ... */
val totalSize = sc.parallelize(Seq(myMap, myOtherMap)) .map(_.size) .reduce(_+_) // Would fail without map(identity), SI-7005 }
After running mapValues on a Map, run map(identity) on it, to avoid a NonSerializableException.
![Page 16: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/16.jpg)
Spark Pitfalls: Avoid groupByKey followed by mapValues
16
The Seq produced by groupByKey can be potentially very large. Try to avoid it.
val rdd = sc.parallelize(Seq( "Hello", "World", "Bonjour", "Monde", "Guten Tag", "Welt" ))
val histogram1 = rdd. .map(_.size -> null) .groupByKey() // : RDD[(Int, Seq[Int, Any])] .mapValues(_.size)
val histogram2 = rdd. .map(_.size -> 1) .reduceByKey(_+_)
![Page 17: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/17.jpg)
Spark Pitfalls: Row’s and null’s
17
• Spark’s Row is nothing but a wrapper for Seq[Any]: No type safety! • A Row will return a null if there is no value present!
row.getAs[String](index) == null // no exception!
row(index) == nullif
• A Row can loose its schema
A proper type hierarchy would not even define the function getAs(fieldName: String) for Rows!
dataFrame .map { row => val newRow = Row.fromSeq(row.toSeq.updated(timeIndex, timeStamp)) row.getAs[Int]("ID") -> newRow // Access element by field name } .map { case (id, row) => id -> row.getAs[String]("First Name") // No schema! }
![Page 18: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/18.jpg)
Overview
18
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 19: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/19.jpg)
Pimp my Spark: The “Pimp my Library” Pattern
19
object RowImplicits {
implicit class RowImplicit(row: Row) {
def updated[T](attributeId: AttributeId, value: T): Row = {
val newRow = Row.fromSeq(row.toSeq.updated(row.fieldIndex(attributeId), value))
Option(row.schema).map(newRow.withSchema).getOrElse(newRow) }
def withSchema(schema: StructType): Row = new GenericRowWithSchema(row.toSeq.toArray, schema)
def getStringOption(attributeIndex: Int): Option[String] = { if (row.isNullAt(attributeIndex)) None else Some(row.getString(attributeIndex)) } } }
Add Functionality to every possible library.
![Page 20: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/20.jpg)
Overview
20
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 21: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/21.jpg)
User Defined Types
21
Functional Programming stands on three pillars: • Variables are immutable (no side effects) • Functions are first class citizens (higher order functions) • Algebraic datatypes (strongly typed)
A good type hierarchy ensures that each function has only valid input and sane output.
Thus, it essential that Spark supports custom data types.
http://pt.slideshare.net/ScottWlaschin/fp-patterns-buildstufflt
def div(nominator: Int, denominator: NonZeroInteger) = nominator / denominator.value
def div(nominator: Int, denominator: Int) = denominator match { case 0 => None case _ => Some(nominator / denominator) }
![Page 22: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/22.jpg)
User Defined Types
22
@SQLUserDefinedType(udt = classOf[EntityIdType]) case class EntityId(uuid: UUID) extends Serializable
object EntityId { def generate(): EntityId = EntityId(UUID.randomUUID()) }
case object EntityIdType extends EntityIdType
If you want to identify your rows with UUIDs,you need to use user defined types since Spark does not support UUIDs.
![Page 23: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/23.jpg)
User Defined Types
23
class EntityIdType private extends UserDefinedType[EntityId] () { override def sqlType: DataType = StringType
override def serialize(obj: Any): UTF8String = obj match { case null => null.asInstanceOf[UTF8String] case t: EntityId => UTF8String.fromString(t.uuid.toString) case _ => throw new IllegalArgumentException(/*...*/) }
override def deserialize(datum: Any): EntityId = datum match { case s: UTF8String => new EntityId(UUID.fromString(s.toString)) case s: String => new EntityId(UUID.fromString(s)) case _ => throw new IllegalArgumentException(/*...*/) }
override def userClass: Class[EntityId] = classOf[EntityId] }
Sometimes Spark serialises using normal Strings, sometimes using UTF8Strings.
![Page 24: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/24.jpg)
Overview
24
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 25: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/25.jpg)
Running Spark in the Cloud (AWS)
25
Three cluster managers: • Standalone • Apache Mesos • Hadoop’s YARN
Two possibilities: • Create an EC2 instance and use the spark-ec2 scripts to manage the instances.
Time-consuming, not everything works out of the box. E.g. encoding has to be set manually. • Use Amazon EMR to have a managed environment.
Pricier and and releases are a bit slower. Uses YARN.
Both methods let you access data on S3 (AWS storage).
$ cat spark/conf/spark-defaults.conf spark.akka.frameSize 1000 spark.driver.memory 11g spark.driver.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 55g spark.executor.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError
![Page 26: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/26.jpg)
Overview
26
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 27: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/27.jpg)
Running Spark and Cassandra
27
Cassandra is a distributed NoSQL database technology optimised for fault tolerance. Initially invented by Facebook and now used world-wide (twitter, reddit, …).
We have just started to experiment with it and fixed some bugs with respect to user defined types and Scala 2.10 reflection.
We are using Datastax’ driver to connect Spark and Cassandra. Supports since recently Spark 1.6 (before was 1.5).
We are using cassandra-unit to write our unit tests.
![Page 28: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/28.jpg)
Overview
28
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
![Page 29: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/29.jpg)
Testing Spark
29
abstract class TestBase extends FunSuite with BeforeAndAfterAll with BeforeAndAfterEach with Matchers {
protected val sparkConfigProperties = mutable.Map[String, String]() protected implicit var sparkContext: SparkContext = _ protected implicit var sqlContext: SQLContext = _ protected implicit var cassandraSession: Session = _
override def beforeAll(): Unit = { System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort")
val conf = new SparkConf() .setMaster("local") .set("spark.testing", "true") .set("spark.ui.enabled", "false") .set("spark.master.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .set("spark.worker.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .setAll(sparkConfigProperties)
sparkContext = new SparkContext(conf) sqlContext = new SQLContext(sparkContext) }
override def afterAll(): Unit = { sparkContext.stop() System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort") } }
![Page 30: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/30.jpg)
Testing Spark and Cassandra
30
class CassandraTest extends TestBase { sparkConfigProperties("spark.cassandra.connection.host") = "127.0.0.1"
override def beforeAll(): Unit = { EmbeddedCassandraServerHelper.startEmbeddedCassandra("cassandra.yaml", 300000)
sparkConfigProperties("spark.cassandra.connection.port") = EmbeddedCassandraServerHelper.getNativeTransportPort.toString
super.beforeAll()
cassandraSession = CassandraConnector(sparkContext.getConf).openSession()
cassandraSession.execute(s"DROP KEYSPACE IF EXISTS test_keyspace") val dataLoader = new CQLDataLoader(cassandraSession) dataLoader.load(new ClassPathCQLDataSet("cassandra/create_schema.cql", true, "test_keyspace")) }
override def afterAll(): Unit = { cassandraSession.close() EmbeddedCassandraServerHelper.cleanEmbeddedCassandra() super.afterAll() } }
![Page 31: Real-Life Apache Spark: Tips and Tricks from the Trenches](https://reader031.vdocuments.site/reader031/viewer/2022021813/58d060c81a28ab10448b53f9/html5/thumbnails/31.jpg)
Wealthport AG, Rütistrasse 16, CH-8952 Schlieren, +41 43 508 50 96, [email protected], www.wealthport.com
Getting your data back into shape.