lessons learned developing and managing massive (300tb+) apache spark pipelines in production with...

33
LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTION Brandon Carl

Upload: spark-summit

Post on 21-Jan-2018

479 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTIONBrandon Carl

Page 2: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Page 3: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MARCH 15, 2016

"SEE THE MOMENTS YOU CARE ABOUT FIRST"

Page 4: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Page 5: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MACHINE LEARNING

Page 6: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MACHINE LEARNING LIFECYCLE

Training Examples

Machine Learning

ModelMake

PredictionsMeasure

Outcomes

Ranking Events

Client Events

Page 7: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

WHY SPARK?

• Performance • Testability • Modularity • Serialized Logging

Page 8: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SERIALIZED LOGGING

{ "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 }}

Page 9: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SERIALIZED LOGGING

struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features;}

new Candidate() .setId(id) .setScores(scores) .setFeatures(features)

Page 10: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

CHANGES OVER TIME

Page 11: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

CHANGES OVER TIME

• RDD • Dataset • Training Data Joiner

Page 12: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

TRAINING DATA JOINER

class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner {val labels: Map[String, LabelFunction] = ???

}

case class Output(id: Long, label_value: Double)

Page 13: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE

Page 14: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE - PEOPLE

Page 15: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

AUTOMATE EVERYTHING

Page 16: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SIMPLE INTERFACE

Page 17: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SIMPLE INTERFACE

RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')

Page 18: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE - DATA

Page 19: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PLAN FOR GROWTH

Page 20: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Page 21: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Source Data Map/Filter

Join Output

Source Data Map/Filter

Page 22: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Source Data Map/Filter Temporary

Table

Source Data Map/Filter Temporary

Table

Join Output

Page 23: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

KRYO SERIALIZATION

Page 24: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

KRYO SERIALIZATION

new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))

Page 25: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

Page 26: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

Page 27: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

Page 28: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA STRUCTURES MATTER

Page 29: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA STRUCTURES MATTER

• AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)

Page 30: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA SKEW MATTERS

Page 31: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

TEST ON SAMPLED DATA

Page 32: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Page 33: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl