lessons learned developing and managing massive (300tb+) apache spark pipelines in production with...

LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTION Brandon Carl

Upload: spark-summit

Post on 21-Jan-2018

479 views

Category:

Data & Analytics

3 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTIONBrandon Carl

Page 2: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Page 3: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MARCH 15, 2016

"SEE THE MOMENTS YOU CARE ABOUT FIRST"

Page 4: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Page 5: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MACHINE LEARNING

Page 6: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MACHINE LEARNING LIFECYCLE

Training Examples

Machine Learning

ModelMake

PredictionsMeasure

Outcomes

Ranking Events

Client Events

Page 7: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

WHY SPARK?

• Performance • Testability • Modularity • Serialized Logging

Page 8: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SERIALIZED LOGGING

{ "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 }}

Page 9: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SERIALIZED LOGGING

struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features;}

new Candidate() .setId(id) .setScores(scores) .setFeatures(features)

Page 10: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

CHANGES OVER TIME

Page 11: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

CHANGES OVER TIME

• RDD • Dataset • Training Data Joiner

Page 12: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

TRAINING DATA JOINER

class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner {val labels: Map[String, LabelFunction] = ???

}

case class Output(id: Long, label_value: Double)

Page 13: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE

Page 14: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE - PEOPLE

Page 15: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

AUTOMATE EVERYTHING

Page 16: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SIMPLE INTERFACE

Page 17: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

SIMPLE INTERFACE

RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')

Page 18: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

MANAGING MASSIVE SCALE - DATA

Page 19: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PLAN FOR GROWTH

Page 20: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Page 21: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Source Data Map/Filter

Join Output

Source Data Map/Filter

Page 22: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

PERSIST TO HDFS

Source Data Map/Filter Temporary

Table

Source Data Map/Filter Temporary

Table

Join Output

Page 23: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

KRYO SERIALIZATION

Page 24: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

KRYO SERIALIZATION

new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))

Page 25: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

Page 26: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

Page 27: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

BIG-O MATTERS

final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

Page 28: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA STRUCTURES MATTER

Page 29: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA STRUCTURES MATTER

• AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)

http://fastutil.di.unimi.it

Page 30: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

DATA SKEW MATTERS

Page 31: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

TEST ON SAMPLED DATA

Page 32: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Page 33: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

October 22, 2017 705 E. Brandon Blvd., Brandon, FL ... FINAL-1.pdf · E. Brandon Blvd., Brandon, FL October 22, 2017 1 Nativity Catholic Church 705 E. Brandon Blvd., Brandon, FL 33511

BRANDON AREA REALTORS · BRANDON AREA REALTORS® 312–10th Street • Brandon, MB R7A 4G1 • Phone 204-727-4672 CITY OF BRANDON MAP Leech Printing 226653 BRANDON AREA REALTORS

Gas Pipelines

Scalable Pipelines

Pipelines and Pipe Networks-I (Pipelines connecting two ...mimoza.marmara.edu.tr/~neslihan.semerci/ENVE204/L2.pdf · ENVE 204 Lecture -2 Pipelines and Pipe Networks-I (Pipelines connecting

Chapter 6 Pipelines. Pipelines Industry Overview Pipelines are limited in the markets they serve and very limited in the commodities they can haul. Pipelines

Transmission Pipelines

Offshore pipelines

Manitoba - ooohshinythings.com List Locations.pdf · Brandon Glenn Kotak 170049GK Brandon Kendra Ledoux 170276KL Brandon Michelle Meier 170242MM Brandon Krystal Morgan 170280KM Brandon

Oil and gas delivery to Europe - European Parliament · Oil and gas delivery to Europe ... Asia, Caspian Sea). ... which include interconnectors in addition to massive oil pipelines

Proposed Pipelines & Infrastructure Proposed Pipelines & Infrastructure

Gaming Control Device By: Brandon Gonzalez By: Brandon Gonzalez

Coal Institute Conference: Fundamentals Update · Financial. Reliable. Operating Pipelines Pipelines completed in 2014-2015 Pipelines under construction Awarded Pipelines CFE Planned

CHAPTER 5 Pipelines, risers and subsea systems og... · Pipelines, risers and subsea systems. 2 REVIEWED DOCUMENTS Pipelines. 3 For pipelines ISO (25.03.2008. 4 ... mitigation action

October 8, 2017 705 E. Brandon Blvd., Brandon, FL Nativity … FINAL.pdf · E. Brandon Blvd., Brandon, FL October 8, 2017 1 Nativity Catholic Church 705 E. Brandon Blvd., Brandon,

Commissioner District 1 Commissioner Brandon Johnson · Commissioner Brandon Johnson 629 604 7/13/2020. Commissioner Brandon Johnson

Pipelines 2

By: Brandon Head. A wildfire is a massive fire caused in the forest. Wildfires destroy huge amounts of land and can cause a large amounts of casualties

Massive Transfusion and Massive Transfusion Protocols · Massive Transfusion and Massive Transfusion Protocols Robert T. Russell, MD, MPH November 2019 Pediatric Trauma Society. Massive

Asynchronous Pipelines

LNG carriers called ﬂoa ng pipelines’ - arlis.org · April 2014 LNG carriers called ‘ﬂoa ng pipelines’ The story of LNG shipping is a tale of massive investment, sophisticated

Brandon Relocation Guide Reloc… · Brandon Relocation Guide 2012 1 Economic Development Brandon Bienvenidos a Brandon… un magnífico lugar para vivir, trabajar y criar una familia

Second Quarter 2017 Conference Call - TC PipeLinesTC PipeLines, LP Second Quarter 2017 Conference Call Brandon Anderson, President Janine Watson, VP and General Manager Nathan Brown,

ATG pipelines

Pipelines Presentation

aboutpipelinespstrust.org/wp...CEPA-Introductory-Presentation.pdf · CEPA member pipelines proposed pipelines non. member pipelines Pipelines Production: 14.7 billion cubic feet/day

Opinion, Page 4 Gun Violence Prevention Activist Rememberedconnectionarchives.com/PDF/2020/021920/Vienna.pdf · campaigning on massive data centers and fighting oil and gas pipelines

1 Clockless Logic: Dynamic Logic Pipelines (contd.) Drawbacks of Williams’ PS0 Pipelines Lookahead Pipelines

Analytical Pipelines

Brandon Valley High School January Newsletter...Brandon Valley High School January Newsletter Brandon Valley High School 301 South Splitrock Blvd. Brandon, SD 57005 Principal From

FIRST PARK BRANDON - images4.loopnet.com€¦ · 53 Brandon Centre South 54 Costco 55 Kohl’s Brandon 56 Lake Brandon Plaza 57 Lake Brandon Village 58 Lowe’s 59 Petsmart 60 Plaza

Brandon Sample Brandon Sample PLC

Delivery pipelines

Haunted places BY: Brandon ward BY: Brandon ward