spark ml pipeline serving

34
Spark Serving by Stepan Pushkarev CTO of Hydrosphere.io

Upload: stepan-pushkarev

Post on 22-Jan-2018

418 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark ML Pipeline serving

Spark Serving

by Stepan Pushkarev CTO of Hydrosphere.io

Page 2: Spark ML Pipeline serving

Spark Users here?

Page 3: Spark ML Pipeline serving

Data Scientists and Spark Users here?

Page 4: Spark ML Pipeline serving
Page 5: Spark ML Pipeline serving

Why do companies hire data scientists?

Page 6: Spark ML Pipeline serving

Why do companies hire data scientists?

To make products smarter.

Page 7: Spark ML Pipeline serving

What is a deliverable of data scientist and data engineer?

Page 8: Spark ML Pipeline serving

What is a deliverable of data scientist?

Academic

paper?

ML Model? R/Python

script?

Jupiter

Notebook?

BI

Dashboard?

Page 9: Spark ML Pipeline serving

cluster

datamodel

data scientist

? web app

Page 10: Spark ML Pipeline serving

val wordCounts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey((a, b) => a + b)

executor

executorexecutor

executor executor

Page 11: Spark ML Pipeline serving

Machine Learning: training + serving

Page 12: Spark ML Pipeline serving

pipeline

Training (Estimation) pipeline

trainpreprocess preprocess

Page 13: Spark ML Pipeline serving

tokenizer

apache spark 1

hadoop mapreduce 0

spark machine learning 1

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

Page 14: Spark ML Pipeline serving

hashing tf

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

Page 15: Spark ML Pipeline serving

logistic regression

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

0 72 -2.7138781446090308

0 94 0.9042505436914775

0 105 3.0835670890496645

0 495 3.2071722417080766

0 722 0.9042505436914775

Page 16: Spark ML Pipeline serving

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")

val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)model.write.save("/tmp/spark-model")

Page 17: Spark ML Pipeline serving

pipeline

Prediction Pipeline

preprocess preprocess

Page 18: Spark ML Pipeline serving

val test = spark.createDataFrame(Seq(("spark hadoop"),("hadoop learning")

)).toDF("text")

val model = PipelineModel.load("/tmp/spark-model")

model.transform(test).collect()

Page 19: Spark ML Pipeline serving

./bin/spark-submit …

Page 20: Spark ML Pipeline serving

cluster

datamodel

data scientist

? web app

Page 21: Spark ML Pipeline serving

Pipeline Serving - NOT Model Serving

Model level API leads to code duplication & inconsistency

at pre-processing stages!

Web App

Ruby/PHP:

preprocess

Check current user

User LogsML Pipeline: preprocess, train

Save

Score/serve model

Fraud Detection Model

Page 22: Spark ML Pipeline serving

https://issues.apache.org/jira/browse/SPARK-16365

https://issues.apache.org/jira/browse/SPARK-13944

Page 23: Spark ML Pipeline serving
Page 24: Spark ML Pipeline serving

cluster

datamodel

data scientist

web app

PMMLPFA

MLEAP

- Yet another Format Lock

- Code & state duplication

- Limited extensibility

- Inconsistency

- Extra moving parts

Page 25: Spark ML Pipeline serving

cluster

datamodel

data scientist

web app

docker

model

libs

deps

- Fat All inclusive Docker - bad

practice

- Every model requires new

docker to be rebuilt

Page 26: Spark ML Pipeline serving

cluster

data

model

data scientist

web appA

PI

API

- Needs Spark Running

- High latency, low throughput

Page 27: Spark ML Pipeline serving

cluster

data

model

data scientist

web appA

PI

serv

ing

AP

I

+ Serving skips Spark

+ But re-uses ML algorithms

+ No new formats and APIs

+ Low Latency but not super tuned

+ Scalable

+ Stateless

Page 28: Spark ML Pipeline serving

Low level API Challenge

MS Azure

Page 29: Spark ML Pipeline serving

A deliverable for ML model

Single row Serving / Scoring layer

xml, json, parquet, pojo, other

Monitoring, testing

integration

Large Scale, Batch

processing engine

Page 30: Spark ML Pipeline serving

Zooming out

Unified Serving/Scoring API

Repository

MLLib model TensorFlow model Other model

Page 31: Spark ML Pipeline serving

Real-time Prediction PIpelines

Page 32: Spark ML Pipeline serving

Starting from scratch - System ML

Multiple execution modes, including Spark MLContext

API, Spark Batch, Hadoop Batch, Standalone, and JMLC.

Page 33: Spark ML Pipeline serving

Demo Time

Page 34: Spark ML Pipeline serving

Thank you

Looking for

- Feedback

- Advisors, mentors & partners

- Pilots and early adopters

Stay in touch

- @hydrospheredata

- https://github.com/Hydrospheredata

- http://hydrosphere.io/

- [email protected]