spark ml pipeline serving

Spark Serving

by Stepan Pushkarev CTO of Hydrosphere.io

Spark Users here?

Data Scientists and Spark Users here?

Why do companies hire data scientists?

To make products smarter.

What is a deliverable of data scientist and data engineer?

What is a deliverable of data scientist?

Academic

paper?

ML Model? R/Python

script?

Jupiter

Notebook?

Dashboard?

cluster

datamodel

data scientist

? web app

val wordCounts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey((a, b) => a + b)

executor

executorexecutor

executor executor

Machine Learning: training + serving

pipeline

Training (Estimation) pipeline

trainpreprocess preprocess

tokenizer

apache spark 1

hadoop mapreduce 0

spark machine learning 1

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

hashing tf

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

logistic regression

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

0 72 -2.7138781446090308

0 94 0.9042505436914775

0 105 3.0835670890496645

0 495 3.2071722417080766

0 722 0.9042505436914775

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")

val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)model.write.save("/tmp/spark-model")

pipeline

Prediction Pipeline

preprocess preprocess

val test = spark.createDataFrame(Seq(("spark hadoop"),("hadoop learning")

)).toDF("text")

val model = PipelineModel.load("/tmp/spark-model")

model.transform(test).collect()

./bin/spark-submit …

cluster

datamodel

data scientist

? web app

Pipeline Serving - NOT Model Serving

Model level API leads to code duplication & inconsistency

at pre-processing stages!

Web App

Ruby/PHP:

preprocess

Check current user

User LogsML Pipeline: preprocess, train

Score/serve model

Fraud Detection Model

https://issues.apache.org/jira/browse/SPARK-16365

https://issues.apache.org/jira/browse/SPARK-13944

cluster

datamodel

data scientist

web app

PMMLPFA

- Yet another Format Lock

- Code & state duplication

- Limited extensibility

- Inconsistency

- Extra moving parts

cluster

datamodel

data scientist

web app

docker

- Fat All inclusive Docker - bad

practice

- Every model requires new

docker to be rebuilt

cluster

data scientist

web appA

- Needs Spark Running

- High latency, low throughput

cluster

data scientist

web appA

+ Serving skips Spark

+ But re-uses ML algorithms

+ No new formats and APIs

+ Low Latency but not super tuned

+ Scalable

+ Stateless

Low level API Challenge

MS Azure

A deliverable for ML model

Single row Serving / Scoring layer

xml, json, parquet, pojo, other

Monitoring, testing

integration

Large Scale, Batch

processing engine

Zooming out

Unified Serving/Scoring API

Repository

MLLib model TensorFlow model Other model

Real-time Prediction PIpelines

Starting from scratch - System ML

Multiple execution modes, including Spark MLContext

API, Spark Batch, Hadoop Batch, Standalone, and JMLC.

Demo Time

Thank you

Looking for

- Feedback

- Advisors, mentors & partners

- Pilots and early adopters

Stay in touch

- @hydrospheredata

- https://github.com/Hydrospheredata

- http://hydrosphere.io/

- spushkarev@hydrosphere.io

spark ml pipeline serving

Data & Analytics

netflix's recommendation ml pipeline using apache spark:...

prediction serving · 2018-01-04 · prediction serving....

henriqueseitinabeshima€¦ · experimentos científicos...

spark spark vrt

spark end-to-end analytics with apache · pdf fileend-to-end...

machine learning pipelines at scale with apache...

what is a distributed data science pipeline. how with apache...

spark sql and dataframes spark graphx spark mlib spark...

filling the pipeline through innovative...

building an etl pipeline for elasticsearch using spark

the ai thunderdome with sahara, spark, and swift using...

machine learning pipeline with spark ml

spark advancing marketplace trust | fall/winter 2018 … ·...

new jersey pipeline - c.ymcdn.com · new jersey pipeline i...

advanced spark and tensorflow meetup 08-04-2016 one click...

real time data pipeline with spark streaming and cassandra...

utilizing accelerators to speedup etl, ml, and dl … ·...

real-time data processing pipeline & visualization with...

big data logging pipeline with apache spark and kafka

bluewater constructors, inc. general contractors serving the...