![Page 1: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/1.jpg)
What is a Distributed Data Science PipelineHow with Apache Spark and Friends.
by @DataFellas@Noootsab, 23th Nov. ‘15 @YaJUG
![Page 2: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/2.jpg)
● (Legacy) Data Science Pipeline/Product● What changed since then● Distributed Data Science (today)● Challenges● Going beyond (productivity)
Outline
![Page 3: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/3.jpg)
Data Fellas6 months old Belgian Startup
Andy Petrella@noootsab
MathsGeospatialDistributed Computing
@SparkNotebookSpark/Scala trainerMachine Learning
Xavier Tordoir@xtordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)Spark trainerMachine Learning
![Page 4: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/4.jpg)
(Legacy) Data Science PipelineOr, so called, Data Product
Static Results
Lot of information lost in translation
Sounds like Waterfall
ETL look and feel
Sampling Modelling Tuning Report Interprete
![Page 5: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/5.jpg)
(Legacy) Data Science PipelineOr, so called, Data Product
Mono machine!
CPU bounds
Memory bounds
Or resampling because small-ish data
Sampling Modelling Tuning Report Interprete
![Page 6: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/6.jpg)
FactsData gets bigger or, precisely, the amount of available sources explodes
Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ
Our world TodayNo, it wasn’t better before
![Page 7: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/7.jpg)
Consequences
HARD (or will be too big...)
Ephemeral
Restricted View
Sampling
Report
Our world TodayNo, it wasn’t better before
![Page 8: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/8.jpg)
Interpretation ⇒ Too SLOW to get real ROI out of the overall system
How to work around that?
Our world TodayNo, it wasn’t better before
Consequences
![Page 9: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/9.jpg)
Our world TodayNo, it wasn’t better before
Alerting system over descriptive charts
More accurate results
more or harder models (e.g. Deep Learning)
More data
Constant data flow
Online interactions under control (e.g. direct feedback)
Needs are
![Page 10: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/10.jpg)
Our world TodayNo, it wasn’t better before
Distributed Systems
So, we need...
![Page 11: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/11.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
![Page 12: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/12.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
![Page 13: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/13.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
![Page 14: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/14.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
![Page 15: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/15.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
![Page 16: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/16.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
YO!Aren’t we talking about “Big” Data ? Fast Data ?
So could really (all) results being neither big nor fast?
Actually, Results are becoming themselves “Big” Data ! Fast Data !
![Page 17: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/17.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
how do we access data since 90’s? remember SOA? → SERVICES!
Nowadays, we’re talking about micro services.
Here we are, one service for one result.
![Page 18: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/18.jpg)
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?
We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!
![Page 19: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/19.jpg)
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 20: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/20.jpg)
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops
sci
sci ops
sci
ops data
web ops data
web ops data
data
sci
![Page 21: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/21.jpg)
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 22: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/22.jpg)
What about Productivity?Streamlining development lifecycle most welcome
➔ Longer production line➔ More constraints (resources sharing, time, …)➔ More people➔ More skills
Overlooking these points and you’ll be soon or sooner
So, how to have:
● results coming fast enough whilst keeping accuracy level high?● Responsivity to external/unpredictable events?
WHEN...
kicked
![Page 23: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/23.jpg)
WarningTeam Fight: seen by members
![Page 24: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/24.jpg)
WarningTeam Fight: seen by managers
![Page 25: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/25.jpg)
WarningTeam Fight: seen by employers
![Page 26: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/26.jpg)
WarningTeam Fight: seen by customers
![Page 27: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/27.jpg)
What about Productivity?Streamlining development lifecycle most welcome
At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).
Hence, Data Fellas
● extends the Spark Notebook (interactivity)● builds the Shar3 product arounds it (Integrated Reactivity)
![Page 28: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/28.jpg)
Concepts of Data Fellas’ Shar3Shareable and Streamlined Data Science
Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
![Page 29: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/29.jpg)
Using Shar3yeah \o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
![Page 30: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/30.jpg)
Using Shar3yeah \o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
![Page 31: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/31.jpg)
Using Shar3yeah \o/
![Page 32: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/32.jpg)
Using Shar3yeah \o/
![Page 33: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/33.jpg)
Using Shar3yeah \o/
![Page 34: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/34.jpg)
Using Shar3yeah \o/
![Page 35: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/35.jpg)
Using Shar3yeah \o/
What do we need to do now?
● Deploy● connect the dots● track● scale
BoTh the Jobs and the services
![Page 36: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/36.jpg)
Using Shar3yeah \o/
From notebook
to SBT project
to Docker
to Marathon
SNBSBT/JAR
Docker
marathon
![Page 37: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/37.jpg)
Using Shar3yeah \o/
From Notebook
● to output● to Avro
SNB
![Page 38: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/38.jpg)
Using Shar3yeah \o/
From Notebook
● to Avro● to service● to SBT● to Docker● to Marathon
SNBSBT/JAR Docker
marathon
![Page 39: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/39.jpg)
Using Shar3yeah \o/
From Notebook
● to Avro● to Tableau● or QlikView● or D3.JS● or …
SNB
![Page 40: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/40.jpg)
Using Shar3yeah \o/
So we have these information available:
● notebook’s markdown text● notebook’s code/model● data sources● Output/sinks● Output/services● Avro schema
Shouldn’t them all be
reused???
![Page 41: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/41.jpg)
Using Shar3yeah \o/
Variant Analysis
![Page 42: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/42.jpg)
There is a service!
Using Shar3yeah \o/
![Page 43: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/43.jpg)
Let’s use it…
Using Shar3yeah \o/
![Page 44: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/44.jpg)
What was the process?
Using Shar3yeah \o/
![Page 45: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/45.jpg)
Fine and the output is in C*
Using Shar3yeah \o/
![Page 46: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/46.jpg)
Let’s check what’s in-there
Using Shar3yeah \o/
![Page 47: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/47.jpg)
not what I need, let’s ADAPT
Using Shar3yeah \o/
![Page 48: What is a distributed data science pipeline. how with apache spark and friends](https://reader033.vdocuments.site/reader033/viewer/2022050614/588b0c551a28abdf3b8b6095/html5/thumbnails/48.jpg)
Poke us on
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Now @TypeSafe: http://t.co/o1Bt6dQtgH
If you wanna learn more about the different tools… Join us @ O’Reilly
Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)
That’s all folksThanks for listening/staying