distributed machine learning 101 using apache spark from a browser devoxx.be2015
TRANSCRIPT
Distributed Machine Learning using Apache Spark from the Browser
Devoxx Belgium 2015, Antwerpen
● Distributed computing● what is Machine Learning?
● Spark for machine learning?
● Spark MLlib by examples
● Spark and other libraries
● Wrap up
Outline
Data Fellas
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
Distributed ComputingWhy you must care, by Data Fellas
Andy Petrella & Xavier Tordoir
Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga!
Computing
Processing Power Memory Storage
Computing
Oh no!
Hence performance is limited in time and space
Processing Power Memory StorageTIME SPACE
Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
The components interact with each other in order to achieve a common goal. [...].
Ref: https://en.wikipedia.org/wiki/Distributed_computing
Distributing
Interesting
Consequences
Oh no!
Algorithms have to work on DATA Partitions and with partial results
The entire dataset cannot be accessed at once
New resource!
Damned
Processing Power Memory StorageSPACE
Network
TIME
Network Will impact performances...
Oops did it again
Distributing
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
network
DrawbackPartition
Huh?
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
network
DrawbackPartition
Hey, you sank my node!
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
network
Processing
Memory
Storage
BOOM
Ouch, my rack
AdvantageElastic scaling
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
What if this cluster happens to not be big enough?
That’s more reasonable
AdvantageElastic scaling
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
network
HPC: computationally intensive applications
Model: specialized hardware (CPU/GPU) and network
They are orchestrated by a scheduler that gather their computing power and memory.
Yeah! what about?
What about HPC?
Drawbacks:
● Costs and upgrades by large blocks● Decoupled storage
storage latency = no streaming / no Iteration
Got No Money and NO time
What about HPC?
Why processing data if not to model?
Machine learning: iterative (streaming & batch)
Data is aggregated in the form of a model (parameters)
Data change little, model is small
Do that baby!
Iterate
Iterate
you gotta be kidding
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
Storage
Moving lots of data again and again...
Distributed computing allow cost effective parallelism
Efficiency requires distributed storage
Colocated with the processing units
What about programming models?
Summary
Interesting
Distributed storage
Partitions!
HDFS: Apache implementation of Google FS
● Natural fit for distributed storage● Works as a service
Other chunked sources...
● Apache Cassandra, S3, Tachyon,...
Distributed storage
Split da Name Node
256Mb put /data/f256.txt
replication factor 2 Data Node 1
Data Node 2
Data Node 4
Data Node 3
Distributed storage
Split da
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb put /data/f256.txt
replication factor 2 64Mb
64Mb
64Mb
64Mb
Distributed storage
Everywhere
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb
64Mb
64Mb
64Mb
64Mb
put /data/f256.txtreplication factor 2 put /data/f256.txt/part-r-00000 64
Mb
Distributed storage
everywhere
256Mb put /data/f256.txt
replication factor 2Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
put /data/f256.txt/part-r-00000 64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
Distributed storage
Replicate
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb put /data/f256.txt
replication factor 2 put /data/f256.txt/part-r-00000 64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
Map ReduceHigh Level Execution
The rocket’s base
data part
data part
data part
data part
Load the data
Map ReduceHigh Level Execution
The rocket’s engines
data part mapper
data part
data part
data part
mapper
mapper
mapper
Mapand Pair
Map ReduceHigh Level Execution
The rocket’s trunk
GroupB
yKey
data part mapper
data part
data part
data part
mapper
mapper
mapper
Shuffle Pairs using Keys
Map ReduceHigh Level Execution
The rocket’s cockpit
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Values per key are Reduced
Map ReduceHigh Level Execution
The rocket’s tip
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Results
We collect the results
Map ReduceHigh Level Execution
To the infinite and beyond!
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Results
The whole#!
Map Reduce Matrix-Vector Product
How about word count?
=
Map Reduce Matrix-Vector Product
Back to school...
=
Map Reduce Matrix-Vector Product
Wait, that’s maths
=
Map Reduce Matrix-Vector Product
Where is the RAT?
Store Matrix as ordered
Vector V loaded in memory as ordered
Map function:
Each matrix element mapped on a producT
Map Reduce Matrix-Vector Product
OK … I TAKE OVER
MAP
Map Reduce Matrix-Vector Product
just a sum …
REDUCE
Map ReduceSummary
Summary ==
Reduce?
Simple Abstraction of computations, Map and Reduce
Using simple abstraction of data, key value pairs
Map ReduceSummary
So what?
Brings transparent:
● parallelization● distribution ● fault tolerance
Why Apache SparkMapReduce on steroids
Man… Finally!
Uses
● Functional paradigm● Lazy computations
Creates dependencies between tasks definitions and optimizes execution
Why Apache SparkMapReduce on steroids
Almost forgot that one
Can cache data in memory or local file system.
Far less IO or network.
What is Machine learning?Why you must care, by Data Fellas
Andy Petrella & Xavier Tordoir
you cannot prove a vague theory is wrong
[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.
—Richard Feynman [1964]
What is Machine Learning?Science with data
Surely You’re Joking Mr…
● Modelling without first principle…
What is Machine Learning?Overview
2nd law neither...
● Modelling without first principle…
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
Take that Newton...
● Modelling without first principle…
● Modelling dependencies from the data
What is Machine Learning?Overview
With some “a priori” knowledge
● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation
What is Machine Learning?Learning Machine…
You still need a domain expert…
Like me!
LearningMachine
● Estimate dependencies from data
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
● Estimate dependencies from data
● Minimize a risk functional over the set given the data
What is Machine Learning?Overview
I like them so much in LaTeX2e
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
● Regression: continuous output
○ Risk = Prediction error
● Classification: categorical output
○ Risk = Probability of misclassification
What is Machine Learning?Supervised learning
Lyfxw y-fxw2…
WTF?
What is Machine Learning?Unsupervised learning: no output
I like clusters, specially with roasted nuts
● Clustering
○ Risk = Error Distortion (distances to center)
● Density estimation (probability densities)
What is Machine Learning?Bias - Variance, Regression illustration
Playtime!
Notebook!
What is Machine Learning?Inductive principle
In principle, it should work.
An inductive principle tells what to do
Finite Data
Inductive principle
Model
What is Machine Learning?Inductive principle
In principle, it should work.
Empirical risk minimization
Finite Data Model
• Functions class not defined• Loss not defined• Optimization procedure not defined
What is Machine Learning?Inductive principle
In principle, it should work.
Regularization
Finite Data Model
• control on penalty strength• Penalize complexity/a priori knowledge
What is Machine Learning?Inductive principle
In principle, it should work.
Early stopping rules
Finite Data Model
• Iterative optimization• Depends on initial params and algorithm• used for neural networks• Penalize along a path
What is Machine Learning?Inductive principle
In principle, it should work.
Structural Risk
Finite Data Model
• Analytic estimates of empirical risk
What is Machine Learning?Inductive principle
In principle, it should work.
Bayesian inference
Finite Data Model
• Explicit a priori probabilities• Learn mixtures• Hard multidimensional integrations…
What is Machine Learning?Curse of dimentionality
In principle, it should work.
We want to control complexity
Finite Data Model
• smoothness constraint in a neighborhood
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data density is key…
Finite DataIn a Space
ModelComplexity
Inductive principle
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data density is key…e.g.● 1-D 0.1m interval => 10 points/m● 2-D 0.1M interval => 100 points/M^2
● d-d 0.1 m interval => 10^d points/m^d
Same smoothness requires lots of data in high dimensional spaces
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Sampling is hard…e.g.● 1-D 10% sample => 0.1 x size● 2-D 10% sample => 0.31 x size
● 10-d 10% sample => 0.79 x size
=> local estimates from samples are difficult
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data points are closer to edges…One Data points “sees” himself as an outlier
=> Predictions require lots of extrapolation
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Samples must increase exponentially
… or model complexity must be controlled
What is Machine Learning?Regularization in more details
In principle, it should work.
Data driven penalized risk minimization
What is Machine Learning?Regularization in more details
In principle, it should work.
Loss functions
What is Machine Learning?Regularization in more details
In principle, it should work.
Regularizers
L2 (ridge)
L1(lasso)
Elastic net
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Which algorithm to find a minimum in a distributed fashion?
Convex optimization methods (linear methods)● Gradient descent● Stochastic gradient descent● Limited-memory BFGS
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Gradient descent● Efficient steps but needs to read through
the whole data
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Stochastic Gradient descent● Samples data for each step but converges
very slowly
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
L-BFGS● quadratic derivative estimates by keeping
several previous gradient in memory● Fast convergence
What is Machine Learning?Model selection
all work and no play makes Jack a dull boy
Model Complexity control: Resampling
Selecting the right lambda…
… to minimize prediction risk
What is Machine Learning?Model selection
Enough theory boy!
The universe
What is Machine Learning?Model selection
Enough theory boy!
Our data
What is Machine Learning?Model selection
Enough theory boy!
Our data
Learning Set (70%)
validation set (30%)
What is Machine Learning?Model selection
Enough theory boy!
Our data
Learning Set (70%)
validation set (30%)
What is Machine Learning?Model selection
Nice flag
K-Fold
K = 4
MLLibA library to learn them all...
Distributed computing framework
Large Scale Data Processing engine
What is Apache Spark?
I play BIG!
Distributed computing framework
Large Scale Data Processing engine
● SQL & Dataframes● Streaming● Graph Processing● Machine Learning
With all colors!
What is Apache Spark?
Distributed computing framework
Large Scale Data Processing engine
● Optimize memory usage (FAST)● Optimize computation execution
(Complex tasks)● Easy programming model
Let the brain do the work...
What is Apache Spark?
Distributed computing framework
Large Scale Data Processing engine
● Interactive● @ any scale
Breed mixin’
What is Apache Spark?
MLLibSpark
In principle, it should work.
Intro to Spark… notebook
MLLibSpark
In principle, it should work.
Intro to Spark… notebook
So we’we seen… ● Basics of Spark data manipulation● MLLib data representation● Linear regression● Regularization and k-fold cross validation
What else is there?
MLLibSpark
In principle, it should work.
Basic statisticsClassification and regressionCollaborative filteringClusteringDimensionality reductionFeature extraction and transformationFrequent pattern miningEvaluation metrics…
http://spark.apache.org/docs/latest/mllib-guide.html
MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)
Playtime!
Some more examples
GenomicsThe data
So… that’s what separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
GenomicsThe data
Please, don’t mind the colors...
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
GenomicsThe data
Woooow, really, you must be kidding me… ahahahahah
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
GenomicsThe data
Oh… damned… hum huh
MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)
Playtime!
Notebook!
What else?Old and new players are now integrating with Spark
(and Scala)
Integrated with Data Frame
Offer API to create
shareable/reusable
Pipeline constructions (PCA, …)
Spark ML Pipeline
Higher API
Like Pipeline but
Type Safe
Chainable API (andThen-friendly)
Spark ML Keystone
Higher API
Memory implementation of “Map-Reduce”
Highly optimised structures for the JVM
blazing fast convergent models
H2O
Higher API
DL4J Spark ML
Higher API
Intel Data Analytics Acceleration Library
DAAL (Intel)
Higher API
Declarative large-scale machine learning
optimization based on data and cluster
characteristics
System ML (IBM)
Higher API
Nitro's Extremely Exciting Deep Learning Engine
MLP, RBM, LSTM and more to come
Needle
Higher API
H2OSparkling & Deep Learning on genomics
water in fire
Learning structures using H2O Deep Learning Algorithm integrated in SparKin a Notebookon an Ec2 Cluster
http://h2o.ai/product/sparkling-water/
H2OSparkling: in-memory data exchange
I remember things better when I remember then twice.
Wrap upwhat we hope you have learned
Distributed computingFor machine learning
I am ready.
Data is exploding
Distributed Technologies are maturing
Scale up and down, interactivity
Distributed ML on SparkWhat is available
What are my options by the way?
Spark MLLibH2O
DL4J
Needle
EC2 GCEURIKA-XA
clouderaMapr
Hortonworks
HDFSC*
kafka
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Shar3 (Data Fellas)ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Shar3 (Data Fellas)Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
That’s all folksThanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH