making machine learning scale: single machine and distributed
TRANSCRIPT
![Page 1: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/1.jpg)
Scalable Machine Learning: Single Machine to Distributed
Yucheng LowChief Architect
![Page 2: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/2.jpg)
What is ML scalability?
![Page 3: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/3.jpg)
Is this scalability?
1600s
Algorithm Implementation X
800s
400s
200s
300s
Best Single Machine Implementation
![Page 4: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/4.jpg)
True Scalability
How long does it take to get to a predetermined accuracy?
Not About:How well you can implement Algorithm X
Understand the tradeoffs between different algorithms.
![Page 5: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/5.jpg)
It is not about
Scaling Up Scaling Out
![Page 6: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/6.jpg)
It is about
Scaling Up Scaling Out
Going as fast as you can, on any hardware
![Page 7: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/7.jpg)
• Assume bounded resources • Optimize for data scalability
The Dato Way
• Scales excellently• Require fewer machines to
solve in the same runtime as other systems
![Page 8: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/8.jpg)
10
~1GB/s
1 TB
~0.1GB/s
10 TB
~1-10 GB/s
0.1 TB
Single Machine Scalability: Storage Hierarchy
Capacity
Throughput
Random access is very slow!
Good External Memory Datastructures For ML
![Page 9: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/9.jpg)
SFrame: Scalable Tabular Data Manipulation
User Com.
Title Body
User Disc.
SGraph: Scalable Graph Manipulation
![Page 10: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/10.jpg)
Data is usually rows… user movie rating
But, data engineering typically column transformations…
![Page 11: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/11.jpg)
13
Feature engineering is columnar
Normalizes the feature x:sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum()
Create a new feature:sf[‘rating-squared’] =
sf[‘rating’].apply(lambda rating: rating*rating)
Create a new dataset with 2 of the features:sf2 = sf[[‘rating’,’ rating-squared’]]
ratinguser movierating
squared
![Page 12: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/12.jpg)
SFrame
• Rich Datatypes• Strong schema types: int, double, string, image, ...• Weak schema types: list, dictionary (Can contain arbitrary
JSON)• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.• Lazy evaluation• Statistics + sketches• Type aware compression
User Com.
Title Body
User Disc.Scalable Out-Of-Core Table Representation
Netflix Dataset, 99M rows, 3 columns, ints1.4GB raw289MB gzip compressed
160MB
![Page 13: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/13.jpg)
Out of Core Machine Learning
Rethink all ML Algorithms
Random Access Sequential Only
Sampling? Sort/Shuffle
Understanding the Statistical/convergence impacts of ML
algorithm variations.
![Page 14: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/14.jpg)
Single Machine Scaling
GraphLab-Create (1 Node)
MLlib 1.3 (5 Node)
MLlib 1.3 (1 Node)
Scikit-Learn
0 500 1000 1500 2000 2500
Runtime
Dataset Source: LIBLinear binary classification datsets.KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.Task: Predict student performance on math problems based on interactions with tutoring system
![Page 15: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/15.jpg)
Single Machine Scaling
GraphLab-Create (1 Node)
BIDMach (1 GPU Node)
0 100 200 300 400 500 600 700 800 900
Runtime
Criteo Kaggle: Click Prediction
46M rows34M sparse coefficients
Not a Compute Bound Task
![Page 16: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/16.jpg)
Social Media
Graphs encode the relationships between:
•Big: trillions of vertices and edges and rich metadata•Facebook (10/2012): 1B users, 144B friendships •Twitter (2011): 15B follower edges
AdvertisingScience Web
PeopleFacts
ProductsInterests
Ideas
![Page 17: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/17.jpg)
SGraph1. Immutable disk-backed graph
representation. (Append only)
2. Vertex / Edge Attributes.3. Optimized for bulk access, not fine-grained queries.
Get neighborhood of [5 Million Vertices]
Get neighborhood of 1 vertex
![Page 18: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/18.jpg)
Standard Graph Representations
src dest1 102132 1048 999129 192998 23392 124
Edge List
Easy to Insert
src dest1 101 991 1022 52 102 120
Sparse Matrix / Sorted Edge List
Difficult to Insert (random writes)102 103
349 13
Difficult to Query
Fast to Query
1 105
![Page 19: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/19.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
__id Address ZipCodeAlice … 98105Bob … 98102
Vertices partitioned into p = 4 SFrames
![Page 20: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/20.jpg)
Edges partitioned into p^2 = 16 SFrames
__id Address ZipCodeJohn … 98105Jack … 98102
SGraph Layout
1
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id MessageAlice Bob “hello”Bob Charlie “world”Charlie Alice “moof”
2
![Page 21: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/21.jpg)
__id Address ZipCodeJohn … 98105Jack … 98102
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id MessageAlice Bob “hello”Bob Charlie “world”Charlie Alice “moof”
![Page 22: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/22.jpg)
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
![Page 23: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/23.jpg)
Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
Largest available public Graph. 200GBCompression factor 10:1
12.5 bits per edge
2 TB
Benefit From SFrame Compression Methods
![Page 24: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/24.jpg)
Common Crawl Graph3.5 billion Nodes and 128 billion
EdgesLargest available public Graph.
200GB
Compression factor 10:112.5 bits per edge
2 TB
![Page 25: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/25.jpg)
Common Crawl Graph
1x r3.8xlarge using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.
![Page 26: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/26.jpg)
SFrame & SGraph
BSD License(August)
![Page 27: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/27.jpg)
Distributed
![Page 28: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/28.jpg)
Train on bigger datasets Train Faster
Speedup Relative to Best Single Machine Implementation
![Page 29: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/29.jpg)
X Y
Time for 1 pass = 100s
Extending Single Machine to Distributed
![Page 30: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/30.jpg)
Extending Single Machine to Distributed
X YTime for 1 pass = 50s
X Y
Parallel Disks
Good External Memory Datastructures For ML Still Help
![Page 31: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/31.jpg)
Distributed Optimization
Newton, LBFGS, FISTA, etc
Parallel Sweep over
data
X Y
Synchronize Parameters
Parallel Sweep over
data
X Y
Synchronize Parameters
Make sure this is embarrassingly parallel
Talk Quickly
![Page 32: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/32.jpg)
Distributed Optimization
HDFSX Y
1. Data begins on HDFS
X YX Y
2. Every machine takes part of the data to local disk/SSD
3. Inter machine communication by fast supercomputer-style primitives
![Page 33: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/33.jpg)
Criteo Terabyte Click Logs
Click Prediction Task: Whether visitor clicked on a link or not.
![Page 34: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/34.jpg)
Criteo Terabyte Click Prediction
4.4 Billion Rows13 Features
½ TB of data
0 4 8 12 160
500
1000
1500
2000
2500
3000
3500
4000
#Machines
Run
time
Linear Speedup 225s
3630s
![Page 35: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/35.jpg)
Distributed Graphs
![Page 36: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/36.jpg)
Graph Partitioning Minimizing Communication
YYYCommunication is linear in the number of machines
each vertex spans
49
Vertex-Cut: Placing edges on machines, and letting vertex span machines
![Page 37: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/37.jpg)
Graph PartitioningCommunication Minimization
Time to compute a partition
Quality of partition
![Page 38: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/38.jpg)
Graph PartitioningSince Large Natural Graphs are difficult to partition anyway…
Time to compute a partition
Quality of partition
How good a partition quality can we get while doing almost no work at all?
![Page 39: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/39.jpg)
Machine 2Machine 1 Machine 3
Randomly assign edges to machines
YYYY ZYYYY ZY Z
Random Partitioning
But is probably the worst partition you can construct. Can we do better?
![Page 40: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/40.jpg)
Sgraph Partitioning
(1,2)
(2,2)
(1,1)
(2,1)
(3,2)
(4,2)
(3,1)
(4,1)
(1,4)
(2,4)
(1,3)
(2,3)
(3,4)
(4,4)
(3,3)
(4,3)
![Page 41: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/41.jpg)
Slides from a couple of years ago
![Page 42: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/42.jpg)
Distributed Graphs
New Graph Partitioning Ideas Mixed in-core out-of-core computation
![Page 43: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/43.jpg)
Common Crawl Graph
0 4 8 12 160
100
200
300
400
500
600
#Machines
Run
time
16 Machines, (c3.8xlarge, 512 vCPUs)45 sec per iteration
3B edges per second
3.5 billion Nodes and 128 billion Edges
![Page 44: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/44.jpg)
In search of PerformanceUnderstand memory access patterns of
algorithms:Single Machine and Distributed
Sequential? Random?
User Com.
Title Body
User Disc. Optimize datastructures for
access patterns
![Page 45: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/45.jpg)
It is not merely about speed, or scaling
Doing more with what you already have
![Page 46: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/46.jpg)
![Page 47: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/47.jpg)
Excess Slides
![Page 48: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/48.jpg)
Our Tools Are Easy To Use
import graphlab as gltrain_data = gl.SFrame.read_csv(traindata_path) train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2) cls = gl.classifier.create(train_data, target='sentiment’)
5 line sentiment analysis
ButYou have preexisting code in Numpy, Scipy, Scikit-learn
![Page 49: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/49.jpg)
Automatic Numpy ScalingAutomatic in-memory, type aware compression using SFrame type compression technology.
import graphlab.numpyScalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.
Demo
![Page 50: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/50.jpg)
Scikit Learn SGDLinearCLassifier
0 50 100 150 200 250 300 350 4000
5001000150020002500300035004000
Millions of Rows
Run
time
(s)
Airline Delay Dataset
Numpy
Graphlab + numpy
![Page 51: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/51.jpg)
Automatic Numpy ScalingAutomatic in-memory, type aware compression using SFrame type compression technology.
import graphlab.numpyScalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.
Demo
Caveats apply - Sequential Access highly preferred. - Scales most memory bound sklearn algorithms by at least 2x, some by more.
![Page 52: Making Machine Learning Scale: Single Machine and Distributed](https://reader030.vdocuments.site/reader030/viewer/2022020314/58eb41ec1a28abd1228b46b3/html5/thumbnails/52.jpg)
H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU
0
5000
10000
15000
20000
25000
30000Im
ages
per
Sec
ond
Deep Learning Throughput GPU
Dataset Source: MNIST 60K examples, 764 dimensionsSource(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..