making machine learning scale: single machine and distributed

Scalable Machine Learning: Single Machine to Distributed

Yucheng LowChief Architect

What is ML scalability?

Is this scalability?

1600s

Algorithm Implementation X

800s

400s

200s

300s

Best Single Machine Implementation

True Scalability

How long does it take to get to a predetermined accuracy?

Not About:How well you can implement Algorithm X

Understand the tradeoffs between different algorithms.

It is not about

Scaling Up Scaling Out

It is about

Scaling Up Scaling Out

Going as fast as you can, on any hardware

• Assume bounded resources • Optimize for data scalability

The Dato Way

• Scales excellently• Require fewer machines to

solve in the same runtime as other systems

10

~1GB/s

1 TB

~0.1GB/s

10 TB

~1-10 GB/s

0.1 TB

Single Machine Scalability: Storage Hierarchy

Capacity

Throughput

Random access is very slow!

Good External Memory Datastructures For ML

SFrame: Scalable Tabular Data Manipulation

User Com.

Title Body

User Disc.

SGraph: Scalable Graph Manipulation

Data is usually rows… user movie rating

But, data engineering typically column transformations…

13

Feature engineering is columnar

Normalizes the feature x:sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum()

Create a new feature:sf[‘rating-squared’] =

sf[‘rating’].apply(lambda rating: rating*rating)

Create a new dataset with 2 of the features:sf2 = sf[[‘rating’,’ rating-squared’]]

ratinguser movierating

squared

SFrame

• Rich Datatypes• Strong schema types: int, double, string, image, ...• Weak schema types: list, dictionary (Can contain arbitrary

JSON)• Columnar Architecture

• Easy feature engineering + Vectorized feature operations.• Lazy evaluation• Statistics + sketches• Type aware compression

User Com.

Title Body

User Disc.Scalable Out-Of-Core Table Representation

Netflix Dataset, 99M rows, 3 columns, ints1.4GB raw289MB gzip compressed

160MB

Out of Core Machine Learning

Rethink all ML Algorithms

Random Access Sequential Only

Sampling? Sort/Shuffle

Understanding the Statistical/convergence impacts of ML

algorithm variations.

Single Machine Scaling

GraphLab-Create (1 Node)

MLlib 1.3 (5 Node)

MLlib 1.3 (1 Node)

Scikit-Learn

0 500 1000 1500 2000 2500

Runtime

Dataset Source: LIBLinear binary classification datsets.KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.Task: Predict student performance on math problems based on interactions with tutoring system

Single Machine Scaling

GraphLab-Create (1 Node)

BIDMach (1 GPU Node)

0 100 200 300 400 500 600 700 800 900

Runtime

Criteo Kaggle: Click Prediction

46M rows34M sparse coefficients

Not a Compute Bound Task

Social Media

Graphs encode the relationships between:

•Big: trillions of vertices and edges and rich metadata•Facebook (10/2012): 1B users, 144B friendships •Twitter (2011): 15B follower edges

AdvertisingScience Web

PeopleFacts

ProductsInterests

Ideas

SGraph1. Immutable disk-backed graph

representation. (Append only)

2. Vertex / Edge Attributes.3. Optimized for bulk access, not fine-grained queries.

Get neighborhood of [5 Million Vertices]

Get neighborhood of 1 vertex

Standard Graph Representations

src dest1 102132 1048 999129 192998 23392 124

Edge List

Easy to Insert

src dest1 101 991 1022 52 102 120

Sparse Matrix / Sorted Edge List

Difficult to Insert (random writes)102 103

349 13

Difficult to Query

Fast to Query

1 105

SGraph Layout

1

2

3

4

Vertex SFrames

__id Address ZipCodeAlice … 98105Bob … 98102

Vertices partitioned into p = 4 SFrames

Edges partitioned into p^2 = 16 SFrames

__id Address ZipCodeJohn … 98105Jack … 98102

SGraph Layout

1

3

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

__src_id __dst_id MessageAlice Bob “hello”Bob Charlie “world”Charlie Alice “moof”

2

__id Address ZipCodeJohn … 98105Jack … 98102

3

SGraph Layout

1

2

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

__src_id __dst_id MessageAlice Bob “hello”Bob Charlie “world”Charlie Alice “moof”

3

SGraph Layout

1

2

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

Common Crawl Graph

3.5 billion Nodes and 128 billion Edges

Largest available public Graph. 200GBCompression factor 10:1

12.5 bits per edge

2 TB

Benefit From SFrame Compression Methods

Common Crawl Graph3.5 billion Nodes and 128 billion

EdgesLargest available public Graph.

200GB

Compression factor 10:112.5 bits per edge

2 TB

Common Crawl Graph

1x r3.8xlarge using 1x SSD.


PageRank: 9 min per iteration.Connected Components: ~ 1 hr.

There isn’t any general purpose library out there capable of this.

SFrame & SGraph

BSD License(August)

Distributed

Train on bigger datasets Train Faster

Speedup Relative to Best Single Machine Implementation

X Y

Time for 1 pass = 100s

Extending Single Machine to Distributed

Extending Single Machine to Distributed

X YTime for 1 pass = 50s

X Y

Parallel Disks

Good External Memory Datastructures For ML Still Help

Distributed Optimization

Newton, LBFGS, FISTA, etc

Parallel Sweep over

data

X Y

Synchronize Parameters

Parallel Sweep over

data

X Y

Synchronize Parameters

Make sure this is embarrassingly parallel

Talk Quickly

Distributed Optimization

HDFSX Y

1. Data begins on HDFS

X YX Y

2. Every machine takes part of the data to local disk/SSD

3. Inter machine communication by fast supercomputer-style primitives

Criteo Terabyte Click Logs

Click Prediction Task: Whether visitor clicked on a link or not.

Criteo Terabyte Click Prediction

4.4 Billion Rows13 Features

½ TB of data

0 4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000

#Machines

Run

time

Linear Speedup 225s

3630s

Distributed Graphs

Graph Partitioning Minimizing Communication

YYYCommunication is linear in the number of machines

each vertex spans

49

Vertex-Cut: Placing edges on machines, and letting vertex span machines

Graph PartitioningCommunication Minimization

Time to compute a partition

Quality of partition

Graph PartitioningSince Large Natural Graphs are difficult to partition anyway…

Time to compute a partition

Quality of partition

How good a partition quality can we get while doing almost no work at all?

Machine 2Machine 1 Machine 3

Randomly assign edges to machines

YYYY ZYYYY ZY Z

Random Partitioning

But is probably the worst partition you can construct. Can we do better?

Sgraph Partitioning

(1,2)

(2,2)

(1,1)

(2,1)

(3,2)

(4,2)

(3,1)

(4,1)

(1,4)

(2,4)

(1,3)

(2,3)

(3,4)

(4,4)

(3,3)

(4,3)

Slides from a couple of years ago

Distributed Graphs

New Graph Partitioning Ideas Mixed in-core out-of-core computation

Common Crawl Graph

0 4 8 12 160

100

200

300

400

500

600

#Machines

Run

time

16 Machines, (c3.8xlarge, 512 vCPUs)45 sec per iteration

3B edges per second


In search of PerformanceUnderstand memory access patterns of

algorithms:Single Machine and Distributed

Sequential? Random?

User Com.

Title Body

User Disc. Optimize datastructures for

access patterns

It is not merely about speed, or scaling

Doing more with what you already have

Excess Slides

Our Tools Are Easy To Use

import graphlab as gltrain_data = gl.SFrame.read_csv(traindata_path) train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2) cls = gl.classifier.create(train_data, target='sentiment’)

5 line sentiment analysis

ButYou have preexisting code in Numpy, Scipy, Scikit-learn

Automatic Numpy ScalingAutomatic in-memory, type aware compression using SFrame type compression technology.

import graphlab.numpyScalable numpy activation successful

Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.

Demo

Scikit Learn SGDLinearCLassifier

0 50 100 150 200 250 300 350 4000

5001000150020002500300035004000

Millions of Rows

Run

time

(s)

Airline Delay Dataset

Numpy

Graphlab + numpy

Automatic Numpy ScalingAutomatic in-memory, type aware compression using SFrame type compression technology.

import graphlab.numpyScalable numpy activation successful

Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.

Demo

Caveats apply - Sequential Access highly preferred. - Scales most memory bound sklearn algorithms by at least 2x, some by more.

H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU

0

5000

10000

15000

20000

25000

30000Im

ages

per

Sec

ond

Deep Learning Throughput GPU

Dataset Source: MNIST 60K examples, 764 dimensionsSource(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..

making machine learning scale: single machine and distributed

Technology