mlconf 2013: metronome and parallel iterative algorithms on yarn
DESCRIPTION
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelize parameter estimation for linear models on the next-gen YARN framework Iterative Reduce and the parallel machine learning library Metronome. We also take a look at non-linear modeling with the introduction of parallel neural network training in Metronome as well.TRANSCRIPT
Metronome
YARN and Parallel Iterative Algorithms
Josh Patterson
Email:
Twitter:
@jpatanooga
Github:
https://github.com/jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing Algorithm”
Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Consultant
Sections
1. Parallel Iterative Algorithms
2. Parallel Neural Networks
3. Future Directions
YARN, IterativeReduce and HadoopParallel Iterative Algorithms
5
Machine Learning and Optimization
Direct Methods
Normal Equation
Iterative Methods
Newton’s Method
Quasi-Newton
Gradient Descent
Heuristics
AntNet
PSO
Genetic Algorithms
Linear Regression
In linear regression, data is modeled using linear predictor functions
unknown model parameters are estimated from the data.
We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
7
Stochastic Gradient Descent
Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11
Hypothesis about data
Cost function
Update function
8
Stochastic Gradient Descent
Training
Simple gradient descent procedure
Loss functions needs to be convex (with exceptions)
Linear Regression
Loss Function: squared error of prediction
Prediction: linear combination of coefficients and input variables
SGD
Model
Training Data
9
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
10
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
11
MapReduce vs. Parallel
IterativeInput
Output
Map Map Map
Reduce Reduce
ProcessorProcessor ProcessorProcessor ProcessorProcessor
Superstep 1Superstep 1
ProcessorProcessor ProcessorProcessor
Superstep 2Superstep 2
. . .
ProcessorProcessor
12
YARN
Yet Another Resource Negotiator
Framework for scheduling distributed applications
Allows for any type of parallel application to run natively on hadoop
MRv2 is now a distributed application
13
IterativeReduce API
ComputableMaster
Setup()
Compute()
Complete()
ComputableWorker
Setup()
Compute()
WorkerWorker WorkerWorker WorkerWorker
MasterMaster
WorkerWorker WorkerWorker
MasterMaster
. . .
WorkerWorker
14
SGD: Serial vs Parallel
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
Split 1 Split 2 Split 3
…
Parallel Iterative Algorithms on YARN
Based directly on work we did with Knitting Boar
Parallel logistic regression
And then added
Parallel linear regression
Parallel Neural Networks
Packaged in a new suite of parallel iterative algorithms called Metronome
100% Java, ASF 2.0 Licensed, on github
Linear Regression Results
64 128 192 256 3200
50
100
150
200
Linear Regression - Parallel vs Serial
Parallel RunsSerial Runs
Megabytes Processed Total
Tota
l P
rocessin
g
Tim
e
17
Logistic Regression: 20Newsgroups
Input Size vs Processing Time
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410
50
100
150
200
250
300
OLRPOLR
Convergence Testing
Debugging parallel iterative algorithms during testing is hard
Processes on different hosts are difficult to observe
Using the Unit Test framework IRUnit we can simulate the IterativeReduce framework
We know the plumbing of message passing works
Allows us to focus on parallel algorithm design/testing while still using standard debugging tools
Let’s Get Non-LinearParallel Neural Networks
What are Neural Networks?
Inspired by nervous systems in biological systems
Models layers of neurons in the brain
Can learn non-linear functions
Recently enjoying a surge in popularity
Multi-Layer Perceptron
First layer has input neurons
Last layer has output neurons
Each neuron in the layer connected to all neurons in the next layer
Neuron has activation function, typically sigmoid / logistic
Input to neuron is the sum of the weight * input of connections
Backpropogation Learning
Calculates the gradient of the error of the network regarding the network's modifiable weights
Intuition
Run forward pass of example through network
Compute activations and output
Iterating output layer back to input layer (backwards)
For each neuron in the layer
Compute node’s responsibility for error
Update weights on connections
Parallelizing Neural Networks
Dean, (NIPS, 2012)
First Steps: Focus on linear convex models, calculating distributed gradient
Model Parallelism must be combined with distributed optimization that leverages data parallelization
simultaneously process distinct training examples in each of the many model replicas
periodically combine their results to optimize our objective function
Single pass frameworks such as MapReduce “ill-suited”
Costs of Neural Network Training
Connections count explodes quickly as neurons and layers increase
Example: {784, 450, 10} network has 357,300 connections
Need fast iterative framework
Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time
5,000 minutes or 83 hours
3 ways to speed up training
Subdivide dataset between works (data parallelism)
Max transfer rate of disks and Vector caching to max data throughput
Minimize inter-epoch setup times with proper iterative framework
Vector In-Memory Caching
Since we make lots of passes over same dataset
In memory caching makes sense here
Once a record is vectorized it is cached in memory on the worker node
Speedup (single pass, “no cache” vs “cached”):
~12x
Neural Networks Parallelization Speedup
1 2 3 4 5 -
1.00
2.00
3.00
4.00
5.00
6.00
UCI IrisUCI LensesUCI WineUCI DermatologyNIST Handwriting Downsample
Number of Parallel Processing Units
Tra
inin
g S
peedup F
acto
r (M
ult
iple
)
Going ForwardFuture Directions
Lessons Learned
Linear scale continues to be achieved with parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
Future Directions
Adagrad (SGD Adaptive Learning Rates)
Parallel Quasi-Newton Methods
L-BFGS
Conjugate Gradient
More Neural Network Learning Refinement
Training progressively larger networks
Github
IterativeReduce
https://github.com/emsixteeen/IterativeReduce
Metronome
https://github.com/jpatanooga/Metronome
Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java