declarative machine learning: bring your own syntax, algorithm, data and infrastructure
TRANSCRIPT
© 2015 IBM Corporation
Declarative Machine Learning: Bring your Own Algorithm, Data, Syntax and Infrastructure
Shivakumar VaithyanathanIBM Fellow
Watson & IBM Research
IBM Research
© 2012 IBM Corporation
Credit Risk Scoring Application at a Large Financial Institution
To execute on one machine (with a hypothetical statistical package/engine) 3.6 TB of RAM required (underestimate). Reduced Set: 1.2 TB of RAM (underestimate)
In practice more RAM is required
– Outputs and intermediates also need to be stored along with the input
2
Prototypical of problems in other industries ranging from automotive to insurance to transportation
Credit Risk Scoring
Payment History
Amount Owed
Length of Credit History
New Credit
Types of Credit Used
Problem size 300 million rows, 1500 features
Reduced set: 500 features
Data size on disk 3.6 TB (uncompressed)
Even for reduced set: 1.2 TB
Algorithm of interest Regression
…
IBM Research
© 2012 IBM Corporation
Insurance
Big Data Analytics Usecases
Problem Description
– Consumer risk modeling – Consumer data with
~300 M rows and ~500 attributes
Large Number of Models
Large Number of Features
3
Large Number of Data Points,
Attributes and Dense
Problem Description
– Predict customer monetary loss
– Multi-million observations, 95 features, evaluate several hundred models for optimal subset of features
Problem Description
– Customer Satisfaction – Multi-million cars with few
reacquired cars– Feature expansion from ~250
to ~21,800
AutomotiveDaaS (Retail
Finance)
RISK
IBM Research
© 2012 IBM Corporation
A Day in the life of a Data Scientist ….
4
data sample
data characteristics
Develop new algorithm or modify existing algorithm
original data
Data scientist
Bayesian networks Neural networks Random forests Support vector machines …
algorithms
Custom
syntax
IBM Research
© 2012 IBM Corporation
Bottleneck: Moving the algorithm onto Big Data Infrastructure
5
Data scientist
HadoopProgrammer
SparkProgrammer
MPIProgrammer
IBM Research
© 2012 IBM Corporation
What If .….
6
Data scientist
HadoopProgrammer
SparkProgrammer
MPIProgrammer
compiler optimizer
IBM Research
© 2012 IBM Corporation
Simplified view of what we want to build …
7
The What The How
language tooling compiler optimizer
High-level language
Write any algorithm
Adapt to different data and program characteristics
Support different backend architectures and configurations
IBM Research
© 2012 IBM Corporation
SystemML: IBM Research Project will soon be in Open Source
8
• IBM Research Project started 6 years ago
• More than 10 papers in major conferences
• In Beta for more than a year and used in multiple applications
What•R- like, Python-like syntax, ….. •Rich set of statistical functions•User-defined & external function
How•Single-node, embeddable and Hadoop & Spark•Dense / sparse matrix representation•Library of more than 15 algorithms
In-Memory Single Node
Hadoop / Spark
Lower Ops (LOP)
Higher Ops (HOP)
R-parserPython-parser
Writing a Python-syntax parser took less than 2
man-months
IBM Research
© 2012 IBM Corporation
How should the “What” work ?
9
package gnmf;
import java.io.IOException;import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF{ public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); long requiredTime = System.currentTimeMillis() - start; long requiredTimeMilliseconds = requiredTime % 1000; requiredTime -= requiredTimeMilliseconds; requiredTime /= 1000; long requiredTimeSeconds = requiredTime % 60; requiredTime -= requiredTimeSeconds; requiredTime /= 60; long requiredTimeMinutes = requiredTime % 60; requiredTime -= requiredTimeMinutes; requiredTime /= 60; long requiredTimeHours = requiredTime;}}
package gnmf;
import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;
import java.io.IOException;import java.util.Iterator;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2{ static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); job.setNumReduceTasks(numReducers); job.setReducerClass(UpdateWHStep2Reducer.class); job.setOutputKeyClass(TaggedIndex.class); job.setOutputValueClass(MatrixObject.class); JobClient.runJob(job); return workingDirectory;
}}
package gnmf;
import gnmf.io.MatrixCell;import gnmf.io.MatrixFormats;import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;
import java.io.IOException;import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1{ public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) throw new RuntimeException("invalid k specified"); } } public static String runJob(int numMappers, int numReducers, int replication, int updateType, String matrixInputDir, String whInputDir, String outputDir, int k) throws IOException {
Java Implementation of Non-
negative Matrix Factorization
for Hadoop
(>1500 lines of code)
R syntax(10 lines of code)
Python syntax(10 lines of code)
A factor of 7 – 10 advantage in man-
months over multiple algorithms
IBM Research
© 2012 IBM Corporation
Scalability and Performance – GNMF Example
10
All operations execute on
Single machine
0 MR Jobs
Hybrid Execution(majority of operations
execute on single machine)4 MR Jobs
Hybrid Execution(majority of operations
execute in map-reduce)6 MR Jobs
IBM Research
© 2012 IBM Corporation
What does the “How” do ?
12
X has 3 times more columns
300M
500X
300M
1y From 2.5 to GB Map Task JVM
7 GB In-Mem Master JVM
Change in Cluster configuration
600M
500X
600M
1y
X has 2 times more rows
300M
1500X
300M
1y
X’y job1
X’y job2
X’X job
solve
X’y job1
X’y job2
X’X job
solve
300M
500X
300M
1y
Original dataX’X andX’y job
solveExecution plan
Change in data characteristics
X’X andX’y job
solve
X’X job1
X’X job2
X’y job
solve3X faster!
IBM Research
© 2012 IBM Corporation
Compilation Chain Overview with Example
13
+
%*%
*
b sb
X
y
Q
bsb
Xbsb
yXbsb
Parse Tree
If dimensions are unknown at compile time, validate will pass
through and additional checks will be made at run time
Runtime Instructions:CP: b+sb _mvar1MR-Job: [map=X%*%_mvar1 _mvar2]CP: y*_mvar2 _mvar3
HOPs DAG:
LOPs DAG:
IBM Research
© 2012 IBM Corporation
Data fits in aggregated memory: SystemML optimizations give ~10X over Hadoop
In-Memory Data Set (160GB)
Some Performance Numbers for Spark / Hadoop
Data larger than aggregated memory: SystemML optimizations give ~ 2X
ML Program MR Backend(All ML optims)
Spark Backend(All ML optims)
Spark Backend(Limited ML optims)
LinregDS 479s 342s 456s
LinregCG 954s 188s 243s
L2SVM 1,517s 237s 531s
GLM 1,989s 205s 318s
ML Program MR Backend (All ML optims)
Spark Backend(All ML optims)
LinregDS 5,429s 6,779s
LinregCG 12,469s 10,014s
L2SVM 24,360s 12,795s
GLM 32,521s 17,301s
Large-Scale Data Set (1.6TB)