declarative machine learning: bring your own syntax, algorithm, data and infrastructure

© 2015 IBM Corporation

Declarative Machine Learning: Bring your Own Algorithm, Data, Syntax and Infrastructure

Shivakumar VaithyanathanIBM Fellow

Watson & IBM Research

IBM Research


Credit Risk Scoring Application at a Large Financial Institution

To execute on one machine (with a hypothetical statistical package/engine) 3.6 TB of RAM required (underestimate). Reduced Set: 1.2 TB of RAM (underestimate)

In practice more RAM is required

– Outputs and intermediates also need to be stored along with the input

2

Prototypical of problems in other industries ranging from automotive to insurance to transportation

Credit Risk Scoring

Payment History

Amount Owed

Length of Credit History

New Credit

Types of Credit Used

Problem size 300 million rows, 1500 features

Reduced set: 500 features

Data size on disk 3.6 TB (uncompressed)

Even for reduced set: 1.2 TB

Algorithm of interest Regression

…

IBM Research


Insurance

Big Data Analytics Usecases

Problem Description

– Consumer risk modeling – Consumer data with

~300 M rows and ~500 attributes

Large Number of Models

Large Number of Features

3

Large Number of Data Points,

Attributes and Dense

Problem Description

– Predict customer monetary loss

– Multi-million observations, 95 features, evaluate several hundred models for optimal subset of features

Problem Description

– Customer Satisfaction – Multi-million cars with few

reacquired cars– Feature expansion from ~250

to ~21,800

AutomotiveDaaS (Retail

Finance)

RISK

IBM Research


A Day in the life of a Data Scientist ….

4

data sample

data characteristics

Develop new algorithm or modify existing algorithm

original data

Data scientist

Bayesian networks Neural networks Random forests Support vector machines …

algorithms

Custom

syntax

IBM Research


Bottleneck: Moving the algorithm onto Big Data Infrastructure

5

Data scientist

HadoopProgrammer

SparkProgrammer

MPIProgrammer

IBM Research


What If .….

6

Data scientist

HadoopProgrammer

SparkProgrammer

MPIProgrammer

compiler optimizer

IBM Research


Simplified view of what we want to build …

7

The What The How

language tooling compiler optimizer

High-level language

Write any algorithm

Adapt to different data and program characteristics

Support different backend architectures and configurations

IBM Research


SystemML: IBM Research Project will soon be in Open Source

8

• IBM Research Project started 6 years ago

• More than 10 papers in major conferences

• In Beta for more than a year and used in multiple applications

What•R- like, Python-like syntax, ….. •Rich set of statistical functions•User-defined & external function

How•Single-node, embeddable and Hadoop & Spark•Dense / sparse matrix representation•Library of more than 15 algorithms

In-Memory Single Node

Hadoop / Spark

Lower Ops (LOP)

Higher Ops (HOP)

R-parserPython-parser

Writing a Python-syntax parser took less than 2

man-months

IBM Research


How should the “What” work ?

9

package gnmf;

import java.io.IOException;import java.net.URISyntaxException;

import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.JobConf;

public class MatrixGNMF{ public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); long requiredTime = System.currentTimeMillis() - start; long requiredTimeMilliseconds = requiredTime % 1000; requiredTime -= requiredTimeMilliseconds; requiredTime /= 1000; long requiredTimeSeconds = requiredTime % 60; requiredTime -= requiredTimeSeconds; requiredTime /= 60; long requiredTimeMinutes = requiredTime % 60; requiredTime -= requiredTimeMinutes; requiredTime /= 60; long requiredTimeHours = requiredTime;}}

package gnmf;

import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep2{ static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";

JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); job.setNumReduceTasks(numReducers); job.setReducerClass(UpdateWHStep2Reducer.class); job.setOutputKeyClass(TaggedIndex.class); job.setOutputValueClass(MatrixObject.class); JobClient.runJob(job); return workingDirectory;

}}

package gnmf;

import gnmf.io.MatrixCell;import gnmf.io.MatrixFormats;import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep1{ public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) throw new RuntimeException("invalid k specified"); } } public static String runJob(int numMappers, int numReducers, int replication, int updateType, String matrixInputDir, String whInputDir, String outputDir, int k) throws IOException {

Java Implementation of Non-

negative Matrix Factorization

for Hadoop

(>1500 lines of code)

R syntax(10 lines of code)

Python syntax(10 lines of code)

A factor of 7 – 10 advantage in man-

months over multiple algorithms

IBM Research


Scalability and Performance – GNMF Example

10

All operations execute on

Single machine

0 MR Jobs

Hybrid Execution(majority of operations

execute on single machine)4 MR Jobs

Hybrid Execution(majority of operations

execute in map-reduce)6 MR Jobs

IBM Research


What does the “How” do ?

1111

IBM Research


What does the “How” do ?

12

X has 3 times more columns

300M

500X

300M

1y From 2.5 to GB Map Task JVM

7 GB In-Mem Master JVM

Change in Cluster configuration

600M

500X

600M

1y

X has 2 times more rows

300M

1500X

300M

1y

X’y job1

X’y job2

X’X job

solve

X’y job1

X’y job2

X’X job

solve

300M

500X

300M

1y

Original dataX’X andX’y job

solveExecution plan

Change in data characteristics

X’X andX’y job

solve

X’X job1

X’X job2

X’y job

solve3X faster!

IBM Research


Compilation Chain Overview with Example

13

+

%*%

*

b sb

X

y

Q

bsb

Xbsb

yXbsb

Parse Tree

If dimensions are unknown at compile time, validate will pass

through and additional checks will be made at run time

Runtime Instructions:CP: b+sb _mvar1MR-Job: [map=X%*%_mvar1 _mvar2]CP: y*_mvar2 _mvar3

HOPs DAG:

LOPs DAG:

IBM Research


Data fits in aggregated memory: SystemML optimizations give ~10X over Hadoop

In-Memory Data Set (160GB)

Some Performance Numbers for Spark / Hadoop

Data larger than aggregated memory: SystemML optimizations give ~ 2X

ML Program MR Backend(All ML optims)

Spark Backend(All ML optims)

Spark Backend(Limited ML optims)

LinregDS 479s 342s 456s

LinregCG 954s 188s 243s

L2SVM 1,517s 237s 531s

GLM 1,989s 205s 318s

ML Program MR Backend (All ML optims)

Spark Backend(All ML optims)

LinregDS 5,429s 6,779s

LinregCG 12,469s 10,014s

L2SVM 24,360s 12,795s

GLM 32,521s 17,301s

Large-Scale Data Set (1.6TB)

IBM Research


Thank You

declarative machine learning: bring your own syntax, algorithm, data and infrastructure

Technology

ibm research project

ibm corporation systemml

ibm corporation bottleneck

different data

features data size

big data infrastructure

tb algorithm

new algorithm