model building with revoscaler: using r and hadoop for statistical computation

Model Building with RevoScaleRUsing R and Hadoop for Statistical Computation

Joseph Rickert, Revolution Analytics

Strata and Hadoop World 2013

Model Buliding with RevoScaleRAgenda:The three realms of dataWhat is RevoScaleR?RevoScaleR working beside HadoopRevoScaleR running within HadoopRun some code

The 3 Realms of Data

Bridging the gaps between architectures

The 3 Realms of Data

Number of rows

Architectural complexity

DataIn

Memory

Data in a File

The realm of “chunking”

Data in

Multiple

The realm of massive data

RevoScaleR

Revolution R Enterprise

RevoScaleR An R package ships exclusively with

Implements Parallel External Memory Algorithms (PEMAs)

Provides functions to:

– Import, Clean, Explore and Transform Data

– Statistical Analysis and Predictive Analytics

– Enable distributed computing

Scales from small local data to huge distributed data

The same code works on small and big data, and on workstation, server, cluster, Hadoop

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Parallel External Memory Algorithms (PEMA’s) Built on a platform (DistributeR)

that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms

Process data a chunk at a time in parallel across cores and nodes:

1. Initialize

2. Process Chunk

3. Aggregate

4. Finalize

Revolution R Enterprise 7

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

RevoScaleR PEMAs

Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:

All exponential family distributions, Tweedie distribution.

Standard link functions user defined distributions & link

functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models

Histogram Line Plot Lorenz Curve ROC Curves

K-Means

Statistical Modeling

Decision Trees Decision Forests

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

Parallel random number generators for Monte Carlo

Variable Selection

Stepwise Regression PCA

GLM comparison using in-memory data: glm() and ScaleR’s rxGlm()

PEMAs: Optimized for Performance Arbitrarily large number of

rows in a fixed amount of memory

Scales linearly with the number of rows with the number of nodes

Scales well with the number of cores

per node with the number of

parameters

Efficient Computational algorithms Memory management: minimize

copying File format: fast access by row

and column Heavy use of C++ Models

pre-analyzed to detect and remove duplicate computations and points of failure (singularities)

Handle categorical variables efficiently

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Write Once. Deploy Anywhere.

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11

In the Cloud Microsoft Azure BurstAmazon AWS

Workstations & ServersDesktopServerLinux

Clustered Systems Platform LSFMicrosoft HPC

EDW IBMTeradata

Hadoop HortonworksCloudera

RRE in Hadoop

beside inside

Revolution R Enterprise BESIDE Architecture

Use Hadoop for data storage and data preparation

Use RevoScaleR on a connected server for predictive modeling

Use Hadoop for model deployment

A Simple Goal: Hadoop As An R Engine.Run Revolution R Enterprise code In Hadoop without change

Provide RevoScaleR Pre-Parallelized

Algorithms

Eliminate:

The Need To “Think In MapReduce”

Data Movement

Hadoop

INSIDEArchitecture

Use RevoScaleR inside Hadoop for:

• Data preparation

• Model building

• Custom small-data parallel programming

• Model deployment

• Late 2013: Big-data predictive models with ScaleR

Name Node

Data NodeData Node Data NodeData Node Data Node

Job Tracker

Task Tracker

MapReduce

Name Node

Job Tracker

Task Tracker

MapReduce

RRE in Hadoop

Name Node

Job Tracker

Task Tracker

MapReduce

RRE in Hadoop

RevoScaleR on Hadoop Each pass through the data is one MapReduce

job Prediction (Scoring), Transformation, Simulation:

– Map tasks store results in HDFS or return to client

Statistics, Model Building, Visualization:– Map tasks produce “intermediate result objects” that are

aggregated by a Reduce task

– Master process decides if another pass through the data is required

Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithmsRevolution R Enterprise 18

Let’s run some code.

Backup slides

Sample code: logit on workstation

# Specify local data source

airData <- myLocalDataSource

# Specify model formula and parameters

rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData )

Sample code for logit on Hadoop# Change the “compute context”rxSetComputeContext(myHadoopCluster)

# Change the data source if necessary

airData <- myHadoopDataSource

# Otherwise, the code is the same

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)

Demo rxLinMod in Hadoop - Launching

Demo rxLinMod in Hadoop - In Progress

Demo rxLinMod in Hadoop - Completed

model building with revoscaler: using r and hadoop for statistical computation

Technology

hadoop file system - university of california,...

hadoop and measurement data - messdatenvisualisierung ·...

introduction to hadoop and mapreduce - europa · eurostat...

hadoop 1.0 vs hadoop 2.0

big data, hadoop and all that -...

scaling storage and computation with hadoop

scalability of the sas/stat hpgenselect high-performance...

osztott, skálázódó platform stream-feldolgozáshoz ·...

hadoop operations powered by ... hadoop (hadoop summit 2014...

hadoop , hadoop , hadoop !!!

hadoop hadoop & spark meetup - altiscale

hadoop based data intensive computation on iaas … based...

hadoop present - open enterprise hadoop

· (page views ? hourly? monthly hadoop node hadoop node...

analyzing hadoop with hadoop

scaling storage and computation with apache...

programming on hadoop. outline different perspective of...

distributed computation of ˇ with apache...

dimensions computation with apache apex - the linux...

hadoop trends & hadoop on ec2