an introduction to revolution analytics - files.meetup.comfiles.meetup.com/1624468/performance...

34

Upload: doannguyet

Post on 05-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Analytical Capability

Compute

Data Scale

UsersPrice

Ease

Security

Memory Limits

In-Memory vs. Shared Infrastructure

CRAN vs. Parallelization

Desktop vs. Remote

Explicit vs. Automatic Distribution

Locality vs. MovementReal-Time vs. MapReduce

Traditional Statistics vs. Machine Learning

No Magic Bullet.

Our Vision:

R becomes the de-

facto standard for

enterprise predictive

analytics

Our Mission:

Drive enterprise

adoption of R by

providing enhanced R

products tailored to

meet enterprise

challenges

• •

• Open Source

• Commercial

Traditional Open Source R “Beside” Architecture:

CRAN

Algorithms

rHDFSrHbaserHive

rODBC

Replace Open Source R “Beside” Architecture with Revolution R Open

As with Open Source R:

• Still Free.

• Still Memory Based.

• Data Still Moves.

Improvements:

• Accelerates Math with

Intel MKL

• Improves R-based

packages

Limitations

• No Effect

for non-R Code

CRAN

Algorithms

rHDFSrHbaserHive

rODBC

Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html

Write R Code to Explicitly Parallelize – Deploy Across Several Systems

Can Include CRAN

Algorithms “Carefully”

ForEach & Iterator

• DoParallel (PC, server)

• DoMPI (cluster)

• RRE RxEXEC

Example Uses:

• Bootstrapping

• Simulation

• HPC

rHDFSrHbaserHive

rODBC

As with Previous:

• Still Free.

• Still Memory Based.

• Data Still Moves.

• Intel MKL with RRO

Improvements:

• Parallelized Execution

Limitations:

• Parallelization Difficulty

• Data Movement

• Platform Specific

Remote

Desktop

R Code

Execute R Code & CRAN Algorithms Inside Hadoop

Example Uses:

• Scoring

• Transformation

• Easily Parallelized

Algorithms

Hadoop

Streaming

Can Include CRAN

Algorithms “Carefully”

As With Previous:

Still Free.

Optional Intel MKL in

RRO

Improvements:

Runs R in MapReduce

No Data Movement

Limitations:

Manual Parallelization

Hadoop Specific

rHbase

rHDFS

rMapReduce

Traditional “Beside” Architecture with Optimized

Algorithms Available for Windows, LinuxAs With Previous:

Includes Intel MKL in RRO

Advantages

Speed: PEMAs Parallelize

Across Threads, Cores &

Sockets

Scale: PEMAs “Chunk” - no

Memory Limits

All of CRAN Available

Portability

Fully Supported

Limitations:

Data Movement

Single Machine

Revolution R Enterprise:

• ScaleR PEMA

Algorithms

plus

• All of CRAN(subject to memory limits)

rHDFSrHbaserHive

rODBC

is….the only big data big analytics platform

based on open source R

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Step Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models

K-Means

Decision Trees

Decision Forests

Gradient Boosted Decision Trees

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination

21

New in 7.3

PEMA-R API

rxDataStep

rxExec

ScaleR PEMA

Master

Algorithm

Process

Data

Analyze Each

Block

• Not Limited to Available

Memory

• Unlimited Data Scale

• Ingests Data One Chunk At

A Time.

• Adjustable Memory

Footprint

• Multi-Thread Execution

Performance

• Highly-Optimized

Algorithms

• Algorithm Math Fully

Refactored for Parallelism

• Delivered as ScaleR Library

in Revolution R Enterprise

Load Block At A

Time

Combine

Individual Results

Script Calls

ScaleR

Algorithm

Scripts can call CRAN Open

Source Algorithms

Start & Manage

Processing

rHDFSrHbaserHive

rODBC

Local File System

(opt.)

ScaleR + CRAN

Algorithms

Fast Single-Server Alternative for Modest Data Scale

Edge

NodeThin Client or

Remote

Desktop

As With Previous:

Single Machine Execution

PEMA Scale & Speed (Single

Machine)

Use ScaleR + CRAN

Accelerate R with Intel MKL

Improvements:

Easily Shared via

No Data Movement

Develop on Desktop Run on

Edge Node

Limitations:

“Shorter Trip” for Data

jobtracker

ScaleR

Algorithms

DeployR

Fast Parallelized Analytics on Large Data Sets In Hadoop

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Use CRAN Where Appropriate

Accelerate R Math with MKL

Custom Parallelized Algo’s

Advantages

Parallel Computation

No Data Movement

ScaleR PEMA Parallelization

Can Parallelize CRAN “Carefully”

Portable Coding

Limitations:

Hadoop Workload Profiles

Web

Servi

ces

Web

Services

Remote

Execution

Desktop & Server

Tools and

Applications

25

Test Cluster - 9 Nodes

Task Processing Time

Importing and Filtering Datasets from

HDFS

14 Million Observations 82 sec.

227 Million Observations 310 sec.

Modeling and Estimation

1.2 M Correlations 2771 sec.

Simple Linear Regression, 227 M

Observations 61 sec.

Multiple Linear Regression, Three

Variables, 227 M Observations 58 sec.

Multiple Linear Regression, Four

Variables, 227 M Observations 58 sec.

Random Forest, 10 Predictor Variables,

227 M Observations, 10 Trees with Max

Depth of 10 Splits 2 hr. 3 min.

64GB

24 cores

each

9 Task

Nodes2 Admin

Nodes1 Edge

Node

128GB

24 cores

each

128GB

24 cores

each

ScaleR

Algorithms

DeployR

Maximized Flexibility, Performance & Workload Handling

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Use CRAN Where Appropriate

Accelerate R Math with MKL

Custom Parallelized Algo’s

Advantages

Flexibility for Blended Workloads

Little or No Data Movement

Maximize CRAN Capabilities by

Sharing Large RAM Edge Nodes

Web

Servi

ces

Thin Client

Development

Remote

Execution

Desktop & Server

Tools and

Applications

rStudio

• •

• Where are the bulk of your skills? SAS? R? Java? Python? SQL?

• Where do you build models today?

• Do you have the skills to parallelize algorithms?

• Can models be built on a big shared server?

• How will you run models?

• Do you have the budget to purchase commercial solutions?

• How will your needs change over time?

• What is your future architecture plan?

• How risk averse is your management team regarding new platforms and open source?

• Revolution Analytics Productshttp://www.revolutionanalytics.com/products

http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws

• Whitepaper: “Delivering Value from Big Data with Revolution R Enterprise and Hadoop

http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-revolution-r-enterprise-and-hadoop

• Revolution Analytics on Social Media:http://blog.revolutionanalytics.com/

Twitter

Twitter

Thank you.

www.revolutionanalytics.com

1.855.GET.REVO

Twitter: @RevolutionR