an introduction to revolution analytics - files.meetup.comfiles.meetup.com/1624468/performance...
TRANSCRIPT
Memory Limits
In-Memory vs. Shared Infrastructure
CRAN vs. Parallelization
Desktop vs. Remote
Explicit vs. Automatic Distribution
Locality vs. MovementReal-Time vs. MapReduce
Traditional Statistics vs. Machine Learning
Our Vision:
R becomes the de-
facto standard for
enterprise predictive
analytics
Our Mission:
Drive enterprise
adoption of R by
providing enhanced R
products tailored to
meet enterprise
challenges
Replace Open Source R “Beside” Architecture with Revolution R Open
As with Open Source R:
• Still Free.
• Still Memory Based.
• Data Still Moves.
Improvements:
• Accelerates Math with
Intel MKL
• Improves R-based
packages
Limitations
• No Effect
for non-R Code
CRAN
Algorithms
rHDFSrHbaserHive
rODBC
Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
Write R Code to Explicitly Parallelize – Deploy Across Several Systems
Can Include CRAN
Algorithms “Carefully”
ForEach & Iterator
• DoParallel (PC, server)
• DoMPI (cluster)
• RRE RxEXEC
Example Uses:
• Bootstrapping
• Simulation
• HPC
rHDFSrHbaserHive
rODBC
As with Previous:
• Still Free.
• Still Memory Based.
• Data Still Moves.
• Intel MKL with RRO
Improvements:
• Parallelized Execution
Limitations:
• Parallelization Difficulty
• Data Movement
• Platform Specific
Remote
Desktop
R Code
Execute R Code & CRAN Algorithms Inside Hadoop
Example Uses:
• Scoring
• Transformation
• Easily Parallelized
Algorithms
Hadoop
Streaming
Can Include CRAN
Algorithms “Carefully”
As With Previous:
Still Free.
Optional Intel MKL in
RRO
Improvements:
Runs R in MapReduce
No Data Movement
Limitations:
Manual Parallelization
Hadoop Specific
rHbase
rHDFS
rMapReduce
Traditional “Beside” Architecture with Optimized
Algorithms Available for Windows, LinuxAs With Previous:
Includes Intel MKL in RRO
Advantages
Speed: PEMAs Parallelize
Across Threads, Cores &
Sockets
Scale: PEMAs “Chunk” - no
Memory Limits
All of CRAN Available
Portability
Fully Supported
Limitations:
Data Movement
Single Machine
Revolution R Enterprise:
• ScaleR PEMA
Algorithms
plus
• All of CRAN(subject to memory limits)
rHDFSrHbaserHive
rODBC
Data import – Delimited, Fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models
K-Means
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination
21
New in 7.3
PEMA-R API
rxDataStep
rxExec
ScaleR PEMA
Master
Algorithm
Process
Data
Analyze Each
Block
• Not Limited to Available
Memory
• Unlimited Data Scale
• Ingests Data One Chunk At
A Time.
• Adjustable Memory
Footprint
• Multi-Thread Execution
Performance
• Highly-Optimized
Algorithms
• Algorithm Math Fully
Refactored for Parallelism
• Delivered as ScaleR Library
in Revolution R Enterprise
Load Block At A
Time
Combine
Individual Results
Script Calls
ScaleR
Algorithm
Scripts can call CRAN Open
Source Algorithms
Start & Manage
Processing
rHDFSrHbaserHive
rODBC
Local File System
(opt.)
ScaleR + CRAN
Algorithms
Fast Single-Server Alternative for Modest Data Scale
Edge
NodeThin Client or
Remote
Desktop
As With Previous:
Single Machine Execution
PEMA Scale & Speed (Single
Machine)
Use ScaleR + CRAN
Accelerate R with Intel MKL
Improvements:
Easily Shared via
No Data Movement
Develop on Desktop Run on
Edge Node
Limitations:
“Shorter Trip” for Data
jobtracker
ScaleR
Algorithms
DeployR
Fast Parallelized Analytics on Large Data Sets In Hadoop
As With Previous:
Speed and Scale of ScaleR PEMA
Algorithms
Use CRAN Where Appropriate
Accelerate R Math with MKL
Custom Parallelized Algo’s
Advantages
Parallel Computation
No Data Movement
ScaleR PEMA Parallelization
Can Parallelize CRAN “Carefully”
Portable Coding
Limitations:
Hadoop Workload Profiles
Web
Servi
ces
Web
Services
Remote
Execution
Desktop & Server
Tools and
Applications
25
Test Cluster - 9 Nodes
Task Processing Time
Importing and Filtering Datasets from
HDFS
14 Million Observations 82 sec.
227 Million Observations 310 sec.
Modeling and Estimation
1.2 M Correlations 2771 sec.
Simple Linear Regression, 227 M
Observations 61 sec.
Multiple Linear Regression, Three
Variables, 227 M Observations 58 sec.
Multiple Linear Regression, Four
Variables, 227 M Observations 58 sec.
Random Forest, 10 Predictor Variables,
227 M Observations, 10 Trees with Max
Depth of 10 Splits 2 hr. 3 min.
64GB
24 cores
each
9 Task
Nodes2 Admin
Nodes1 Edge
Node
128GB
24 cores
each
128GB
24 cores
each
ScaleR
Algorithms
DeployR
Maximized Flexibility, Performance & Workload Handling
As With Previous:
Speed and Scale of ScaleR PEMA
Algorithms
Use CRAN Where Appropriate
Accelerate R Math with MKL
Custom Parallelized Algo’s
Advantages
Flexibility for Blended Workloads
Little or No Data Movement
Maximize CRAN Capabilities by
Sharing Large RAM Edge Nodes
Web
Servi
ces
Thin Client
Development
Remote
Execution
Desktop & Server
Tools and
Applications
rStudio
• Where are the bulk of your skills? SAS? R? Java? Python? SQL?
• Where do you build models today?
• Do you have the skills to parallelize algorithms?
• Can models be built on a big shared server?
• How will you run models?
• Do you have the budget to purchase commercial solutions?
• How will your needs change over time?
• What is your future architecture plan?
• How risk averse is your management team regarding new platforms and open source?
• Revolution Analytics Productshttp://www.revolutionanalytics.com/products
http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws
• Whitepaper: “Delivering Value from Big Data with Revolution R Enterprise and Hadoop
http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-revolution-r-enterprise-and-hadoop
• Revolution Analytics on Social Media:http://blog.revolutionanalytics.com/