microsoft r server stefan cronjaeger - meetupfiles.meetup.com/3576292/stefan cronjaeger r...
TRANSCRIPT
Microsoft R serverStefan CronjaegerTechnical Solution Specialist Advanced AnalyticsGlobal Blackbelt – [email protected]+49 151 4406 3425
?
DatasizeIn-memory
In-memory In-Memory or Disk Based
Speed of AnalysisSingle threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
SupportCommunity Community Community + Commercial
Analytic Breadth
& Depth 7500+ innovative analytic
packages7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-
speed functions
LicenceOpen Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server
R Open Microsoft R Server
DeployRDevelopR
ConnectR•High-speed & direct connectors
Available for:•High-performance XDF
•SAS, SPSS, delimited & fixed format text data files
•Hadoop HDFS (text & XDF)
•Teradata Database & Aster
•ODBC
ScaleR•Ready-to-Use high-performance big data big analytics
• Fully-parallelized analytics DistributedR•Distributed computing framework
•Delivers cross-platform portability
R+CRAN•Open source R interpreter
•R 3.1.2
• Freely-available huge range of R algorithms
•100% Compatible with existing R scripts, functions and packages
RevoR•Performance enhanced R interpreter
•Based on open source R
•Adds high-performance math library to speed up linear algebra functions
Scale R – Parallelized Algorithms & Functions
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Preparation Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forest
Gradient Boosted Decision Trees
Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination rxDataStep
rxExec
PEMA API
Algorithm
Master
Analyze
Blocks In
Parallel
Load Block
At A TimeDistribute
Work, Compile
Results
Not every algorithm works in parallel
Often there are several steps involved
Distributed R - How Does Remote Execution Work?
Algorithm
Master
Big
Data
Predictive
Algorithm
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
The Results:
• Even Faster Computation
• Larger Data Set Capacity
• Fewer Security Concerns
• No Data Movement, No Copies
“Pack and Ship” Requests
to Remote Environments
Results
Microsoft R Server functions
• A compute context defines remote connection• Microsoft R functions prefixed with rx
• Current compute context determines processing
location
Parallelization or Sequential Execution
Algorithm
Master
Analyze
Blocks In
Parallel
Load Block
At A TimeDistribute
Work, Compile
Results
Parallel:
• Large number of parallel workers
• Fast handling of Big Data
Algorithm
Master
Analyze
Blocks
Sequentially
Load Block
At A TimeDistribute
Work, Compile
Results
Sequential:
• Just one block of data in RAM
• Handling of Big Data with moderate
resources
Microsoft R Server has no data size limits in relation to size of available RAM
US flight data for 20 years
Linear Regression on Arrival Delay
Run on 4 core laptop, 16GB RAM and 500GB SSD
Microsoft R Server
Hadoop
R R R R R
R R R R R
ScaleR Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
Write Once Deploy Anywhere
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data =
AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
# SETUP SQLSERVER ENVIRONMENT VARIABLES
mySqlServer <- RxInSqlServer()
# SQL SERVER COMPUTE CONTEXT AND TABLE REF
rxSetComputeContext(mySqlServer)
AirlineDataSet <-
RxSqlServerData(table=“AirlineDemoSmall”)
### SETUP HADOOP ENVIRONMENT VARIABLES
myHadoopCluster <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT USING HDFS
rxSetComputeContext(myHadoopCluster)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = hdfsFS
Local Parallel – Linux or Windows In – Hadoop
ScaleR functions can run in-Hadoop, in-Spark or in-Database without any
functional R recoding
R script – does not
need to change to
run across different
platforms
# SETUP LINUX ENVIRONMENT VARIABLES
rxSetComputeContext("localpar")
# CREATE LINUX, DIRECTORY AND FILE
OBJECTS
linuxFS <- RxNativeFileSystem()
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = linuxFS)
SQL Server
Microsoft R Server
SQL Server R Services
Leverage Full Capability of R:
• Rich Statistical, Visualization & Predictive Analytics
• A Large and Growing Skill Base
… including Microsoft R Servers Big Data Capabilities:
• Scalable Computation
• Scalable Data Size
… all Running In-Database:
• Divide Work Between Data Scientists and Data Engineers
• Reduce Data Duplication and Data Movement
… While Protecting Information:
• Eliminate Data Movement & Unnecessary Copying
• Leverage Database Data Protections
• Leverage Database Tools for Backup, Scheduling, …
+SQL Server
2016
Enterprise
Edition
Copyright Microsoft Corporation. All rights reserved.
SQL
In-Database Execution:
Remote Execution
Parallelized Compute SQL
Server
Remote
Execution
Context
Explore and Model:
In Parallel, In-Database
Parallelize distributable R and CRAN
Operationlize:
Score In Parallel
Parallel
Worker
Tasks
Move
BIG
Work to
the
DataLarge Data Sets in Chunks
Parallel
Algorithm
Iterate/ Sequence
Run Parallel Algorithms in Database from an R client
Run R In-Database from TSQL
SQL
Server
2016
In-Database
Execution of
R + CRAN
+ SQL
In-Database Execution of:
R Code
CRAN Packages
Move the
Work to
the Data
Run R
From the
Query
Processor
Retrieve
Models,
Scores,
Transformed
Data,
Plots/Images
Operationalise
scoring/predicti
on in database
for data batches
or real-time
Microsoft R Server
Operationalizing R based Analytics
•
•
•
•
•
•
•
•
:
Copyright Microsoft Corporation. All rights reserved.
Revolution Scale R engine + Open Source R with access to 7000+ packages
R Model Repository
Enterprise Security
R Session Management
Resource Management
Desktop Applications Mobile Apps Web Applications Real-Time Applications
API Client Libraries
R / Statistical
Modeling Expert
Layer
Business User
Layer
DeployR
Web
Services
Layer
Copyright Microsoft Corporation. All rights reserved.
Copyright Microsoft Corporation. All rights reserved.
The creation and management of R runtime resource usage
The monitoring of events on the grid and server