microsoft r server stefan cronjaeger - meetupfiles.meetup.com/3576292/stefan cronjaeger r...
TRANSCRIPT
![Page 1: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/1.jpg)
Microsoft R serverStefan CronjaegerTechnical Solution Specialist Advanced AnalyticsGlobal Blackbelt – [email protected]+49 151 4406 3425
![Page 2: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/2.jpg)
![Page 3: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/3.jpg)
?
![Page 4: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/4.jpg)
DatasizeIn-memory
In-memory In-Memory or Disk Based
Speed of AnalysisSingle threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
SupportCommunity Community Community + Commercial
Analytic Breadth
& Depth 7500+ innovative analytic
packages7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-
speed functions
LicenceOpen Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server
![Page 5: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/5.jpg)
R Open Microsoft R Server
DeployRDevelopR
ConnectR•High-speed & direct connectors
Available for:•High-performance XDF
•SAS, SPSS, delimited & fixed format text data files
•Hadoop HDFS (text & XDF)
•Teradata Database & Aster
•ODBC
ScaleR•Ready-to-Use high-performance big data big analytics
• Fully-parallelized analytics DistributedR•Distributed computing framework
•Delivers cross-platform portability
R+CRAN•Open source R interpreter
•R 3.1.2
• Freely-available huge range of R algorithms
•100% Compatible with existing R scripts, functions and packages
RevoR•Performance enhanced R interpreter
•Based on open source R
•Adds high-performance math library to speed up linear algebra functions
![Page 6: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/6.jpg)
Scale R – Parallelized Algorithms & Functions
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Preparation Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forest
Gradient Boosted Decision Trees
Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination rxDataStep
rxExec
PEMA API
![Page 7: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/7.jpg)
Algorithm
Master
Analyze
Blocks In
Parallel
Load Block
At A TimeDistribute
Work, Compile
Results
Not every algorithm works in parallel
Often there are several steps involved
![Page 8: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/8.jpg)
Distributed R - How Does Remote Execution Work?
Algorithm
Master
Big
Data
Predictive
Algorithm
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
The Results:
• Even Faster Computation
• Larger Data Set Capacity
• Fewer Security Concerns
• No Data Movement, No Copies
“Pack and Ship” Requests
to Remote Environments
Results
Microsoft R Server functions
• A compute context defines remote connection• Microsoft R functions prefixed with rx
• Current compute context determines processing
location
![Page 9: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/9.jpg)
Parallelization or Sequential Execution
Algorithm
Master
Analyze
Blocks In
Parallel
Load Block
At A TimeDistribute
Work, Compile
Results
Parallel:
• Large number of parallel workers
• Fast handling of Big Data
Algorithm
Master
Analyze
Blocks
Sequentially
Load Block
At A TimeDistribute
Work, Compile
Results
Sequential:
• Just one block of data in RAM
• Handling of Big Data with moderate
resources
![Page 10: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/10.jpg)
![Page 11: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/11.jpg)
Microsoft R Server has no data size limits in relation to size of available RAM
US flight data for 20 years
Linear Regression on Arrival Delay
Run on 4 core laptop, 16GB RAM and 500GB SSD
![Page 12: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/12.jpg)
Microsoft R Server
Hadoop
![Page 13: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/13.jpg)
R R R R R
R R R R R
ScaleR Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
![Page 14: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/14.jpg)
Write Once Deploy Anywhere
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data =
AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
# SETUP SQLSERVER ENVIRONMENT VARIABLES
mySqlServer <- RxInSqlServer()
# SQL SERVER COMPUTE CONTEXT AND TABLE REF
rxSetComputeContext(mySqlServer)
AirlineDataSet <-
RxSqlServerData(table=“AirlineDemoSmall”)
### SETUP HADOOP ENVIRONMENT VARIABLES
myHadoopCluster <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT USING HDFS
rxSetComputeContext(myHadoopCluster)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = hdfsFS
Local Parallel – Linux or Windows In – Hadoop
ScaleR functions can run in-Hadoop, in-Spark or in-Database without any
functional R recoding
R script – does not
need to change to
run across different
platforms
# SETUP LINUX ENVIRONMENT VARIABLES
rxSetComputeContext("localpar")
# CREATE LINUX, DIRECTORY AND FILE
OBJECTS
linuxFS <- RxNativeFileSystem()
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = linuxFS)
SQL Server
![Page 15: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/15.jpg)
Microsoft R Server
SQL Server R Services
![Page 16: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/16.jpg)
Leverage Full Capability of R:
• Rich Statistical, Visualization & Predictive Analytics
• A Large and Growing Skill Base
… including Microsoft R Servers Big Data Capabilities:
• Scalable Computation
• Scalable Data Size
… all Running In-Database:
• Divide Work Between Data Scientists and Data Engineers
• Reduce Data Duplication and Data Movement
… While Protecting Information:
• Eliminate Data Movement & Unnecessary Copying
• Leverage Database Data Protections
• Leverage Database Tools for Backup, Scheduling, …
+SQL Server
2016
Enterprise
Edition
Copyright Microsoft Corporation. All rights reserved.
![Page 17: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/17.jpg)
SQL
In-Database Execution:
Remote Execution
Parallelized Compute SQL
Server
Remote
Execution
Context
Explore and Model:
In Parallel, In-Database
Parallelize distributable R and CRAN
Operationlize:
Score In Parallel
Parallel
Worker
Tasks
Move
BIG
Work to
the
DataLarge Data Sets in Chunks
Parallel
Algorithm
Iterate/ Sequence
Run Parallel Algorithms in Database from an R client
![Page 18: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/18.jpg)
Run R In-Database from TSQL
SQL
Server
2016
In-Database
Execution of
R + CRAN
+ SQL
In-Database Execution of:
R Code
CRAN Packages
Move the
Work to
the Data
Run R
From the
Query
Processor
Retrieve
Models,
Scores,
Transformed
Data,
Plots/Images
Operationalise
scoring/predicti
on in database
for data batches
or real-time
![Page 19: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/19.jpg)
Microsoft R Server
Operationalizing R based Analytics
![Page 20: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/20.jpg)
•
•
•
•
•
•
•
•
![Page 21: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/21.jpg)
:
Copyright Microsoft Corporation. All rights reserved.
Revolution Scale R engine + Open Source R with access to 7000+ packages
R Model Repository
Enterprise Security
R Session Management
Resource Management
Desktop Applications Mobile Apps Web Applications Real-Time Applications
API Client Libraries
R / Statistical
Modeling Expert
Layer
Business User
Layer
DeployR
Web
Services
Layer
![Page 22: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/22.jpg)
Copyright Microsoft Corporation. All rights reserved.
![Page 23: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd4f95bdf3a53aec2fafdd/html5/thumbnails/23.jpg)
Copyright Microsoft Corporation. All rights reserved.
The creation and management of R runtime resource usage
The monitoring of events on the grid and server