ejecutando lenguaje r en hadoop: bigr · 3 © 2015 ibm corporation the explainer: data in hadoop...
TRANSCRIPT
© 2015 IBM Corporation
Ejecutando Lenguaje R en Hadoop: BigR
© 2015 IBM Corporation2
What is Open Source R? What is CRAN?
R is a powerful programming language and environment for statistical computing and graphics.
R offers a rich analytics ecosystem:− Full analytics life-cycle
• Data exploration• Statistical analysis• Modeling, machine learning, simulations• Visualization
− Highly extensible via user-submitted packages• Tap into innovation pipeline contributed to by highly-regarded statisticians• Currently 4700+ statistical packages in repository• Easily accessible via CRAN, the Comprehensive R Archive Network
− R is the one of the fastest growing data analysis software• Deeply knowledgeable and supportive analytics community• The most popular software used in data analysis competitions• Gaining speed in corporate, government, and academic settings
© 2015 IBM Corporation3
The Explainer: Data in Hadoop
You
R User
Distributed data
© 2015 IBM Corporation4
Data in Hadoop: Open Source R on a single node
R User
You
Distributed data
© 2015 IBM Corporation5
Challenges with Running Large-Scale Analytics
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze small subsets of information
Analyze all information
Analyzedinformation
All available information
All available informationanalyzed
© 2015 IBM Corporation6
Various Approaches to integrate R with Hadoop
Key Challenges of using R with Hadoop:
� In R the processing is fundamentally memory bound – Data Frames/Matrix are loaded in memory and all processing happens there.
� So it is hard to integrate R with a fundamentally distributed processing paradigm like Map Reduce (Hadoop)
Various Approaches to integrate R with Hadoop
� RHIPE – Open Source framework integrates R with Hadoop through MapReduce coding on the client-side
� Rhadoop/RMR - RHadoop is also provides an Open Source framework trying to integrate R with Hadoop at client side
� Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through the I/O stream.
� Oracle Enterprise R Connector for Hadoop – Licensed product. Essentially adoption of Oracle R to be able to work with any Hadoop distribution through MapReduce coding on the client side
� Big R – Licensed product. Proprietary mechanism to integrate with Hadoop without the need of MapReduce style R code.
© 2015 IBM Corporation7
Sample Code – R with RHIPE for Hadoop integration
Original R Code
tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)
coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)
maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]
tempData$avgTempDay <- rowMeans(maxMin)
avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)
write(avgTempCity, file = “output.csv", sep = “, “)
R Code using RHIPE packagelibrary(Rhipe)rhinit(TRUE,TRUE)
map <- expression ( { process_line <- function(currentLine) {fields <- unlist(strsplit(currentLine, ",")) maxMin <- c(as.double(fields[3]), as.double(fields[6]))rhcollect(fields[1], toString(mean(maxMin)))}lapply(map.values, process_line) } )
reduce <- expression(pre = {means <- numeric(0)}, reduce = {means <- c(means, as.numeric(unlist(reduce.values)))},post = {rhcollect(reduce.key, toString(mean(means)))})input_file <- “temparature.csv“output_dir <- "output.csv“
job <- rhmr(jobname = “TempAvg", map = map, reduce = reduce, ifolder = input_file, ofolder = output_dir, inout = c("text","sequence"))rhex(job)
© 2015 IBM Corporation8
Text Analytics
POSIX Distributed Filesystem
Multi-workload, multi-tenant scheduling
IBM BigInsights
Enterprise Management
Machine Learning on Big R
Machine Learning on Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Industry standard SQL (Big SQL)
Spreadsheet-style tool (BigSheets)
Overview of BigInsights
Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support
…
© 2015 IBM Corporation9
IBM BigInsights brings efficient integration of R with Big R
� R as a big data query language− Outside-in execution
� R as a statistical language for deep computing
− Inside-out execution
− Partitioning of large data (“divide”)
− Parallel cluster execution of pushed down R code (“conquer”)
− Almost any R package can run in this environment
� R as the gateway to scalable machine learning
− A scalable ML engine that provides canned algorithms, and an ability to author new ones, all via R
R Clients
Scalable ML
Engine
Data Sources
Embedded R Execution
R Packages
R Packages
Pull data (summaries) to
R client
Or, push R functions right
on the data
© 2015 IBM Corporation10
Big R Architecture
1Scalable
AlgorithmsScalable Data
ProcessingNative
R functions
R UserInterface
2 3
© 2015 IBM Corporation11
Sample Code – BigR
Code using Big R
library(bigr)
temperatureData <- bigr.frame(dataSource="DEL", dataPath="/user/temperature.csv", header=TRUE)
coltypes(temperatureData)=ifelse(1:10 %in% c(3, 6), "numeric", "character")
buildAvgTempFunc <- function(df) { maxMin <- df[ , c(‘minTemp’, ‘maxTemp’)]df$avgTempDay <- rowMeans(maxMin)avgTempCity <- aggregate (df$avgTempDay,
by=list(city=df$city), FUN=mean)return(data.frame(avgTempCity))
}
avgTemperature <- groupApply(temperatureData, temperatureData$city, buildAvgTempFunc, data.frame(city=“city", average_temperature=1.0))
bigr.persist(avgTemperature, dataSource="DEL", dataPath="/user/output.csv", header=T, del=',')
� This code (using Big R) achieves the same as the original R code on the same dataset in the csv file in HDFS.
� Note that the function call buildAvgTempFunc has same R code snippet as in original R code.
� The groupApply function is specific to bigr package. Other similar useful functions are rowApply and tableApply
Original R Code
tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)
coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)
maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]
tempData$avgTempDay <- rowMeans(maxMin)
avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)
write(avgTempCity, file = “output.csv", sep = “, “)
© 2015 IBM Corporation12
User Experience for Big R
Connect to BI cluster
Data frame proxy to large data file
Data transformation step
Run scalable linear regression on cluster
© 2015 IBM Corporation13
3 Key Capabilities in Big R
� Use of R as a language on Big Data
− Scalable data processing
� Running native R functions in Hadoop
− Can leverage existing R assets (code and CRAN packages)
� Running scalable algorithms beyond R in Hadoop
− Wide class of algorithms and growing
− R-like syntax to develop new algorithms and customize existing algorithms
1
2
3
End-to-end integration of R into BigInsights Hadoop
© 2015 IBM Corporation14
Big R Data Structures: Proxy to entire dataset
data <- bigr.frame(…)
Appears and acts like all of the data is on your laptop
You
1
© 2015 IBM Corporation15
Out-of-box Big R Functions: Seamlessly compile into MapReduce
dataCorrelation <- cor(data)
MapReduce job runs over the entire dataset
You
1
© 2015 IBM Corporation16
Big R Partitioned Execution: Run R functions on partitions of data
16
R
R
R
R
R
R
R
R
R
R
R
R
Each map stands up an instance of R
models <- rowApply(... some R function…)
You
2
© 2015 IBM Corporation17
Big R Partitioned Execution: How rowApply works (on 4 nodes)
17
R
R
R
R
Each partition of rows stands up an instance of R
Logical representation of dataset
2
© 2015 IBM Corporation18
SystemML from Big R: Statistics and machine learning at scale
model <- bigr.lm(…)
Optimized MapReduce jobs run over the entire dataset
You
3
© 2015 IBM Corporation19
Rich Functionality in Big R
Big R Function
Connection connect, disconnect, …
HDFS listfs, rmfs
Types & Functions
Types bigr.frame, bigr.vector
Functionsdim, nrow, colnames, coltypes, head, tail, na.string, na.omit, sort, summary
Coercion and Casting
as.bigr.frame, as.data.frame, ….vectoras.integer, as.logical, as.numeric
Built-in Functions
Arithmetic +, -, *, /, ^
Mathematical abs, acos, asin, atan, ceiling, floor, exp, …
String grepl, substr
Statistical cor, cov, mean, sd
Miscellaneous attach, pull, random, sample, ifelse
Visualization histogram
Apply R functions groupApply, tableApply, rowApply
Run scalable algorithms bigr.lm, bigr.svm, bigr. … (see subsequent slide)3
2
1
© 2015 IBM Corporation20
Scalable Machine Learning Algorithms in Big R
Category Description Big R Function
Descriptive Statistics
Univariate bigr.univariateStats()
Bivariate bigr.bivariateStats()
Stratified Bivariate bigr.bivariateStats()
Classification
Logistic Regression (multinomial) bigr.logistic.regression()
Multi-Class SVM bigr.svm()
Naïve Bayes (multinomial) bigr.naive.bayes()
Clustering k-Means bigr.kmeans()
Regression
Linear Regression
system of equations bigr.lm()
CG (conjugate gradient descent) bigr.lm()
Generalized Linear Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial and Bernoulli
bigr.glm()
Links for all distributions: identity, log, sq. root, inverse, 1/µ2
bigr.glm()
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
bigr.glm()
Predict Scoring bigr.predict()
Transformationdummy coding, binning, scaling, missing value imputation
bigr.transform()
3
© 2015 IBM Corporation21
What’s behind running Big R’s Scalable Algorithms?
� High-level declarative language with R-like syntax shields your algorithm investment from platform progression
� Cost-based compilation of algorithms to generate execution plans
− Compilation and parallelization
• Based on data characteristics
• Based on cluster and machine characteristics
− In-Memory single node and MR execution
� Enable algorithm developer productivity to build additional algorithms (scalability, numeric stability and optimizations)
Hadoop Cluster
In-Memory
Single Node
Declarative analytics:1) Future-proof algorithm investment2) Automatic performance tuning
© 2015 IBM Corporation22
K-Means Input DataX = 10m x 500 (135 GB Text file)K = 10
Compute distance matrix D
Minimum distance for each record, minD
Find all closest centroids for each record, P
Compute new centers, C
Compute normalized Pthat accounts for records w/ multiple closest centroids
Input Data (X)
1 MR Job
In Memory
In Memory
In Memory
1 MR Job
10m x 10(dense: ~800 MB)
10m x 1(dense: ~80 MB)
10 x 500(dense: ~40MB)
© 2015 IBM Corporation23
K-Means Input DataX = 300m x 500 (4 TB Text file)K = 10
Compute distance matrix D
Minimum distance for each record, minD
Find all closest centroids for each record, P
Compute new centers, C
Compute normalized Pthat accounts for records w/ multiple closest centroids
Input Data (X)
1 MR Job
1 MR Job
1 MR Job
4 MR Jobs
2 MR Jobs
300m x 10(dense: ~24 GB)
300m x 1(dense: ~2.4 GB)
10 x 500(dense: ~40MB
© 2015 IBM Corporation24
Physical Cluster• 5 machines, each 2x4 (16 HWT),
64GB RAM • 1.5 TB Storage, 1 GbE
Hadoop Cluster• Map Capacity: 80• Reduce Capacity: 10• -Xmx1024m• SystemML
All operations execute on
Single machine0 MR Jobs
Hybrid Execution
(majority of operations execute on single machine)
4 MR Jobs
Hybrid Execution
(majority of operations execute in map-reduce)
6 MR Jobs
Matrix Factorization Sample – Scalability and Performance
© 2015 IBM Corporation25