ejecutando lenguaje r en hadoop: bigr · 3 © 2015 ibm corporation the explainer: data in hadoop...

© 2015 IBM Corporation

Ejecutando Lenguaje R en Hadoop: BigR

© 2015 IBM Corporation2

What is Open Source R? What is CRAN?

R is a powerful programming language and environment for statistical computing and graphics.

R offers a rich analytics ecosystem:− Full analytics life-cycle

• Data exploration• Statistical analysis• Modeling, machine learning, simulations• Visualization

− Highly extensible via user-submitted packages• Tap into innovation pipeline contributed to by highly-regarded statisticians• Currently 4700+ statistical packages in repository• Easily accessible via CRAN, the Comprehensive R Archive Network

− R is the one of the fastest growing data analysis software• Deeply knowledgeable and supportive analytics community• The most popular software used in data analysis competitions• Gaining speed in corporate, government, and academic settings


The Explainer: Data in Hadoop

You

R User

Distributed data


Data in Hadoop: Open Source R on a single node

R User

You

Distributed data


Challenges with Running Large-Scale Analytics

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze small subsets of information

Analyze all information

Analyzedinformation

All available information

All available informationanalyzed


Various Approaches to integrate R with Hadoop

Key Challenges of using R with Hadoop:

� In R the processing is fundamentally memory bound – Data Frames/Matrix are loaded in memory and all processing happens there.

� So it is hard to integrate R with a fundamentally distributed processing paradigm like Map Reduce (Hadoop)

Various Approaches to integrate R with Hadoop

� RHIPE – Open Source framework integrates R with Hadoop through MapReduce coding on the client-side

� Rhadoop/RMR - RHadoop is also provides an Open Source framework trying to integrate R with Hadoop at client side

� Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through the I/O stream.

� Oracle Enterprise R Connector for Hadoop – Licensed product. Essentially adoption of Oracle R to be able to work with any Hadoop distribution through MapReduce coding on the client side

� Big R – Licensed product. Proprietary mechanism to integrate with Hadoop without the need of MapReduce style R code.


Sample Code – R with RHIPE for Hadoop integration

Original R Code

tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)

coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)

maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]

tempData$avgTempDay <- rowMeans(maxMin)

avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)

write(avgTempCity, file = “output.csv", sep = “, “)

R Code using RHIPE packagelibrary(Rhipe)rhinit(TRUE,TRUE)

map <- expression ( { process_line <- function(currentLine) {fields <- unlist(strsplit(currentLine, ",")) maxMin <- c(as.double(fields[3]), as.double(fields[6]))rhcollect(fields[1], toString(mean(maxMin)))}lapply(map.values, process_line) } )

reduce <- expression(pre = {means <- numeric(0)}, reduce = {means <- c(means, as.numeric(unlist(reduce.values)))},post = {rhcollect(reduce.key, toString(mean(means)))})input_file <- “temparature.csv“output_dir <- "output.csv“

job <- rhmr(jobname = “TempAvg", map = map, reduce = reduce, ifolder = input_file, ofolder = output_dir, inout = c("text","sequence"))rhex(job)


Text Analytics

POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

IBM BigInsights

Enterprise Management

Machine Learning on Big R

Machine Learning on Big R

Big R (R support)

IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,

Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

Overview of BigInsights

Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support

…


IBM BigInsights brings efficient integration of R with Big R

� R as a big data query language− Outside-in execution

� R as a statistical language for deep computing

− Inside-out execution

− Partitioning of large data (“divide”)

− Parallel cluster execution of pushed down R code (“conquer”)

− Almost any R package can run in this environment

� R as the gateway to scalable machine learning

− A scalable ML engine that provides canned algorithms, and an ability to author new ones, all via R

R Clients

Scalable ML

Engine

Data Sources

Embedded R Execution

R Packages

R Packages

Pull data (summaries) to

R client

Or, push R functions right

on the data


Big R Architecture

1Scalable

AlgorithmsScalable Data

ProcessingNative

R functions

R UserInterface

2 3


Sample Code – BigR

Code using Big R

library(bigr)

temperatureData <- bigr.frame(dataSource="DEL", dataPath="/user/temperature.csv", header=TRUE)

coltypes(temperatureData)=ifelse(1:10 %in% c(3, 6), "numeric", "character")

buildAvgTempFunc <- function(df) { maxMin <- df[ , c(‘minTemp’, ‘maxTemp’)]df$avgTempDay <- rowMeans(maxMin)avgTempCity <- aggregate (df$avgTempDay,

by=list(city=df$city), FUN=mean)return(data.frame(avgTempCity))

}

avgTemperature <- groupApply(temperatureData, temperatureData$city, buildAvgTempFunc, data.frame(city=“city", average_temperature=1.0))

bigr.persist(avgTemperature, dataSource="DEL", dataPath="/user/output.csv", header=T, del=',')

� This code (using Big R) achieves the same as the original R code on the same dataset in the csv file in HDFS.

� Note that the function call buildAvgTempFunc has same R code snippet as in original R code.

� The groupApply function is specific to bigr package. Other similar useful functions are rowApply and tableApply

Original R Code

tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)

coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)

maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]

tempData$avgTempDay <- rowMeans(maxMin)

avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)

write(avgTempCity, file = “output.csv", sep = “, “)


User Experience for Big R

Connect to BI cluster

Data frame proxy to large data file

Data transformation step

Run scalable linear regression on cluster


3 Key Capabilities in Big R

� Use of R as a language on Big Data

− Scalable data processing

� Running native R functions in Hadoop

− Can leverage existing R assets (code and CRAN packages)

� Running scalable algorithms beyond R in Hadoop

− Wide class of algorithms and growing

− R-like syntax to develop new algorithms and customize existing algorithms

1

2

3

End-to-end integration of R into BigInsights Hadoop


Big R Data Structures: Proxy to entire dataset

data <- bigr.frame(…)

Appears and acts like all of the data is on your laptop

You

1


Out-of-box Big R Functions: Seamlessly compile into MapReduce

dataCorrelation <- cor(data)

MapReduce job runs over the entire dataset

You

1


Big R Partitioned Execution: Run R functions on partitions of data

16

R

R

R

R

R

R

R

R

R

R

R

R

Each map stands up an instance of R

models <- rowApply(... some R function…)

You

2


Big R Partitioned Execution: How rowApply works (on 4 nodes)

17

R

R

R

R

Each partition of rows stands up an instance of R

Logical representation of dataset

2


SystemML from Big R: Statistics and machine learning at scale

model <- bigr.lm(…)

Optimized MapReduce jobs run over the entire dataset

You

3


Rich Functionality in Big R

Big R Function

Connection connect, disconnect, …

HDFS listfs, rmfs

Types & Functions

Types bigr.frame, bigr.vector

Functionsdim, nrow, colnames, coltypes, head, tail, na.string, na.omit, sort, summary

Coercion and Casting

as.bigr.frame, as.data.frame, ….vectoras.integer, as.logical, as.numeric

Built-in Functions

Arithmetic +, -, *, /, ^

Mathematical abs, acos, asin, atan, ceiling, floor, exp, …

String grepl, substr

Statistical cor, cov, mean, sd

Miscellaneous attach, pull, random, sample, ifelse

Visualization histogram

Apply R functions groupApply, tableApply, rowApply

Run scalable algorithms bigr.lm, bigr.svm, bigr. … (see subsequent slide)3

2

1


Scalable Machine Learning Algorithms in Big R

Category Description Big R Function

Descriptive Statistics

Univariate bigr.univariateStats()

Bivariate bigr.bivariateStats()

Stratified Bivariate bigr.bivariateStats()

Classification

Logistic Regression (multinomial) bigr.logistic.regression()

Multi-Class SVM bigr.svm()

Naïve Bayes (multinomial) bigr.naive.bayes()

Clustering k-Means bigr.kmeans()

Regression

Linear Regression

system of equations bigr.lm()

CG (conjugate gradient descent) bigr.lm()

Generalized Linear Models (GLM)

Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial and Bernoulli

bigr.glm()

Links for all distributions: identity, log, sq. root, inverse, 1/µ2

bigr.glm()

Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit

bigr.glm()

Predict Scoring bigr.predict()

Transformationdummy coding, binning, scaling, missing value imputation

bigr.transform()

3


What’s behind running Big R’s Scalable Algorithms?

� High-level declarative language with R-like syntax shields your algorithm investment from platform progression

� Cost-based compilation of algorithms to generate execution plans

− Compilation and parallelization

• Based on data characteristics

• Based on cluster and machine characteristics

− In-Memory single node and MR execution

� Enable algorithm developer productivity to build additional algorithms (scalability, numeric stability and optimizations)

Hadoop Cluster

In-Memory

Single Node

Declarative analytics:1) Future-proof algorithm investment2) Automatic performance tuning


K-Means Input DataX = 10m x 500 (135 GB Text file)K = 10

Compute distance matrix D

Minimum distance for each record, minD

Find all closest centroids for each record, P

Compute new centers, C

Compute normalized Pthat accounts for records w/ multiple closest centroids

Input Data (X)

1 MR Job

In Memory

In Memory

In Memory

1 MR Job

10m x 10(dense: ~800 MB)

10m x 1(dense: ~80 MB)

10 x 500(dense: ~40MB)


K-Means Input DataX = 300m x 500 (4 TB Text file)K = 10

Compute distance matrix D

Minimum distance for each record, minD

Find all closest centroids for each record, P

Compute new centers, C

Compute normalized Pthat accounts for records w/ multiple closest centroids

Input Data (X)

1 MR Job

1 MR Job

1 MR Job

4 MR Jobs

2 MR Jobs

300m x 10(dense: ~24 GB)

300m x 1(dense: ~2.4 GB)

10 x 500(dense: ~40MB


Physical Cluster• 5 machines, each 2x4 (16 HWT),

64GB RAM • 1.5 TB Storage, 1 GbE

Hadoop Cluster• Map Capacity: 80• Reduce Capacity: 10• -Xmx1024m• SystemML

All operations execute on

Single machine0 MR Jobs

Hybrid Execution

(majority of operations execute on single machine)

4 MR Jobs

Hybrid Execution

(majority of operations execute in map-reduce)

6 MR Jobs

Matrix Factorization Sample – Scalability and Performance

ejecutando lenguaje r en hadoop: bigr · 3 © 2015 ibm corporation the explainer: data in hadoop...

Documents