big data analysis with rhadoop

50
Big Data Analysis With RHadoop David Chiu (Yu-Wei, Chiu) @ML/DM Monday 2014/03/17

Upload: david-chiu

Post on 08-Sep-2014

2.260 views

Category:

Technology


2 download

DESCRIPTION

Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis? This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program. Please mail me if you found any problem toward the slide. EMAIL: [email protected]   談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?   本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?

TRANSCRIPT

Page 1: Big Data Analysis With RHadoop

Big Data Analysis With RHadoop

David Chiu (Yu-Wei, Chiu)

@ML/DM Monday

2014/03/17

Page 2: Big Data Analysis With RHadoop

About Me

Co-Founder of NumerInfo

Ex-Trend Micro Engineer

ywchiu-tw.appspot.com

Page 3: Big Data Analysis With RHadoop

R + Hadoop

http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg

Page 4: Big Data Analysis With RHadoop

Scaling RHadoop enables R to do parallel computing

Do not have to learn new languageLearning to use Java takes time

Why Using RHadoop

Page 5: Big Data Analysis With RHadoop

Rhadoop Architecture

rhdfs

rhbase

rmr2

HDFS

MapReduce

Hbase ThriftGateway

Streaming API

Hbase

R

Page 6: Big Data Analysis With RHadoop

Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)

Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output

Streaming Job would have additional overhead of starting a scripting VM

Streaming v.s. Native Java. 

Page 7: Big Data Analysis With RHadoop

Writing MapReduce Using R

mapreduce function Mapreduce(input output, map, reduce…)

Changelogrmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0rmr 2.3.0 (2013/10/07): support plyrmr

rmr2

Page 8: Big Data Analysis With RHadoop

Access HDFS From R

Exchange data from R dataframe and HDFS

rhdfs

Page 9: Big Data Analysis With RHadoop

Exchange data from R to Hbase

Using Thrift API

rhbase

Page 10: Big Data Analysis With RHadoop

Perform common data manipulation operations, as found in plyr and reshape2

It provides a familiar plyr-like interface while hiding many of the mapreduce details

plyr: Tools for splitting, applying and combining data

NEW! plyrmr

Page 11: Big Data Analysis With RHadoop

RHadoopInstallation

Page 12: Big Data Analysis With RHadoop

R and related packages should be installed on each tasknode of the cluster

A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2

Prerequisites

Page 13: Big Data Analysis With RHadoop

Getting Ready (Cloudera VM)

Downloadhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html

This VM runs CentOS 6.2 CDH4.4R 3.0.1 Java 1.6.0_32

Page 14: Big Data Analysis With RHadoop

CDH 4.4

Page 15: Big Data Analysis With RHadoop

Get RHadoop

https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

Page 16: Big Data Analysis With RHadoop

Installing rmr2 dependencies

Make sure the package is installed system wise

$ sudo R

> install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))

Page 17: Big Data Analysis With RHadoop

Install rmr2

$ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz

$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz

Page 18: Big Data Analysis With RHadoop

Installing…

Page 19: Big Data Analysis With RHadoop

http://cran.r-project.org/src/contrib/Archive/Rcpp/

Downgrade Rcpp

Page 20: Big Data Analysis With RHadoop

$ wget --no-check-certificate http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.tar.gz

$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz

Install Rcpp_0.11.0

Page 21: Big Data Analysis With RHadoop

$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz

Install rmr2 again

Page 22: Big Data Analysis With RHadoop

Install RHDFS

$ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz

$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz

Page 23: Big Data Analysis With RHadoop

Enable hdfs

> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")

> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar")

> library(rmr2)

> library(rhdfs)

> hdfs.init()

Page 24: Big Data Analysis With RHadoop

Javareconf error

$ sudo R CMD javareconf

Page 25: Big Data Analysis With RHadoop

javareconf with correct JAVA_HOME

$ echo $JAVA_HOME

$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf

Page 26: Big Data Analysis With RHadoop

MapReduceWith RHadoop

Page 27: Big Data Analysis With RHadoop

MapReduce

mapreduce(input, output, map, reduce)

Like sapply, lapply, tapply within R

Page 28: Big Data Analysis With RHadoop

Hello World – For Hadoop

http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png

Page 29: Big Data Analysis With RHadoop

Move File Into HDFS

# Put data into hdfs

Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”)hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")

$ hadoop fs –mkdir /user/cloudera/wordcount/data$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data

Page 30: Big Data Analysis With RHadoop

Wordcount Mapper

map <- function(k,lines) {   words.list <- strsplit(lines, '\\s')   words <- unlist(words.list)   return( keyval(words, 1) ) }

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);

} } }

#Mapper

Page 31: Big Data Analysis With RHadoop

Wordcount Reducer

reduce <- function(word, counts) {   keyval(word, sum(counts)) }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {     public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {         int sum = 0;         while (values.hasNext()) {              sum += values.next().get();       }     output.collect(key, new IntWritable(sum));     } }

#Reducer

Page 32: Big Data Analysis With RHadoop

Call Wordcount

hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out')

wordcount <- function (input, output=NULL) {   mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)

Page 33: Big Data Analysis With RHadoop

Read data from HDFS

results <- from.dfs(out)

results$key[order(results$val, decreasing = TRUE)][1:10]

$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10

Page 34: Big Data Analysis With RHadoop

MapReduce Benchmark

> a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time

> b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time

V.S.

Page 35: Big Data Analysis With RHadoop

sapply

Elapsed 0.982 second

Page 36: Big Data Analysis With RHadoop

mapreduce

Elapsed 102.755 seconds

Page 37: Big Data Analysis With RHadoop

HDFS stores your files as data chunk distributed on multiple datanodes

M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.

It takes time for mapper and reducer being spawned on these distributed system.

Hadoop Latency

Page 38: Big Data Analysis With RHadoop

Its not possible to apply built in machine learning method on MapReduce Program

kcluster= kmeans((mydata, 4, iter.max=10) 

Kmeans Clustering

Page 39: Big Data Analysis With RHadoop

kmeans =   function(points, ncenters, iterations = 10, distfun = NULL) {     if(is.null(distfun))       distfun = function(a,b) norm(as.matrix(a-b), type = 'F')     newCenters =       kmeans.iter(         points,         distfun,         ncenters = ncenters) # interatively choosing new centers    for(i in 1:iterations) {        newCenters = kmeans.iter(points, distfun, centers = newCenters) }     newCenters}

Kmeans in MapReduce Style

Page 40: Big Data Analysis With RHadoop

kmeans.iter =   function(points, distfun, ncenters = dim(centers)[1], centers = NULL) {     from.dfs(mapreduce(input = points,          map =           if (is.null(centers)) { #give random point as sample          function(k,v) keyval(sample(1:ncenters,1),v)}           else {           function(k,v) { #find center of minimum distance           distances = apply(centers, 1, function(c) distfun(c,v))                  keyval(centers[which.min(distances),], v)}},          reduce = function(k,vv) keyval(NULL,

apply(do.call(rbind, vv), 2, mean))),     to.data.frame = T)}

Kmeans in MapReduce Style

Page 41: Big Data Analysis With RHadoop

One More Thing…plyrmr

Page 42: Big Data Analysis With RHadoop

Perform common data manipulation operations, as found in plyr and reshape2

It provides a familiar plyr-like interface while hiding many of the mapreduce details

plyr: Tools for splitting, applying and combining data

NEW! plyrmr

Page 43: Big Data Analysis With RHadoop

Installation plyrmr dependencies

$ yum install libxml2-devel$ sudo yum install curl-devel$ sudo R

> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE> Install.packages(“devtools”, dependencies = TRUE)> library(devtools)> install_github("pryr", "hadley")> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)

Page 44: Big Data Analysis With RHadoop

$ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz

$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz

Installation plyrmr

Page 45: Big Data Analysis With RHadoop

> data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl)

> library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out")

Transform in plyrmr

Page 46: Big Data Analysis With RHadoop

where(     select(         mtcars,         carb.per.cyl = carb/cyl,         .replace = FALSE),     carb.per.cyl >= 1)

select and where

https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md

Page 47: Big Data Analysis With RHadoop

as.data.frame(     select(         group(             input("/tmp/mtcars"),             cyl),         mean.mpg = mean(mpg)))

Group by

https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md

Page 49: Big Data Analysis With RHadoop

Websiteywchiu-tw.appspot.com

Email [email protected]@gmail.com

Companynumerinfo.com

Contact

Page 50: Big Data Analysis With RHadoop