big data analytics with r and hadoop chapter3 : integrating r and hadoop sang-min song 2015.04.09

23
Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Upload: rosanna-lewis

Post on 24-Dec-2015

233 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Big Data Analytics withR and Hadoop

Chapter3 : Integrating R and Hadoop

Sang-Min Song2015.04.09

Page 2: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Three ways to link R and HadoopRHIPE

RHadoop

Hadoop streaming

Chapter3 : Integrating R and Hadoop 2

Page 3: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Introducing RHIPERHIPE stands for R and Hadoop Integrated Programming

Environment.

It means "in a moment" in Greek and is a merger of R and Hadoop.

The RHIPE package uses the Divide and Recombine tech-nique to perform data analytics over Big Data.

RHIPE has mainly been designed to accomplish two goals.Allowing you to perform in-depth analysis of large as well as

small data.Allowing users to perform the analytics operations within R us-

ing a lower-level language.

RHIPE is a lower-level interface as compared to HDFS and MapReduce operation.

Chapter3 : Integrating R and Hadoop 3

Page 4: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Install Sequence1. Installing Hadoop.

2. Installing R.

3. Installing protocol buffers.

4. Setting up environment variables.

5. Installing rJava.

6. Installing RHIPE.

Chapter3 : Integrating R and Hadoop 4

Page 5: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Installing RHIPE3. Installing protocol buffers

Chapter3 : Integrating R and Hadoop 5

Page 6: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Installing RHIPE4. Environment variables

~./bashrc file of hduser (Hadoop user)

R console

Chapter3 : Integrating R and Hadoop 6

Page 7: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Installing RHIPE5. The rJava package installation

6. Installing RHIPE

Chapter3 : Integrating R and Hadoop 7

Page 8: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the architecture of RHIPE

Chapter3 : Integrating R and Hadoop 8

Page 9: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Word count

Chapter3 : Integrating R and Hadoop 9

Page 10: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Word count

Chapter3 : Integrating R and Hadoop 10

Page 11: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Word count

Chapter3 : Integrating R and Hadoop 11

Page 12: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHIPE function referenceAll these methods are with three categories

Initialization, HDFS, and MapReduce operations

Initializationrhinit(TRUE,TRUE)

Chapter3 : Integrating R and Hadoop 12

Page 13: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHIPE function referenceHDFS

rhls(path)hdfs.getwd()hdfs.setwd("/RHIPE")rhput(src,dest) and rhput("/usr/local/hadoop/NOTICE.txt","/

RHIPE/")rhcp('/RHIPE/1/change.txt','/RHIPE/2/change.txt')rhdel("/RHIPE/1")rhget("/RHIPE/1/part-r-00000", "/usr/local/")rhwrite(list(1,2,3),"/tmp/x")

Chapter3 : Integrating R and Hadoop 13

Page 14: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHIPE function referenceMapReduce

rhwatch(map, reduce, combiner, input, output, mapred, parti-tioner,mapred, jobname)

rhex(job)rhjoin(job)rhkill(job)rhoptions()rhstatus(job)

Chapter3 : Integrating R and Hadoop 14

Page 15: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Introducing RHadoopRHadoop is available with three main R packages

rhdfs, rmr, and rhbase.

rhdfs is an R interface for providing the HDFS usability from the R console.

rmr is an R interface for providing Hadoop MapReduce facil-ity inside the R environment.

rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server.

Chapter3 : Integrating R and Hadoop 15

Page 16: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the architecture of RHadoopSince Hadoop is highly popular because of HDFS and

MapReduce, Revolution Analytics has developed separate R packages, namely, rhdfs, rmr, and rhbase.

Chapter3 : Integrating R and Hadoop 16

Page 17: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Installing RHadoopWe need several R packages to be installed that help it to

connect R with Hadoop.rJava, RJSONIO, itertools, digest, Rcpp, httr, functional, dev-

tools, plyr, reshape2

Setting environment variables

Chapter3 : Integrating R and Hadoop 17

Page 18: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Installing RHadoopInstalling RHadoop [rhdfs, rmr, rhbase]

Chapter3 : Integrating R and Hadoop 18

Page 19: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Word count

Map phase

Reduce phase

Defining the MapReduce job

Executing the MapReduce job

Exploring the wordcount out-put

Chapter3 : Integrating R and Hadoop 19

Page 20: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHadoop function referenceThe hdfs package

Initialization hdfs.init() hdfs.defaults()

File manipulation hdfs.put('/usr/local/hadoop/README.txt','/RHadoop/1/') hdfs.copy('/RHadoop/1/','/RHadoop/2/') hdfs.move('/RHadoop/1/README.txt','/RHadoop/2/') hdfs.rename('/RHadoop/README.txt','/RHadoop/README1.txt') hdfs.delete("/RHadoop") hdfs.rm("/RHadoop") hdfs.chmod('/RHadoop', permissions= '777')

Chapter3 : Integrating R and Hadoop 20

Page 21: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHadoop function referenceThe hdfs package

File read/write f = hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600) hdfs.write(object,con,hsync=FALSE) hdfs.close(f) m = hdfs.read(f)

Directory operation hdfs.mkdir("/RHadoop/2/") hdfs.rm("/RHadoop/2/")

Utility Hdfs.ls('/') hdfs.file.info("/RHadoop")

Chapter3 : Integrating R and Hadoop 21

Page 22: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Understanding the RHadoop function referenceThe rmr package

For storing and retrieving data small.ints = to.dfs(1:10) from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')

For MapReduce mapreduce(input, output, map, reduce, combine, input.fromat, output.-

format, verbose) keyval(key, val)

Chapter3 : Integrating R and Hadoop 22

Page 23: Big Data Analytics with R and Hadoop Chapter3 : Integrating R and Hadoop Sang-Min Song 2015.04.09

Chapter3 : Integrating R and Hadoop 23

Thank you