big data, beating the skills gap - earl conf › 2014 › presentations › earl2014... · rhadoop...

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Big Data, beating the Skills Gap

Using R with Hadoop

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Using R with Hadoop

There are a number of R packages available that can interact withHadoop, including:

• hive - Not to be confused with Apache Hive, which iseffectively SQL for Hadoop.

• HadoopStreaming - Provides a framework for writingmap/reduce scripts for use in Hadoop Streaming

• RHadoop - A collection of five R packages that allow users tomanage and analyze data with Hadoop

Kate Hanley - R Consultant

[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

What do I need to know about Hadoop to understand thistalk?

For the purposes of this talk, we need to know about the following:

• HDFS• MapReduce


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

HDFS

What is it?

• This stands for Hadoop Distributed File System• A way of storing data across multiple machines to allow ease

of access• The data is broken up into chunks and spread across multiple

machinesWhy bother?

• No massive files which contain terabytes of information• There are multiple copies of each file dotted around different

machines, so if one of your machines go down or the data iscorrupted, you can still perform the analysis


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

MapReduce

What is it?• A way of coding your problem so that it can be split up and

spread across multiple machines• Each machine performs its own part of the analysis, and the

results are then collected together at the end


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

RHadoop

• We’ll focus on using some of the RHadoop packages, as theseseem to be widely used.

• Not on CRAN, but available to download on github.• RHadoop is divided into a number of packages. We’re going

to use the rmr2 package, which allows you to use R toperform MapReduce


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

• The task - Perform a word count on the text of JamesJoyce’s Ulysses.

• Input - A text file containing the plain text version of thenovel.

• Output - An R object containing all unique words and acount specifying the number of times it occurred in the novel.

• Simplifications - We removed all punctuation from the filebefore beginning the analysis. This avoids issues where “hello”and “hello!” are treated as different words.


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 1: Load up the rmr2 package

library(rmr2)


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Option 2a:You can try out the package without having Hadoop installedThis will allow you to play around with the package, but you won’tbe able to execute any Hadoop jobs. To enable the package towork locally, you need to set the following option:

rmr.options(backend="local")


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Option 2b:If you have a Hadoop installation available, you need to tellthe package where to locate itThe exact paths will vary depending on your Hadoop installation(ask your IT department!), but for example, on our system theoptions were set as follows:

Sys.setenv(HADOOP_CMD = "/usr/local/hadoop/bin/hadoop")

Sys.setenv(HADOOP_STREAMING = "/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar")

Sys.setenv(JAVA_HOME = "/usr/lib/jvm/java")Kate Hanley - R Consultant

[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 3: Tell R where it can set up a temporary directory forstoring data(you will need to have write permission here!)

rmr.options(hdfs.tempdir = "/user/kate")


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 4: Store the data in the distributed file systemThe data is not held in memory, but dfsUly points to its location

dfsUly <- to.dfs(readLines("pg4300.txt"))


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 5: Write a map function

• The map function expects to receive a single line of the file ata time and creates key-value pairs for each word in the file

• For this example, the “key” will be the word that we want toinclude in our total, and the “value” will always be 1

• At the end, we will sum up all the 1’s for each word/key,which will give us the number of times it appeared in the book


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 5: Write a map function

wc.map <- function(., lines) {

# Splits the line into a vector of individual wordsvecOfWords <- strsplit(x = lines, split = " ")[[1]]

# Creates a key-value pair for each word,# assigning them the value 1keyval(vecOfWords, 1)

}


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 6: Write a reduce function

wc.reduce <- function(word, counts) {# For each word, sum the countskeyval(word, sum(counts))

}


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.


Step 7: Run your code and view the resultsUse the MapReduce function to run your analysis.

res <- mapreduce(input = dfsUly,map = wc.map,reduce = wc.reduce

)

# I like to convert the results into a# data frame for ease of useresultsDFS <- from.dfs(res)results <- as.data.frame(resultsDFS)


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Results!

We can then take a look at the results:

head(results)

## key val## 1 hoof 9## 2 hook 10## 3 hoop 3## 4 hoot 1## 5 hope 64## 6 hopk 1


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Results!

• Most common word - “the” (14,932 times)• Most common word with more than 5 letters - “stephen” (505

times)• Most common word with more than 10 letters - “shakespeare”

(39 times)• Total number of words - 267,175


[email protected]

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

To Conclude

• R and Hadoop work really well together• There are plenty of packages out there that allow you to use

MapReduce with R• Unless you’re very tech-savvy, you may well need support

(either from your IT team, or externally) to get Hadoop upand running


[email protected]

big data, beating the skills gap - earl conf › 2014 › presentations › earl2014... · rhadoop...

Documents