big data, beating the skills gap - earl conf › 2014 › presentations › earl2014... · rhadoop...
TRANSCRIPT
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Big Data, beating the Skills Gap
Using R with Hadoop
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Using R with Hadoop
There are a number of R packages available that can interact withHadoop, including:
• hive - Not to be confused with Apache Hive, which iseffectively SQL for Hadoop.
• HadoopStreaming - Provides a framework for writingmap/reduce scripts for use in Hadoop Streaming
• RHadoop - A collection of five R packages that allow users tomanage and analyze data with Hadoop
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
What do I need to know about Hadoop to understand thistalk?
For the purposes of this talk, we need to know about the following:
• HDFS• MapReduce
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
HDFS
What is it?
• This stands for Hadoop Distributed File System• A way of storing data across multiple machines to allow ease
of access• The data is broken up into chunks and spread across multiple
machinesWhy bother?
• No massive files which contain terabytes of information• There are multiple copies of each file dotted around different
machines, so if one of your machines go down or the data iscorrupted, you can still perform the analysis
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
MapReduce
What is it?• A way of coding your problem so that it can be split up and
spread across multiple machines• Each machine performs its own part of the analysis, and the
results are then collected together at the end
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
RHadoop
• We’ll focus on using some of the RHadoop packages, as theseseem to be widely used.
• Not on CRAN, but available to download on github.• RHadoop is divided into a number of packages. We’re going
to use the rmr2 package, which allows you to use R toperform MapReduce
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
• The task - Perform a word count on the text of JamesJoyce’s Ulysses.
• Input - A text file containing the plain text version of thenovel.
• Output - An R object containing all unique words and acount specifying the number of times it occurred in the novel.
• Simplifications - We removed all punctuation from the filebefore beginning the analysis. This avoids issues where “hello”and “hello!” are treated as different words.
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 1: Load up the rmr2 package
library(rmr2)
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Option 2a:You can try out the package without having Hadoop installedThis will allow you to play around with the package, but you won’tbe able to execute any Hadoop jobs. To enable the package towork locally, you need to set the following option:
rmr.options(backend="local")
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Option 2b:If you have a Hadoop installation available, you need to tellthe package where to locate itThe exact paths will vary depending on your Hadoop installation(ask your IT department!), but for example, on our system theoptions were set as follows:
Sys.setenv(HADOOP_CMD = "/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING = "/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar")
Sys.setenv(JAVA_HOME = "/usr/lib/jvm/java")Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 3: Tell R where it can set up a temporary directory forstoring data(you will need to have write permission here!)
rmr.options(hdfs.tempdir = "/user/kate")
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 4: Store the data in the distributed file systemThe data is not held in memory, but dfsUly points to its location
dfsUly <- to.dfs(readLines("pg4300.txt"))
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 5: Write a map function
• The map function expects to receive a single line of the file ata time and creates key-value pairs for each word in the file
• For this example, the “key” will be the word that we want toinclude in our total, and the “value” will always be 1
• At the end, we will sum up all the 1’s for each word/key,which will give us the number of times it appeared in the book
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 5: Write a map function
wc.map <- function(., lines) {
# Splits the line into a vector of individual wordsvecOfWords <- strsplit(x = lines, split = " ")[[1]]
# Creates a key-value pair for each word,# assigning them the value 1keyval(vecOfWords, 1)
}
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 6: Write a reduce function
wc.reduce <- function(word, counts) {# For each word, sum the countskeyval(word, sum(counts))
}
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
A simple MapReduce task using R
Step 7: Run your code and view the resultsUse the MapReduce function to run your analysis.
res <- mapreduce(input = dfsUly,map = wc.map,reduce = wc.reduce
)
# I like to convert the results into a# data frame for ease of useresultsDFS <- from.dfs(res)results <- as.data.frame(resultsDFS)
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
The Results!
We can then take a look at the results:
head(results)
## key val## 1 hoof 9## 2 hook 10## 3 hoop 3## 4 hoot 1## 5 hope 64## 6 hopk 1
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
The Results!
• Most common word - “the” (14,932 times)• Most common word with more than 5 letters - “stephen” (505
times)• Most common word with more than 10 letters - “shakespeare”
(39 times)• Total number of words - 267,175
Kate Hanley - R Consultant
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
To Conclude
• R and Hadoop work really well together• There are plenty of packages out there that allow you to use
MapReduce with R• Unless you’re very tech-savvy, you may well need support
(either from your IT team, or externally) to get Hadoop upand running
Kate Hanley - R Consultant