big data, beating the skills gap - earl conf › 2014 › presentations › earl2014... · rhadoop...

20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big Data, beating the Skills Gap Using R with Hadoop

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Big Data, beating the Skills Gap

Using R with Hadoop

Page 2: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Using R with Hadoop

There are a number of R packages available that can interact withHadoop, including:

• hive - Not to be confused with Apache Hive, which iseffectively SQL for Hadoop.

• HadoopStreaming - Provides a framework for writingmap/reduce scripts for use in Hadoop Streaming

• RHadoop - A collection of five R packages that allow users tomanage and analyze data with Hadoop

Kate Hanley - R Consultant

[email protected]

Page 3: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

What do I need to know about Hadoop to understand thistalk?

For the purposes of this talk, we need to know about the following:

• HDFS• MapReduce

Kate Hanley - R Consultant

[email protected]

Page 4: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

HDFS

What is it?

• This stands for Hadoop Distributed File System• A way of storing data across multiple machines to allow ease

of access• The data is broken up into chunks and spread across multiple

machinesWhy bother?

• No massive files which contain terabytes of information• There are multiple copies of each file dotted around different

machines, so if one of your machines go down or the data iscorrupted, you can still perform the analysis

Kate Hanley - R Consultant

[email protected]

Page 5: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

MapReduce

What is it?• A way of coding your problem so that it can be split up and

spread across multiple machines• Each machine performs its own part of the analysis, and the

results are then collected together at the end

Kate Hanley - R Consultant

[email protected]

Page 6: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Kate Hanley - R Consultant

[email protected]

Page 7: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

RHadoop

• We’ll focus on using some of the RHadoop packages, as theseseem to be widely used.

• Not on CRAN, but available to download on github.• RHadoop is divided into a number of packages. We’re going

to use the rmr2 package, which allows you to use R toperform MapReduce

Kate Hanley - R Consultant

[email protected]

Page 8: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

• The task - Perform a word count on the text of JamesJoyce’s Ulysses.

• Input - A text file containing the plain text version of thenovel.

• Output - An R object containing all unique words and acount specifying the number of times it occurred in the novel.

• Simplifications - We removed all punctuation from the filebefore beginning the analysis. This avoids issues where “hello”and “hello!” are treated as different words.

Kate Hanley - R Consultant

[email protected]

Page 9: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 1: Load up the rmr2 package

library(rmr2)

Kate Hanley - R Consultant

[email protected]

Page 10: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Option 2a:You can try out the package without having Hadoop installedThis will allow you to play around with the package, but you won’tbe able to execute any Hadoop jobs. To enable the package towork locally, you need to set the following option:

rmr.options(backend="local")

Kate Hanley - R Consultant

[email protected]

Page 11: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Option 2b:If you have a Hadoop installation available, you need to tellthe package where to locate itThe exact paths will vary depending on your Hadoop installation(ask your IT department!), but for example, on our system theoptions were set as follows:

Sys.setenv(HADOOP_CMD = "/usr/local/hadoop/bin/hadoop")

Sys.setenv(HADOOP_STREAMING = "/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar")

Sys.setenv(JAVA_HOME = "/usr/lib/jvm/java")Kate Hanley - R Consultant

[email protected]

Page 12: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 3: Tell R where it can set up a temporary directory forstoring data(you will need to have write permission here!)

rmr.options(hdfs.tempdir = "/user/kate")

Kate Hanley - R Consultant

[email protected]

Page 13: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 4: Store the data in the distributed file systemThe data is not held in memory, but dfsUly points to its location

dfsUly <- to.dfs(readLines("pg4300.txt"))

Kate Hanley - R Consultant

[email protected]

Page 14: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 5: Write a map function

• The map function expects to receive a single line of the file ata time and creates key-value pairs for each word in the file

• For this example, the “key” will be the word that we want toinclude in our total, and the “value” will always be 1

• At the end, we will sum up all the 1’s for each word/key,which will give us the number of times it appeared in the book

Kate Hanley - R Consultant

[email protected]

Page 15: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 5: Write a map function

wc.map <- function(., lines) {

# Splits the line into a vector of individual wordsvecOfWords <- strsplit(x = lines, split = " ")[[1]]

# Creates a key-value pair for each word,# assigning them the value 1keyval(vecOfWords, 1)

}

Kate Hanley - R Consultant

[email protected]

Page 16: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 6: Write a reduce function

wc.reduce <- function(word, counts) {# For each word, sum the countskeyval(word, sum(counts))

}

Kate Hanley - R Consultant

[email protected]

Page 17: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

A simple MapReduce task using R

Step 7: Run your code and view the resultsUse the MapReduce function to run your analysis.

res <- mapreduce(input = dfsUly,map = wc.map,reduce = wc.reduce

)

# I like to convert the results into a# data frame for ease of useresultsDFS <- from.dfs(res)results <- as.data.frame(resultsDFS)

Kate Hanley - R Consultant

[email protected]

Page 18: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Results!

We can then take a look at the results:

head(results)

## key val## 1 hoof 9## 2 hook 10## 3 hoop 3## 4 hoot 1## 5 hope 64## 6 hopk 1

Kate Hanley - R Consultant

[email protected]

Page 19: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Results!

• Most common word - “the” (14,932 times)• Most common word with more than 5 letters - “stephen” (505

times)• Most common word with more than 10 letters - “shakespeare”

(39 times)• Total number of words - 267,175

Kate Hanley - R Consultant

[email protected]

Page 20: Big Data, beating the Skills Gap - EARL Conf › 2014 › Presentations › EARL2014... · RHadoop We’ll focus on using some of the RHadoop packages, as these seem to be widely

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

To Conclude

• R and Hadoop work really well together• There are plenty of packages out there that allow you to use

MapReduce with R• Unless you’re very tech-savvy, you may well need support

(either from your IT team, or externally) to get Hadoop upand running

Kate Hanley - R Consultant

[email protected]