data hacking with rhadoop
DESCRIPTION
Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science. We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.TRANSCRIPT
Using R and Hadoop to do large-‐scale data science
Rhadoop Data Hacking
• Predict X? – The outcome of a future event – Who is likely to do something – Gene?c factors leading to disease
• Pre-‐filter things so humans can accomplish more?
• Do all of this faster and beCer?
Would You Like to…
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 2
• R is a fantas?c plaHorm for data science – Has a peer-‐reviewed community
and journal that vets libraries – (Mostly) intui?ve language
• Hadoop is the de-‐facto plaHorm for parallel processing
• Today, we’ll be talking about rmr, but there’s two more packages: rhbase and rhdfs
Why R and Hadoop?
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 3
• Some of the most effec?ve techniques for data mining are rela?vely old – Modern SVM dates back to ‘92 – Logis?c regression dates back to ‘44 – Important elements of the algorithms date back to Newton
• Accessibility and relevance have changed – Accessibility to data – Accessibility of computa?onal power – Necessity of methods
Nothing Has Changed. Everything Has Changed.
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 4
• R docs are wriCen in their own language (using data frames, etc.) that is unfamiliar to computer scien?sts
• R and CRAN documenta?on are more like old-‐school GNU than most Apache projects – Get used to Googling and using R’s help() func?on
• R’s data management facili?es are inconsistent • Streaming API isn’t super fast • (get over it)
Some CriBcisms of R & Rhadoop
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 5
• SNOW/SNOWFALL – Operates over MPI, Sockets, or PVM – No ?e-‐in to a DFS (bad for data-‐intensive compu?ng) – Handles matrix mul?plica?on well (perhaps beCer) – Doesn’t handle other non-‐trivial IPC well (basically for parallel linear
algebra and simula?ons)
• Rmpi – More code – All synchroniza?on constructs are user-‐built (just like MPI)
Comparison to Other R Parallelism Frameworks
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 6
• Others… – Only other Hadoop libraries have integra?on with HDFS/are appropriate for data intensive compu?ng
– Only Rhadoop supports local and cluster based backends and has an intui?ve interface that duplicates closures in the remote environment
– Most environments are targeted towards modeling and simula?on
Comparison to Other R Parallelism Frameworks
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 7
• Install R – Macports – sudo port install r-framework!– Ubuntu – sudo apt-get install r-base!– RHEL – sudo yum install R!
• Install R dependencies (inside R) – install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),
repos="http://watson.nci.nih.gov/cran_mirror/”)!
• Install RMR – curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/
rmr_1.3.1.tar.gz > rmr.tar.gz!– install.packages("rmr.tar.gz”) # from inside r, in the same
directory!
• Configure the local backend each ?me you run R – rmr.options.set(backend=“local”)!
InstallaBon – Local WorkstaBon
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 8
• Install R and all packages you plan on using (rmr, e1071, topicmodels, tm, etc.) on each node.
• Use a compa?ble version of Hadoop 1 (1.0.3+ or CDH3+). Hadoop 2 may or may not work.
• The example on the previous slide installs R packages in your home directory, you probably want to install them to the root install.
• Configure environment variables export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar!
InstallaBon -‐ Cluster
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 9
The Curse of Dimensionality • The volume of the unit sphere
tends towards 0 as the dimensionality of hyperspace increases
• Intui?vely this means that there is more “slop room” for your dividing hyperplane to fall into
• The amount of data we need to train a model rises with the feature space, tending towards infinity, making the problem untenable
• With a small feature space, there is no need for lots of data
• Thus, there is liCle point in using Hadoop to implement many classic machine learning models
10 This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton
Volume of the Unit Ball vs. Dimensionality
• Join • Sample • Model • Repeat
The Hadoop Data Science Flow
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 11
• Put two pieces of data together using a common key
• Scenario: – Data is in two flat files in HDFS – Turn rows into rows of key-‐value pairs, where the key is the join key and the value is the rest of the row
Join
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 12
• Take a sample of your (maybe) joined data • Most common method is probabilis?cally • Numerous other techniques can leverage par??ons and randomness of the key hash
• Scenarios (a precursor for): – Supervised learning/classifica?on – Unsupervised learning/clustering – Regression – Distribu?on modeling
Sample
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 13
• Supervised learning: I want to predict something and I already know (some) of the answers. Also called classifica?on and binary classifica?on
• Unsupervised learning: I want to find natural groupings in the data that I might not have known about
• Regression, probability modeling – I want to fit a curve to my data
Model
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 14
• Gain insight about the data • Change your procedure (select only outliers, etc.)
• Gain more insight
Repeat
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 15
• Work totally in R • Execute large, complex joins such as cross joins
Rhadoop Impact: Join, Sample
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 16
• Most algorithms work perfectly well (or beCer) over a sample of the data
• Train and cross-‐validate a large number of models in parallel
• Perform model selec?on in the reduce phase
Rhadoop Impact: Model
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 17
mapreduce(! input,! output = NULL,! map = to.map(identity),! reduce = NULL,! combine = NULL,! reduce.on.data.frame = FALSE,! input.format = "native",! output.format = "native",! vectorized = list(map = FALSE, reduce = FALSE),! structured = list(map = FALSE, reduce = FALSE),! backend.parameters = list(),! verbose = TRUE)!
Rhadoop API
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 18
rmr.options.set(backend = c("hadoop", "local"),! profile.nodes = NULL, vectorized.nrows = NULL) !to.dfs(object, output = dfs.tempfile(), ! format = "native")!!from.dfs(input, format = "native", ! to.data.frame = FALSE, vectorized = FALSE,! structured = FALSE)
Rhadoop API
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 19
• Objects – my_car = list(color=“green”, model=“volt”)!
• Transforming a vector (list), iterating – lapply/sapply/tapply – functional programming constructs
• Loops (not preferred) – for ( i in 1:100) {…}!– Note this is the same as lapply(1:100, function(i){…})!
• Other control structures – basically as you would expect
Doing Things the R Way
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 20
• R helps you! O_o • Every object has a mode and length and hence can be interpreted as some
sort of vector – even primi?ves! • Even primi?ves such as strings or integers are stored in a vector of length
1, never free-‐standing • There are lots of types of vectors
– Lists (think linked list) – Atomic vectors (think array)
hCp://cran.r-‐project.org/doc/manuals/R-‐intro.html#The-‐intrinsic-‐aCributes-‐mode-‐and-‐length
• Type coercion usually works the way you would expect – But… you may find yourself using as.list() or as.vector() or doing manual coercion
frequently depending on what libraries you’re using due to mode not matching
Vectors in R
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 21
fakedata = data.frame(x = c(rnorm(100)*.25, rep(.75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)), z = c(rep(0,100), rep(1,100)) )!!plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"], function(z) ifelse(z>0,"blue","green")))!
Example – Fake Data
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 22
rmr.options.set(backend=“local”)!!ints = to.dfs(1:100)!!squares = mapreduce(ints, map=function(x) reyval(NULL,x^2))!!print from.dfs(ints)!!# notice the result will be !# keyvals!
Examples – Simple Parallelism
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 23
kernels = to.dfs(list("linear","polynomial","radial","sigmoid"))!!models = from.dfs(mapreduce(kernels,map=function(nothing,kern) keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!!plot(models[[1]][["val"]],fakedata)!!!
Examples – Trying Lots of SVM Kernels
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 24
calls = to.dfs(list(list("glm",z~.,family=binomial("logit"), fakedata),list("svm",z~.,fakedata)))!!models = from.dfs(mapreduce(calls, map=function(nothing,callsig) keyval(NULL,do.call(callsig[[1]],callsig[2:length(callsig)]))))!!models[[1]][["val"]]!
Examples – Different Models
This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 25