data hacking with rhadoop

Using R and Hadoop to do large-‐scale data science

Rhadoop Data Hacking

•  Predict X? – The outcome of a future event – Who is likely to do something – Gene?c factors leading to disease

•  Pre-‐filter things so humans can accomplish more?

•  Do all of this faster and beCer?

Would You Like to…

This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton 2

•  R is a fantas?c plaHorm for data science –  Has a peer-‐reviewed community

and journal that vets libraries –  (Mostly) intui?ve language

•  Hadoop is the de-‐facto plaHorm for parallel processing

•  Today, we’ll be talking about rmr, but there’s two more packages: rhbase and rhdfs

Why R and Hadoop?


•  Some of the most effec?ve techniques for data mining are rela?vely old –  Modern SVM dates back to ‘92 –  Logis?c regression dates back to ‘44 –  Important elements of the algorithms date back to Newton

•  Accessibility and relevance have changed –  Accessibility to data –  Accessibility of computa?onal power –  Necessity of methods

Nothing Has Changed. Everything Has Changed.


•  R docs are wriCen in their own language (using data frames, etc.) that is unfamiliar to computer scien?sts

•  R and CRAN documenta?on are more like old-‐school GNU than most Apache projects –  Get used to Googling and using R’s help() func?on

•  R’s data management facili?es are inconsistent •  Streaming API isn’t super fast •  (get over it)

Some CriBcisms of R & Rhadoop


•  SNOW/SNOWFALL –  Operates over MPI, Sockets, or PVM –  No ?e-‐in to a DFS (bad for data-‐intensive compu?ng) –  Handles matrix mul?plica?on well (perhaps beCer) –  Doesn’t handle other non-‐trivial IPC well (basically for parallel linear

algebra and simula?ons)

•  Rmpi –  More code –  All synchroniza?on constructs are user-‐built (just like MPI)

Comparison to Other R Parallelism Frameworks


•  Others… – Only other Hadoop libraries have integra?on with HDFS/are appropriate for data intensive compu?ng

– Only Rhadoop supports local and cluster based backends and has an intui?ve interface that duplicates closures in the remote environment

– Most environments are targeted towards modeling and simula?on

Comparison to Other R Parallelism Frameworks


•  Install R –  Macports – sudo port install r-framework!–  Ubuntu – sudo apt-get install r-base!–  RHEL – sudo yum install R!

•  Install R dependencies (inside R) –  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),

repos="http://watson.nci.nih.gov/cran_mirror/”)!

•  Install RMR –  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/

rmr_1.3.1.tar.gz > rmr.tar.gz!–  install.packages("rmr.tar.gz”) # from inside r, in the same

directory!

•  Configure the local backend each ?me you run R –  rmr.options.set(backend=“local”)!

InstallaBon – Local WorkstaBon


•  Install R and all packages you plan on using (rmr, e1071, topicmodels, tm, etc.) on each node.

•  Use a compa?ble version of Hadoop 1 (1.0.3+ or CDH3+). Hadoop 2 may or may not work.

•  The example on the previous slide installs R packages in your home directory, you probably want to install them to the root install.

•  Configure environment variables export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar!

InstallaBon -‐ Cluster


The Curse of Dimensionality •  The volume of the unit sphere

tends towards 0 as the dimensionality of hyperspace increases

•  Intui?vely this means that there is more “slop room” for your dividing hyperplane to fall into

•  The amount of data we need to train a model rises with the feature space, tending towards infinity, making the problem untenable

•  With a small feature space, there is no need for lots of data

•  Thus, there is liCle point in using Hadoop to implement many classic machine learning models

10 This document is company confiden?al and is intended solely for the use and informa?on of Booz Allen Hamilton

Volume of the Unit Ball vs. Dimensionality

•  Join •  Sample •  Model •  Repeat

The Hadoop Data Science Flow


•  Put two pieces of data together using a common key

•  Scenario: – Data is in two flat files in HDFS – Turn rows into rows of key-‐value pairs, where the key is the join key and the value is the rest of the row

Join


•  Take a sample of your (maybe) joined data •  Most common method is probabilis?cally •  Numerous other techniques can leverage par??ons and randomness of the key hash

•  Scenarios (a precursor for): –  Supervised learning/classifica?on –  Unsupervised learning/clustering –  Regression –  Distribu?on modeling

Sample


•  Supervised learning: I want to predict something and I already know (some) of the answers. Also called classifica?on and binary classifica?on

•  Unsupervised learning: I want to find natural groupings in the data that I might not have known about

•  Regression, probability modeling – I want to fit a curve to my data

Model


•  Gain insight about the data •  Change your procedure (select only outliers, etc.)

•  Gain more insight

Repeat


•  Work totally in R •  Execute large, complex joins such as cross joins

Rhadoop Impact: Join, Sample


•  Most algorithms work perfectly well (or beCer) over a sample of the data

•  Train and cross-‐validate a large number of models in parallel

•  Perform model selec?on in the reduce phase

Rhadoop Impact: Model


mapreduce(! input,! output = NULL,! map = to.map(identity),! reduce = NULL,! combine = NULL,! reduce.on.data.frame = FALSE,! input.format = "native",! output.format = "native",! vectorized = list(map = FALSE, reduce = FALSE),! structured = list(map = FALSE, reduce = FALSE),! backend.parameters = list(),! verbose = TRUE)!

Rhadoop API


rmr.options.set(backend = c("hadoop", "local"),! profile.nodes = NULL, vectorized.nrows = NULL) !to.dfs(object, output = dfs.tempfile(), ! format = "native")!!from.dfs(input, format = "native", ! to.data.frame = FALSE, vectorized = FALSE,! structured = FALSE)

Rhadoop API


•  Objects –  my_car = list(color=“green”, model=“volt”)!

•  Transforming a vector (list), iterating –  lapply/sapply/tapply – functional programming constructs

•  Loops (not preferred) –  for ( i in 1:100) {…}!–  Note this is the same as lapply(1:100, function(i){…})!

•  Other control structures – basically as you would expect

Doing Things the R Way


•  R helps you! O_o •  Every object has a mode and length and hence can be interpreted as some

sort of vector – even primi?ves! •  Even primi?ves such as strings or integers are stored in a vector of length

1, never free-‐standing •  There are lots of types of vectors

–  Lists (think linked list) –  Atomic vectors (think array)

hCp://cran.r-‐project.org/doc/manuals/R-‐intro.html#The-‐intrinsic-‐aCributes-‐mode-‐and-‐length

•  Type coercion usually works the way you would expect –  But… you may find yourself using as.list() or as.vector() or doing manual coercion

frequently depending on what libraries you’re using due to mode not matching

Vectors in R


fakedata = data.frame(x = c(rnorm(100)*.25, rep(.75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)), z = c(rep(0,100), rep(1,100)) )!!plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"], function(z) ifelse(z>0,"blue","green")))!

Example – Fake Data


rmr.options.set(backend=“local”)!!ints = to.dfs(1:100)!!squares = mapreduce(ints, map=function(x) reyval(NULL,x^2))!!print from.dfs(ints)!!# notice the result will be !# keyvals!

Examples – Simple Parallelism


kernels = to.dfs(list("linear","polynomial","radial","sigmoid"))!!models = from.dfs(mapreduce(kernels,map=function(nothing,kern) keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!!plot(models[[1]][["val"]],fakedata)!!!

Examples – Trying Lots of SVM Kernels


calls = to.dfs(list(list("glm",z~.,family=binomial("logit"), fakedata),list("svm",z~.,fakedata)))!!models = from.dfs(mapreduce(calls, map=function(nothing,callsig) keyval(NULL,do.call(callsig[[1]],callsig[2:length(callsig)]))))!!models[[1]][["val"]]!

Examples – Different Models


data hacking with rhadoop

Technology

r install

booz allen hamilton

company conden

inside r

r dependencies

cribcisms of r rhadoop

data mining

data frames