big datascienceh2oandr
DESCRIPTION
Anqi Fu's presentation from the August 20 Meetup on using H2O with R.TRANSCRIPT
![Page 1: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/1.jpg)
H2O – The Open Source Math Engine
Big Data Science with H2O in R
![Page 2: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/2.jpg)
4/23/13
H2O –Open Source Math
& Machine Learning for Big DataAnqi Fu, August 2013
![Page 3: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/3.jpg)
Universe is sparse. Life is messy. Data is sparse & messy.
- Lao Tzu
![Page 4: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/4.jpg)
Introduction to Big Data
• There are about as many bits of information in our digital universe as there are stars in our actual universe.
• The process to decode the human genome took 10 years. It can now be done in a week.
• Big data means more than “lots of data”
![Page 5: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/5.jpg)
H2O – The Open Source Math Engine
Better PredictionsSame Interface
![Page 6: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/6.jpg)
Installation
1. Install and run H2O• Command line: java –Xmx2g –jar h2o.jar
• Pull up http://localhost:54321 in browser
2. Install the R package• install.packages(c(“RCurl”, “rjson”, “bitops”))
• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source")
3. In R console, type library(h2o)• demo(package=“h2o”)
• demo(h2o.glm)
Replace this!
![Page 7: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/7.jpg)
Always have H2O running first!
![Page 8: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/8.jpg)
Basic R Script
1. Tell R where H2O is running:
localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)
2. Check connection:
h2o.checkClient(localH2O)
3. Pass H2OClient as parameter to import:
h2o.importFile(localH2O, path=“Path/To/Data”, …)
![Page 9: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/9.jpg)
Overview of Objects
• H2OClient: ip=character, port=numeric
• H2OParsedData: h2o=H2OClient, key=character
• H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients
H2Okey=“prostate.hex”
key=“airlines.hex”
![Page 10: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/10.jpg)
Overview of Methods
Standard R H2O
read.csv, read.table, etc h2o.importFile, h2o.importURL
summary summary (limited to data only)
glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda)
kmeans h2o.kmeans(data, centers, cols, iter.max)
randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
![Page 11: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/11.jpg)
Demo 1: Basic GLM in H2O through R
![Page 12: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/12.jpg)
Demo 1: Prostate Cancer Data
• Prostate cancer data set from Ohio State University Comprehensive Cancer Center• N = 380 patients, ages ranging from 43-79
• Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
![Page 13: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/13.jpg)
![Page 14: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/14.jpg)
Prostate Cancer
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA (prostate-specific
antigen)
![Page 15: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/15.jpg)
Prostate Cancer
Logistic Regression Fit Family: Binomial, Link: Logit
Data:
y = CAPSULE0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Goal:
Estimate probability CAPSULE = 1
![Page 16: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/16.jpg)
GLM Parameters
• y = response variable
• x = predictor variables (vector)
• family = binomial (default link = logit)
• data = H2OParsedData object
• nfolds = cross-validation
• lambda = weight on penalty factor
• alpha = elastic net mixing parameter• alpha = 0 is ridge penalty (L2 norm)
• alpha = 1 is lasso penalty (L1 norm)
![Page 17: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/17.jpg)
Under the Hood: Hacking R for H2O
![Page 18: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/18.jpg)
Under the Hood
REST API
Data
(JSON)
ImportParse
H2O
Data Scientist,Analyst, etc
![Page 19: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/19.jpg)
GLM Code Snippet
• Create an object to represent model
setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list"))
• Declare new method for algorithm
setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })
Name Slots
Parameter Initial Value
![Page 20: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/20.jpg)
GLM Code Snippet
setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) {
• Send parameters to GLM.json page GLM job started
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …)
• Keep polling and wait until job completed
while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }
• Query Inspect.json page with GLM model key to get results
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key)
http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
![Page 21: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/21.jpg)
Demo 2: Data Munging and Remote H2O
![Page 22: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/22.jpg)
Demo 2: Airlines Data
• Airlines data set 1987-2013 from RITA (25%)
• Goal: Predict if flight’s arrival will be delayed
• Examine slices of data directly
head(airlines.hex, n = 10); tail(airlines.hex)
summary(airlines.hex$DepTime)
• Take a subset of data to play with in R
airlines.small = as.data.frame(airlines.hex[1:1000,])
glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
![Page 23: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/23.jpg)
![Page 24: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/24.jpg)
http://www.transtats.bts.gov/Fields.asp?Table_ID=236
![Page 25: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/25.jpg)
Connecting to H2O Remotely
• Your slip of paper contains IP/port of your assigned cluster
• Point R to remote H2O client
remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)
• All data operations occur on cluster
h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …)
• Objects/methods operate just like before!
![Page 26: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/26.jpg)
Roadmap
• Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”]
• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1
• Filter rows: df[df$cName < 5,]
• Combine data frames by row/col: rbind, cbind
• Apply functions: tapply, sapply, lapply
• Support for R libraries (plyr, ggplot2, etc)
• More Algorithms: GBM, PCA, Neural Networks
![Page 27: Big datascienceh2oandr](https://reader034.vdocuments.site/reader034/viewer/2022051819/54c661504a79595e038b4583/html5/thumbnails/27.jpg)
4/23/13
Questions and Suggestions?