sparkr: enabling interactive data science at scale
DESCRIPTION
"SparkR" presentation by Shivaram Venkataraman and Zongheng YangTRANSCRIPT
![Page 1: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/1.jpg)
SparkR: Enabling Interactive Data Science at Scale
Shivaram Venkataraman Zongheng Yang
![Page 2: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/2.jpg)
Fast !
Scalable Flexible
![Page 3: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/3.jpg)
Statistics !
Packages Plots
![Page 4: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/4.jpg)
Fast !
Scalable
Flexible
Statistics !
Plots
Packages
![Page 5: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/5.jpg)
Outline
SparkR API Live Demo Design Details
![Page 6: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/6.jpg)
RDD
Parallel Collection
Transformations map filter
groupBy …
Actions count collect
saveAsTextFile …
![Page 7: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/7.jpg)
R + RDD = R2D2
![Page 8: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/8.jpg)
R + RDD = RRDD
lapply lapplyPartition
groupByKey reduceByKey sampleRDD
collect cache filter …
broadcast includePackage
textFile parallelize
![Page 9: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/9.jpg)
SparkR – R package for Spark
R
RRDD
Spark
![Page 10: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/10.jpg)
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”)
![Page 11: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/11.jpg)
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,
function(word) { list(word, 1L) })
![Page 12: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/12.jpg)
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,
function(word) { list(word, 1L) })
counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts)
![Page 13: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/13.jpg)
Demo: Digit Classification
![Page 14: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/14.jpg)
MNIST
![Page 15: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/15.jpg)
A
b
|| Ax − b ||2Minimize x = (ATA)−1ATb
![Page 16: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/16.jpg)
How does this work ?
![Page 17: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/17.jpg)
Dataflow
Local Worker
Worker
![Page 18: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/18.jpg)
Dataflow
Local
R
Worker
Worker
![Page 19: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/19.jpg)
Dataflow
Local
R Spark Context
Java Spark
Context
JNI
Worker
Worker
![Page 20: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/20.jpg)
Dataflow
Local Worker
Worker R Spark Context
Java Spark
Context
JNI
Spark Executor exec R
Spark Executor exec R
![Page 21: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/21.jpg)
![Page 22: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/22.jpg)
From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
![Page 23: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/23.jpg)
![Page 24: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/24.jpg)
Dataflow
Local Worker
Worker R Spark Context
Java Spark
Context
JNI
Spark Executor exec R
Spark Executor exec R
![Page 25: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/25.jpg)
Pipelined RDD words <-‐ flatMap(lines,…) wordCount <-‐ lapply(words,…)
Spark Executor exec R Spark
Executor R exec
![Page 26: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/26.jpg)
Pipelined RDD
Spark Executor exec R Spark
Executor R exec
Spark Executor exec R R Spark
Executor
![Page 27: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/27.jpg)
Alpha developer release
One line install !
install_github("amplab-‐extras/SparkR-‐pkg", subdir="pkg")
![Page 28: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/28.jpg)
SparkR Implementation
Very similar to PySpark Spark is easy to extend
329 lines of Scala code 2079 lines of R code 693 lines of test code in R
![Page 29: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/29.jpg)
EC2 setup scripts All Spark examples MNIST demo YARN, Windows support
Also on github
![Page 30: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/30.jpg)
Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx
![Page 31: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/31.jpg)
On the Roadmap
High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark
![Page 32: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/32.jpg)
SparkR
Combine scalability & utility
RDD à distributed lists Run R on clusters Re-use existing packages
![Page 33: SparkR: Enabling Interactive Data Science at Scale](https://reader033.vdocuments.site/reader033/viewer/2022042816/559444811a28abfa2f8b4754/html5/thumbnails/33.jpg)
SparkR https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman [email protected] Zongheng Yang [email protected]
SparkR mailing list [email protected]