scootr: scaling r dataframes on dataflow systems · scootr: scaling r dataframes on dataflow...
TRANSCRIPT
![Page 1: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/1.jpg)
ScootR: Scaling R Dataframes on Dataflow SystemsAndreas Kunft1 Lukas Stadler2 Daniele Bonetta2 Cosmin Basca2 Jens Meiners1
Sebastian Breß1 Tilmann Rabl1 Juan Fumero3 Volker Markl2
0
Technische Universität Berlin1 Oracle Labs2 University of Manchester3
![Page 2: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/2.jpg)
R gained increased traction
• Dynamically typed, open-source language
• Rich support for analytics & statistics
1
![Page 3: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/3.jpg)
R gained increased traction
• Dynamically typed, open-source language
• Rich support for analytics & statistics
But
• Standalone R is not well suited for out-of-core data loads
2
![Page 4: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/4.jpg)
Analytics pipelines often work on large amounts of raw data
• Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
• Provide rich support for user-defined functions (UDFs)
3
![Page 5: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/5.jpg)
Analytics pipelines often work on large amounts of raw data
• Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
• Provide rich support for user-defined functions (UDFs)
But
• R users are often unfamiliar with DF APIs and concepts
4
![Page 6: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/6.jpg)
Combine the usability of R with the scalability of dataflow engines
- Goals- From functions calls to an operator graph- Approaches to execute R UDFs- Our Approach: ScootR- Evaluation
5
![Page 7: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/7.jpg)
GOALS
1. Provide data.frame API with natural feeling
•
•
•
6
df$km <- df$miles * 1.6
df <- select(df, count = flights, distance)
df <- apply(df, func)
![Page 8: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/8.jpg)
GOALS
1. Provide data.frame API with natural feeling
•
•
•
2. Achieve comparable performance to native dataflow API
7
df$km <- df$miles * 1.6
df <- select(df, count = flights, distance)
df <- apply(df, func)
![Page 9: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/9.jpg)
From function calls to an operator graph
8
![Page 10: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/10.jpg)
MAPPING DATA TYPES
• R data.frame(T1,T2,…,TN) as Flink DataSet<TupleN<T1,T2,…,TN>>
• E.g., data.frame(integer, character) as DataSet<Tuple2<Integer, String>>
9
N columns N fieldsFixed element type ofTuple with arity N
![Page 11: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/11.jpg)
MAPPING R FUNCTIONS TO OPERATORS
• Functions on data.frames lazily build an operator graph
10
![Page 12: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/12.jpg)
MAPPING R FUNCTIONS TO OPERATORS
• Functions on data.frames lazily build an operator graph
1. Functions w/o UDFs are handled before execution, e.g.,a select function is mapped to a project operator
select(df$id, df$arrival) to ds.project(1, 3)
11
![Page 13: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/13.jpg)
MAPPING R FUNCTIONS TO OPERATORS
• Functions on data.frames lazily build an operator graph
1. Functions w/o UDFs are handled before execution
2. Functions w/ UDFs call R functions during execution
12
![Page 14: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/14.jpg)
Approaches to execute R UDFs
13
![Page 15: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/15.jpg)
INTER PROCESS COMMUNICATION (IPC)
14
Driver
Client
Worker
Task
Task
Worker
Task
Task
R Process
R Process
R Process
R Process
![Page 16: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/16.jpg)
INTER PROCESS COMMUNICATION (IPC)
15
1 Communication + Serialization (R <> Java)
2 JVM and R compete for memory
Worker
filter R Process
filter <- function(df) {df$language == ‘english’
}
1
2JVM
![Page 17: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/17.jpg)
SOURCE-TO-SOURCE TRANSLATION (STS)
• Translate restricted set of functions to native dataflow API
• Constant translation overhead, but native execution performance
16
![Page 18: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/18.jpg)
SOURCE-TO-SOURCE TRANSLATION (STS)
• E.g., STS translation in SparkR to Spark’s Scala Dataframe API:
17
df <- filter(df,df$language == ‘english’
)val df = df.filter($”language” === “english”)
df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
![Page 19: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/19.jpg)
Inter Process Communication
+ Execute arbitrary R code
- Data serialization
- Data exchange
- Java and R process compete for memory
Source-to-source translation
+ Native performance
- Restricted to a language subsetor requires to build full-fledgedcompiler
18
![Page 20: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/20.jpg)
A common runtime for R and Java
19
![Page 21: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/21.jpg)
BACKGROUND: TRUFFLE/GRAAL
20
HotSpot
JIT
Bytecode
![Page 22: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/22.jpg)
BACKGROUND: TRUFFLE/GRAAL
21
HotSpot
JIT
Bytecode
Graal
![Page 23: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/23.jpg)
BACKGROUND: TRUFFLE/GRAAL
22
HotSpot
Graal
Truffle
GraalVM
![Page 24: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/24.jpg)
BACKGROUND: TRUFFLE/GRAAL
23Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.
HotSpot Runtime
Graal Interpreter GC …
Truffle
TruffleR (fastR) TruffleJSjavac
*.js*.R*.java
GraalVM
AST Interpreter
Source Code
![Page 25: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/25.jpg)
SCOOTR: FASTR + FLINK
24
![Page 26: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/26.jpg)
SCOOTR OVERVIEW
25
flink.init(SERVER, PORT)flink.parallelism(DOP)
df <- flink.readdf(SOURCE, list("id", “body“, …),list(character, character, …)
)
words <- function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)
}
df <- flink.apply(df, words)
flink.writeAsText(df, SINK)flink.execute()
![Page 27: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/27.jpg)
SCOOTR OVERVIEW
26
flink.init(SERVER, PORT)flink.parallelism(DOP)
df <- flink.readdf(SOURCE, list("id", “body“, …),list(character, character, …)
)
words <- function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)
}
df <- flink.apply(df, words)
flink.writeAsText(df, SINK)flink.execute()
![Page 28: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/28.jpg)
Efficient data access in R UDFs
27
![Page 29: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/29.jpg)
28
function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)
}
![Page 30: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/30.jpg)
29
function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)
}
1 Dataframe proxy keeps track of columns and provides efficient access
function(tuple) {len <- length(strsplit(tuple[[2]], " ")[[1]])list(tuple[[1]], tuple[[2]], len)
}
![Page 31: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/31.jpg)
30
function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)
}
1 Dataframe proxy keeps track of columns and provides efficient access
function(tuple) {len <- length(strsplit(tuple[[2]], " ")[[1]])flink.tuple(tuple[[1]], tuple[[2]], len)
}
2 Rewrite to directly instantiate a Flink tuple instead of an R list
![Page 32: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/32.jpg)
IMPACT OF DIRECT TYPE ACCESS
• From list(...) to flink.tuple(...)
• Avoids additional pass over R list to create Flink tuple
• Up to 1.75x performance improvement
31
Output w/ arity 2 Output w/ arity 19
Purple is function execution, pink (hatched) conversion from list to tuple
![Page 33: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/33.jpg)
Evaluation
32
![Page 34: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/34.jpg)
APPLY FUNCTION MICROBENCHMARK
• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB
• UDF: df$km <- df$miles * 1.6
33
![Page 35: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/35.jpg)
APPLY FUNCTION MICROBENCHMARK
• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB
• UDF: df$km <- df$miles * 1.6
34
ScootR and SparkR (STS) achieve near native performance
![Page 36: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/36.jpg)
APPLY FUNCTION MICROBENCHMARK
• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB
• UDF: df$km <- df$miles * 1.6
35
Both heavily outperform gnu R and fastR
ScootR and SparkR (STS) achieve near native performance
![Page 37: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/37.jpg)
APPLY FUNCTION MICROBENCHMARK: SCALABILITY
36
![Page 38: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/38.jpg)
MIXED PIPELINE W/ PREPROCESSING AND ML
Pipeline:
- (Distributed) preprocessing of the dataset
- Data is collected locally and an generalized linear model is trained
37
Majority of the time is spent in preprocessing
ScootR is up to 11x faster than gnu R and fastR
![Page 39: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian](https://reader030.vdocuments.site/reader030/viewer/2022040608/5ec9770c12ef1a5709510266/html5/thumbnails/39.jpg)
RECAP
• ScootR provides a data.frame API in R for Apache Flink
• R and Flink run within the same runtime
• Avoids serialization and data exchange
• Avoids type conversion
> Achieves near native performance for a rich set of operators
38