what is spark? - stanford university talks/stanford-seminar.pdf · what is spark? apps > ve s...
TRANSCRIPT
![Page 1: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/1.jpg)
New Developments in Spark
Matei Zaharia and many others
And Rethinking APIs for Big Data
![Page 2: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/2.jpg)
What is Spark?
Unified computing engine for big data apps > Batch, streaming and interactive
Collection of high-level APIs > One of first widely used systems
with a functional API > Libraries for SQL, ML, graph, …
Spark
Stre
amin
g
SQL
MLl
ib
Gra
phX
![Page 3: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/3.jpg)
Project Growth
June 2013 January 2016
Lines of code 70,000 450,000
Total contributors 80 1000
Monthly contributors 20 140
Largest cluster 400 nodes 8000 nodes
![Page 4: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/4.jpg)
Project Growth
June 2013 January 2016
Lines of code 70,000 450,000
Total contributors 80 1000
Monthly contributors 20 140
Largest cluster 400 nodes 8000 nodes
Most active open source project in big data
![Page 5: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/5.jpg)
This Talk
Original Spark vision
How did the vision hold up?
New APIs: DataFrames + Spark SQL
New capabilities under these APIs
Ongoing research
![Page 6: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/6.jpg)
Original Spark Vision
1) Unified engine for big data processing > Combines batch, interactive, streaming
2) Concise, language-integrated API > Functional programming in Scala/Java/Python
![Page 7: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/7.jpg)
MapReduce
General batch processing
Pregel
Dremel
Presto
Storm
Giraph
Drill
Impala
S4 . . .
Specialized systems for new workloads
Motivation: Unification
Hard to manage, tune, deploy Hard to compose into pipelines
![Page 8: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/8.jpg)
MapReduce
Pregel
Dremel
Presto
Storm
Giraph
Drill
Impala
S4
Specialized systems for new workloads
General batch processing
Unified engine
Motivation: Unification
? . . .
![Page 9: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/9.jpg)
Motivation: Concise API
Much of data analysis is exploratory / interactive Answer: Resilient Distributed Datasets (RDDs) > Distributed collections with simple functional API
lines = spark.textFile(“hdfs://...”)
points = lines.map(line => parsePoint(line))
points.filter(p => p.x > 100).count()
![Page 10: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/10.jpg)
This Talk
Original Spark vision
How did the vision hold up?
New APIs: DataFrames + Spark SQL
New capabilities under these APIs
Ongoing research
![Page 11: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/11.jpg)
How Did the Vision Hold Up?
Mostly well Users really appreciate unification Functional API causes some challenges, which we are now tackling
![Page 12: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/12.jpg)
Spark Core
Spark Streaming
real-time
Spark SQL relational
MLlib machine learning
GraphX graph
Libraries Built on Spark
Largest integrated library for big data
![Page 13: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/13.jpg)
Which Libraries do People Use?
80% of users use more than one component 60% use three or more
18%
54%
58%
69%
GraphX
MLlib
Streaming
Spark SQL
![Page 14: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/14.jpg)
Which Languages do People Use?
84%
38% 38%
71%
31%
58%
18%
2014 Languages Used 2015 Languages Used
![Page 15: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/15.jpg)
Main Challenge: Functional API
Looks high-level, but hides many semantics of computation from the engine > Functions passed in are arbitrary blocks of code > Data stored is arbitrary Java/Python objects
Users can mix APIs in suboptimal ways
![Page 16: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/16.jpg)
Which API Call Causes Most Tickets?
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
groupByKey
![Page 17: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/17.jpg)
What People Do
pairs = data.map(word => (word, 1))
groups = pairs.groupByKey()
groups.map((k, vs) => (k, vs.sum))
Materializes all groups as lists of integers
Then sums each list
(“the”, [1, 1, 1, 1, 1, 1]) (“quick”, [1, 1]) (“fox”, [1, 1])
(“the”, 6) (“quick”, 2) (“fox”, 2)
Better code: pairs.reduceByKey(_ + _)
![Page 18: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/18.jpg)
class User(name: String, friends: Array[Int])
Challenge: Data Representation
User 0x… 0x…
String
3
0
1 2
B o b b y
5 0x…
int[]
char[] 5
Object graphs much larger than underlying data
![Page 19: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/19.jpg)
This Talk
Original Spark vision
How did the vision hold up?
New APIs: DataFrames + Spark SQL
New capabilities under these APIs
Ongoing research
![Page 20: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/20.jpg)
DataFrames and Spark SQL
Efficient library for working with structured data > Two interfaces: SQL for data analysts + external
apps, DataFrames for programmers
Optimized computation and storage
SIGMOD 2015
![Page 21: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/21.jpg)
Spark SQL Architecture
Logical Plan
Physical Plan
Catalog
Optimizer RDDs
…
Data Source
API
SQL Data Frames
Code
Generator
![Page 22: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/22.jpg)
DataFrame API
DataFrames hold rows with a known schema and offer relational operations on them through a DSL
users = sql(“select * from users”) ma_users = users[users.state == “MA”] ma_users.count() ma_users.groupBy(“name”).avg(“age”) ma_users.map(lambda u: u.name.toUpper())
Expression AST
![Page 23: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/23.jpg)
API Details
Based on data frame concept in R and Python > Spark is first system to make this API declarative
Integrated with the rest of Spark > MLlib takes DataFrames as input/output > Easily convert RDDs � DataFrames
Google trends for “data frame”
![Page 24: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/24.jpg)
What DataFrames Enable
1. Compact binary representation • Columnar format outside Java heap
2. Optimization across operators (join reordering, predicate pushdown, etc)
3. Runtime code generation
![Page 25: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/25.jpg)
Performance
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
![Page 26: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/26.jpg)
Performance
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
![Page 27: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/27.jpg)
DataFrames vs SQL
Easier to compose into large programs: organize code into functions, classes, etc
“[DataFrames are] concise and declarative like SQL, but I can name intermediate values”
Spark 1.6 adds static typing over DataFrames (Datasets: tinyurl.com/spark-datasets)
![Page 28: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/28.jpg)
This Talk
Original Spark vision
How did the vision hold up?
New APIs: DataFrames + Spark SQL
New capabilities under these APIs
Ongoing research
![Page 29: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/29.jpg)
New Capabilities under Spark SQL
Uniform and efficient access to data sources Rich optimization across libraries
![Page 30: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/30.jpg)
Data Sources
Having a uniform API for structured data lets apps migrate across data sources > Hive, MySQL, Cassandra, JSON, …
API semantics allow query pushdown into sources (not possible with old RDD API)
users[users.age > 20]
select id from users
Spark SQL
![Page 31: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/31.jpg)
Examples
JSON:
JDBC:
Together:
select user.id, text from tweets
{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } }
tweets.json
select age from users where lang = “en”
select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = “en”
Spark SQL
{JSON}
select id, age from users where lang=“en”
![Page 32: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/32.jpg)
Library Composition
One of our goals was to unify processing types Problem: optimizing across libraries > Big data is expensive to copy & scan > Libraries are written in isolation
Spark SQL gives more semantics to do this
Logical Plan
SQL Data Frames ML Graph
Not a problem for small data
![Page 33: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/33.jpg)
Example: ML Pipelines
New API in MLlib that lets users express and optimize end-to-end workflows > Feature preparation, training, evaluation > Similar to scikit-learn, but declarative
tokenizer = Tokenizer() tf = HashingTF(features=1000) lr = LogisticRegression(r=0.1) p = Pipeline(tokenizer, tf, lr) p.fit(df)
tokenizer TF LR
model DataFrame Fused into one pass over data
Filters pushed into data source
CrossValidator.fit(p, df, args)
Repeated queries
![Page 34: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/34.jpg)
This Talk
Original Spark vision
How did the vision hold up?
New APIs: DataFrames + Spark SQL
New capabilities under these APIs
Ongoing research
![Page 35: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/35.jpg)
The Problem
Hardware has changed a lot since big data systems were first designed
2010
Storage 50+MB/s (HDD)
Network 1Gbps
CPU ~3GHz
![Page 36: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/36.jpg)
The Problem
Hardware has changed a lot since big data systems were first designed
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
![Page 37: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/37.jpg)
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD) 10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz !
The Problem
Hardware has changed a lot since big data systems were first designed
New bottleneck in Spark, Hadoop, etc
![Page 38: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/38.jpg)
CPU
To Make Matters Worse
In response to the slowdown of Moore’s Law, hardware is becoming more diverse
Have to optimize separately for each platform!
GPU FPGA
App 1 App 2 App 3
![Page 39: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/39.jpg)
Observation
Many common algorithms can be written with “embarrassingly” data-parallel operations > See how many run on MapReduce / Spark
Focus on optimizing these as opposed to general programs (e.g. C++)
![Page 40: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/40.jpg)
The Goal
CPUs GPUs ...
intermediate language
machine learning SQL graph
algorithms
transformations
![Page 41: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/41.jpg)
Nested Vector Language (NVL)
Functional-like parallel language > Captures SQL, machine learning, and graphs,
but very easy to analyze Closed under composition (nested calls) and common transformations (e.g. loop fusion) > Unlike relational algebra, OpenCL, NESL
![Page 42: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/42.jpg)
Example Transformations def query(products: vec[{dept:int, price:int}]): sum = 0 for p in products: if p.dept == 20: sum += p.price
def query(dept: vec[int], price: vec[int]): sum = 0 for i in 0..len(users): if dept[i] == 20: sum += price[i]
for i in 0..len(products) by 4: sum += price[i..i+4] * (dept[i..i+4] == [20,20,20,20])
row-to-column
vectorization
![Page 43: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/43.jpg)
Results: TPC-H Q6
0.53
0.14 0.08 0.11
0.03 0.00
0.10
0.20
0.30
0.40
0.50
0.60
Python Java C HyPer Database
NVL
Run
time
(sec
)
![Page 44: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/44.jpg)
Effect of Transformations
0.23
0.08
0.03
0.00
0.05
0.10
0.15
0.20
0.25
Row-Oriented Program
After Row-To-Column
After Vectorization
Run
time
(sec
)
Transformations usable on any NVL program
![Page 45: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/45.jpg)
Library Composition API
Disjoint libraries can take & return “NVL objects” to build up a combined program Example: optimize across Spark and NumPy
data = sql(“select features from users where age>20”)
scores = data.map(lambda vec: scoreMatrix * vec)
mean = scores.mean()
![Page 46: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code](https://reader030.vdocuments.site/reader030/viewer/2022040214/5ec995d5bbcdfb09b032fd7f/html5/thumbnails/46.jpg)
Conclusion
Large data volumes + changing hardware pose a formidable challenge for next-generation apps Spark shows a unified API for data apps is useful NVL targets a new range of optimizations and environments