query optimization 2 - stanford university...2007:facebook starts hive (later apache hive) for sql...
TRANSCRIPT
![Page 2: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/2.jpg)
Outline
What can we optimize?
Rule-based optimization
Data statistics
Cost models
Cost-based plan selection
Spark SQLCS 245 2
![Page 3: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/3.jpg)
Outline
What can we optimize?
Rule-based optimization
Data statistics
Cost models
Cost-based plan selection
Spark SQLCS 245 3
![Page 4: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/4.jpg)
Recall From Last Time
Cost models attempt to predict a cost metric for each operator (e.g. CPU cycles, I/Os, etc)
Most common metric: # of disk I/Os
CS 245 4
![Page 5: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/5.jpg)
Example: Index vs Table Scan
Our query: σp(R) for some predicate p
s = p’s selectivity (fraction tuples passing)
CS 245 5
Table scan:R has B(R) = T(R)⨯S(R)/bblocks on disk
Cost: B(R) I/Os
Index search:Index lookup for p takes L I/Os
We then have to read part of R;Pr[read block i]
≈ 1 – Pr[no match]records in block
= 1 – (1–s)b / S(R)
Cost: L + (1–(1–s)b/S(R)) B(R)
block size
![Page 6: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/6.jpg)
What If Results Were Clustered?
We’d need to change our estimate of Cindex:
Cindex = L + s B(R)
CS 245 6
Unclustered: records that match p are spread out uniformly
Clustered: records that match p are close together in R’s file
Fraction of R’s blocks read
Less than Cindex for unclustered data
![Page 7: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/7.jpg)
Join Operators
Join orders and algorithms are often the choices that affect performance the most
For a multi-way join R ⨝ S ⨝ T ⨝ …, each join is selective and order matters a lot» Try to eliminate lots of records early
Even for one join R ⨝ S, algorithm matters
CS 245 7
![Page 8: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/8.jpg)
ExampleSELECT order.date, product.price, customer.nameFROM order, product, customerWHERE order.product_id = product.product_idAND order.cust_id = customer.cust_idAND product.type = “car”AND customer.country = “US”
CS 245 8
⨝order product
(type=car)
customer(country=US)
⨝
⨝order customer
(country=US)
product(type=car)
⨝Plan 1: Plan 2:
join conditions
selection predicates
When is each plan better?
![Page 9: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/9.jpg)
Common Join Algorithms
Iteration (nested loops) join
Merge join
Join with index
Hash join
CS 245 9
![Page 10: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/10.jpg)
Iteration Joinfor each rÎR1:
for each sÎR2:if r.C == s.C then output (r, s)
I/Os: one scan of R1 and T(R1) scans of R2, so cost = B(R1) + T(R1) B(R2) reads
Improvement: read M blocks of R1 in RAM at a time then read R2: B(R1) + B(R1) B(R2) / M
CS 245 10
Note: cost of writes is always B(R1 ⨝ R2)
![Page 11: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/11.jpg)
Merge Join
if R1 and R2 not sorted by C then sort themi, j = 1while i £ T(R1) && j £ T(R2):
if R1[i].C = R2[j].C then outputTupleselse if R1[i].C > R2[j].C then j += 1else if R1[i].C < R2[j].C then i += 1
CS 245 11
![Page 12: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/12.jpg)
Merge Join
procedure outputTuples:while R1[i].C == R2[j].C && i £ T(R1):
jj = jwhile R1[i].C == R2[jj].C && jj £ T(R2):
output (R1[i], R2[jj]) jj += 1
i += i+1
CS 245 12
![Page 13: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/13.jpg)
i R1[i].C R2[j].C j1 10 5 12 20 20 23 20 20 34 30 30 45 40 30 5
50 652 7
Example
CS 245 13
![Page 14: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/14.jpg)
Cost of Merge Join
If R1 and R2 already sorted by C, then
cost = B(R1) + B(R2) reads
CS 245 14
(+ write cost of B(R1 ⨝ R2))
![Page 15: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/15.jpg)
Cost of Merge Join
If Ri is not sorted, can sort it in 4 B(Ri) I/Os:» Read runs of tuples into memory, sort» Write each sorted run to disk» Read from all sorted runs to merge» Write out results
CS 245 15
![Page 16: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/16.jpg)
Join with Indexfor each rÎR1:
list = index_lookup(R2, C, r.C)for each sÎlist:
output (r, s)
Read I/Os: 1 scan of R1, T(R1) index lookupson R2, and T(R1) data lookups
cost = B(R1) + T(R1) (Lindex + Ldata)
CS 245 16
Can be less when R1 is sorted/clustered by C!
![Page 17: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/17.jpg)
Hash Join (R2 Fits in RAM)hash = load R2 into RAM and hash by Cfor each rÎR1:
list = hash_lookup(hash, r.C)for each sÎlist:
output (r, s)
Read I/Os: B(R1) + B(R2)
CS 245 17
![Page 18: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/18.jpg)
Hash Join on Disk
Can be done by hashing both tables to a common set of buckets on disk» Similar to merge sort: 4 (B(R1) + B(R2))
Trick: hash only (key, pointer to record) pairs» Can then sort the pointers to records that
match and fetch them near-sequentially
CS 245 18
![Page 19: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/19.jpg)
Other Concerns
Join selectivity may affect how many records we need to fetch from each relation» If very selective, may prefer methods that join
pointers or do index lookups
CS 245 19
![Page 20: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/20.jpg)
Summary
Join algorithms can have different performance in different situations
In general, the following are used:» Index join if an index exists» Merge join if at least one table is sorted» Hash join if both tables unsorted
CS 245 20
![Page 21: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/21.jpg)
Outline
What can we optimize?
Rule-based optimization
Data statistics
Cost models
Cost-based plan selection
Spark SQLCS 245 21
![Page 22: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/22.jpg)
Complete CBO Process
Generate and compare possible query plans
Query
Generate Plans
Prune x x
Estimate Cost Costs
SelectCS 245 22
Pick Min
![Page 23: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/23.jpg)
How to Generate Plans?
Simplest way: recursive search of the options for each planning choice
CS 245 23
Access pathsfor table 1
Access pathsfor table 2
Algorithmsfor join 1
Algorithmsfor join 2⨯ ⨯ ⨯ ⨯ …
![Page 24: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/24.jpg)
How to Generate Plans?
Can limit search space: e.g. many DBMSesonly consider “left-deep” joins
CS 245 24
Often interacts well with conventions for specifying join inputs in asymmetric join algorithms (e.g. assume right argument has index)
![Page 25: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/25.jpg)
How to Generate Plans?
Can prioritize searching through the most impactful decisions first» E.g. join order is one of the most impactful
CS 245 25
![Page 26: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/26.jpg)
How to Prune Plans?
While computing the cost of a plan, throw it away if it is worse than best so far
Start with a greedy algorithm to find an “OK” initial plan that will allow lots of pruning
CS 245 26
![Page 27: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/27.jpg)
Memoization and Dynamic ProgrammingDuring a search through plans, many subplans will appear repeatedly
Remember cost estimates and statistics (T(R), V(R, A), etc) for those: “memoization”
Can pick an order of subproblems to make it easy to reuse results (dynamic programming)
CS 245 27
![Page 28: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/28.jpg)
Resource Cost of CBO
It’s possible for cost-based optimization itself to take longer than running the query!
Need to design optimizer to not take too long» That’s why we have shortcuts in stats, etc
Luckily, a few “big” decisions drive most of the query execution time (e.g. join order)
CS 245 28
![Page 29: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/29.jpg)
Outline
What can we optimize?
Rule-based optimization
Data statistics
Cost models
Cost-based plan selection
Spark SQLCS 245 29
![Page 30: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/30.jpg)
Background
2004: MapReduce published, enables writing large scale data apps on commodity clusters» Cheap but unreliable “consumer” machines,
so system emphasizes fault tolerance» Focus on C++/Java programmers
CS 245 30
![Page 31: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/31.jpg)
Background
2006: Apache Hadoop project formed as an open source MapReduce + distributed FS» Started in Nutch open source search engine» Soon adopted by Yahoo & Facebook
2006: Amazon EC2 service launched as the newest attempt at “utility computing”CS 245 31
![Page 32: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/32.jpg)
Background
2007: Facebook starts Hive (later Apache Hive) for SQL on Hadoop» Other SQL-on-MapReduces existed too» First steps toward “data lake” architecture
CS 245 32
![Page 33: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/33.jpg)
Background
2006-2012: Many other cluster programming frameworks proposed to bring MR’s benefits to other apps
CS 245 33
Pregel Dremel DryadImpala
![Page 34: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/34.jpg)
Background
2010: Spark engine released, built around MapReduce + in-memory computing» Motivation: interactive queries + iterative
algorithms such as graph analytics
Spark then moves to be a general (“unified”) engine, covering existing ones
CS 245 34
![Page 35: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/35.jpg)
Code Size Comparison (2013)
0
20000
40000
60000
80000
100000
120000
140000
HadoopMapReduce
Impala(SQL)
Storm(Streaming)
Giraph(Graph)
Spark
non-test, non-example source lines
Shark
GraphXStreaming
![Page 36: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/36.jpg)
Background
2012: Shark starts as a port of Hive on Spark
2014: Spark SQL starts as a SQL engine built directly on Spark (but interoperable w/ Hive)» Also adds two new features: DataFrames for
integrating relational ops in complex programs and extensible optimizer
CS 245 36
![Page 37: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/37.jpg)
Original Spark API
Resilient Distributed Datasets (RDDs)» Immutable collections of objects that can be
stored in memory or disk across a cluster» Built with parallel transformations (map, filter, …)» Automatically rebuilt on failure
![Page 38: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/38.jpg)
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(s => s.startswith(“ERROR”))messages = errors.map(s => s.split(‘\t’)(2))messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(s => s.contains(“foo”)).count()messages.filter(s => s.contains(“bar”)).count()
. . .
tasks
resultsCache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in 1 sec(vs 40 s for on-disk data)
Example: Log Mining
![Page 39: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/39.jpg)
Challenges with Spark’s Functional APILooks high-level, but hides many semantics of computation from engine» Functions passed in are arbitrary blocks of
code» Data stored is arbitrary Java/Python objects
Users can mix APIs in suboptimal ways
![Page 40: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/40.jpg)
Example Problempairs = data.map(word => (word, 1))
groups = pairs.groupByKey()
groups.map((k, vs) => (k, vs.sum))
Materializes all groupsas lists of integers
Then promptlyaggregates them
![Page 41: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/41.jpg)
Challenge: Data RepresentationJava objects often many times larger than data
class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5
![Page 42: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/42.jpg)
Spark SQL & DataFramesEfficient library for working with structured data» 2 interfaces: SQL for data analysts and external
apps, DataFrames for complex programs» Optimized computation and storage underneath
![Page 43: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/43.jpg)
Spark SQL Architecture
Logical Plan
Physical Plan
Catalog
OptimizerRDDs
…
DataSource
API
SQL DataFrames
Code
Generator
![Page 44: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/44.jpg)
DataFrame APIDataFrames hold rows with a known schemaand offer relational operations through a DSL
c = HiveContext()users = c.sql(“select * from users”)
ma_users = users[users.state == “MA”]
ma_users.count()
ma_users.groupBy(“name”).avg(“age”)
ma_users.map(lambda row: row.user.toUpper())
Expression AST
![Page 45: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/45.jpg)
API Details
Based on data frame concept in R, Python» Spark is the first to make this declarative
Integrated with the rest of Spark» ML library takes DataFrames as input/output» Easily convert RDDs ↔ DataFrames
Google trends for “data frame”
![Page 46: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/46.jpg)
What DataFrames Enable1. Compact binary representation
• Columnar, compressed cache; rows for processing
2. Optimization across operators (join reordering, predicate pushdown, etc)
3. Runtime code generation
![Page 47: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/47.jpg)
Performance
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
![Page 48: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/48.jpg)
Performance
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
![Page 49: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/49.jpg)
Data SourcesUniform way to access structured data» Apps can migrate across Hive, Cassandra,
JSON, Parquet, …» Rich semantics allows query pushdown into
data sources
SparkSQL
users[users.age > 20]
select * from users
![Page 50: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/50.jpg)
ExamplesJSON:
JDBC:
Together:
select user.id, text from tweets
{“text”: “hi”,“user”: {“name”: “bob”,“id”: 15 }
}
tweets.jsonselect age from users where lang = “en”
select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark
SQL{JSON}
select id, age fromusers where lang=“en”
![Page 51: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/51.jpg)
Extensible Optimizer
Uses Scala pattern matching (see demo!)
CS 245 51
![Page 52: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/52.jpg)
Which Spark ComponentsDo People Use?
58%
58%
62%
69%
MLlib + GraphX
Spark Streaming
DataFrames
Spark SQL
75% of users use 2 or more components(2015 survey)
CS 245 52
![Page 53: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/53.jpg)
Which Languages Are Used?
84%
38% 38%
71%
31%
58%
18%
2014 Languages Used 2015 Languages Used
CS 245 53
![Page 54: Query Optimization 2 - Stanford University...2007:Facebook starts Hive (later Apache Hive) for SQL on Hadoop »Other SQL-on-MapReducesexisted too »Firststepstoward“data lake”](https://reader033.vdocuments.site/reader033/viewer/2022042302/5ecd8b2782411d72781dce6e/html5/thumbnails/54.jpg)
Extensions to Spark SQL
Tens of data sources using the pushdown API
Interval queries on genomic data
Geospatial package (Magellan)
Approximate queries & other research
CS 245 54