indexes part 2 and query execution - stanford...

84
Indexes Part 2 and Query Execution Instructor: Matei Zaharia cs245.stanford.edu

Upload: others

Post on 14-Oct-2020

31 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Indexes Part 2 and Query Execution

Instructor: Matei Zahariacs245.stanford.edu

Page 2: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

From Last Time: Indexes

Conventional indexes

B-trees

Hash indexes

Multi-key indexing

CS 245 2

Page 3: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Hash Indexes

key h(key)

record / ptr

...

Buckets(block sized)

Buckets can contain records or pointers to file

overflowbucket

CS 245 3

Chaining is used to handle bucket overflow

Page 4: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Hash vs Tree Indexes

+ O(1) instead of O(log N) disk accesses

– Can’t efficiently do range queries

CS 245 4

Page 5: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Challenge: Resizing

Hash tables try to keep occupancy in a fixed range (50-80%) and slow down beyond that» Too much chaining

How to resize the table when this happens?» In memory: just move everything, amortized

cost is pretty low» On disk: moving everything is expensive!

CS 245 5

Page 6: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Extendible Hashing

Tree-like design for hash tables that allows cheap resizing while requiring 2 IOs / access

CS 245 6

Page 7: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Extendible Hashing: 2 Ideas

(a) Use i of b bits output by hash functionb

h(K) ®

i

i will grow over time; the first i bits of each key’s hash are used to map it to a bucket

00110101

CS 245 7

Page 8: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

(b) Use a directory with pointers to buckets

h(K)[0..i] to bucket...

...

Extendible Hashing: 2 Ideas

CS 245 8

Page 9: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example: 4-bit h(K), 2 keys/bucket

i = 11

1

0001

10011100

CS 245 9

local depthglobal depth

0010

Insert 0010

Page 10: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example: 4-bit h(K), 2 keys/bucket

i = 11

1

0001

10011100

Insert 1010

CS 245 10

local depthglobal depth

Page 11: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

i = 11

1

0001

10011100

Insert 101011100

1010

Example: 4-bit h(K), 2 keys/bucket

CS 245 11

Page 12: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

i = 11

1

0001

10011100

Insert 101011100

1010

New directory

200

01

10

11

i =

2

2

Example: 4-bit h(K), 2 keys/bucket

CS 245 12

Page 13: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

10001

21001101021100

Insert:

0111

0000

00

01

10

11

2i =

Example

CS 245 13

Page 14: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

10001

21001101021100

Insert:

0111

0000

00

01

10

11

2i =

0111

0000

0111

0001

Example

CS 245 14

Page 15: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

10001

21001101021100

Insert:

0111

0000

00

01

10

11

2i =

0111

0000

0111

0001

2

2Example

CS 245 15

Page 16: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

00

01

10

11

2i =

210011010

21100

20111

200000001

Example

Note: still need chaining if values of h(K) repeat and

fill a bucket

CS 245 16

Page 17: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Some Types of Indexes

Conventional indexes

B-trees

Hash indexes

Multi-key indexing

CS 245 17

Page 18: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Motivation

Example: find records where

DEPT = “Toy” AND SALARY > 50k

CS 245 18

Page 19: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Strategy I:

Use one index, say Dept.

Get all Dept = “Toy” recordsand check their salary

I1

CS 245 19

Page 20: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Strategy II:

Use 2 indexes; intersect lists of pointers

Toy Sal> 50k

CS 245 20

Page 21: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Strategy III:

Multi-key index

One design:

I1

I2

I3

CS 245 21

Page 22: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

h

nb

i a

co

de

g

f

m

l

kj

k-d Trees

CS 245 23

Split dimensions in any order to hold k-dimensional data

Page 23: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj

CS 245 24

k-d Trees

Page 24: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

CS 245 25

k-d Trees

Page 25: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

5

15 15

CS 245 26

k-d Trees

Page 26: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

5

15 15

h i a bcd efg

n omlj k

Efficient range queries in both

dimensionsCS 245 27

k-d Trees

Page 27: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Storage System Examples

MySQL: transactional DBMS» Row-oriented storage with 16 KB pages» Variable length records with headers, overflow» Index types:• B-tree•Hash (in memory only)• R-tree (spatial data)• Inverted lists for full text search

» Can compress pages with Lempel-Ziv

CS 245 29

Page 28: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Apache Parquet + Hive: analytical data lake» Column-oriented storage as set of ~1 GB files

(each file has a slice of all columns)» Various compression and encoding schemes

at the level of pages in a file• Special scheme for nested fields (Dremel)

» Header with statistics at the start of each file•Min/max of columns, nulls, Bloom filter

» Files partitioned into directories by one key

CS 245 30

Storage System Examples

Page 29: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Query Execution

Overview

Relational operators

Execution methods

CS 245 31

Page 30: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Query Execution Overview

Recall that one of our key principles in data intensive systems was declarative APIs» Specify what you want to compute, not how

We saw how these can translate into many storage strategies

How to execute queries in a declarative API?

CS 245 32

Page 31: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Query Execution Overview

Query representation (e.g. SQL)

Logical query plan(e.g. relational algebra)

Optimized logical plan

Physical plan(code/operators to run)

Many execution methods: per-record exec, vectorization,

compilation

CS 245 33

Page 32: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Plan Optimization Methods

Rule-based: systematically replace some expressions with other expressions» Replace X OR TRUE with TRUE» Replace M*A + M*B with M*(A+B) for matrices

Cost-based: propose several execution plans and pick best based on a cost model

Adaptive: update execution plan at runtime

CS 245 34

Page 33: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Execution Methods

Interpretation: walk through query plan operators for each record

Vectorization: walk through in batches

Compilation: generate code (like System R)

CS 245 35

Page 34: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

parse

convert

apply rules

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1, P2, …}

{(P1,C1), (P2,C2), ...}

Pi

result

SQL query

parse tree

logical query plan

“improved” l.q.p

l.q.p. +sizes

statistics

Typical RDBMS Execution

CS 245 36

Page 35: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Query Execution

Overview

Relational operators

Execution methods

CS 245 37

Page 36: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

The Relational Algebra

Collection of operators over tables (relations)» Each table has named attributes (fields)

Codd’s original RA: tables are sets of tuples(unordered and tuples cannot repeat)

SQL’s RA: tables are bags (multisets) of tuples; unordered but each tuple may repeat

CS 245 38

Page 37: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Relational Algebra Operators

Basic set operators:

Intersection: R ∩ S

Union: R ∪ S

Difference: R – S

Cartesian Product: R ⨯ S

for tables with same schema

CS 245 39

{ (r, s) | r ∈ R, s ∈ S }

Page 38: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Relational Algebra Operators

Basic set operators:

Intersection: R ∩ S

Union: R ∪ S

Difference: R – S

Cartesian Product: R ⨯ S

CS 245 40

consider both distinct (set union)and non-distinct (bag union)

Page 39: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Relational Algebra Operators

Special query processing operators:

Selection: σcondition(R)

Projection: Pexpressions(R)

Natural Join: R ⨝ S

{ r ∈ R | condition(r) is true }

{ expressions(r) | r ∈ R }

{ (r, s) ∈ R ⨯ S) | r.key = s.key }where key is the common fields

CS 245 41

Page 40: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Relational Algebra Operators

Special query processing operators:

Aggregation: keysGagg(attr)(R)

Examples: departmentGMax(salary)(Employees)

GMax(salary)(Employees)

SELECT agg(attr)FROM RGROUP BY keys

CS 245 42

Page 41: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Algebraic Properties

Many properties about which combinations of operators are equivalent» That’s why it’s called an algebra!

CS 245 43

Page 42: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: Unions, Products and JoinsR ∪ S = S ∪ RR ∪ (S ∪ T) = (R ∪ S) ∪ T

R ⨯ S = S ⨯ R(R ⨯ S) ⨯ T = R ⨯ (S ⨯ T)

R ⨝ S = S ⨝ R(R ⨝ S) ⨝ T = R ⨝ (S ⨝ T) CS 245 44

Attribute order in a relation doesn’t matter either

Tuple order in a relation doesn’t matter (unordered)

Page 43: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: Selects

σp∧q(R) =

σp∨q(R) =

CS 245 45

Page 44: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: Selects

σp∧q(R) = σp(σq(R))

σp∨q(R) = σp(R) ∪ σq(R)

CS 245 46

careful with repeated elements

Page 45: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Bags vs. Sets

R = {a,a,b,b,b,c}

S = {b,b,c,c,d}

R ∪ S = ?

CS 245 47

Page 46: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Bags vs. Sets

R = {a,a,b,b,b,c}

S = {b,b,c,c,d}

R ∪ S = ?

CS 245 48

• Option 1: SUM of countsR ∪ S = {a,a,b,b,b,b,b,c,c,c,d}

• Option 2: MAX of countsR ∪ S = {a,a,b,b,b,c,c,d}

Page 47: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Executive Decision

Use “SUM” option for bag unions

Some rules that work for set unions cannot be used for bags

CS 245 49

Page 48: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Let: X = set of attributesY = set of attributes

PX∪Y (R) =

Properties: Project

CS 245 50

Page 49: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Let: X = set of attributesY = set of attributes

PX∪Y (R) = PX(PY(R))

Properties: Project

CS 245 51

Page 50: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Let: X = set of attributesY = set of attributes

PX∪Y (R) = PX(PY(R))

Properties: Project

CS 245 52

Page 51: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: σ + ⨝

Let p = predicate with only R attribs

q = predicate with only S attribs

m = predicate with only R, S attribs

σp(R ⨝ S) =

σq(R ⨝ S) = CS 245 53

Page 52: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: σ + ⨝

Let p = predicate with only R attribs

q = predicate with only S attribs

m = predicate with only R, S attribs

σp(R ⨝ S) = σp(R) ⨝ S

σq(R ⨝ S) = R ⨝ σq(S)CS 245 54

Page 53: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: σ + ⨝

Some rules can be derived:

σp∧q(R ⨝ S) =

σp∧q∧m(R ⨝ S) =

σp∨q(R ⨝ S) =

CS 245 55

Page 54: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: σ + ⨝

Some rules can be derived:

σp∧q(R ⨝ S) = σp(R) ⨝ σq(S)

σp∧q∧m(R ⨝ S) = σm(σp(R) ⨝ σq(S))

σp∨q(R ⨝ S) = (σp(R) ⨝ S) ∪ (R ⨝ σq(S))

CS 245 56

Page 55: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Prove One, Others for Practice

σp∧q(R ⨝ S) = σp (σq(R ⨝ S))

= σp (R ⨝ σq(S))

= σp (R) ⨝ σq(S)

CS 245 57

Page 56: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: P + σ

Let x = subset of R attributes

z = attributes in predicate p(subset of R attributes)

Px(σp (R)) =

CS 245 58

Page 57: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: P + σ

Let x = subset of R attributes

z = attributes in predicate p(subset of R attributes)

Px(σp (R)) = σp(Px(R))

CS 245 59

Page 58: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Properties: P + σ

Let x = subset of R attributes

z = attributes in predicate p(subset of R attributes)

Px(σp (R)) = Px(σp(Px∪z(R)))

CS 245 60

Page 59: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Let x = subset of R attributesy = subset of S attributesz = intersection of R,S attributes

Px∪y(R ⨝ S) = Px∪y ((Px∪z (R)) ⨝ (Py∪z (S)))

CS 245 61

Properties: P + ⨝

Page 60: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

parse

convert

apply rules

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1, P2, …}

{(P1,C1), (P2,C2), ...}

Pi

result

SQL query

parse tree

logical query plan

“improved” l.q.p

l.q.p. +sizes

statistics

Typical RDBMS Execution

CS 245 62

Page 61: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example SQL Query

SELECT titleFROM StarsInWHERE starName IN (

SELECT nameFROM MovieStarWHERE birthdate LIKE ‘%1960’

);

(Find the movies with stars born in 1960)

CS 245 63

Page 62: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Parse Tree <Query>

<SFW>

SELECT <SelList> FROM <FromList> WHERE <Condition>

<Attribute> <RelName> <Tuple> IN <Query>

title StarsIn <Attribute> ( <Query> )

starName <SFW>

SELECT <SelList> FROM <FromList> WHERE <Condition>

<Attribute> <RelName> <Attribute> LIKE <Pattern>

name MovieStar birthDate ‘%1960’

CS 245 64

Page 63: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Ptitle

sstarName=name

StarsIn Pname

sbirthdate LIKE ‘%1960’

MovieStar

´

Logical Query Plan

CS 245 65

Page 64: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Improved Logical Query Plan

Ptitle

starName=name

StarsIn Pname

sbirthdate LIKE ‘%1960’

MovieStar

Question:Push Ptitleto StarsIn?

CS 245 66

Page 65: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Need expected size

StarsIn

MovieStar

P

s

Estimate Result Sizes

CS 245 67

Page 66: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Parameters: join order,memory size, project attributes, ...Hash join

Seq scan Index scan Parameters:select condition, ...

StarsIn MovieStar

One Physical Plan

H

CS 245 68

Page 67: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Parameters: join order,memory size, project attributes, ...Hash join

Index scan Seq scan Parameters:select condition, ...

StarsIn MovieStar

Another Physical Plan

H

CS 245 69

Page 68: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Sort-merge join

Seq scan Seq scan

StarsIn MovieStar

Another Physical Plan

CS 245 70

Which plan is likely to be better?

Page 69: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Logical plan

P1 P2 … Pn

C1 C2 … Cn

Pick best!

Estimating Plan Costs

Physical plancandidates

CS 245 71

Covered in next few lectures!

Page 70: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Query Execution

Overview

Relational operators

Execution methods

CS 245 72

Page 71: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Now That We Have a Plan, How Do We Run it?

Several different options that trade between complexity, setup time & performance

CS 245 73

Page 72: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example: Simple Query

SELECT quantity * priceFROM ordersWHERE productId = 75

Pquanity*price (σproductId=75 (orders))

CS 245 74

Page 73: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Method 1: Interpretationinterface Operator {Tuple next();

}

class TableScan: Operator {String tableName;

}

class Select: Operator {Operator parent;Expression condition;

}

class Project: Operator {Operator parent;Expression[] exprs;

}

CS 245 75

interface Expression {Value compute(Tuple in);

}

class Attribute: Expression {String name;

}

class Times: Expression {Expression left, right;

}

class Equals: Expression {Expression left, right;

}

Page 74: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example Expression Classes

CS 245 76

class Attribute: Expression {String name;

Value compute(Tuple in) {return in.getField(name);

}}

class Times: Expression {Expression left, right;

Value compute(Tuple in) {return left.compute(in) * right.compute(in);

}}

probably better to use anumeric field ID instead

Page 75: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Example Operator Classes

CS 245 77

class TableScan: Operator {String tableName;

Tuple next() {// read & return next record from file

}}

class Project: Operator {Operator parent;Expression[] exprs;

Tuple next() {tuple = parent.next();fields = [expr.compute(tuple) for expr in exprs];return new Tuple(fields);

}}

Page 76: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Running Our Query with Interpretation

CS 245 78

ops = Project(expr = Times(Attr(“quantity”), Attr(“price”)),parent = Select(expr = Equals(Attr(“productId”), Literal(75)),parent = TableScan(“orders”)

));

while(true) {Tuple t = ops.next();if (t != null) {out.write(t);

} else {break;

}}

Pros & cons of this approach?

recursively calls Operator.next()and Expression.compute()

Page 77: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Method 2: Vectorization

Interpreting query plans one record at a time is simple, but it’s too slow» Lots of virtual function calls and branches for

each record (recall Jeff Dean’s numbers)

Keep recursive interpretation, but make Operators and Expressions run on batches

CS 245 79

Page 78: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Implementing Vectorization

CS 245 80

class TupleBatch {// Efficient storage, e.g.// schema + column arrays

}

interface Operator {TupleBatch next();

}

class Select: Operator {Operator parent;Expression condition;

}

...

class ValueBatch {// Efficient storage

}

interface Expression {ValueBatch compute(TupleBatch in);

}

class Times: Expression {Expression left, right;

}

...

Page 79: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Typical Implementation

Values stored in columnar arrays (e.g. int[]) with a separate bit array to mark nulls

Tuple batches fit in L1 or L2 cache

Operators use SIMD instructions to update both values and null fields without branching

CS 245 81

Page 80: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Pros & Cons of Vectorization

+ Faster than record-at-a-time if the queryprocesses many records

+ Relatively simple to implement

– Lots of nulls in batches if query is selective

– Data travels between CPU & cache a lot

CS 245 82

Page 81: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Method 3: Compilation

Turn the query into executable code

CS 245 83

Page 82: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Compilation Example

Pquanity*price (σproductId=75 (orders))

class MyQuery {void run() {

Iterator<OrdersTuple> in = openTable(“orders”);for(OrdersTuple t: in) {

if (t.productId == 75) {out.write(Tuple(t.quantity * t.price));

}}

}}

CS 245 84

generated class with the rightfield types for orders table

Can also theoretically generate vectorized code

Page 83: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

Pros & Cons of Compilation

+ Potential to get fastest possible execution

+ Leverage existing work in compilers

– Complex to implement

– Compilation takes time

– Generated code may not match hand-written

CS 245 85

Page 84: Indexes Part 2 and Query Execution - Stanford Universityweb.stanford.edu/class/cs245/slides/06-Query-Execution.pdf · Indexes Part 2 and Query Execution Instructor: Matei Zaharia

What’s Used Today?

Depends on context & other bottlenecks

Transactional databases (e.g. MySQL):mostly record-at-a-time interpretation

Analytical systems (Vertica, Spark SQL):vectorization, sometimes compilation

ML libs (TensorFlow): mostly vectorization (the records are vectors!), some compilation

CS 245 86