data modeling and databases - systems group · data modeling and databases ... for both relations,...

17
1 Data Modeling and Databases Exercise Sheet 10: Query Processing

Upload: phunglien

Post on 05-Jun-2018

249 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

1

Data Modeling and Databases

Exercise Sheet 10:

Query Processing

Page 2: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

2

Question 1

• Employees table: 64k tuples x 256B each

• Storage format: 16kB pages

• Number of pages in Employees: ________

• Infinite memory and B-tree resides in memory, no table pages are in memory.

• E[pages] = m ∗ (1 − (1 − 1/m)^k )

• Three storage devices:

HDD SSD RDMA

Random access 20ms 0.16ms 10us

Read bandwidth 128 MB/s 1024 MB/s 4096 MB/s

Page 3: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

3

Question 1

• 1: When is Index better than Scan?

Page 4: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

4

Question 1

• 2: What if we compress by 5x?

Page 5: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

5

Question 3Consider two relations we want to join: R and S, with cardinalities |R|=100

and |S|=100000. For both relations, a database page holds 10 tuples. Mark below the statements that are true:

A Nested Loop join will require the same number of comparisons as the Sort Merge Join to perform the join, namely 10 million.

A Sort Merge Join will always have a higher I/O cost than a Grace Hash Join, that in this case will access 10010 pages during its execution.

The Grace Hash Join algorithm will always perform less comparisons between the tuples than Sort Merge Join.

The expected number of comparisons performed by the Sort Merge Join is linear in the size of the relations.

In this particular example a Grace Hash Join would read and write more than 30000 pages in total to execute the join of R and S.

Page 6: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

6

Question 3Consider two relations we want to join: R and S, with cardinalities |R|=100

and |S|=100000. For both relations, a database page holds 10 tuples. Mark below the statements that are true:

A Nested Loop join will require the same number of comparisons as the Sort Merge Join to perform the join, namely 10 million.NLJ -> 10mil; SMJ -> |R|*log|R|+|S|*log|S|+|R|+|S| <800000

A Sort Merge Join will always have a higher I/O cost than a Grace Hash Join, that in this case will access 10010 pages during its execution.SMJ I/O = 5(p(R)+p(S)); GHJ I/O = 3(p(R)+p(S)) but >10010

The Grace Hash Join algorithm will always perform less comparisons between the tuples than Sort Merge Join.GHJ Comp = max(R,S); SMJ = |R|*log|R|+|S|*log|S|+|R|+|S|

The expected number of comparisons performed by the Sort Merge Join is linear in the size of the relations.False

In this particular example a Grace Hash Join would read and write more than 30000 pages in total to execute the join of R and S.GHJ I/O = 3(p(R)+p(S)) > 30030

Page 7: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

7

Question 3 Details

Page 8: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

8

Question 4

• Assume we have the following relational schema:

Customer(Cid, Name)

Order(Oid, Customer, Volume)

• There are 1000 Customer tuples and 100000 Order tuples. One page holds 500 tuples. Assume we have the following query that asks for the volume of orders of a customer called ’Alice’:

SELECT SUM(o.Volume)

FROM Customer c, Order o

WHERE c.Cid = o.Customer AND c.Name = "Alice";

Page 9: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

9

Question 4

SELECT SUM(o.Volume)

FROM Customer c, Order o

WHERE c.Cid = o.Customer AND c.Name = "Alice";

1. Translate this query to relational algebra (can use ``SUM’’ operator):

2. Assume that a) the selectivity of the projection on the customer’s name is 0.2%, i.e., there are 2 customers with name ‘Alice’, and b) our memory capacity is limited to three pages. Which operator implementation is the best for our database query?

Page 10: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

10

Question 4

SELECT SUM(o.Volume)

FROM Customer c, Order o

WHERE c.Cid = o.Customer AND c.Name = "Alice";

1. Translate this query to relational algebra (can use ``SUM’’ operator):

2. Assume that a) the selectivity of the projection on the customer’s name is 0.2%, i.e., there are 2 customers with name ‘Alice’, and b) our memory capacity is limited to three pages. Which operator implementation is the best for our database query?

Page 11: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

11

Question 4.2

SELECT SUM(o.Volume)FROM Customer c, Order oWHERE c.Cid = o.Customer AND c.Name = "Alice“ matches 2 tuples

NLJIf we keep the qualifying customers in memory (1 page), we can stream the Order relation and do the comparisons on the fly. This is possible with two pages of memory.

GHJUsing the GHJ, we first partition Customer as well as the Order relation which requires 2 + 100,000 hashing operations. Ignoring empty partitions, we have to do two hash probes in the comparison phase. In the worst case, the two customers that fit this query are on two different pages. Thus, we have at most 3(2 + (Order)) I/O operations. Thus GHJ is more costly for this specific case.

SMJEfficient sorting of a relation should be done with all tuples in memory to avoid disk swapping. This is possible for the Customer relation but not the Order relation which incurs a significant overhead during the sorting phase. The SMJ is the most costly algorithm for this input.

Page 12: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

12

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

1. Give three different query plans for this query (inclusive

join method).

Page 13: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

13

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

1. Give three different query plans for this query (inclusive

join method).

Page 14: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

14

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

2. For each plan in the previous part, specify the size of each table so that each plan would be optimal.

Page 15: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

15

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

2. For each plan in the previous part, specify the size of each table so that each plan would be optimal.

1. R is small, S and T are big tables, but their join results in a small intermediate table.

2. R and S are small tables, T is a big table.

3. T is small and R has indices on the join condition thus it can be big, S is big.

Page 16: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

16

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

3. Take one of the plans from the previous part and assume none of the tables fit in main memory. How do you allocate buffers? What is the page replacement policy?

Page 17: Data Modeling and Databases - Systems Group · Data Modeling and Databases ... For both relations, a database page holds 10 tuples. ... Question 4.2 SELECT SUM(o.Volume)

17

Question 5

Assume we have the following Query:SELECT * FROM R, S, T WHERE R.rid = S.sid

AND S.sid = T.tid AND T.tid = R.rid

3. Take one of the plans from the previous part and assume none of the tables fit in main memory. How do you allocate buffers? What is the page replacement policy?

With plan 1): First we process the join of S and T (S and T are big tables). We read as many pages as the square root of the total number of pages of the inner table into the memory, and the replacement policy does not matter. For the NLJ, we take 2 pages (one of the outer and one of the inner relation) into memory and the replacement policy will be Most-Recently-Used.