blinkdb: queries with bounded errors and bounded response times on very large data acm eurosys 2013...

44
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Upload: archibald-blair

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

ACM EuroSys 2013 (Best Paper Award)

Page 2: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

MotivationSupport interactive SQL-like aggregate queries over massive sets of data

Page 3: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

FeatureMost queries focus on global message of the whole table.

blinkdb> SELECT AVG(jobtime)

FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE

etc.

Page 4: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Where and group semantics focus on limited clauses.

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

FILTERS, GROUP BY clauses

Feature

Page 5: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

Query Execution on Samples

Page 6: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

0.2325

Page 7: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325

Page 8: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.19 +/- 0.050.2325

Page 9: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?ID City Buff Ratio Sampling

Rate

2 NYC 0.13 1/2

3 Berkeley

0.25 1/2

5 NYC 0.19 1/2

6 Berkeley

0.09 1/2

8 NYC 0.18 1/2

12 Berkeley

0.49 1/2

UniformSample

$0.22 +/- 0.02

0.23250.19 +/- 0.05

Page 10: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Speed/Accuracy Trade-off

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Execution Time (Sample Size)

Page 11: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Speed/Accuracy Trade-off

Pre-ExistingNoise Execution Time (Sample Size)

Page 12: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

10x as response timeis dominated by I/O

Page 13: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

(0.02%)(0.07%) (1.1%) (3.4%) (11%)

Error Bars

Page 14: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

What is BlinkDB?A framework built on Shark and Spark that …

- creates and maintains a variety of uniform and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 15: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

BlinkDB• Background

• System Overview

• Sample Creation

• BlinkDB Runtime

• Inplementation & Evaluation

Page 16: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Background• One common assumption is that future queries

will be similar to historical queries.

• The meaning of “similarity” can differ.

• This choice of model of past workloads is one of the key differences between BlinkDB and prior work

Page 17: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Workload Taxonomy

Page 18: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Page 19: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

System overviewBlinkDB extends the Apache Hive frame work by adding two major components to it:

(1)an offline sampling module that creates and maintains samples over time

(2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries

Page 20: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Supported queries• standard SQL aggregate queries involving COUNT,

AVG, SUM and QUANTILE. Queries involving these operations can be annotated with either an error bound, or a time constraint.

• Nested or joines queries not supported yet, but not a hindrance

Page 21: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

It would also be straight forward to extend BlinkDB to deal with foreign keyjoins between two sampled tables (or a self join on one sampled table) where both tables have a stratified sample on the set of columns used for joins.

Page 22: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Sample CreationWhy Stratified samples are useful?

Samples carry storage costs, so we can only build a limited number of them.

Page 23: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Stratified Samples• when uniform sample is useful?

• A uniform sample may not contain any members of the subset at all, leading to a missing row in the final output of the query.

Page 24: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Stratified Samples for a single query• e/t->n, how to

estimate n at runtime will be illustrated later

• If uniform sampling is used, the expected size of

• If is small can be very small or even zero.

Page 25: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Stratified Samples for a single query

This problem has been studied before. Briefly, since error decreases at a decreasing rate as sample size increases, the best choices imply assigns equal sample size to each groups. In addition, the assignment of sample sizes is deterministic.

Page 26: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

[16] S. Lohr. Sampling: design and analysis. Thomson, 2009.

K=

Page 27: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Optimizing a set of stratified samples for all queries sharing a QCS • n will change through queries.

Page 28: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Columns selection optimization• Sparsity of the data. A stratified sample on ϕ is

useful when the original table T contains many small groups under ϕ.

• Workload. A stratified sample is only useful when it is beneficial to actual queries. A query has a QCS qj with some(unknown) probability pj

• Storage cost. is the storage cost(inrows) of building a stratified sample on a set of columns ϕ.(for simplicity , k is fixed)

• In practice, we set M=K=100000

Page 29: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Let the overall storage capacity budget(in row) be . Our goal is to select β column sets from among m possible QCSs , say

can also be useful by partially covering qj

Page 30: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

• The size of this optimization problem increases exponentially with the number of columns in T, which looks worrying. However, it is possible to solve these problems in practice by applying some simple optimizations, like considering only column sets that actually occurred in the past queries, or eliminating column sets that are unrealistically large.

Page 31: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

BlinkDB Runtime• Selecting the Sample candidates

if , qj⊆ϕi, we simply pick the the smallest ϕi

elsehigh selectivity

• Selecting the Right Sample/Size Error Profile & Latency Profile for every candidate using standard closed form statistical error estimates[16]

Page 32: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Predict mainly based on:1. For all standard SQL aggregates, the variance is proportional to∼1/n, and thus the standard deviation (or the statistical error) is proportional to∼1/√n2. BlinkDB simply predicts n by assuming that latency scales linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments.

Page 33: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Bias correction

use stratified sample to simulate a normal sample by trace the sample rate of every group.

Page 34: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

InplementationEnables queries with response time and error bounds

Creates or updates the set of random and multi-dimensional samples

re-writes the query and iteratively assigns it an appropriately sized uniform or stratified sample

Modify all pre-existing aggregation functions with statistical closed forms to return errors bars and confidence intervals in addition to there result.

Page 35: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Sample refresh

• inaccuracies in analysis based on multiple queries.Multiple queries on unchanged biased sample will not help to convergence.

• periodically( typically, daily) samples from the original data to avoid correlation among the answers to queries which use the same sample.

Page 36: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on Samples

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325

Page 37: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Time cost for sample• uniform samples are generally created in a few

hundred seconds.

• creating stratified samples on a set of columns takes anywhere between a 5− 30 minutes depending on the number of unique values to stratify on, which decides the number of reducers and the amount of data shuffled.

Page 38: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Evaluationworkloads and sample storage cost

• 8(a) and 8(b) show there relative sizes of the set of stratified sample(s) created for50%, 100%and 200% storage budget on Conviva and TPC-H workloads respectively

• A storage budget of x% indicates that the cumulative size of all the samples will not exceed times the origin data.

QCS choices change through the storage budget

Page 39: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Response time improvement by sample

Page 40: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Error by different samples

Page 41: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Error Convergence

Page 42: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Time and error bound

Page 43: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Scaling Up• Highly selective queries

Those queries that only operate on a small fraction of input dataconsist of one or more highly selective WHERE clauses

• those queries that are intended to crunch huge amounts of data Average among x=2

Average among all the data

Page 44: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Conclusion• BlinkDB, a parallel, sampling-based approximate query engine that

provides support for ad-hoc queries with error and response time constraints

• two key ideas: (i) a multi-dimensional sampling strategy that builds and maintains a variety of samples.(ii) a run-time dynamic sample selection strategy that uses parts of a sample to estimate query selectivity and chooses the best samples for satisfying query constraints.

• Answer a “range” of queries within 2 seconds on 17 TB of data with 90-98% accuracy.