blinkdb: queries with bounded errors and bounded response times on very large data acm eurosys 2013...

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

ACM EuroSys 2013 (Best Paper Award)

MotivationSupport interactive SQL-like aggregate queries over massive sets of data

FeatureMost queries focus on global message of the whole table.

blinkdb> SELECT AVG(jobtime)

FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE

etc.

Where and group semantics focus on limited clauses.

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

FILTERS, GROUP BY clauses

Feature

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

Query Execution on Samples

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

0.2325

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10


ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10


ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.19 +/- 0.050.2325

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?ID City Buff Ratio Sampling

Rate

2 NYC 0.13 1/2

3 Berkeley

0.25 1/2

5 NYC 0.19 1/2

6 Berkeley

0.09 1/2

8 NYC 0.18 1/2

12 Berkeley

0.49 1/2

UniformSample

$0.22 +/- 0.02

0.23250.19 +/- 0.05

Speed/Accuracy Trade-off

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Execution Time (Sample Size)

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Speed/Accuracy Trade-off

Pre-ExistingNoise Execution Time (Sample Size)

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

10x as response timeis dominated by I/O

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

(0.02%)(0.07%) (1.1%) (3.4%) (11%)

Error Bars

What is BlinkDB?A framework built on Shark and Spark that …

- creates and maintains a variety of uniform and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

BlinkDB• Background

• System Overview

• Sample Creation

• BlinkDB Runtime

• Inplementation & Evaluation

Background• One common assumption is that future queries

will be similar to historical queries.

• The meaning of “similarity” can differ.

• This choice of model of past workloads is one of the key differences between BlinkDB and prior work

Workload Taxonomy

System overviewBlinkDB extends the Apache Hive frame work by adding two major components to it:

(1)an offline sampling module that creates and maintains samples over time

(2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries

Supported queries• standard SQL aggregate queries involving COUNT,

AVG, SUM and QUANTILE. Queries involving these operations can be annotated with either an error bound, or a time constraint.

• Nested or joines queries not supported yet, but not a hindrance

It would also be straight forward to extend BlinkDB to deal with foreign keyjoins between two sampled tables (or a self join on one sampled table) where both tables have a stratified sample on the set of columns used for joins.

Sample CreationWhy Stratified samples are useful?

Samples carry storage costs, so we can only build a limited number of them.

Stratified Samples• when uniform sample is useful?

• A uniform sample may not contain any members of the subset at all, leading to a missing row in the final output of the query.

Stratified Samples for a single query• e/t->n, how to

estimate n at runtime will be illustrated later

• If uniform sampling is used, the expected size of

• If is small can be very small or even zero.

Stratified Samples for a single query

This problem has been studied before. Briefly, since error decreases at a decreasing rate as sample size increases, the best choices imply assigns equal sample size to each groups. In addition, the assignment of sample sizes is deterministic.

[16] S. Lohr. Sampling: design and analysis. Thomson, 2009.

K=

Optimizing a set of stratified samples for all queries sharing a QCS • n will change through queries.

Columns selection optimization• Sparsity of the data. A stratified sample on ϕ is

useful when the original table T contains many small groups under ϕ.

• Workload. A stratified sample is only useful when it is beneficial to actual queries. A query has a QCS qj with some(unknown) probability pj

• Storage cost. is the storage cost(inrows) of building a stratified sample on a set of columns ϕ.(for simplicity , k is fixed)

• In practice, we set M=K=100000

Let the overall storage capacity budget(in row) be . Our goal is to select β column sets from among m possible QCSs , say

can also be useful by partially covering qj

• The size of this optimization problem increases exponentially with the number of columns in T, which looks worrying. However, it is possible to solve these problems in practice by applying some simple optimizations, like considering only column sets that actually occurred in the past queries, or eliminating column sets that are unrealistically large.

BlinkDB Runtime• Selecting the Sample candidates

if , qj⊆ϕi, we simply pick the the smallest ϕi

elsehigh selectivity

• Selecting the Right Sample/Size Error Profile & Latency Profile for every candidate using standard closed form statistical error estimates[16]

Predict mainly based on:1. For all standard SQL aggregates, the variance is proportional to∼1/n, and thus the standard deviation (or the statistical error) is proportional to∼1/√n2. BlinkDB simply predicts n by assuming that latency scales linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments.

Bias correction

use stratified sample to simulate a normal sample by trace the sample rate of every group.

InplementationEnables queries with response time and error bounds

Creates or updates the set of random and multi-dimensional samples

re-writes the query and iteratively assigns it an appropriately sized uniform or stratified sample

Modify all pre-existing aggregation functions with statistical closed forms to return errors bars and confidence intervals in addition to there result.

Sample refresh

• inaccuracies in analysis based on multiple queries.Multiple queries on unchanged biased sample will not help to convergence.

• periodically( typically, daily) samples from the original data to avoid correlation among the answers to queries which use the same sample.

ID

City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley

0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley

0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley

0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on Samples

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/4

6 Berkeley

0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325

Time cost for sample• uniform samples are generally created in a few

hundred seconds.

• creating stratified samples on a set of columns takes anywhere between a 5− 30 minutes depending on the number of unique values to stratify on, which decides the number of reducers and the amount of data shuffled.

Evaluationworkloads and sample storage cost

• 8(a) and 8(b) show there relative sizes of the set of stratified sample(s) created for50%, 100%and 200% storage budget on Conviva and TPC-H workloads respectively

• A storage budget of x% indicates that the cumulative size of all the samples will not exceed times the origin data.

QCS choices change through the storage budget

Response time improvement by sample

Error by different samples

Error Convergence

Time and error bound

Scaling Up• Highly selective queries

Those queries that only operate on a small fraction of input dataconsist of one or more highly selective WHERE clauses

• those queries that are intended to crunch huge amounts of data Average among x=2

Average among all the data

Conclusion• BlinkDB, a parallel, sampling-based approximate query engine that

provides support for ad-hoc queries with error and response time constraints

• two key ideas: (i) a multi-dimensional sampling strategy that builds and maintains a variety of samples.(ii) a run-time dynamic sample selection strategy that uses parts of a sample to estimate query selectivity and chooses the best samples for satisfying query constraints.

• Answer a “range” of queries within 2 seconds on 17 TB of data with 90-98% accuracy.

blinkdb: queries with bounded errors and bounded response times on very large data acm eurosys 2013...

Documents