1 cs 361 lecture 5 approximate quantiles and histograms 9 oct 2002 gurmeet singh manku...

30
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku ([email protected])

Upload: dwain-robinson

Post on 17-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

1

CS 361 Lecture 5

Approximate Quantiles and Histograms

9 Oct 2002

Gurmeet Singh Manku([email protected])

2

Frequency Related Frequency Related Problems ...Problems ...

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find all elements with frequency > 0.1%

Top-k most frequent elements

What is the frequency of element 3? What is the total frequency

of elements between 8 and 14?

Find elements that occupy 0.1% of the tail.

Mean + Variance?

Median?

How many elements have non-zero frequency?

3

Types of Histograms ...Types of Histograms ...• Equi-Depth Histograms

– Idea: Select buckets such that counts per bucket are equal

Count forbucket

Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Count forbucket

Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2)( minimizeB

BB Bv v V

Cf

• V-Optimal Histograms

– Idea: Select buckets to minimize frequency variance within

buckets

4

Histograms: ApplicationsHistograms: Applications

• One Dimensional Data

– Database Query Optimization [Selinger78]

• Selectivity estimation

– Parallel Sorting [DNS91] [NowSort97]

• Jim Gray’s sorting benchmark

– [PIH96] [Poo97] introduced a taxonomy, algorithms, etc.

• Multidimensional Data

– OLTP: not much use (independent attribute assumption)

– OLAP & Mining: yeah

5

Finding The Median ...Finding The Median ...

• Exact median in main memory O(n) [BFPRT

73]

• Exact median in one pass n/2 [Pohl 68]

• Exact median in p passes O(n^(1/p)) [MP 80]

2 passes O(sqrt(n))

How about an approximate median?

6

Approximate Medians & QuantilesApproximate Medians & Quantiles

-Quantile element with rank N 0 <

< 1

( = 0.5 means Median)

-Approximate -quantile any element with rank ( ) N 0

< < 1 Typical = 0.01 (1%) -approximate median

Multiple equi-spaced -approximate quantiles= Equi-depth Histogram

7

Plan for Today ...Plan for Today ...

Greenwald-Khanna Algorithmfor arbitrary length stream

Munro-Paterson Algorithmfor fixed N

Sampling-based Algorithmsfor arbitrary length stream

Randomized Algorithm for fixed N

Randomized Algorithm for arbitrary length stream

Generalization

8

Data distribution assumptions ...Data distribution assumptions ...

Input sequence of ranks is arbitrary.

e.g., warehouse data

9

Munro-Paterson Algorithm [MP 80]Munro-Paterson Algorithm [MP 80]

Munro-Paterson [1980]

1 1

2

1 1

2

3

1 1

2

1 1

2

3

4

b = 4

b buffers, each of size kMemory = bk

Minimize bk subject to following constraints:

Number of elements in leaves = k 2^b > NMax relative error in rank = b/2k <

b log ( N)k 1/ log ( N)

Memory = bk = ))(log1

( 2 NO

How do we collapse two sorted buffers into one? Merge Pick alternate elements

Input: N and

10

Error Propagation ...Error Propagation ...

S S S S S ? ? ? ? L L L L L LDepth d

S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L

S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L LDepth d+1 Depth d+1

Number of “?” elements <= 2x+1

x “?” elements

2x+1 “?” elements

Top-down analysis

11

Error Propagation at Depth 0 ...Error Propagation at Depth 0 ...

S S S S S S S M L L L L L L L

S S S S S S S S S S S S S S S M L L L L L L L L L L L L L L

S S S S S S S S S S S L L L L S S S S M L L L L L L L L L L

Depth 0

Depth 1 Depth 1

12

Error Propagation at Depth 1 ...Error Propagation at Depth 1 ...

S S S S S S S S S S L L L L L

S S S S S S S S S S S S S S S S S S S S ? L L L L L L L L L

S S S S S S S S S S S S L L L S S S S S S S S ? L L L L L L

Depth 1

Depth 2 Depth 2

13

Error propagation at Depth 2 ...Error propagation at Depth 2 ...

S S S S S S S S ? L L L L L L

S S S S S S S S S S S S S S S S ? ? ? L L L L L L L L L L L

S S S S S S S S ? L L L L L L S S S S S S S S ? ? L L L L L

Depth 2

Depth 3 Depth 3

14

Error Propagation ...Error Propagation ...

S S S S S ? ? ? ? L L L L L LDepth d

S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L

S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L LDepth d+1 Depth d+1

Number of ? elements <= 2x+1

x “?” elements

2x+1 “?” elements

15

Error Propagation level by levelError Propagation level by level

Number of elements at depth d = k 2^d

Increase in fractional error in rank is 1/2k per level

Munro-Paterson [1980]

3 3

2

3 3

2

1

3 3

2

3 3

2

1

0

b = 4

b buffers, each of size kMemory = bk

Depth d = 2

Let sum of “?” elements at depth d be XThen fraction of “?” elements at depth d

f = X / (k 2^d)

Sum of “?” elements at depth d+1 is at most 2X+2^dThen fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k

Fractional error in rank at depth 0 is 0.Max depth = bSo, total fractional error is <= b/2k

Constraint 2: b/2k <

16

Generalized Munro-Paterson [MRL Generalized Munro-Paterson [MRL 98]98]

b = 5

How do we collapseBuffers with different weights?

Each buffer has a ‘weight’ associated with it.

17

Generalized Collapse ...Generalized Collapse ...

31 37 6 12 5 10 35 8 19 13 28 15 16 25 27

6 10 15 27 35

5 5 6 6 8 8 8 10 10 10 12 12 13 13 13 15 16 19 19 19 25 27 28 31 31 35 35 35 37 37

Weight 6

Weight 2 Weight 3 Weight 1

k = 5

31 31 37 37 6 6 12 12 5 5

10 10 10 35 35 35 8 8 8 19 19 19 13 13 13

28 15 16 25 27

18

Analysis of Generalized Munro-Analysis of Generalized Munro-PatersonPaterson

Munro-Paterson

Generalized Munro-Paterson - But smaller constant

))(log1

( 2 nO

))(log1

( 2 nO

19

Reservoir Sampling [Vitter 85]Reservoir Sampling [Vitter 85]

Maintain a uniform sample of size s

If s = , then with probability at least 1-,

answer is an -approximate median

Input Sequence of length N

Sample of size s

12 log O

Approximate median = median of sample

20

““Non-Reservoir” SamplingNon-Reservoir” Sampling

A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H

Choose 1 out of every N/s successive elements

N/s elements

))1

log(1

(2

O

At end of stream, sample size is sApproximate median = median of sample

If s = , then with probability at least 1-,

answer is an -approximate median

21

Non-uniform Sampling ...Non-uniform Sampling ...

A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A SX Z D B A B D T G H ...

s out ofs elements

Weight = 1

))1

log(1

(2

O

At end of stream, sample size is O(s log(N/s))Approximate median = weighted median of sample

If s = , then with probability at least 1-,

answer is an -approximate median

s out of2s elementsWeight = 2

s out of4s elementsWeight = 4

s out of8s elementsWeight = 8

22

Sampling + Generalized Munro-Paterson [MRL Sampling + Generalized Munro-Paterson [MRL 98]98]

Advance knowledge of N

Output is an -approximate median

with probability at least 1-.

Reservoir SamplingMaintain samples.

12 log O

12 2log5.0

Memory required:

Compute exact median of samples.

Stream of unknown length, and

“1-in-N/s” SamplingChoose s = samples.

Generalized Munro-PatersonCompute -approximate median of samplesMemory required =

Stream of known length N, and

Memory required: )))1

log(1

(log1

( 2

O

12 2log5.0

1)))1((log)1(( 211 sO

23

Unknown-N Algorithm [MRL 99]Unknown-N Algorithm [MRL 99]

Non-uniform Sampling

Modified Deterministic AlgorithmFor Approximate Medians

Stream of unknown length, and

Output is an -approximate median

with probability at least 1-.

Memory required: )))1

log(1

(log1

( 2

O

24

Non-uniform Sampling ...Non-uniform Sampling ...

A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A SX Z D B A B D T. …

s out ofs elements

Weight = 1

))1

log(1

(2

O

At end of stream, sample size is O(s log(N/s))Approximate median = weighted median of sample

If s = , then with probability at least 1-,

answer is an -approximate median

s out of2s elementsWeight = 2

s out of4s elementsWeight = 4

s out of8s elementsWeight = 8

A B D E

s out ofs elements

Weight = 1

25

Modified Deterministic Modified Deterministic Algorithm ...Algorithm ...

h

h+1

h+2

h+3

Height

2s elementswith

W = 1

L = highest levelh = height of tree

Sample Inputs elements

with

W = 2

s elementswith

W = 4

s elementswith

W = 8

s elementswith

W = 2^(L-h)

L

Compute approximate median of weighted samples.b buffers, each of size k

26

Modified Munro-Paterson Modified Munro-Paterson AlgorithmAlgorithm

Height

WeightedSamples 2s elements

with

W = 1

H = highest levelb = height of tree

s elementswith

W = 2

s elementswith

W = 4

s elementswith

W = 8

s elementswith

W = 2^(H-b)

Compute approximate median of weighted samples.

b

b+1

b+2

b+3

H

b buffers, each of size k

27

Error Analysis ...Error Analysis ...

WeightedSamples 2s elements

with

W = 1

b+h = total heightb = height of small tree

s elementswith

W = 2

s elementswith

W = 4

s elementswith

W = 8

s elementswith

W = 2^(H-b)

b

b+1

b+2

b+3

b+h

b buffers, each of size k

Increase in fractional error in rank is 1/2k per level

Total fractional error <=

k

b

k

hb

k

b

k

b

k

bh 12

1

28

1

2

2

4

1

2

1

2

1

2

28

Error Analysis contd...Error Analysis contd...

b O(log ( s))k O(1/ log ( s))

Memory = bk = )))1

log(1

(log1

( 2

O

Minimize bk subject to following constraints:

Number of elements in leaves = k 2^b > s where s =

Max fractional error in rank = b/k < (1-)

12 2log5.0 Almost the same

as before

29Require advance knowledge of n.

))(log1

( 2 nO

Summary of Algorithms ...Summary of Algorithms ...

• Reservoir Sampling [Vitter 85]

– Probabilistic

• Munro-Paterson [MP 80]

– Deterministic

• Generalized Munro-Paterson [MRL 98]

– Deterministic

• Sampling + Generalized MP [MRL98]

– Probabilistic

• Non-uniform Sampling + GMP [MRL 99]

– Probabilistic

• Greenwald & Khanna [GK 01]

– Deterministic

)))1

log(1

(log1

( 2

O

))(log1

( 2 nO

))log(1

( nO

))1

log(1

(2

O

)))1

log(1

(log1

( 2

O

33

List of papers ...List of papers ...

[Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963

[MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980.

[Vit85] J S Vitter, “Random Sampling with a Reservoir”, ACM Trans. on Math. Software, 11(1):37-57, 1985.

[MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998.

[MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999.

[GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.