1 online computation and continuous maintaining of quantile summaries tian xia database lab @ ccis...
TRANSCRIPT
1
Online Computation and Continuous Maintaining of Quantile Summaries
Tian XiaDatabase Lab @ CCISNortheastern University
April 16, 2004
2
References
M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, 2001.
X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages 362-373, 2004
3
Outline of this talk
Quantile Estimation Overview GK-quantile Summary Algorithm
Data Structure Operations Space Complexity Analysis
Sliding Window Model
4
Problem Definitions
-Quantile: A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N .
Quantile Query: Given , find the data element with rank N among all elements in the stream. Variation: N recent elements (sliding window model).
(-approximate): Find the element with rank r within the interval [r-N, r+N].
5
Example of A Quantile Query
The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.
0.5-quantile returns the element ranked 8, which is 8.
0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
6
Why Approximation?
Munro and Paterson (Theoretical Computer Science,
1980) showed that any algorithm which exactly computes -quantile of N data elements in p passes, requires a space of .
Approximate quantile techniques are necessary to achieve sub-linear space efficiency.
7
Quantile Summary
Quantile Summary: A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries.
Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.
8
Properties of A Good Quantile Estimator Provide tunable and explicit a priori guarantees
on the precision of the approximation, e.g. it is -approximate.
Data independent. Use as small a memory footprint as possible,
which includes temporary storage.
9
Previous Work
Manku, Rajagopalan, and Lindsay (SIGMOD, 19
98) proposed a single-pass algorithm that constructs an -approximate quantile summary. Space complexity: log2N. It requires an advance knowledge of N, the size of
data set. Won’t work in data stream environment.
10
Outline of this talk
Quantile Estimation Overview GK-quantile Summary Algorithm
Data Structure Operations Space Complexity Analysis
Sliding Window Model
11
Contributions of GK-algorithm Dynamically adjust quantile summary with the
growth of N, the total number of data elements in the data stream.
Space complexity is reduced to logN.
12
Assumptions
A new data element arrives after each unit of time. n denotes both the number of elements of the data s
equence, as well as the current time. A data element is represented by its value v. rmin(v) and rmax(v) denote respectively the lower and u
pper bounds on the actual rank r of v among the elements seen so far.
13
The Summary Data Structure
GK-algorithm maintains a summary data structure S=S(n) at any point in time n.
S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.
14
The Summary Data Structure
S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi). vi is the value of one of the elements seen so far.
gi = rmin(vi) - rmin(vi-1)
Δi = rmax(vi) - rmin(vi)
v0 and vs-1 always correspond to the minimum and the maximum elements seen so far.
15
The Summary Data Structure
Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) - rmi
n(vi), rmin(vi) = ji gj
rmax(vi) = ji gj +Δi
gi +Δi -1 is upper bound on the total number of elements that may have fallen between vi-1and vi.
rmin(vs-1) = i gj = n.
16
Example of A Quantile Summary
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples.
For clarity, re-write the tuples of the above summary in the form ti = (vi, rmin(vi), rmax(vi)) as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}.
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
17
Error Rate?
PROPOSITION 1: Given a quantile summary S, a -quantile can always be identified to within an error of maxi(gi+Δi)/2.
COROLLARY 1: If at any time n, the summary S(n) satisfies the property that maxigi+i 2n, than we can answer any -quantile query to within an n precision.
18
QUANTILE ()
QUANTILE(): To compute an -approximate -quantile from the summary S(n) after n data elements, compute the rank r=n. Find i such that both r rmin(vi) n and rmax(vi) r n, return vi. i.e. r n rmin(vi) rmax(vi) r n
19
Example of A Quantile Summary
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream.
An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0).
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
20
Outline of this talk
Quantile Estimation Overview GK-quantile Summary Algorithm
Data Structure Operations Space Complexity Analysis
Sliding Window Model
21
How does their algorithm work? Insert a tuple in the summary corresponding to a
new incoming element. Periodically sweep over the summary to “merge”
some of the tuples into their neighbors. It ensures the space requirement.
At all times maxi (gi +Δi) 2n.
What to merge & How to merge?
22
INSERT (v)
INSERT(v): Find the smallest i, such that vi-1 vvi,
and insert the tuple (v, 1, 2n ), between ti-1 and ti. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).
23
Example of INSERT
S={(12, 1, 0)}, n=1 S={(6, 1, 0), (12, 1, 0)}, n=2 S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4
t0
12
t3
10
t4
1
t8
6
25.0
24
Merge
Space will increase with insertions. Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj) c
an be merged into a new tuple (vk, gk,Δk), as l
ong as gk +Δk 2n.
An individual tuple is full if gk +Δk 2n. Capacity and Band are introduced.
25
Capacity and Band
The capacity of a tuple is the maximum numer of elements that can be counted by gi before the tuple bec
ome full. (gi 2n i). The merge phase will free up space by merging tuples with
small capacities into tuples with similar or larger capacities. Bands: Roughly speaking, divide the Δs into bands t
hat lie between elements of (0, ½2n, ¾2n, …, 2i-1 2i 2n, …, 2n-1, 2n).
The larger the capacity (with smallerΔ), the larger the band.
26
Example of A Quantile Summary
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples.
(2,1,7) and (3,1,7) are in the lowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands.
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
27
Band
Strictly, Given from 1 to log2n, p=2n, band is the set of allΔsuch that p2 (p mod 2)Δ p2-1 (p mod 2-1). If twoΔs are ever in the same band, they never ap
pear in different bands as n increase. In band0,Δ= 2n .
A tree structure is imposed to facilitate merges between bands.
28
Tree Representation
Given a summary S = {t0, t1, …, ts-1}, the tree T associated with S contains a node Vi for each ti and a special root node R.
The parent of a node Vi is the node Vj such that j is the least index greater than i with band(ti) > band(tj). Otherwise R is the parent.
29
Tree Representation
PROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S.
PROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S.
(1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0)
R
30
Merge Actually
GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling.
The tuple that results after the merge must not be full, i.e. gi +i 2n.
The operation is called COMPRESS().
31
COMPRESS ( )
The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling.
COMPRESS()
for i from s-2 to 0 do
if ((BAND(i, 2n) BAND(i+1, 2n)) && g*gi+1i+1 2n)) then
DELETE all descendants of ti and the tuple ti itself;
end if
end for
end COMPRESS
g* denotes the sum of g-values of the tuple ti and all its descendants in T.
32
DELETE (vi)
DELETE(vi): To delete the tuple (vi, gi,Δi) from S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.
33
Example of COMPRESS and DELETE
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (1
2, 1, 0)}, s=6, n=6 Compress tuples (11, 1, 1) and (12, 1, 0) into a new tupl
e (12, 2, 0). S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s
=5, n=6
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
25.0
34
Pseudo-Code for the whole algorithmInitial State
S; s 0; n 0;
AlgorithmTo add the n+1st element, v, to summary S(n):
if (n 0 mod 12) then
COMPRESS();
end if
INSERT (v);
n=n+1;
35
A Complete Example ( )
S={(10, 1, 0), (12, 1, 0)}, n=2 S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1),
(12, 1, 0)}, n=6, s=6 Perform compress when t6 comes. S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},
n=6, s=5
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
25.0
36
A Complete Example ( )
S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7
Perform compress when t8 comes. S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)},
n=8, s=5
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
25.0
37
A Complete Example ( )
S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5
Perform compress S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 Finally S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6,
0)}, n=16, s=6
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
25.0
38
Outline of this talk
Quantile Estimation Overview GK-quantile Summary Algorithm
Data Structure Operations Space Complexity Analysis
Sliding Window Model
39
Band Property
Observe that the number of band and elements in a band determine the space complexity.
PROPOSITION 2: At any point in time n and for any 1, band(n) contains either 2 or 2-1 d
istinct values ofΔ. Since no more than 1 2 elements with any gi
venΔ are inserted, band is a summary of at most 2 2 elements in the stream.
40
LEMMAs
LEMMA 3: At any time n and for any given , there are at most 32 nodes in T(n) that have a child with band value of . Only a small number of nodes can have a child wit
h band . See Proposition 3.
41
LEMMAs
A full pair of tuples (ti-1, ti): band(ti-1) band(ti). The tuple ti-1 is left partner and ti is a right partner in this full pair.
LEMMA 4: At any time n and for any given , there are at most 4 tuples from band(n) that are right partners in a full tuple pair.
42
Full Pair Example
{(2,1,7), (3,1,7)} and is a full pair {(1,1,0), (2,1,7)} is not a full pair. (2,1,7) can only be a left partner!
(1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0)
R
43
Space Efficiency
Any band(n) node either is a right partner of a full pair, or can only be a left partner.
By Proposition 3, a band(n) node that can only be a left partner only occurs once for every parent of nodes from band(n).
By Lemma 3 and 4, the number of nodes in any band is bounded by 3 2 4 11 2.
44
Space Efficiency
The number of band is 1. THEOREM: At any time n, the total number of
tuples stored in S(n) is at most (11 2)log(2n).
GK-algorithm’s space complexity is logN.
45
Outline of this talk
Quantile Estimation Overview GK-quantile Summary Algorithm
Data Structure Operations Space Complexity Analysis
Sliding Window Model
46
Sliding Window Model
Under sliding window model, a summary is maintained for the most recently seen N data elements.
Eliminate exact out-dated elements requires a space of O(N).
Lin, etc. (ICDE 2004) proposed a space-efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.
47
n-of-N Model
A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n N. That is, for any (0,1], and any n N, we can return -quantiles among the n most recent elements in a data stream seen so far.
Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK-algorithm, solving n-of-N model.
48
Example of n-of-N model
Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1 n 16.
0.5-quantile returns 6 for n=12 and 3 for n=4.
FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.
t0
12
t1
10
t2
11
t3
10
t4
1
t5
10
t6
11
t7
9
t8
6
t9
7
t10
8
t11
11
t12
4
t13
5
t14
2
t15
3
49
Thank you!