approximate medians and other quantiles in one pass and with limited memory researchers: g. singh,...
TRANSCRIPT
![Page 1: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/1.jpg)
Approximate Medians and Approximate Medians and other Quantiles in One Pass other Quantiles in One Pass
and with Limited Memory and with Limited Memory
Researchers:
G. Singh, S.Rajagopalan & B. Lindsey
Lecturer:
Eitan Ben Amos, 2003
![Page 2: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/2.jpg)
Lecture StructureLecture Structure
Problem DefinitionA Deterministic AlgorithmProofComplexity analysisComparison to other algorithmsA randomized solution.Pros & cons of randomized solution.
![Page 3: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/3.jpg)
Problem DefinitionProblem Definition
When given a large data set (N), design an algorithm for computing approximate quantiles () in a single pass.
Approximation guarantee is an input ().Algorithm should be applicable to any
distribution of values & arrival.Compute multiple values with no extra cost.Low memory requirements.
![Page 4: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/4.jpg)
QuantilesQuantiles
Given a stream of N values, the -quantile, for [0,1], is the value located in position *N in the sorted input stream.
When =0.5 the it is the median. is approximate -quantile if its rank in
the sorted input stream is between (-)*N and (+)*N.
There can be several values in this range.
![Page 5: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/5.jpg)
Database ApplicationsDatabase Applications
Used for query optimizations.Used by parallel DB systems in order to
split the inserted data among the servers into approximately equal parts.
Distributed parallel sorting uses quantiles to split the ranges between the machines.
![Page 6: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/6.jpg)
Algorithm FrameworkAlgorithm Framework
An algorithm is parameterized by 2 integers: b,k.
It will use b buffers, each stores k elements.Memory usage is b*k + CEvery buffer (x) is associated with a
positive integer weight, w(x).The weight denotes the number of input
elements represented by an element in x.
![Page 7: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/7.jpg)
Algorithm Framework (cont’d)Algorithm Framework (cont’d)
Buffers are labeled either “Empty” or “Full”.
Initially all buffers are “Empty”.The values of b,k are calculated so that they
enforce the approximation guarantee () and minimize memory requirement: b*k
It must be able to process N elements.
![Page 8: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/8.jpg)
Framework Basic OperationsFramework Basic Operations(1) NEW
Takes an empty buffer as input. Populates the buffer with the next k elements from the input stream.
Assigns the “Full” buffer a weight of 1.If there are less than k elements, an equal
number of + & - are added to fill the buffer.
The input stream with the additional ± elements is called “augmented stream”.
![Page 9: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/9.jpg)
Quantile in augmented streamQuantile in augmented stream
Length of augmented stream is *N, >=1’ = (2*+-1)/(2*)The -quantile in the original stream is the
’ quantile in the augmented stream.Proof: (-1)*N elements were added, ½ of
which appear before in the sorted stream.’N=*N+(-1)*N/2= (N/2)*(2+-1)
![Page 10: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/10.jpg)
Basic Operations (Cont’d)Basic Operations (Cont’d)
(2) COLLAPSETakes c 2 “Full” input buffers X1,….Xc &
outputs a buffer Y (all of size k).All input buffers are marked “Empty”,
output buffer Y is marked “Full”.Weight of Y is the sum of weights of all
input buffers: W(Y) = w(Xi)
![Page 11: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/11.jpg)
Collapsing BuffersCollapsing Buffers
Make w(Xi) copies of each element in Xi
Sort elements from all buffers together.Elements of Y are k equally-spaced
elements from the sorted elements.w(Y) is odd elements are j*w(Y)+(w(Y)
+1)/2 , j=0,….,k-1w(Y) is even elements are j*w(Y)
+w(Y)/2 or j*w(Y)+(w(Y)+2)/2
![Page 12: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/12.jpg)
Collapsing Buffers (Cont’d)Collapsing Buffers (Cont’d)
For 2 successive COLLAPSE with even w(Y) alternate between the choices.
Define offset(Y)=(w(Y)+z)/2 , z{0,1,2}Y has the elements : j*w(y)+offset(Y)Collapsing buffers does not require the
creation of multiple copies of elements. A single scan of the elements in a manner similar to merge-sort will do.
![Page 13: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/13.jpg)
COLLAPSE exampleCOLLAPSE example
![Page 14: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/14.jpg)
Lemma 1Lemma 1
C = Number of COLLAPSE operations made by the algorithm.
W = Sum of weights of output buffers produced by these COLLAPSE operations.
Lemma: sum of offsets of all COLLAPSE operations is at least (W+C-1)/2
![Page 15: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/15.jpg)
ProofProof
C=Codd+Ceven (Number of COLLAPSE operations with w(Y) being odd & even respectively).
Ceven= Ceven1+ Ceven2 (Number of COLLAPSE operations with offset(Y) being w(Y)/2 & (w(Y)+2)/2 respectively).
Sum of all offsets is (W+Codd+2Ceven2)/2
![Page 16: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/16.jpg)
Proof (Cont’d)Proof (Cont’d)
Since COLLAPSE alternates between the 2 offset choices for even w(Y):
If Ceven1=Ceven2 Ceven=2Ceven2
If Ceven1=Ceven2+1 Ceven=Ceven2+1+Ceven2.In any case : Ceven2 (Ceven-1)/2Sum-of-offsets (W+C-1)/2
![Page 17: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/17.jpg)
Basic Operations (Cont’d)Basic Operations (Cont’d)
(3) OUTPUTOUTPUT is performed only once, just
before termination.Takes c 2 “Full” input buffers X1,….Xc of
size k.Outputs a single element corresponding to
the ’ quantile of the augmented stream.
![Page 18: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/18.jpg)
OUTPUT (Cont’d)OUTPUT (Cont’d)
Makes w(Xi) copies of each element in buffer Xi, sorts all input buffers together.
Outputs the element in position ’kW where W=w(X1)+….+w(Xc)
![Page 19: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/19.jpg)
COLLAPSE policiesCOLLAPSE policies
Different COLLAPSE policies mean different criteria for when to use the NEW/COLLAPE operations.– Munro & Pateson– Alsabti, Ranka & Singh– New Algorithm.
![Page 20: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/20.jpg)
Munro & PatesonMunro & Pateson
If there are empty buffers, invoke NEW. Otherwise, invoke COLLAPSE on 2 buffers having the same weight.
Following is an example of operations for b=6.
![Page 21: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/21.jpg)
Munro & PatesonMunro & Pateson
![Page 22: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/22.jpg)
Alsabti, Ranka & SinghAlsabti, Ranka & Singh
Fill b/2 “Empty” buffers by invoking NEW & then invoke COLLAPSE on them.
Repeat this b/2 times.Invoke OUTPUT on b/2 resulting buffers.Following is an example of operations for
b=10.
![Page 23: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/23.jpg)
Alsabti, Ranka & SinghAlsabti, Ranka & Singh
![Page 24: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/24.jpg)
New AlgorithmNew Algorithm
Associate with every buffer X an integer l(X) denoting its level.
Let l = minimum among all levels of currently “Full” buffers.
If there’s exactly one “Empty” buffer, invoke NEW & assign it level l.
If there are at least 2 “Empty” buffers, invoke NEW on each & assign them level 0.
![Page 25: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/25.jpg)
New Algorithm (Cont’d)New Algorithm (Cont’d)
If there are no “Empty” buffers invoke COLLAPSE on the set of buffers with level l. Assign the output buffers level (l+1).
Following is an example of operations for b=5, h=4.
![Page 26: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/26.jpg)
New AlgorithmNew Algorithm
![Page 27: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/27.jpg)
Tree representationTree representation
Sequence of operations can be seen as a tree.
Vertices (except root) represent the set of all logical buffers (initial, intermediate, final).
Leaves correspond to initial buffers which are populated from the input stream by the NEW operation.
![Page 28: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/28.jpg)
Tree representation (Cont’d)Tree representation (Cont’d)
An edge is drawn from every input buffer to its output buffer (created by COLLAPSE).
The root corresponds to the final OUTPUT operation.
The children of the root are the final buffers that are produced (by COLLAPSE operations). Broken edges are drawn toward the children of the root.
![Page 29: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/29.jpg)
DefinitionsDefinitions
User Specified:– N Size of input stream Quantile to be computed. Approximation Guarantee
Others:– b Number of buffers– k size of each buffer ’ Quantile in the augmented stream
![Page 30: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/30.jpg)
Definitions (Cont’d)Definitions (Cont’d)
More Others– C Number of COLLAPSE operations– W sum of weights of all COLLAPSE– wmax weight of heaviest COLLAPSE– L Number of leaves in tree– h height of tree
![Page 31: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/31.jpg)
Approximation GuaranteesApproximation Guarantees
We will prove the following:
The difference in rank between the true -quantile of the original dataset & the output of the algorithm is at most wmax+(W-C-1)/2
![Page 32: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/32.jpg)
Lemma 2Lemma 2
Lemma: The sum of weights of the top buffers (the children of the root) is L, the number of leaves
![Page 33: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/33.jpg)
ProofProof
Every buffers that is filled by NEW has a weight of 1.
COLLAPSE of buffers creates a buffer with a weight that is the sum of weights of input buffers.
Looking at the tree of operations, every node weighs exactly like the weight of all its children.
Recursively applying this from the top buffers towards the root we can see that the weight of a top buffer is identical to the number of leaves in the sub-tree root at it.
![Page 34: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/34.jpg)
Definitely Small/LargeDefinitely Small/Large
Let Q be the output of the algorithm.An element in the input stream is DS(DL) if
it is smaller(larger) than Q.In order to identify all the DS(DL) elements
we will start from the top buffers (children of root) and move towards the leaves.
Mark elements of top buffers as DS(DL) if they are smaller(larger) than Q.
![Page 35: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/35.jpg)
Definitely Small/Large (Cont’d)Definitely Small/Large (Cont’d)
When going from a parent to its children, mark as DS(DL) all elements in the child buffers that are smaller(larger) than the DS(DL) elements in their parent.
We will pursue a way of showing how many DS(DL) elements exists.
![Page 36: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/36.jpg)
Weighted DS/DL boundWeighted DS/DL bound
Weight of element is the weight of the buffer it is in.
Weighted DS(DL) adds w(X) for every element in buffer X that is DS(DL)
Let DStop(DLtop) denote the weighted sum of DS(DL) elements among the top buffers.
![Page 37: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/37.jpg)
Lemma 3Lemma 3’kL - wmax DStop ’kL - 1Right side: OUTPUT gives the element at
position ’kL of the weighted buffers & so there’s obviously less than that number of elements which are smaller.
![Page 38: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/38.jpg)
Lemma 3 (Cont’d)Lemma 3 (Cont’d)
Left side: Surrounding Q there are w(Xi)-1 elements that are copies of Q. if we had asked a quantile that is just a bit different we would have just got a different copy of Q as the output, although it would have been a different element in the input stream. Error can be as large as w(Xi) which is bound by wmax. Reducing the number of copies from the position of Q, all others are DS for sure.
![Page 39: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/39.jpg)
Lemma 3 (Cont’d)Lemma 3 (Cont’d)
kL - ’kL - wmax + 1 DLtop kL - ’kLRight side: there are a total of kL elements in
the augmented stream. Q is in position ’kL. So there are kL - ’kL elements after the position of Q, of which some might be copies of Q.
![Page 40: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/40.jpg)
Lemma 3 (Cont’d)Lemma 3 (Cont’d)
Left Side: there are kL - ’kL elements after the position of Q. of these there are at most (wmax -1) copies of Q after (wmax including Q) which all elements are DL.
![Page 41: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/41.jpg)
Weighted DSWeighted DS
Consider node Y of the tree corresponding to a COLLAPSE operation.
Let Y have s 0 DS elements.Consider the largest element among these
DS elements. It appears in position (s-1)*w(Y)+offset(Y) in the sorted sequence of elements of its children with each element being duplicated as the weight of the buffer it originates from.
![Page 42: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/42.jpg)
Weighted DS (Cont’d)Weighted DS (Cont’d)
Therefore, the weighted sum of DS elements among children of Y is (s-1)*w(Y) + offset(Y) which is equivalent to s*w(Y)-(w(Y)-offset(Y)).
![Page 43: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/43.jpg)
Weighted DLWeighted DL
Similarly, let Y have l 0 DL elements. Consider the smallest element among these DL
elements. It appears in position (l-1)*w(Y) + [w(Y)-offset(Y)] in the sorted sequence of elements of its children with each element being duplicated as the weight of the buffer it originates from (when counting from end of stream towards its beginning).
![Page 44: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/44.jpg)
Weighted DL (Cont’d)Weighted DL (Cont’d)
the weighted sum of DL elements among children of Y is (l-1)*w(Y) + [w(Y)-offset(Y)] which is equivalent to l*w(Y)-offset(Y) which can also be written as l*w(Y)-(w(Y)-offset(Y)).
![Page 45: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/45.jpg)
DS/DL ConclusionDS/DL Conclusion
The weighted sum of DS(DL) among the children of a node Y is smaller by at most w(Y)-offset(Y) than the weighted sum of DS(DL) elements in Y itself.
So we can count DS(DL) from the top buffers towards the leaves, reducing w(Y)-offset(Y) for each COLLAPSE on the way.
![Page 46: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/46.jpg)
How many leaves ?How many leaves ?
Let DSleaves (DLleaves) denote the number of definitely-small(large) elements among the leaf buffers of the operations tree.
Weight of a leaf is 1 DSleaves (DLleaves) are, in fact, the number of definitely-small(large) elements in the augmented stream.
![Page 47: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/47.jpg)
Lemma 4Lemma 4
DSleaves DStop - (W-C+1)/2
DLleaves DLtop - (W-C+1)/2
![Page 48: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/48.jpg)
Lemma 4 – ProofLemma 4 – Proof
Starting at the top buffers, the initial weighted sum of DS(DL) elements is DStop
(DLtop)Each COLLAPSE that creates node Y
diminishes the weighted sum by at most w(Y)-offset(Y).
Traveling down to the leaves we do this for all COLLAPSE operations.
![Page 49: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/49.jpg)
Lemma 4 – Proof (Cont’d)Lemma 4 – Proof (Cont’d)
W(Y) on all COLLAPSE operations is W.offset(Y) on all COLLAPSE operations is at
least (W+C-1)/2 by lemma 1.Reducing these 2 from DStop (DLtop) yields
Lemma 4.
![Page 50: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/50.jpg)
Lemma 5Lemma 5
The difference in rank between the true -quantile of the original input stream & that of the output of the algorithm is at most (W-C-1)/2+wmax.
![Page 51: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/51.jpg)
Lemma 5 - proofLemma 5 - proof
Since there are L leaves each of size k, there are a total of k*L elements in the augmented input stream.
The true ’-quantile of the augmented stream is in position ’*k*L.
The output of the algorithm can be any element that is neither DS nor DL.
![Page 52: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/52.jpg)
Lemma 5 – proof (Cont’d)Lemma 5 – proof (Cont’d)
So the output can be as small as DSleaves+1 or as large as k*L-DLleaves.
The difference between the true ’-quantile and the output could be as large as ’kL-DSleaves-1or kL-DLleaves-’kL.
Assign DSleaves from Lemma 4 & we get:
’kL-DSleaves-1 ’kL-DStop+(W-C+1)/2-1
![Page 53: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/53.jpg)
Lemma 5 – proof (Cont’d)Lemma 5 – proof (Cont’d)
Substituting ’kL-DStop wmax from lemma 3 we get:
’kL-DSleaves-1 wmax+(W-C+1)/2-1 = wmax+(W-C-1)/2
The same bound can be established for the quantity kL-DLleaves-’kL.
![Page 54: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/54.jpg)
Approx. bound Approx. bound Munro-PatersonMunro-Paterson
Requires 2 buffers at leaf level & one buffer at every other level, except the root.
Therefore height is at most b.The original paper assumes there are
exactly 2^(b-1) leaves & that the final OUTPUT operation assumes 2 buffers of level 2^(b-2) as inputs.
![Page 55: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/55.jpg)
Approx. bound Approx. bound Munro-PatersonMunro-Paterson
W=(b-2)*2^(b-1) since the weight of nodes at each level is 2^(b-1) & COLLAPSE counts all levels except leaves & root.
C=2^(b-1)-2 since a tree of height b-1 (ignoring leaves) has 2^(b-1)-1 nodes. Reducing the root yields the proper value.
wmax= 2^(b-2) since it is the entire tree under a child of the root.
![Page 56: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/56.jpg)
Approx. bound Approx. bound Munro-PatersonMunro-Paterson
Plugging these values in to Lemma 5 yields:
(W-C-1)/2+wmax=(b-2)*2^(b-2)+1/2
This value has to be smaller than *N for the output to be -Approximation Quantile.
![Page 57: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/57.jpg)
Approx. bound Approx. bound Alsabti-Ranka-SinghAlsabti-Ranka-Singh
B is assumed to be even (since b/2 is used)C=b/2W=(b/2)^2 since there are b/2 COLLAPSE
operations, each with b/2 buffers of weight 1.
wmax=b/2 since all COLLAPSE are the same.L=(b/2)^2 since the root has b/2 children
with each having b/2 children.
![Page 58: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/58.jpg)
Approx. bound Approx. bound Alsabti-Ranka-SinghAlsabti-Ranka-Singh
Plugging these values in to Lemma 5 yields:
(W-C-1)/2+wmax=[(b^2)/4-b/2-1]/2 + b/2 = (b^2)/8+b/4-1/2
This value has to be smaller than *N for the output to be -Approximation Quantile.
![Page 59: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/59.jpg)
Approx. bound Approx. bound new-Algorithmnew-Algorithm
The values W, C, wmax are a function of the height of the tree, denoted as h in addition to b.
The height of the tree is not restricted by b, unlike the previous schemes we saw.
Assume h 3 (so there is a level of COLLAPSE except the leaves & the root.
![Page 60: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/60.jpg)
Approx. bound Approx. bound new-Algorithmnew-Algorithm
1
2
h
hbL 1
2
3
h
hbC
3
3
1
2)2(
h
hb
h
hbhW
2
3max h
hbw
![Page 61: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/61.jpg)
Approx. bound Approx. bound new-Algorithmnew-Algorithm
Plugging these values in to Lemma 5 yields:
This value has to be smaller than *N for the output to be -Approximation Quantile.
Nh
hb
h
hb
h
hbh
2
2
3
3
3
1
2)2(
![Page 62: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/62.jpg)
Memory Usage ComparisonMemory Usage Comparison
![Page 63: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/63.jpg)
Memory Usage (Cont’d)Memory Usage (Cont’d)
Why does the curve of the Munro-Paterson algorithm has these kinks?
We optimize under 2 equations.(B-2)*2^(b-2)+1/2<= N ; k*2^(b-1)>=NAs N increases, k is increased until N
reaches a threshold in which adding 1to b (constraint 1) diminishes k by half, thereby decreasing the memory usage by half.
![Page 64: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/64.jpg)
Multiple QuantilesMultiple Quantiles
During the analysis we did not assume that a single quantile is being requested.
Nor did we use the specific quantile until the last operation (OUTPUT) which selected a single element from the top buffers.
Conclusion: any algorithm of this framework can output multiple quantiles with the same cost (of memory) as computing a single quantile.
![Page 65: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/65.jpg)
Space ComplexitySpace Complexity
Best space complexity is achieved for b=h.The -Approximation constraint can be
relaxed a little to get:
This means that b=h=O(log(N))
Nb
bb
22
![Page 66: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/66.jpg)
Space Complexity (Cont’d)Space Complexity (Cont’d)
Second constraint is kL N.Replacing L with its value gives:
Yields: k=(1/ )*O(b)=(1/ )*O(log(N)) = O((1/ )*log(N))
32
21
2
b
bbN
h
hbk
![Page 67: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/67.jpg)
Space Complexity (Cont’d)Space Complexity (Cont’d)
The overall space complexity is b*k
))(log
(
))log(
(*))(log(*
2
NO
NONOkb
![Page 68: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/68.jpg)
Parallel VersionParallel Version
The new algorithm scales very good on parallel machines.
The input stream can be divided among the processors either statically (each one takes T values) or dynamically.
Up till having the top buffers (children of root) which are the input buffers for the OUTPUT operation, parallelism is obvious.
![Page 69: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/69.jpg)
Sampling based AlgorithmSampling based Algorithm
The deterministic algorithm presented earlier, coupled with sampling can reduce the memory requirements dramatically.
Interestingly, we will achieve a space bound that is independent of N.
We add a new input parameter, . The probability that the output is correct is required to be 1-.
![Page 70: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/70.jpg)
Hoeffding’s InequalityHoeffding’s Inequality
Let X1, …, Xn be independent random variables with 0 Xi 1for i=1,….n.
Let X= X1+ …+Xn Let E(X) denote the expected value of X.Then, for any > 0 the following holds:
Pr[X – E(X) ] exp ((-2**)/n)
![Page 71: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/71.jpg)
Lemma 7Lemma 7
Let = 1+2
A total of
samples drawn from a population of N elements are enough to guarantee that the set of elements between the pair of positions (1)*S in the sorted sequence of samples is a subset of the set of elements between the pair of positions ()*N in the sorted sequence of the N elements.
)2log(2
1 122
S
![Page 72: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/72.jpg)
ProofProof
We say a sample is “bad” if it does not satisfy the previously mentioned property; otherwise it is called “good”.
Let N- (N+) denote the elements preceding (succeeding) the - (+) quantiles among the N elements.
A sample of size S is “bad” iff more than (-1)*S elements are drawn from N- or more than S-(+1)*S elements are drawn from N+
![Page 73: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/73.jpg)
Proof (Cont’d)Proof (Cont’d)
The probability that more than (-1)*S elements are drawn from N- is bounded as follows.
The drawing of S elements from a population of N can be seen as S independent coin tosses with probability -
The expected number of successful tosses is (-)S
![Page 74: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/74.jpg)
Proof (Cont’d)Proof (Cont’d)
The probability that this occurs is:
22
122
22
22
2
1
11
2
)2log(2)
2log()2exp(2
))(2
exp(]Pr[
])(Pr[
])()(Pr[])(Pr[
SSS
S
SSEXX
SEXX
SSEXXSX
![Page 75: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/75.jpg)
Values of Values of 1, 1, 22
When 1 is close to 1 (2 close to 0) the number of samples increases to be very large.
When 1 is close to 0 the required approximation guarantee from the deterministic algorithm increases.
In either case the memory requirement is high.
![Page 76: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/76.jpg)
Values of Values of 1, 1, 2 2 (Cont’d)(Cont’d)
We need to optimize 1, 2 to reduce memory usage.
The theoretical complexity can be determined by setting 1= 2=0.5
Then S becomes The new algorithm’s space complexity is:
)log(( 12 O
)log( 21 NO
![Page 77: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/77.jpg)
Values of Values of 1, 1, 2 2 (Cont’d)(Cont’d)
The space required to run the new algorithm on the samples is:
)])log([log( 1121 O
![Page 78: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/78.jpg)
Multiple QuantilesMultiple Quantiles
We want p different quantiles, each with error bound & confidence of 1-.
Let = 1+2
let We choose S samples & feed them all to the
deterministic algorithm, which is approximate.
Read p quantiles from the output buffers.
)2log(2
1 122
pS
![Page 79: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/79.jpg)
Multiple Quantiles (Cont’d)Multiple Quantiles (Cont’d)
All quantiles are guaranteed with probability >= 1- to be -approximate.
Using lemma 7 & substituting with ’= /p we compute the number of samples.
The probability that some quantile is not an is 1- ’/p.
The probability that any quantile isn’t -approximate is p*’ which is .
![Page 80: Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,](https://reader035.vdocuments.site/reader035/viewer/2022062719/56649eca5503460f94bd90a3/html5/thumbnails/80.jpg)
Pros & ConsPros & Cons
(Pros) The randomized algorithm has a complexity that is not a function of N.
(Cons) When computing multiple quantiles, the deterministic algorithm is unchanged. The randomized algorithm, however, does require a larger sample as the number of quantiles increases.