fast, small-space algorithms for approximate histogram maintenance (on a stream)
DESCRIPTION
Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream). A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. A data stream. Data items/updates arrive one at a time Small storage, no random access to data unless stored. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/1.jpg)
Fast, Small-Space Algorithms for Approximate Histogram
Maintenance (on a Stream).A. Gilbert, S. Guha, P. Indyk,Y. Kotidis, S. Muthukrishnan,
M. Strauss
![Page 2: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/2.jpg)
A data stream
Data items/updates arrive one at a timeSmall storage, no random access to data unless stored
![Page 3: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/3.jpg)
Dimensionality reductionJohnson-Lindenstrauss Lemma:
x is an n-dimensional vectorA is a random n times k matrix, each entry independently drawn from e.g. Gaussian distribution, k=O(log N/2 )Then with probability 1-1/N
A can be pseudo-random222
)1( xAxx
![Page 4: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/4.jpg)
What it means Can maintain the sketch Ax of x when the coordinates are incremented:
A(x+b)=Ax+Ab
A x
Can maintain approximate 2-norm of x
![Page 5: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/5.jpg)
HistogramsView x as a function x:[1…n] -> [1…M]Approximate it using piecewise constant function h, with B pieces (buckets)
![Page 6: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/6.jpg)
Find all Indians worth $200K - $300K1. Select on
country2. Select on worth
1. Select on worth2. Select on
country
Example app in DB
![Page 7: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/7.jpg)
Example app continued
![Page 8: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/8.jpg)
Our goal
Want to maintain the best B-bucket representation of x, under changes of xMeasure the error using 2-norm (1-norm also OK)
![Page 9: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/9.jpg)
Our Approach
Maintain sketches Ax of xUsing Ax, construct B-histogram h which approximately minimizes ||x-h||
![Page 10: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/10.jpg)
Our result
Can maintain a B-histogram h which minimizes ||x-h|| up to a factor of (1+), using poly(log n, B, 1/) time/space, with probability 1-1/poly(n)
![Page 11: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/11.jpg)
Proof: by iterated improvement
B buckets, >nB construction timeB log n buckets, n3 construction timeB log2n buckets, n2 construction time B log2n buckets, n poly(B+log n) timeB logO(1) n buckets, poly(B+log n) timeB buckets, poly(B+log n) time
![Page 12: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/12.jpg)
Exponential time approach
There are at most (Mn2)B functions hBy JL lemma, can reduce dimension to O(B log n), and approximately preserve ||x-h|| for all hTo reconstruct h, minimize ||Ax-Ah||Can be trivially done by enumerating all h’s
![Page 13: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/13.jpg)
Greedy approach
Start from h=0Let be the characteristic function over interval IFind c and I minimizing
& repeat
I
IAx A(h c ) 2
Ih h c
![Page 14: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/14.jpg)
Details
IAx A(h c ) 2
The square of
is a quadratic function of c
Once we compute the parameters of this function, e.g. E(c)=Ac2+Bc+D,
the minimum is achieved for c=B/(2A)
![Page 15: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/15.jpg)
Example
![Page 16: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/16.jpg)
How does it helpO(n2) intervalsO(n) time to find best c minimizing
Overall: O(n3) time, O(k log (nM)) intervals
IAx A(h c ) 2
![Page 17: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/17.jpg)
Approximation factorAssume for simplicityLet h* be the optimal k-histogram If we replaced the current histogram h by all k intervals of h* (with proper values c), we would reduce the squared error from ||x-h||2 to ||x-h*||2 Thus, there is an interval I of h* (and c) such that
||x-h||2-||x - h cI||2 > 1/k (||x-h||2 -||x-h*||2)
O(k log (nM2)) intervals enough to reduce the error to about ||x-h*||2
![Page 18: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/18.jpg)
Dyadic intervals
Each interval can be decomposed into log n dyadic intervals [1,1],[2,2]…[1,2]...[1,4]We can assume opt h is defined by B log n dyadic intervalsThe number of dyadic intervals is n log nReduces the time to n2 log n
![Page 19: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/19.jpg)
Range summability
RecallNeed to compute i.e., range sum of random variables Goal: time polylog n
IA
IAx A(h c ) 2
![Page 20: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/20.jpg)
Naor & Reingold constructionMethod:
Generate sum of a1,a2,…,an
Generate sum of left half, conditioned on the total sumRecurse
Conditional distributions are explicitThe generation can be simulated by Nisan’s PRGResult: reduces the time to n polylog n
![Page 21: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/21.jpg)
Fast selection of good intervals
Find which (dyadic) intervals to add in polylog n time Consider interval of length 1Need to find a “spike” in h-x (if exists)Assume only one spike
![Page 22: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/22.jpg)
Chasing Bits Non-adaptive binary search
Essentially, we compose the signal with a filter
![Page 23: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/23.jpg)
More spikes
There are few large spikes Permute coordinates using pair-wise independent permutation. Likely that each interval contains only one spike Caveat : how does it work with the range summabilityResult: reduces the time to polylog n
![Page 24: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/24.jpg)
Where are we
We managed to reduce the time to polylog nHowever, the number of buckets is B polylog nNeed to reduce the number of buckets to B
![Page 25: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/25.jpg)
Getting rid of the buckets
B buckets, but O(1)-approximation:Compute h with B polylog n bucketsFind h’ with B buckets closest to h
An off-line problemCan be done approximately using dynamic programming
Factor O(1) by triangle inequality Factor (1+) is a mess (esp. for 1-norm)
![Page 26: Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream)](https://reader036.vdocuments.site/reader036/viewer/2022062811/56816008550346895dcf0879/html5/thumbnails/26.jpg)
Conclusions
Can efficiently maintain compact representation of an array of numbers under additive changesWorks well in practice [TGIK’02]