a comparison of five probabilistic view-size estimation techniques in olap
DESCRIPTION
A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.TRANSCRIPT
A Comparison of Five Probabilistic View-SizeEstimation Techniques in OLAP
Kamel Aouiche and Daniel Lemire
Universite du Quebec a Montreal (UQAM), Canada
November 11, 2008
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................1/19
Online Analytical Processing (OLAP)
(Source: dwreview.com)
What is OLAP?
multidimensional model, several(hierarchical) dimensions
measures aggregated (SUM,MIN, MAX, AVERAGE,. . . )
a set of standard operations:drill-down, roll-up, slice, dice
answers expected in nearconstant time
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................2/19
The View-Size Materialization problem
Storing the result of a query (a view) is important
You trade storage for (future) speed.
Storage is cheap, faster CPUs are expensive.
Materialized views can be used to compute other views faster.
From sales per (time,store) it is faster to computesales per store, than to go back to transactions!
For aggressive aggregation (coarse views), materialized viewsare unbeatable!
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................3/19
The data cube
Time - Store - Product
Time - Store Time -Product Store - Product
Time Store Product
none
The Data Cube
A d dimensional datacube is made of 2d − 1cuboids.
Typical values for drange from 10 to 20.
You also havedimensional hierarchiesto handle.
It would take too longto materialize them alleven if you hadenough storage andthe data neverchanged.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................4/19
View Selection Heuristics
The Data Cube
given a data warehouse, heuristics can be used to determinewhich views to aggregate (the problem itself is typicallyNP-hard);
many heuristics assume reliable and accurate estimates of theview sizes;
even if the choice is done by hand, the analyst needs guidance;
finding optimally fast, accurate and reliable estimates is stillan open problem;
there has been little experimental work to compare thealternatives!
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................5/19
What is view-size estimation?
Algorithmically?
Views are typically result ofGROUP BYs;
The size of the view is thenumber of distinct elements inthe GROUP BY;
thus, in a simplistic sense,view-size estimation isequivalent to finding thenumber of distinct elementsin a sequence, using littlememory.
A B CA B DA A CA B E
A BA A
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................6/19
Summary of the methods under review
Sampling Method
Sampling and Multifractal models [Faloutsos et al., 1996]
Probabilistic Methods
Adaptive counting [Cai et al., 2005];
LogLog probabilistic counting [Durand and Flajolet, 2003];
Gibbons-Tirthapura [Gibbons and Tirthapura, 2001]
Generalized counting [Bar-Yossef et al., 2002]
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................7/19
Unassuming Probabilistic Techniques
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx
x
x
x
UNIFORM RANDOM HASHING
What is the big idea?
It is difficult to workwith the original databecause it hasunknown bias.
Instead of learning thedistribution, just hashevery element, andwork into hashedspace.
Suddenly the datadistribution is known:it is uniform!
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................8/19
k-wise independent hashing
What is it?
hash to [0, 2L)
uniform hashing: P(h(x) = y) = 1/2L
pairwise independent hashing:P(h(x) = y ∧ h(x ′) = y ′) = 1/4L.
3-wise independent hashing:P(h(x) = y ∧ h(x ′) = y ′ ∧ h(x ′′) = y ′′) = 1/8L
pairwise independence implies uniformity.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................9/19
Multidimensional Hashing
TIME
MarchApril
AugustMarchMarch
October
STORE
MontrealNew YorkMontrealToronto
Los AngelesMiami
PRODUCT
hamjelly
breadmilk
March→ 0111April →1010
August→1110October→0101
...ham→ 0111jelly →1010bread→1110milk→0101
XOR
How to hash facts?
Use a random numbergenerator, andgenerate independenthashed values for eachdimensions.
XOR the hashedvalues.
If you have kdimensions, get k-wiseindependent hashing.
Scales well if you storethe dimension-wisehash functions.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................10/19
Stochastic (LogLog) Probabilistic Counting
000001011011010010110101101010111010111101100010101011000011010101
000000
00000
COUNTING
LOGLOG
rejects outlier
only keep maximum
The counting trick
Hash to [0, 2L)
Keep track of numberof leading zeroes t,estimate ≈ 2t
LogLog variant onlyseek max leading zero(outliers)
Stochastic: hash xrandomly to one of Mintervals [0, 2L), keeptrack of M lesservalues, do some sort ofgeometric average ofthe M estimates
LogLogK. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................11/19
Adaptive counting
[Cai et al., 2005]
Probablistic counting schemes require the view size to be verylarge.
A small view compared to the available memory (M), willleave several of the M counters unused.
When more than 5% of the counters are unused we return alinear counting estimate [Whang et al., 1990] instead of theLogLog estimate.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................12/19
Generalized counting
[Bar-Yossef et al., 2002]
The tuples and hashed values are stored in an ordered set M.
For small M with respect to the view size, most tuples arenever inserted since their hashed value is larger than thesmallest M hashed values.
Estimate is 2Lsize(M)/max(M) where max(M) returns anelement with the largest hashed value.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................13/19
Gibbons-Tirthapura
[Gibbons and Tirthapura, 2001]
Keep track of all items hashed to 1/2t of the hashing space
Estimate is 2tm where m is number of items tracked.
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx
x
x
x
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................14/19
Experimental results
Benchmark the accuracy and speed for the five algorithmsover:
synthetic data set (derived by DBGEN)real data set (US Census 1990)
US Census 1990 DBGEN# of facts 2 458 285 13 977 981# of views 20 8
# of attributes 69 16Data size 360 MiB 1.5 GiB
Table: Characteristic of data sets.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................15/19
Experimental results: small memory budgets
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
M=16M=64
M=256M=2048
(a) Gibbons-Tirthapura
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
M=16M=64
M=256M=2048
theory M=2048
(b) Probabilistic Count-ing
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
M=16M=64
M=256M=2048
theory M=2048
(c) LogLog
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
p=0.1%p=0.3%p=0.5%p=0.7%
(d) Multifractal
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
M=16 M=64 M=256 M=2048
(e) Generalized Counting
0.01
0.1
1
10
100
1000
100 1000 10000 100000 1e+06 1e+07
ε, s
tan
da
rd e
rro
r (%
)
View size
M=16 M=64 M=256 M=2048
(f) Adaptive Counting
Figure: Standard error of estimation as a function of exact view size forincreasing values of M (US Census 1990).
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................16/19
Experimental results: large memory budgets
0.01
0.1
1
10
100
1000
10 100 1000 10000 100000 1e+006 1e+007 1e+008
ε, s
tand
ard
erro
r (%
)
Memory budget
Gibbons-TirthapuraGeneralized Counting
LogLogCounting
Adaptive Counting
(a) L=32
0.01
0.1
1
10
100
1000
10 100 1000 10000 100000 1e+006 1e+007 1e+008
ε, s
tand
ard
erro
r (%
)
Memory budget
Gibbons-TirthapuraGeneralized Counting
LogLogCounting
Adaptive Counting
(b) L=64
Figure: Standard error of estimation for a given view (four dimensionsand 1.18× 107 distinct tuples) as a function of memory budgets M(synthetic data set).
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................17/19
Experimental results: speed
0
50
100
150
200
250
10 100 1000 10000 100000 1e+006
Tim
e (s
econ
ds)
Memory budget
Gibbons-TirthapuraGeneralized Counting
LogLogCounting
Adaptive Counting
Figure: Estimation time for a given view (four dimensions and 1.18× 107
distinct tuples) as a function of memory budgets M (synthetic data set).
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................18/19
Conclusion
Main Points
Sampling can be quite unreliable, but very fast.
Processing time of probabilistic methods is dominated byhashing.
For small view-sizes relative to the available memory budget,the accuracy of Probablistic Counting and LogLog can bevery low.
However, as you increase the memory budget,Gibbons-Tirthapura, Generalized Counting and Adaptivecounting systematically improve, but they also become slower.
Adaptive Counting remains constantly fast.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19
Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., andTrevisan, L. (2002).Counting distinct elements in a data stream.In RANDOM’02, pages 1–10.
Cai, M., Pan, J., Kwok, Y.-K., and Hwang, K. (2005).Fast and accurate traffic matrix measurement using adaptivecardinality counting.In MineNet’05, pages 205–206.
Durand, M. and Flajolet, P. (2003).Loglog counting of large cardinalities.In ESA’03, volume 2832 of LNCS, pages 605–617.
Faloutsos, C., Matias, Y., and Silberschatz, A. (1996).Modeling skewed distribution using multifractals and the 80-20law.In VLDB’96, pages 307–317.
Gibbons, P. B. and Tirthapura, S. (2001).
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19
Estimating simple functions on the union of data streams.In SPAA’01, pages 281–291.
Whang, K.-Y., Vander-Zanden, B. T., and Taylor, H. M.(1990).A linear-time probabilistic counting algorithm for databaseapplications.ACM Trans. Database Syst., 15(2):208–229.
K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19