histogram-aware sorting for enhanced word-aligned compression in bitmap indexes

18

Click here to load reader

Upload: daniel-lemire

Post on 06-May-2015

2.233 views

Category:

Technology


2 download

DESCRIPTION

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate reordering heuristics based on computed attribute-value histograms. Simply permuting the columns of the table based on these histograms can increase the sorting efficiency by 40%.

TRANSCRIPT

Page 1: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Histogram-Aware Sorting for EnhancedWord-Aligned Compression in Bitmap Indexes

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2

1- University of New Brunswick, Saint John2- Universite du Quebec at Montreal (UQAM)

November 11, 2008

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 2: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 3: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Bitmaps and fast AND/OR operations

Computing the union or the intersection of two sets ofintegers between 1 and 64 (eg row ids, trivial table). . .

E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 4: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 5: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 6: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Construction in time O(nc);

Total size of bitmaps is also in O(nc);

Given two bitmaps B1, B2 of compressed size |B1| and |B2|. . .

AND, OR, XOR in time O(|B1|+ |B2|).

Bounds do not depend on the number ofbitmaps. Implementation scales to millions of bitmaps.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 7: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted data better;

But finding the best row ordering is NP-hard.

Lexicographic sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 8: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

k-of-N encoding

1-of-N100000010000001000000100000010100000000001

valuecatdogdishfishcowcatpony

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance for areduced number of bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 9: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Gray-code order

A binary Gray Code (GC) minimizes bit transitions.

Pinar et al. propose to sort whole index by <GC, Gray-codeordering. Practical?

We contribute an easy/fast way to achieve GC-like resultsusing lexicographic sort.

Paper has details.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 10: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Experimental environment

Mac Pro with 2 dual-core CPUs 2 GiB RAM (no thrashing)

GNU GCC 4.0.2 (C++)—32-bit binaries

Source code under GPL:http://code.google.com/p/lemurbitmapindex/(Linux and MacOS)

Mix of real and synthetic data:1 up to 877 M rows, 22 GB, 4 M attributes.2 experiments using 4–10 columns

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 11: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Coding suggestions: choosing k

1 If k > 1, bitmaps are denser and a query processes k of them;

cost can grow with n(k−1)/ki (big jump from 1 to 2).

2 If k > 1, we (usually) get smaller indexes.

Potentially high query-time penalty for k > 1, at best modestspace gains.

Default choice of k

Our index-construction algorithm handles the extremely sparse(k = 1) indexes nicely. k = 1 looks like a good choice.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 12: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial!

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

zecolumn permutation

1-of-N encoding4-of-N encoding

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 13: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 14: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).See also [Canahuate et al., 2006] for related work.Situation worse for k > 1. Details

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 15: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Future directions

Need better mathematical modelling of bitmap compressedsize in sorted tables;

Study the effect of word length (16, 32, 64, 128 bits);

Apply to Column-oriented DBMS [Stonebraker et al., 2005];

Consider encodings that can efficiently support rangequeries [Chan and Ioannidis, 1999].

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 16: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Questions?

?

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 17: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006).Improving bitmap index compression by data reorganization.http://hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked2008-05-30).

Chan, C. Y. and Ioannidis, Y. E. (1999).An efficient bitmap encoding scheme for selection queries.In SIGMOD’99, pages 215–226.

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.(2005).C-store: a column-oriented dbms.In VLDB ’05, pages 553–564.

Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Page 18: Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Owen Kaser1, Daniel Lemire2, Kamel Aouiche2 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes