compressing column-oriented indexes

36
Compressing column-oriented indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). November 19, 2009 Daniel Lemire Compressing column-oriented indexes

Upload: daniel-lemire

Post on 06-May-2015

3.860 views

Category:

Technology


1 download

DESCRIPTION

Column-oriented databases have become fashionable following the work of Stonebraker et al. In the data warehousing industry, the terms "column oriented" and "column store" have become necessary marketing buzzwords. One of the benefits of column-oriented indexes is good compression through run-length encoding (RLE). This type of compression is particularly benefitial since it simultaneously reduce the volume of data and the necessary computations. However, the efficiency of the compression depends on the order of the rows in the table and this is even more important with larger tables. Finding the best row ordering is NP hard. We compare some heuristics for this problem including variations on the lexicographical order, Gray codes, and Hilbert space-filling curves.

TRANSCRIPT

Page 1: Compressing column-oriented indexes

Compressing column-oriented indexes

Daniel Lemire

http://www.professeurs.uqam.ca/pages/lemire.daniel.htm

blog: http://www.daniel-lemire.com/

Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc).

November 19, 2009

Daniel Lemire Compressing column-oriented indexes

Page 2: Compressing column-oriented indexes

Row Stores

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

Dominant paradigm

Transactional: Quick append and delete

Daniel Lemire Compressing column-oriented indexes

Page 3: Compressing column-oriented indexes

Column Stores

name date age sex salary

Goes back to StatCan in theseventies [Turner et al., 1979]

Made fashionable again in DataWarehousing byStonebraker [Stonebraker et al., 2005]

New: Oracle Exadata hybrid columnarcompression

Favors run-length encoding (compression)

Daniel Lemire Compressing column-oriented indexes

Page 4: Compressing column-oriented indexes

Main column-oriented indexes

(1) Bitmap indexes [O’Neil, 1989]

(2) Projection indexes [O’Neil and Quass, 1997]

Both are compressible.

Daniel Lemire Compressing column-oriented indexes

Page 5: Compressing column-oriented indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire Compressing column-oriented indexes

Page 6: Compressing column-oriented indexes

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

It is a form of vectorization.

Daniel Lemire Compressing column-oriented indexes

Page 7: Compressing column-oriented indexes

Common applications of the bitmaps

The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 64-bit words.)

Search engines use bitmaps to filter queries, e.g. ApacheLucene

Daniel Lemire Compressing column-oriented indexes

Page 8: Compressing column-oriented indexes

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire Compressing column-oriented indexes

Page 9: Compressing column-oriented indexes

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,0001111100010111 →3, 5, 3, 1,1,3

Daniel Lemire Compressing column-oriented indexes

Page 10: Compressing column-oriented indexes

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire Compressing column-oriented indexes

Page 11: Compressing column-oriented indexes

RLE with delta codes is pretty good

In some (weak) sense, RLE compression with delta codes isoptimal!

Theorem

A bitmap index over an N-value column of length n, compressedwith RLE and delta codes, uses O(n log N) bits.

Daniel Lemire Compressing column-oriented indexes

Page 12: Compressing column-oriented indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire Compressing column-oriented indexes

Page 13: Compressing column-oriented indexes

What are bitmap indexes for?

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Myth: bitmap indexes are for low cardinality columns.

the Bitmap index is the conclusive choice for datawarehouse design for columns with high or lowcardinality [Zaker et al., 2008].

Daniel Lemire Compressing column-oriented indexes

Page 14: Compressing column-oriented indexes

What about other compression types?

With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).

Hence, with RLE, compress saves both storage and CPUcycles!!!!

Not always true with other techniques such as Huffman,LZ77, Arithmetic Coding, . . .

Daniel Lemire Compressing column-oriented indexes

Page 15: Compressing column-oriented indexes

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N), can begeneralized [Lemire et al., 2009].

Daniel Lemire Compressing column-oriented indexes

Page 16: Compressing column-oriented indexes

How do 64-bit words compare to 32-bit words?

We implemented EWAH using 16-bit, 32-bit and 64-bit words;

Only 32-bit and 64-bit are efficient;

64-bit indexes are nearly twice as large;

64-bit indexes are between 5%-40% faster (despite higherI/O costs).

Daniel Lemire Compressing column-oriented indexes

Page 17: Compressing column-oriented indexes

Open Source Software?

Lemur Bitmap Index C++ Library:http://code.google.com/p/lemurbitmapindex/.

JavaEWAH: A compressed alternative to the Java BitSet classhttp://code.google.com/p/javaewah/.

Daniel Lemire Compressing column-oriented indexes

Page 18: Compressing column-oriented indexes

Projection indexes

SELECTsum(number*price)FROM T;

Simply write out the valuessequentially.

Ideal for low selectivity querieson few columns.

Compressible with RLE.

Daniel Lemire Compressing column-oriented indexes

Page 19: Compressing column-oriented indexes

Improving compression by sorting the table

RLE are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

So we sort:

lexicographicallywith Gray codesHilbert, . . .

Daniel Lemire Compressing column-oriented indexes

Page 20: Compressing column-oriented indexes

How many ways to sort? (1)

Lexicographic row sortingis

fast, even for verylarge tables.easy: sort is a Unixstaple.

Substantial index-sizereductions (often 2.5times, benefits grow withtable size)

Daniel Lemire Compressing column-oriented indexes

Page 21: Compressing column-oriented indexes

How many ways to sort? (2)

Gray Codes are list oftuples with successive(Hamming) distance of1 [Knuth, 2005,§ 7.2.1.1].

Reflected Gray Code orderis

sometimes slightlybetter thanlexicographical. . .. . . but benefit goes as≈ 1/N with columncardinality Npoorly supported byexisting software.

Daniel Lemire Compressing column-oriented indexes

Page 22: Compressing column-oriented indexes

How many ways to sort? (3)

Reflected Gray Code orderis not the only Gray code.

Knuth also presentsModular Gray-code.

But alternatives toreflected are never better?

Daniel Lemire Compressing column-oriented indexes

Page 23: Compressing column-oriented indexes

How many ways to sort? (4)

Can also try esotericorders.

Hilbert Index[Hamilton and Rau-Chaplin, 2007].

Gives very bad results forcolumn-oriented indexes.

Daniel Lemire Compressing column-oriented indexes

Page 24: Compressing column-oriented indexes

Modelling the size of an index

Any formal result?

Tricky: There are many variations on RLE.

Use: number of runs of identical value in a column

Daniel Lemire Compressing column-oriented indexes

Page 25: Compressing column-oriented indexes

Recursive orders

Lexicographical, reflected Gray code and modular Gray codebelong to a larger class:

Definition

A recursive order over c-tuples is such that it generates a recursiveorder over c − 1-tuples. All orders over 1-tuples are recursive.

This is a recursive order:1 0 01 0 10 1 1

This is not recursive:1 0 00 1 11 0 1

Daniel Lemire Compressing column-oriented indexes

Page 26: Compressing column-oriented indexes

When sorting, column order matters

Question

Given a phone directory, to minimize the number of runs, shouldsort by first or last names?

Daniel Lemire Compressing column-oriented indexes

Page 27: Compressing column-oriented indexes

When sorting, column order matters

c columns

any recursive order

in practice, column order is very significant (factor of two ormore)

Proposition

The number of column runs vary by a factor of ≈ c under thepermutation of the columns.

Daniel Lemire Compressing column-oriented indexes

Page 28: Compressing column-oriented indexes

But column reordering fails to buy optimality

From some tables. . .

Lemma

No recursive order minimizes the number of runs—even afterreordering the columns.

Open problem: how far from optimality?

Daniel Lemire Compressing column-oriented indexes

Page 29: Compressing column-oriented indexes

Best column order?

We almost have this result [Lemire and Kaser, ]:

any recursive order

order the columns by increasing cardinality (small toLARGE)

Proposition

The expected number of runs is minimized.

Truth is complicated.

Assume uniformly distributed tables.

Daniel Lemire Compressing column-oriented indexes

Page 30: Compressing column-oriented indexes

What about non-uniform or dependent columns?

Real columns have skewed distributions [Missaoui et al., 2007]and they are statistically dependent.

It can impact column ordering in unpredictable ways.

Daniel Lemire Compressing column-oriented indexes

Page 31: Compressing column-oriented indexes

Take away messages

Column stores are good because of RLE and sorting;

Lexicographical sort with right column order is good;

More exotic sorting (such as Hilbert) might be bad.

Daniel Lemire Compressing column-oriented indexes

Page 32: Compressing column-oriented indexes

Future direction?

Need better mathematical modelling of skewed anddependent columns;

New column-oriented indexes?

Better ways to sort?

Daniel Lemire Compressing column-oriented indexes

Page 33: Compressing column-oriented indexes

Questions?

?

Daniel Lemire Compressing column-oriented indexes

Page 34: Compressing column-oriented indexes

Hamilton, C. H. and Rau-Chaplin, A. (2007).Compact Hilbert indices: Space-filling curves for domains withunequal side lengths.Information Processing Letters, 105(5):155–163.

Knuth, D. E. (2005).The Art of Computer Programming, volume 4, chapter fascicle2.Addison Wesley.

Lemire, D. and Kaser, O.Reordering columns for smaller indexes.in preparation, available fromhttp://arxiv.org/abs/0909.1346.

Lemire, D., Kaser, O., and Aouiche, K. (2009).Sorting improves word-aligned bitmap indexes.to appear in Data & Knowledge Engineering, preprint availablefrom http://arxiv.org/abs/0901.3751.

Daniel Lemire Compressing column-oriented indexes

Page 35: Compressing column-oriented indexes

Missaoui, R., Goutte, C., Choupo, A. K., and Boujenoui, A.(2007).A probabilistic model for data cube compression and queryapproximation.In DOLAP, pages 33–40.

O’Neil, P. and Quass, D. (1997).Improved query performance with variant indexes.In SIGMOD ’97, pages 38–49.

O’Neil, P. E. (1989).Model 204 architecture and performance.In 2nd International Workshop on High PerformanceTransaction Systems, pages 40–59.

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.(2005).C-store: a column-oriented DBMS.

Daniel Lemire Compressing column-oriented indexes

Page 36: Compressing column-oriented indexes

In VLDB’05, pages 553–564.

Turner, M. J., Hammond, R., and Cotton, P. (1979).A DBMS for large statistical databases.In VLDB’79, pages 319–327.

Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.

Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008).An adequate design for large data warehouse systems: Bitmapindex versus b-tree index.IJCC, 2(2).

Daniel Lemire Compressing column-oriented indexes