approximate encoding for direct access and query processing over compressed bitmaps tan apaydin –...

26
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio

Upload: nelson-fields

Post on 31-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps

Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State UniversityHakan Ferhatosmanoglu – The Ohio State UniversityAli Saman Tosun – University of Texas at San Antonio

Presentation Outline

Motivation Goal Approximate Bitmaps (AB) encoding AB example Theoretical analysis Experiments and Results Conclusion

Motivation

Bitmap indices Data warehouses Scientific data Visualization applications Bitwise operations

Bitmap Compression Run-length encoders

Word Aligned Hybrid (WAH) Byte-aligned Bitmap Code (BBC)

Motivation

The row numbers do not longer correspond to the bit position in the bitmap

Queries over few particular rows As expensive as queries asking for all the rows

Commonly, users are only interested in a small subset of the dataset at a time.

For example: A query over the transactions of the last 7 days Spatial queries over objects in a specific

geographical area

Motivation

Visualization applications Millions of different readings ordered by

their geographic location Users ask range queries over some of

the readings for a given area The answers are highlighted in the

screen Several degrees of resolution make

approximate answers acceptable

Our Goal

Enable direct access over any subset of the bitmap

Achieve effective compression Maintain bitwise operations for query

execution Trade-off efficiency vs. accuracy

No false negatives

The approach

Our solution is inspired by Bloom Filters A 2m bit array indexed using k

independent hash functions A data object is inserted by setting the k

positions in the array corresponding to the hash values of the object

False positives can happen, but false negatives cannot

Approximate Bitmaps (AB)

A bloom filter-like structure Only the set bits are inserted into the AB Three levels of encoding:

Per table, per attribute, per bitmap column Parameters:

The hash string mapping function, F The k hash functions, {H1(x),…,Hk(x)} The size of the AB, n = αs = 2m

Precision in terms of α and k, ~(1-(1-e-k/α)k)

AB Example

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories.

Bitmap Table Size: 72 bits Number of set bits = 24. F(i,j) = concatenate(i,j) = x H1(x) = x mod 32 m = 5 AB Size: 25 = 32 bits

AB Example - Insertion

Initially all bits in the AB are zero To insert set bit in (1,1)

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

00000000000000000000000000000000

AB Example - Insertion

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

00000000000100000000000000000000

To insert set bit in (1,1) x = 11 H(11) = 11 mod 32 = 11 AB(11) = 1

AB Example - Insertion

To insert set bit in (5,4) x = 54 H(54) = 54 mod 32 = 22 AB(22) = 1

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

00000000000100000000001000000000

AB Example - Insertion

After all insertions

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

01110100100100101101001001001100

AB Example - Analysis

The underlined positions are false positives

Only 8 out of the 48 zeros are set in the AB

0123456789

10111213141516171819202122232425262728293031

01110100100100101101001001001100

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

Estimated Precision: α = ABSize/Set Bits α = 32/24 = 1.33 k = 1 FP = (1-e-k/α) P = 1-FP P = 1-(1-e-1/1.33) P = 47%

AB Example - Retrieval

Consider this query, asking for 4 rows

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

01110100100100101101001001001100

This a range query over 4 rows, where the third attribute falls into C1 or C2

Row 4: (4,7): H(47) = 15

AB(15)=0

(4,8): H(48) = 16 AB(16)=1

Row 5: (5,7): H(57) = 25

AB(25)=1

Stop

AB Example - Retrieval

Consider this query, asking for 4 rows

1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3

1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1

0123456789

10111213141516171819202122232425262728293031

01110100100100101101001001001100

Row 6: (6,7): H(67) = 3

AB(67)=1 Stop

Approx Query Answer: {1,1,1,0}

Exact Answer: {0,1,1,0}

Approximate Bitmaps (AB) – Mapping Function F

F maps each cell in the bitmap table to a unique string (the hashing string)

For one AB per table and one AB per attribute, the bit in row i column j is identified by F(i,j) = i << w || j, where w is large enough to

accommodate all j For one AB per column, the bit in row i is

identified by F(i,j) = i

Approximate Bitmaps (AB) – Hash Functions

Single Hash Function Called once and the result is divided into pieces. Each piece considered as the value of a different hash

function. Secure Hash Algorithm (SHA), developed by National

Institute of Standards and Technology (NIST)

Multiple Hash Functions Independent hash functions For large number, similar performance

Hash Function H0 H1 H2 ... H9Bits 159..144 143..128 127..112 ... 15..0SHA Output 0100100010001010 1000010100100001 0111100011100010 ... 0000010101110011

Approximate Bitmaps (AB) – FP Rate

FP Rate: Probability that all k bits are set by another data object

n is the size of the AB s is the number of set bits n = αs, α = n/s

0.00001

0.0001

0.001

0.01

0.1

1

1 3 5 7 9 11 13 15 17 19

k

FP

Rat

e

a=4a=8a=16a=32

0

0.2

0.4

0.6

0.8

1

1 3 5 7 9 11 13 15 17 19

alpha

FP

Rat

e

k=1k=2k=3k=4k=5

kkk

n

kskks

een

FP

11

111

Approximate Bitmaps (AB) – Size

In terms of α: n = αs m = ceil(log2(αs))

One AB per dataset: s = |A|*N

One AB per attribute: s = N

One AB per column: s depends on the data distribution

Experimental Setup

Three datasets:

Rows Attributes Columns

Uniform 100,000 2 100

Landsat 275,465 60 900

HEP 2,173,762 6 66

Query by sampling (randomly selecting the columns queried)

Varying the number of rows queried from 100 to 10K

Experimental Results - Size

Always use the max α that produces a smaller or comparable AB than WAH

Uniform

0

100

200

300

400

500

600

700

800

900

1,000

2 4 8 16alpha

Bit

map

Siz

e (

KW

ord

s)

WAHPer DatasetPer AttributePer Column

HEP

0

10000

20000

30000

40000

50000

60000

2 4 8 16alpha

Bit

map

Siz

e (K

Wo

rds)

WAHPer DatasetPer AttributePer Column

Landsat

0

10000

20000

30000

40000

50000

60000

70000

2 4 8 16alpha

Bit

map

Siz

e (

KW

ord

s) WAH

Per DatasetPer AttributePer Column

Experimental Results - Precision

Precision vs. # of Hash Functions

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10k

Prec

isio

n

uniform, α=16

landsat, α=8

hep, α=4

hep, α=8

As α increases, the precision increases steadily and is very close to 1 for larger α

Precision increases as k increases up to the optimum point

Because large number of hash functions produces more collisions

Experimental Results – Exec Time

0

200

400

600

800

1000

1200

1400

1600

0 2000 4000 6000 8000 10000

# of Rows QueriedE

xec.

Tim

e (m

sec) WAH Uniform

AB UniformWAH LandsatAB LandsatWAH HEPAB HEP

Execution time of the AB depends on the number of rows queried, not in the number of rows in the dataset

For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH

Conclusion

AB encoding approximates the bitmaps using multiple hashing of the set bits

Allows efficient retrieval of any subset of rows and columns

Trade-off between bitmap size and precision Three levels of encoding Approximate query answers are given

without database access

Questions and Comments

Thank you!

Email: [email protected]