data structures for big data: bloom filtervielmo/notes/2014...data structures for big data...

30
Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Upload: others

Post on 27-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Data Structures for Big Data:

Bloom Filter

Vinicius Vielmo Cogo

Smalltalks, DI, FC/UL. October 16, 2014.

Page 2: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

2 / 30

is relative

is not defined by a specific number of TB, PB, EB

is when it becomes big for you

is when your solutions become inefficient/impractical

Page 3: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Data Structures for Big Data

Traditional DSs are subject to the same problems

e.g., lists, trees

(e.g., YARN, NoSQL)

or

(e.g., index, metadata)

reached the point of thinking in new DSs for BD

3 / 30

Page 4: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Outline

Bloom Filter

Use Cases

Implementations

Other Filters

Other Data Structures for Big Data

4 / 30

Page 5: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

Membership testing

Does my collection contain this element?

5 / 30

Page 6: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

City

Coimbra

Leiria

6 / 30

Page 7: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Index i

bf[i]

http://billmill.org/bloomfilter-tutorial/ 7 / 30

Page 8: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

8 / 30

Page 9: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

i=4

i=7

9 / 30

Page 10: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

i=4

i=7

10 / 30

Page 11: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

11 / 30

Page 12: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

i=2

i=9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

12 / 30

Page 13: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

i=2

i=9

13 / 30

Page 14: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Coimbra

Leiria

Hash Function

Fnv

Murmur

14 / 30

Page 15: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

City

Braga

Guarda

Coimbra

Lisboa

15 / 30

Page 16: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Result: false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Braga

Guarda

Coimbra

Lisboa

Hash Function

Fnv

Murmur

i=10

i=14

16 / 30

Page 17: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Result: false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Braga

Guarda

Coimbra

Lisboa

Hash Function

Fnv

Murmur

i=2

i=12

17 / 30

Page 18: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Result: true

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Braga

Guarda

Coimbra

Lisboa

Hash Function

Fnv

Murmur

i=4

i=7

18 / 30

Page 19: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Result: true (but it is a false positive)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i

bf[i]

City

Braga

Guarda

Coimbra

Lisboa

Hash Function

Fnv

Murmur

i=7

i=9

19 / 30

Page 20: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom Filter

DS proposed by Burton Howard Bloom in 1970

Design principles

Space-efficient

Smaller than the original dataset

Time-efficient

Low latency R/W

O(k), which is much smaller than O(n)

High throughput

Probabilistic

E.g., myCollection.mightContain(myObject)

False positives happen (but in a configurable way)

20 / 30

Page 21: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

= Optimal number of hash functions

Hash Function

Fnv

Murmur

Important variables

Bloom Filter

= Expected collection size

= Bitmap size

= False positive rate (e.g., 0.0001% or 1 in 1M)

City

Coimbra

Leiria

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

21 / 30

Page 22: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Important variables

Bloom Filter

22 / 30

Page 23: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Users define two of them (normally n and any other)

The other two are calculated with those equations

Interesting relations:

Bigger collection ( ) Larger bitmap ( )

Bigger collection ( ) More false positives ( )

Larger bitmap ( Less false positives ( )

Larger bitmap ( ) Less hash functions ( )

Less hash functions ( )

Bloom Filter

23 / 30

Page 24: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Bloom filter size vs. False positive rate

Bloom Filter

24 / 30

Page 25: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Use Cases

Reducing unnecessary disk reads

Client BloomFilter Dataset

RAM Hard Disk

1

2

3

F

T

T F

T

1?

2?

3?

necessary

read(2)

unnecessary

read(3)

No

2

No

F

25 / 30

Page 26: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Use Cases

Google BigTable, Apache Cassandra and HBase

Reducing disk lookups

Google Chrome

Lookup a list of known malicious URLs

Bitcoin

Get only the transactions relevant to your wallet

Others

In my Ph.D. work

Lookup a list of known privacy-sensitive DNA

sequences 26 / 30

Page 27: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Implementations

-libraries https://code.google.com/p/guava-libraries/

Orestes-Bloomfilter https://github.com/Baqend/Orestes-Bloomfilter

java-bloomfilter https://github.com/magnuss/java-bloomfilter

java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/

27 / 30

Page 28: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Other Filters

Counting Bloom filters Allow deletions (use a 4-bit counter instead of 1 bit)

Buffered Bloom filters Sub-filters in SSD with buffered R/W exploring bit locality

Quotient and Cascade filters Uses an SSD, instead of the main memory, for scalability

28 / 30

Page 29: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Other DSs (and techniques) for Big Data

Locality-sensitive hashing (LSH) Hashing similar elements into the same bucket with high probability

HyperLogLog for computing cardinality Counting the number of distinct elements in a collection

Log Structured Merge (LSM) trees Indexed access to files with high insert volume and background batch synchronization

29 / 30

Page 30: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or

Thank you!

Vinicius Vielmo Cogo

Smalltalks, DI, FC/UL. October 16, 2014.