probabilistic data structures in real life

16
PROBABILISTIC DATA STRUCTURES IN REAL LIFE Valentin Bazarevsky

Upload: valentin-bazarevsky

Post on 16-Feb-2017

209 views

Category:

Engineering


5 download

TRANSCRIPT

PROBABILISTIC DATA STRUCTURES IN REAL LIFEValentin Bazarevsky

WHO THEY ARE?

Bloom FilterLogLog FamilyMinHash

BUSINESS CASE:ESTIMATE YOUR AUDIENCE

SEGMENT BUILDER

15 Tb of transactional data4h SLA

POSSIBLE SOLUTIONS

Brute force (15 TB of transactional data) Sampling (1 % of users => 1.2 mb / b.o.)Magic tool (?!)

EstimatorHyperLogLog allows to estimate > 1 000 000 000 sets of unique elements with 1% error, and requires only 4kb memory

50 000 000 basic operations

OOPS…

Supports only Unions

But we need Intersections, Subtractions, Not operators

HYPERLOGLOG INTUITION

00101010101010001111010101101 => a[2] = 010010101010100101010101001011 => a[9] = 100000101010100101010101110101 => a[0] = 101010101010100100101010101010 => a[5] = 1

01010000000000000000000000010 => a[5] = 23

INCLUSION-EXCLUSION PRINCIPLE

MINHASH

Store only x (8192) smallest hashes in setJaccard Distance

UNION OF INTERSECTIONS

A (B C) = (A B) (A B)A - B - C = A - (B C)

NOT OPERATOR

Subtraction

I WANT EVERYONE EXCEPT…

A and not B Not A and Not B

CORNER CASES

|(A not(B)) C| => |A C||A not(B)| = |Everything| - |B| + |A B||A not(B)| => |A| - |A B|

ARCHITECTURE

ERROR RATE

Median = 5%Percentile 75 = 8%