owf14 - big data track : abstract algebra for analytics
DESCRIPTION
Sam BESSALAH Algebird is an abstract algebra library for Scala developed at Twitter and released under the ASL 2.0 license. It has support for algebraic structures such as semigroups, monoids, groups, rings and fields as well as the standard functional things like monads. More interestingly though are the probabilistic data structures and the accompanying monoids that come out of the box. I'll talk a bit about Algebird in general and how it eases building large scale analytics systems with Map Reduce systems or in a stream processing context.TRANSCRIPT
Abstract Algebra for Analytics
Sam BESSALAH
@samklr
What do we want?
• We want to build scalable systems.
• Preferably by leveraging distributed computing
• A lot of analytics amount to counting or adding in some sort of way.
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
11, 12, 0,3,56,48 K=3 56,48,12
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce
In Scalding
In Scalding
Problems
• Curse of the last reducer
• Network Chatter, hinder on performance
• Inefficient Order for map and reduce steps
• Multiple jobs, with a sync barrier at the reducer
But in Scalding, « sortWithTake » uses :
But in Scalding, « sortWithTake » uses :
Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative
PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
But in Scalding, « sortWithTake » uses :
Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative
PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
In a single Pass
Why is it better and faster?
Associativity allows parallelism
Do we have data structures that are intrinsically parallelizable?
Abstract Algebra Redux
• Semi Group
Associative Set (Grouping doesn’t matter)
• Monoid
Semi Group with a zero (Zeros get ignored)
• Group
Monoid with inverse
• Abelian Group
Commutative Set (ordering doesn’t matter)
Stream mining challenges
• Update predictions after every observation
• Single pass : can’t read old data or replay the stream
• Limited time for computation per observation
• O(n) memory size
Existing solutions
• Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.
• Stream subsampling
• Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees
• Use time series analysis methods …
• Etc
Approximate algorithms for stream analytics
Idea : Hash, don’t Sample
Bloom filters
• Approximate data structure for set membership
• Like an approximate set
BloomFilter.contains(x) => Maybe | NO
P(False Positive) > 0
P(False Negative) = 0
• Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
• Bloom Filters
Adding an element uses a boolean OR
Querying uses a boolean AND
Both are Monoids
HyperLogLogard
Intuition
• Long runs of trailings 0 in a random bits chain are rare
• But the more bit chains you look at, the more likely you are to find a long one
• The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
HyperLogLog
• Popular sketch for cardinality estimation
HLL.size = Approx[Number]
We know the distribution on the error.
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
• HyperLogLog
Adding an element uses MAX, which is a
monoid (Ordered Semi Group really ...)
Querying use an harmonic sum : Monoid.
Min Hash
• Gives the probability of two sets being similar.
• Essentially amounts to
P(A ∩ B) / P(A U B)
• Jaccard Similarity
Count min Sketch
Gives an approximation of the number of occurrences of an element in a set.
• Count min sketch
Adding an element is a numerical addition
Querying uses a MIN function.
Both are associative.
Anomaly Detection
- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.
- Many exist : Q-Tree, Q-Digest, T-Digest
- All of those are associative.
- Another neat thing : types your data uniformaly.
Many more sketches and tricks
• FM Counters, KMV
• Histograms
• Ball Sketches : streaming k-means, clustering
• SGD : fit online machine learning algorithms
Algebird
Conclusion
• Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers
• As data size grows, sampling becomes painful, hashing provide better cost effective solution
• Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.
http://speakerdeck.com/samklr
DON’T BE SCARED ANYMORE.
Bibliography
• Great intro into Algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/
• Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
• Probabilistic data structures for web analytics.
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
Algebird : github.com/twitter/algebird
Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-analytics
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf