owf14 - big data track : abstract algebra for analytics

48
Abstract Algebra for Analytics Sam BESSALAH @samklr

Upload: open-world-forum

Post on 03-Jul-2015

170 views

Category:

Data & Analytics


5 download

DESCRIPTION

Sam BESSALAH Algebird is an abstract algebra library for Scala developed at Twitter and released under the ASL 2.0 license. It has support for algebraic structures such as semigroups, monoids, groups, rings and fields as well as the standard functional things like monads. More interestingly though are the probabilistic data structures and the accompanying monoids that come out of the box. I'll talk a bit about Algebird in general and how it eases building large scale analytics systems with Map Reduce systems or in a stream processing context.

TRANSCRIPT

Page 1: OWF14 - Big Data Track : Abstract Algebra for Analytics

Abstract Algebra for Analytics

Sam BESSALAH

@samklr

Page 2: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 3: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 4: OWF14 - Big Data Track : Abstract Algebra for Analytics

What do we want?

• We want to build scalable systems.

• Preferably by leveraging distributed computing

• A lot of analytics amount to counting or adding in some sort of way.

Page 5: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

11, 12, 0,3,56,48 K=3 56,48,12

Page 6: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

Page 7: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

Page 8: OWF14 - Big Data Track : Abstract Algebra for Analytics

In Scalding

Page 9: OWF14 - Big Data Track : Abstract Algebra for Analytics

In Scalding

Page 10: OWF14 - Big Data Track : Abstract Algebra for Analytics

Problems

• Curse of the last reducer

• Network Chatter, hinder on performance

• Inefficient Order for map and reduce steps

• Multiple jobs, with a sync barrier at the reducer

Page 11: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Page 12: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

Page 13: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

In a single Pass

Page 14: OWF14 - Big Data Track : Abstract Algebra for Analytics

Why is it better and faster?

Page 15: OWF14 - Big Data Track : Abstract Algebra for Analytics

Associativity allows parallelism

Page 16: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 17: OWF14 - Big Data Track : Abstract Algebra for Analytics

Do we have data structures that are intrinsically parallelizable?

Page 18: OWF14 - Big Data Track : Abstract Algebra for Analytics

Abstract Algebra Redux

• Semi Group

Associative Set (Grouping doesn’t matter)

• Monoid

Semi Group with a zero (Zeros get ignored)

• Group

Monoid with inverse

• Abelian Group

Commutative Set (ordering doesn’t matter)

Page 19: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 20: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 21: OWF14 - Big Data Track : Abstract Algebra for Analytics

Stream mining challenges

• Update predictions after every observation

• Single pass : can’t read old data or replay the stream

• Limited time for computation per observation

• O(n) memory size

Page 22: OWF14 - Big Data Track : Abstract Algebra for Analytics

Existing solutions

• Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.

• Stream subsampling

• Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees

• Use time series analysis methods …

• Etc

Page 23: OWF14 - Big Data Track : Abstract Algebra for Analytics

Approximate algorithms for stream analytics

Page 24: OWF14 - Big Data Track : Abstract Algebra for Analytics

Idea : Hash, don’t Sample

Page 25: OWF14 - Big Data Track : Abstract Algebra for Analytics

Bloom filters

• Approximate data structure for set membership

• Like an approximate set

BloomFilter.contains(x) => Maybe | NO

P(False Positive) > 0

P(False Negative) = 0

Page 26: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Bit Array of fixed size

add(x) : for all element i, b[h(x,i)]=1

contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

Page 27: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 28: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 29: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Bloom Filters

Adding an element uses a boolean OR

Querying uses a boolean AND

Both are Monoids

Page 30: OWF14 - Big Data Track : Abstract Algebra for Analytics

HyperLogLogard

Page 31: OWF14 - Big Data Track : Abstract Algebra for Analytics

Intuition

• Long runs of trailings 0 in a random bits chain are rare

• But the more bit chains you look at, the more likely you are to find a long one

• The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.

Page 32: OWF14 - Big Data Track : Abstract Algebra for Analytics

HyperLogLog

• Popular sketch for cardinality estimation

HLL.size = Approx[Number]

We know the distribution on the error.

Page 33: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 34: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 35: OWF14 - Big Data Track : Abstract Algebra for Analytics

http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Page 36: OWF14 - Big Data Track : Abstract Algebra for Analytics

• HyperLogLog

Adding an element uses MAX, which is a

monoid (Ordered Semi Group really ...)

Querying use an harmonic sum : Monoid.

Page 37: OWF14 - Big Data Track : Abstract Algebra for Analytics

Min Hash

• Gives the probability of two sets being similar.

• Essentially amounts to

P(A ∩ B) / P(A U B)

• Jaccard Similarity

Page 38: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 39: OWF14 - Big Data Track : Abstract Algebra for Analytics

Count min Sketch

Gives an approximation of the number of occurrences of an element in a set.

Page 40: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Count min sketch

Adding an element is a numerical addition

Querying uses a MIN function.

Both are associative.

Page 41: OWF14 - Big Data Track : Abstract Algebra for Analytics

Anomaly Detection

Page 42: OWF14 - Big Data Track : Abstract Algebra for Analytics

- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.

- Many exist : Q-Tree, Q-Digest, T-Digest

- All of those are associative.

- Another neat thing : types your data uniformaly.

Page 43: OWF14 - Big Data Track : Abstract Algebra for Analytics

Many more sketches and tricks

• FM Counters, KMV

• Histograms

• Ball Sketches : streaming k-means, clustering

• SGD : fit online machine learning algorithms

Page 44: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 45: OWF14 - Big Data Track : Abstract Algebra for Analytics

Algebird

Page 46: OWF14 - Big Data Track : Abstract Algebra for Analytics

Conclusion

• Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers

• As data size grows, sampling becomes painful, hashing provide better cost effective solution

• Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.

http://speakerdeck.com/samklr

Page 47: OWF14 - Big Data Track : Abstract Algebra for Analytics

DON’T BE SCARED ANYMORE.

Page 48: OWF14 - Big Data Track : Abstract Algebra for Analytics

Bibliography

• Great intro into Algebird

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

• Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

• Probabilistic data structures for web analytics.

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Algebird : github.com/twitter/algebird

Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-analytics

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf