streaming algorithms

Streaming Algorithms

Joe KelleyData Engineer

July 2013

CONFIDENTIAL | 2

Accelerating Your Time to Value

Strategy

and Roadmap

IMAGINE

Training

and Education

ILLUMINATE

Hands-On

Data Science and

Data Engineering

IMPLEMENT

Leading Provider ofData Science & Engineering for Big

Analytics

CONFIDENTIAL | 3

• Operates on a continuous stream of data

• Unknown or infinite size

• Only one pass; options:

• Store it

• Lose it

• Store an approximation

• Limited processing time per item

•

• Limited total memory

•

What is a Streaming Algorithm?

Algorithm

Standing Query

Ad-hoc Query

Input

Output

Memory

Disk

CONFIDENTIAL | 4

Why use a Streaming Algorithm?

• Compare to typical “Big Data” approach: store

everything, analyze later, scale linearly

• Streaming Pros:

• Lower latency

• Lower storage cost

• Streaming Cons:

• Less flexibility

• Lower precision (sometimes)

• Answer?

• Why not both?

Streaming

Algorithm

Result

Initial Answer

Long-term Storage Batch Algorithm

Result

Authoritative Answer

CONFIDENTIAL | 5

General Techniques

1. Tunable Approximation

2. Sampling

• Sliding window

• Fixed number

• Fixed percentage

3. Hashing: useful randomness

CONFIDENTIAL | 6

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)

• Scenario:

• Not enough space to store everything

• Simple queries storing 1% is good enough

Device-1

(Device-1, event-1, 10001123)(Device-1, event-3, 10001126)(Device-1, event-1, 10001129)

...

Device-2

(Device-2, event-2, 10001124)(Device-2, ERROR, 10001130)(Device-2, event-4, 10001132)

...

Device-3

(Device-3, event-3, 10001122)(Device-3, event-1, 10001127)(Device-3, ERROR, 10001135)

...

(Device-3, event-3, 10001122)(Device-1, event-1, 10001123)(Device-2, event-2, 10001124)(Device-1, event-3, 10001126)(Device-3, event-1, 10001127)(Device-1, event-1, 10001129)(Device-2, ERROR, 10001130)(Device-2, event-4, 10001132)(Device-3, ERROR, 10001135)

...

Input

CONFIDENTIAL | 7



• Scenario:



Algorithm:

for each element e:

with probability 0.01:

store e

else:

throw out e

Can lead to some insidious statistical “bugs”…

CONFIDENTIAL | 8



• Scenario:



Query:

How many errors has the average device encountered?

Answer:

SELECT AVG(n) FROM (

SELECT COUNT(*) AS n FROM events

WHERE event = 'ERROR'

GROUP BY device_id

)

Simple… but off by up to 100x. Each device had only 1% of its events

sampled.

Can we just multiply by 100?

CONFIDENTIAL | 9



• Scenario:



Better Algorithm:

for each element e:

if (hash(e.device_id) mod 100) == 0

store e

else:

throw out e

Choose how to hash carefully... or hash every different way

CONFIDENTIAL | 10

Example 2: Sampling fixed number

Choice of p is crucial:

• p = constant prefer more recent elements. Higher p = more recent

• p = k/n sample uniformly from entire stream

Let arr = array of size k

for each element e:

if arr is not yet full:

add e to arr

else:

with probability p:

replace a random element of arr with e

else:

throw out e

Want to sample a fixed count (k), not a fixed percentage.

Algorithm:

CONFIDENTIAL | 11

Example 2: Sampling fixed number

CONFIDENTIAL | 12

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)

• Want to know how many distinct users are seen over

a time period

• Naïve approach:

• Store all user_id’s in a list/tree/hashtable

• Millions of users = lot of memory

• Better approach:

• Store all user_id’s in a database

• Good, but maybe it’s not fast enough…

• What if an approximate count is ok?

CONFIDENTIAL | 13



• Want to know how many distinct users are seen over a time period

• Approximate count is ok

• Flajolet-Martin Idea:

• Hash each user_id into a bit string

• Count the trailing zeros

• Remember maximum number of trailing zeros seen

user_id H(user_id) trailing zeros max(trailing zeros)

john_doe 0111001001 0 0

jane_doe 1011011100 2 2

alan_t 0010111000 3 3

EWDijkstra 1101011110 1 3

jane_doe 1011011100 2 3

CONFIDENTIAL | 14




• Intuition:

• If we had seen 2 distinct users, we would expect 1

trailing zero

• If we had seen 4, we would expect 2 trailing zeros

• If we had seen , we would expect

• In general, if there has been a maximum of trailing

zeros, is a reasonable estimation of distinct users

• Want more precision? User more independent hash

functions, and combine the results• Median = only get powers of two

• Mean = subject to skew

• Median of means of groups works well in practice

CONFIDENTIAL | 15




Flajolet-Martin, all together:

arr = int[k]

for each item e:

for i in 0...k-1:

z = trailing_zeros(hashi(e))

if z > arr[i]:

arr[i] = z

means = group_means(arr)

median = median(means)

return pow(2, median)

CONFIDENTIAL | 16


Flajolet-Martin in practice

• Devil is in the details

• Tunable precision

• more hash functions = more precise

• See the paper for bounds on precision

• Tunable latency

• more hash functions = higher latency

• faster hash functions = lower latency

• faster hash functions = more possibility of

correlation = less precision

Remember: streaming algorithm for quick, imprecise

answer. Back-end batch algorithm for slower, exact

answer

CONFIDENTIAL | 17

Example 4: Counting Individual Item Frequencies

Want to keep track of how many times each item has

appeared in the stream

Many applications:

• How popular is each search term?

• How many times has this hashtag been tweeted?

• Which IP addresses are DDoS’ing me?

Again, two obvious approaches:

• In-memory hashmap of itemcount

• Database

But can we be more clever?

CONFIDENTIAL | 18


Want to keep track of how many times each item has appeared in the stream

Idea:

• Maintain array of counts

• Hash each item, increment array at that index

To check the count of an item, hash again and check

array at that index

• Over-estimates because of hash “collisions”

CONFIDENTIAL | 19


Count-Min Sketch algorithm:

• Maintain 2-d array of size w x d

• Choose d different hash functions; each row in array corresponds to one

hash function

• Hash each item with every hash function, increment the appropriate

position in each row

• To query an item, hash it d times again, take the minimum value from all

rows

CONFIDENTIAL | 20


Want to keep track of how many times each item has appeared in the stream

Count-Min Sketch, all together:

arr = int[d][w]

for each item e:

for i in 0...d-1:

j = hashi(e) mod w

arr[i][j]++

def frequency(q):

min = +infinity

for i in 0...d-1:

j = hashi(e) mod w

if arr[i][j] < min:

min = arr[i][j]

return min

CONFIDENTIAL | 21


Count-Min Sketch in practice

• Devil is in the details

• Tunable precision

• Bigger array = more precise

• See the paper for bounds on precision

• Tunable latency

• more hash functions = higher latency

• Better at estimating more frequent items

• Can subtract out estimation of collisions

Remember: streaming algorithm for quick, imprecise

answer. Back-end batch algorithm for slower, exact

answer

CONFIDENTIAL | 22

Questions?

• Feel free to reach out

• www.thinkbiganalytics.com

• [email protected]

• www.slideshare.net/jfkelley1

• References:

• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

• http://infolab.stanford.edu/~ullman/mmds.html

We’re hiring! Engineers and Data Scientists

http://www.thinkbiganalytics.com

mailto:[email protected]

http://www.slideshare.net/jfkelley1

http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf



http://infolab.stanford.edu/~ullman/mmds.html

http://infolab.stanford.edu/~ullman/mmds.html

streaming algorithms

Technology