streaming algorithms

22
Streaming Algorithms Joe Kelley Data Engineer July 2013

Upload: joe-kelley

Post on 12-Jul-2015

340 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Streaming Algorithms

Streaming Algorithms

Joe KelleyData Engineer

July 2013

Page 2: Streaming Algorithms

CONFIDENTIAL | 2

Accelerating Your Time to Value

Strategy

and Roadmap

IMAGINE

Training

and Education

ILLUMINATE

Hands-On

Data Science and

Data Engineering

IMPLEMENT

Leading Provider ofData Science & Engineering for Big

Analytics

Page 3: Streaming Algorithms

CONFIDENTIAL | 3

• Operates on a continuous stream of data

• Unknown or infinite size

• Only one pass; options:

• Store it

• Lose it

• Store an approximation

• Limited processing time per item

• Limited total memory

What is a Streaming Algorithm?

Algorithm

Standing Query

Ad-hoc Query

Input

Output

Memory

Disk

Page 4: Streaming Algorithms

CONFIDENTIAL | 4

Why use a Streaming Algorithm?

• Compare to typical “Big Data” approach: store

everything, analyze later, scale linearly

• Streaming Pros:

• Lower latency

• Lower storage cost

• Streaming Cons:

• Less flexibility

• Lower precision (sometimes)

• Answer?

• Why not both?

Streaming

Algorithm

Result

Initial Answer

Long-term Storage Batch Algorithm

Result

Authoritative Answer

Page 5: Streaming Algorithms

CONFIDENTIAL | 5

General Techniques

1. Tunable Approximation

2. Sampling

• Sliding window

• Fixed number

• Fixed percentage

3. Hashing: useful randomness

Page 6: Streaming Algorithms

CONFIDENTIAL | 6

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)

• Scenario:

• Not enough space to store everything

• Simple queries storing 1% is good enough

Device-1

(Device-1, event-1, 10001123)(Device-1, event-3, 10001126)(Device-1, event-1, 10001129)

...

Device-2

(Device-2, event-2, 10001124)(Device-2, ERROR, 10001130)(Device-2, event-4, 10001132)

...

Device-3

(Device-3, event-3, 10001122)(Device-3, event-1, 10001127)(Device-3, ERROR, 10001135)

...

(Device-3, event-3, 10001122)(Device-1, event-1, 10001123)(Device-2, event-2, 10001124)(Device-1, event-3, 10001126)(Device-3, event-1, 10001127)(Device-1, event-1, 10001129)(Device-2, ERROR, 10001130)(Device-2, event-4, 10001132)(Device-3, ERROR, 10001135)

...

Input

Page 7: Streaming Algorithms

CONFIDENTIAL | 7

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)

• Scenario:

• Not enough space to store everything

• Simple queries storing 1% is good enough

Algorithm:

for each element e:

with probability 0.01:

store e

else:

throw out e

Can lead to some insidious statistical “bugs”…

Page 8: Streaming Algorithms

CONFIDENTIAL | 8

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)

• Scenario:

• Not enough space to store everything

• Simple queries storing 1% is good enough

Query:

How many errors has the average device encountered?

Answer:

SELECT AVG(n) FROM (

SELECT COUNT(*) AS n FROM events

WHERE event = 'ERROR'

GROUP BY device_id

)

Simple… but off by up to 100x. Each device had only 1% of its events

sampled.

Can we just multiply by 100?

Page 9: Streaming Algorithms

CONFIDENTIAL | 9

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)

• Scenario:

• Not enough space to store everything

• Simple queries storing 1% is good enough

Better Algorithm:

for each element e:

if (hash(e.device_id) mod 100) == 0

store e

else:

throw out e

Choose how to hash carefully... or hash every different way

Page 10: Streaming Algorithms

CONFIDENTIAL | 10

Example 2: Sampling fixed number

Choice of p is crucial:

• p = constant prefer more recent elements. Higher p = more recent

• p = k/n sample uniformly from entire stream

Let arr = array of size k

for each element e:

if arr is not yet full:

add e to arr

else:

with probability p:

replace a random element of arr with e

else:

throw out e

Want to sample a fixed count (k), not a fixed percentage.

Algorithm:

Page 11: Streaming Algorithms

CONFIDENTIAL | 11

Example 2: Sampling fixed number

Page 12: Streaming Algorithms

CONFIDENTIAL | 12

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)

• Want to know how many distinct users are seen over

a time period

• Naïve approach:

• Store all user_id’s in a list/tree/hashtable

• Millions of users = lot of memory

• Better approach:

• Store all user_id’s in a database

• Good, but maybe it’s not fast enough…

• What if an approximate count is ok?

Page 13: Streaming Algorithms

CONFIDENTIAL | 13

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)

• Want to know how many distinct users are seen over a time period

• Approximate count is ok

• Flajolet-Martin Idea:

• Hash each user_id into a bit string

• Count the trailing zeros

• Remember maximum number of trailing zeros seen

user_id H(user_id) trailing zeros max(trailing zeros)

john_doe 0111001001 0 0

jane_doe 1011011100 2 2

alan_t 0010111000 3 3

EWDijkstra 1101011110 1 3

jane_doe 1011011100 2 3

Page 14: Streaming Algorithms

CONFIDENTIAL | 14

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)

• Want to know how many distinct users are seen over a time period

• Intuition:

• If we had seen 2 distinct users, we would expect 1

trailing zero

• If we had seen 4, we would expect 2 trailing zeros

• If we had seen , we would expect

• In general, if there has been a maximum of trailing

zeros, is a reasonable estimation of distinct users

• Want more precision? User more independent hash

functions, and combine the results• Median = only get powers of two

• Mean = subject to skew

• Median of means of groups works well in practice

Page 15: Streaming Algorithms

CONFIDENTIAL | 15

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)

• Want to know how many distinct users are seen over a time period

Flajolet-Martin, all together:

arr = int[k]

for each item e:

for i in 0...k-1:

z = trailing_zeros(hashi(e))

if z > arr[i]:

arr[i] = z

means = group_means(arr)

median = median(means)

return pow(2, median)

Page 16: Streaming Algorithms

CONFIDENTIAL | 16

Example 3: Counting unique users

Flajolet-Martin in practice

• Devil is in the details

• Tunable precision

• more hash functions = more precise

• See the paper for bounds on precision

• Tunable latency

• more hash functions = higher latency

• faster hash functions = lower latency

• faster hash functions = more possibility of

correlation = less precision

Remember: streaming algorithm for quick, imprecise

answer. Back-end batch algorithm for slower, exact

answer

Page 17: Streaming Algorithms

CONFIDENTIAL | 17

Example 4: Counting Individual Item Frequencies

Want to keep track of how many times each item has

appeared in the stream

Many applications:

• How popular is each search term?

• How many times has this hashtag been tweeted?

• Which IP addresses are DDoS’ing me?

Again, two obvious approaches:

• In-memory hashmap of itemcount

• Database

But can we be more clever?

Page 18: Streaming Algorithms

CONFIDENTIAL | 18

Example 4: Counting Individual Item Frequencies

Want to keep track of how many times each item has appeared in the stream

Idea:

• Maintain array of counts

• Hash each item, increment array at that index

To check the count of an item, hash again and check

array at that index

• Over-estimates because of hash “collisions”

Page 19: Streaming Algorithms

CONFIDENTIAL | 19

Example 4: Counting Individual Item Frequencies

Count-Min Sketch algorithm:

• Maintain 2-d array of size w x d

• Choose d different hash functions; each row in array corresponds to one

hash function

• Hash each item with every hash function, increment the appropriate

position in each row

• To query an item, hash it d times again, take the minimum value from all

rows

Page 20: Streaming Algorithms

CONFIDENTIAL | 20

Example 4: Counting Individual Item Frequencies

Want to keep track of how many times each item has appeared in the stream

Count-Min Sketch, all together:

arr = int[d][w]

for each item e:

for i in 0...d-1:

j = hashi(e) mod w

arr[i][j]++

def frequency(q):

min = +infinity

for i in 0...d-1:

j = hashi(e) mod w

if arr[i][j] < min:

min = arr[i][j]

return min

Page 21: Streaming Algorithms

CONFIDENTIAL | 21

Example 4: Counting Individual Item Frequencies

Count-Min Sketch in practice

• Devil is in the details

• Tunable precision

• Bigger array = more precise

• See the paper for bounds on precision

• Tunable latency

• more hash functions = higher latency

• Better at estimating more frequent items

• Can subtract out estimation of collisions

Remember: streaming algorithm for quick, imprecise

answer. Back-end batch algorithm for slower, exact

answer

Page 22: Streaming Algorithms

CONFIDENTIAL | 22

Questions?

• Feel free to reach out

• www.thinkbiganalytics.com

[email protected]

• www.slideshare.net/jfkelley1

• References:

• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

• http://infolab.stanford.edu/~ullman/mmds.html

We’re hiring! Engineers and Data Scientists