sampling for windows on data streams by vladimir braverman [email protected]

55
Sampling for Windows on Data Streams by Vladimir Braverman [email protected]

Upload: ruby-morton

Post on 19-Jan-2016

247 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling for Windows on Data Streams

by Vladimir Braverman

[email protected]

Page 2: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Data Stream

Sequence of elements D=p1,p2,…,pN pi is drown from [m].

Objective: Calculate a function f(D). Restrictions: single pass, sub-linear

memory, fast processing time (per element).

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6

Time

Page 3: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Motivation Today’s applications:

Huge amounts of data is whizzing by Objective

Mining the data, computing statistics etc.

Restrictions Expensive overload is not allowed

Useful for many applications Networking, databases etc.

Page 4: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Data Stream

Intensive theoretical research Streaming Systems

Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin),

Telegraph (Berkley) etc.

Page 5: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Data Stream

The model allows insertions only What about deletions?

Turnstile model Sliding Windows

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6

Time

Page 6: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Sliding Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

Page 7: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7

n=5

Sliding Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

Page 8: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7

n=5,n is “huge”

Sequence-based Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

Page 9: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5

Time

Timestamp-based windows

p6

p7

p8

p9

p10

p11

p12

p13

Page 10: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

What is known on sliding windows

[BDM 02] Random sampling

[DGIM 02] Sum, Count, average, Lp, 0<p≤2, weakly additive functions.

[DM 02] Rarity, similarity

[GT 02] Distributed sum, count

[FKZ 02], [CS 04] Diameter

[BDMO 03] Variance, k-medians

[GDDLM 03] Frequent elements

[AM 04] Counts, quantiles

[AGHLRS 04] LIS

[LT 06] Frequent items

[LT 06] Count

[ZG 06] Variance

[CCM 07] Entropy

Page 11: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Random Sampling

Page 12: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Random Sampling

Fundamental approximation method

Pick a subset S of D Use f(S) to approximate f(D)

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pNpN-6p9 p10

Page 13: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Types of k-sampling

With replacement Samples x1,…,xk are independent

Without replacement Repetitions are forbidden, i.e., xi ≠ xj

Page 14: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation

Allows to change f a posteriori. Can be used for multiple statistics.

Provides effective solutions with worst-case guarantees

The only known solution for many problems

Page 15: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Some Known Methodsfor Data Streams

Reservoir Sampling [V 85]

Concise Sampling [GM 98]

Inverse Sampling [CMR 05]

Weighted Sampling [CMN 99]

Biased Sampling [A 06]

Priority Sampling [ADLT 05]

Dynamic Sampling [FIS 05]

Chain Sampling [BDM 02]

Page 16: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Streaming Sampling

Easy if N is fixed Pick random index I from {1,2,…,N} Output pI

But: N is not known in advance Naïve methods

Store the whole stream Linear memory

“Guess” the final value of N Not really uniform

Page 17: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Reservoir Sampling (Vitter 85)

Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix

Intuition: The probability to pick p decreases as N

grows probabilities can be adjusted

dynamically

Page 18: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Reservoir Sampling (Vitter 85)

Reservoir (array) of k elements, initially empty

Algorithm: Insert k first elements into the reservoir. For i>k, pick pi with probability 1/i If pi is chosen

Pick one of samples in the reservoir randomly Replace it with pi

Page 19: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling on Sliding Windows:Problem Definition

Maintain uniform random sampling on sliding windows Output a sample for every window

Use provably optimal memory

Page 20: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling for Sliding Windows

Can we use previous methods? No - samples expire

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Page 21: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Naïve Approach

Store the whole window Linear memory => compute f(W)

directly

Page 22: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Periodic Sampling

Pick a sample pi from the first window

When pi expires, take the new element

Continue…

Page 23: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Periodic Sampling: problems

Vulnerability to malicious behavior Given one sample, it is possible to

predict all future samples Poor representation of periodic

data If the period “agrees” with the sample

Unacceptable for applications

Page 24: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling on Sliding Windows:Problem Definition

Maintain uniform random sampling on sliding windows

Use provably optimal memory Samples on distinct windows are

independent

Page 25: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Chain and Priority Methods Babcock, Datar, Motwani, SODA 2002. Maintain uniform random sampling on sliding

windows Chain Sampling

Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent

Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent

Page 26: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

S3 Algorithms

Maintain uniform random sampling on sliding windows

Supports all cases Provably optimal Samples on distinct windows are

independent

Page 27: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sequence-based Timestamp-based

With Replacement O(k) O(k*log n)

Without Replacement O(k) O(k*log n)

Window

Sam

plin

gS3: Recap

Page 28: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Concepts

Prior algorithms: Replacement policy for expired

samples S3 algorithms:

Divide stream into buckets Sample(s) for each bucket Combination rule

Page 29: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling With Replacement for Sequence-Based Windows

Page 30: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/n BN/n+1

pN+2 pN+3

Active element

Bucket

Expired element Future element

Notations

Page 31: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

The Algorithm (for one sample) Divide D into buckets of size n

Maintain random sample for each bucket (reservoir algorithm)

Combine samples of buckets that have active elements: There are at most two such buckets

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/n BN/n+1

pN+2 pN+3

R1 R2 RN/n RN/n+1

Time

Page 32: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

Time

…. ….

R1 R2

X

expired is if

active is if

12

11

RR

RRX

Page 33: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/M BN/M+1

pN+2 pN+3

Time

…. ….

X

n

pRP

pXP

11

Case 1

Page 34: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

Time

…. ….

R1 R2

X

nln

l

pRPRP

pXP

11*

expired is 21

l

Case 2

Page 35: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling Without Replacement for Sequence-Based Windows

Page 36: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

The Algorithm

Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active

elements:

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/M BN/M+1

pN+2 pN+3

R1,1 R1,2 R2,1 R2,2 R2,1 R2,2 R2,1 R2,2

Timek=2

Page 37: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

R1,1 R1,2 R2,1 R2,2

Time

….

R1,1 R2,2X=

2

1

fromrest

ofpart active

R

RX

R1= R2=

Page 38: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling With Replacement for Timestamp-Based Windows

Page 39: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Timestamp-based window

n is unknown! Can be changed arbitrary

Does our concept work? How to divide stream into buckets? How to combine samples?

Page 40: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩W| If sample from A expired, X = sample from B If sample from A is active,

X= sample from A with probability a/n Otherwise X= sample from B

c= |A∩W|=3

The main idea, revised

Page 41: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

c= |A∩W|=3

Correctness

nn

a

a

c

a

c

bpXP

11*1*

1)(

nn

a

apXP

1*

1)(

Page 42: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

The combination rule works if: 1. a ≤ n2. It is possible to generate events w.p.

a/M

c= |A∩W|=3

Conclusions

Page 43: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

The First Problem

How to maintain A, B at any moment? |A| is less then n

Page 44: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

The solution: ζ-decomposition List of buckets B1,…,Bs

Contain all active elements 2 samples from each buckets B1 may contain expired elements as well

B1 B2 B3 B4 Bs-1 Bs……

Define Ensure that |A| ≤ |B| and s = O(log n)

1

1 ,

j

jBBBA

Page 45: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

ζ-decomposition : implementation

Similar idea to smooth histograms Slightly different structure

1log

2 ijjB

iB

Page 46: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

M=13

a=|A|=5

c= |A∩W|=3 b=|B|=10

Assuming a ≤ b ≤ n, how to generate events w.p. a/n?

a,b are known, c is unknown and n=b+c

The Second Problem

Page 47: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Approach

Generate “biased” sample Y on A, using such that Y expires w.p. b/n

Use Y to obtain probability a/n The details are in the paper

Page 48: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

M=13a=|A|=5 c= |A∩W|=3 b=|B|=10

Given random sample from A, it is possible to construct random variable Y on A such that

Lemma 1

1)(

2,...,0 )1)((

)(

1

ab

bpYP

aiibib

bpYP

abN

ibN

Page 49: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Lemma 2

Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n

Proof sketch:- Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known

cb

b

ibib

bpYPYP

c

i

c

iibN

1

0

1

0 )1)((1)(1)expired is (

n

a

b

a

cb

bYP

*)T expired, is (

Page 50: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Sampling Without Replacement for Timestamp-Based Windows

Page 51: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Main idea Implement k-sample without

replacement using k independent samples

What can we do if the same point is sampled more then once?

Approach: sample from different domains

Page 52: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Cascading lemma

Hij

j-sample (without replacement) from {1,…,i}

Given Hij and Hi+1

1, we can construct Hi+1

j+1 .

Page 53: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Cascading Lemma (Illustration)

H1n-k+1 H1

n-k+2 H1n-k+3 H1

n-k+4 H1n-1 H1

n…..

H2n-k+2

H3n-k+3

H4n-k+4

Hk-1n-1

Hkn

Page 54: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Conclusions Random Sampling

Optimally solved Gives worst-case solutions for many

problems

Page 55: Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

Thank you!