sampling for windows on data streams by vladimir braverman vova@cs.ucla.edu

Post on 19-Jan-2016

247 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sampling for Windows on Data Streams

by Vladimir Braverman

vova@cs.ucla.edu

Data Stream

Sequence of elements D=p1,p2,…,pN pi is drown from [m].

Objective: Calculate a function f(D). Restrictions: single pass, sub-linear

memory, fast processing time (per element).

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6

Time

Motivation Today’s applications:

Huge amounts of data is whizzing by Objective

Mining the data, computing statistics etc.

Restrictions Expensive overload is not allowed

Useful for many applications Networking, databases etc.

Data Stream

Intensive theoretical research Streaming Systems

Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin),

Telegraph (Berkley) etc.

Data Stream

The model allows insertions only What about deletions?

Turnstile model Sliding Windows

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6

Time

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Sliding Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7

n=5

Sliding Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7

n=5,n is “huge”

Sequence-based Windows

expired

active

SW contains n most recent elements that are “active”.

Older elements are “expired”.

p1 p2 p3 p4 p5

Time

Timestamp-based windows

p6

p7

p8

p9

p10

p11

p12

p13

What is known on sliding windows

[BDM 02] Random sampling

[DGIM 02] Sum, Count, average, Lp, 0<p≤2, weakly additive functions.

[DM 02] Rarity, similarity

[GT 02] Distributed sum, count

[FKZ 02], [CS 04] Diameter

[BDMO 03] Variance, k-medians

[GDDLM 03] Frequent elements

[AM 04] Counts, quantiles

[AGHLRS 04] LIS

[LT 06] Frequent items

[LT 06] Count

[ZG 06] Variance

[CCM 07] Entropy

Random Sampling

Random Sampling

Fundamental approximation method

Pick a subset S of D Use f(S) to approximate f(D)

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pNpN-6p9 p10

Types of k-sampling

With replacement Samples x1,…,xk are independent

Without replacement Repetitions are forbidden, i.e., xi ≠ xj

Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation

Allows to change f a posteriori. Can be used for multiple statistics.

Provides effective solutions with worst-case guarantees

The only known solution for many problems

Some Known Methodsfor Data Streams

Reservoir Sampling [V 85]

Concise Sampling [GM 98]

Inverse Sampling [CMR 05]

Weighted Sampling [CMN 99]

Biased Sampling [A 06]

Priority Sampling [ADLT 05]

Dynamic Sampling [FIS 05]

Chain Sampling [BDM 02]

Streaming Sampling

Easy if N is fixed Pick random index I from {1,2,…,N} Output pI

But: N is not known in advance Naïve methods

Store the whole stream Linear memory

“Guess” the final value of N Not really uniform

Reservoir Sampling (Vitter 85)

Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix

Intuition: The probability to pick p decreases as N

grows probabilities can be adjusted

dynamically

Reservoir Sampling (Vitter 85)

Reservoir (array) of k elements, initially empty

Algorithm: Insert k first elements into the reservoir. For i>k, pick pi with probability 1/i If pi is chosen

Pick one of samples in the reservoir randomly Replace it with pi

Sampling on Sliding Windows:Problem Definition

Maintain uniform random sampling on sliding windows Output a sample for every window

Use provably optimal memory

Sampling for Sliding Windows

Can we use previous methods? No - samples expire

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Naïve Approach

Store the whole window Linear memory => compute f(W)

directly

p1 p2 p3 p4 p5 p6 p7 p8

Time

n=5

Periodic Sampling

Pick a sample pi from the first window

When pi expires, take the new element

Continue…

Periodic Sampling: problems

Vulnerability to malicious behavior Given one sample, it is possible to

predict all future samples Poor representation of periodic

data If the period “agrees” with the sample

Unacceptable for applications

Sampling on Sliding Windows:Problem Definition

Maintain uniform random sampling on sliding windows

Use provably optimal memory Samples on distinct windows are

independent

Chain and Priority Methods Babcock, Datar, Motwani, SODA 2002. Maintain uniform random sampling on sliding

windows Chain Sampling

Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent

Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent

S3 Algorithms

Maintain uniform random sampling on sliding windows

Supports all cases Provably optimal Samples on distinct windows are

independent

Sequence-based Timestamp-based

With Replacement O(k) O(k*log n)

Without Replacement O(k) O(k*log n)

Window

Sam

plin

gS3: Recap

Concepts

Prior algorithms: Replacement policy for expired

samples S3 algorithms:

Divide stream into buckets Sample(s) for each bucket Combination rule

Sampling With Replacement for Sequence-Based Windows

p1 p2 p3 p4 p5 p6 p7 p8

Time

pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/n BN/n+1

pN+2 pN+3

Active element

Bucket

Expired element Future element

Notations

The Algorithm (for one sample) Divide D into buckets of size n

Maintain random sample for each bucket (reservoir algorithm)

Combine samples of buckets that have active elements: There are at most two such buckets

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/n BN/n+1

pN+2 pN+3

R1 R2 RN/n RN/n+1

Time

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

Time

…. ….

R1 R2

X

expired is if

active is if

12

11

RR

RRX

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/M BN/M+1

pN+2 pN+3

Time

…. ….

X

n

pRP

pXP

11

Case 1

pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

Time

…. ….

R1 R2

X

nln

l

pRPRP

pXP

11*

expired is 21

l

Case 2

Sampling Without Replacement for Sequence-Based Windows

The Algorithm

Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active

elements:

p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

B1 B2

p9 p10

BN/M BN/M+1

pN+2 pN+3

R1,1 R1,2 R2,1 R2,2 R2,1 R2,2 R2,1 R2,2

Timek=2

pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6

BN/n BN/n+1

pN+2 pN+3

R1,1 R1,2 R2,1 R2,2

Time

….

R1,1 R2,2X=

2

1

fromrest

ofpart active

R

RX

R1= R2=

Sampling With Replacement for Timestamp-Based Windows

Timestamp-based window

n is unknown! Can be changed arbitrary

Does our concept work? How to divide stream into buckets? How to combine samples?

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩W| If sample from A expired, X = sample from B If sample from A is active,

X= sample from A with probability a/n Otherwise X= sample from B

c= |A∩W|=3

The main idea, revised

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

c= |A∩W|=3

Correctness

nn

a

a

c

a

c

bpXP

11*1*

1)(

nn

a

apXP

1*

1)(

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

n=13

a=|A|=5 b=|B|=10

The combination rule works if: 1. a ≤ n2. It is possible to generate events w.p.

a/M

c= |A∩W|=3

Conclusions

The First Problem

How to maintain A, B at any moment? |A| is less then n

The solution: ζ-decomposition List of buckets B1,…,Bs

Contain all active elements 2 samples from each buckets B1 may contain expired elements as well

B1 B2 B3 B4 Bs-1 Bs……

Define Ensure that |A| ≤ |B| and s = O(log n)

1

1 ,

j

jBBBA

ζ-decomposition : implementation

Similar idea to smooth histograms Slightly different structure

1log

2 ijjB

iB

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

M=13

a=|A|=5

c= |A∩W|=3 b=|B|=10

Assuming a ≤ b ≤ n, how to generate events w.p. a/n?

a,b are known, c is unknown and n=b+c

The Second Problem

Approach

Generate “biased” sample Y on A, using such that Y expires w.p. b/n

Use Y to obtain probability a/n The details are in the paper

A B

pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3

M=13a=|A|=5 c= |A∩W|=3 b=|B|=10

Given random sample from A, it is possible to construct random variable Y on A such that

Lemma 1

1)(

2,...,0 )1)((

)(

1

ab

bpYP

aiibib

bpYP

abN

ibN

Lemma 2

Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n

Proof sketch:- Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known

cb

b

ibib

bpYPYP

c

i

c

iibN

1

0

1

0 )1)((1)(1)expired is (

n

a

b

a

cb

bYP

*)T expired, is (

Sampling Without Replacement for Timestamp-Based Windows

Main idea Implement k-sample without

replacement using k independent samples

What can we do if the same point is sampled more then once?

Approach: sample from different domains

Cascading lemma

Hij

j-sample (without replacement) from {1,…,i}

Given Hij and Hi+1

1, we can construct Hi+1

j+1 .

Cascading Lemma (Illustration)

H1n-k+1 H1

n-k+2 H1n-k+3 H1

n-k+4 H1n-1 H1

n…..

H2n-k+2

H3n-k+3

H4n-k+4

Hk-1n-1

Hkn

Conclusions Random Sampling

Optimally solved Gives worst-case solutions for many

problems

Thank you!

top related