sampling for windows on data streams by vladimir braverman [email protected]
TRANSCRIPT
Data Stream
Sequence of elements D=p1,p2,…,pN pi is drown from [m].
Objective: Calculate a function f(D). Restrictions: single pass, sub-linear
memory, fast processing time (per element).
p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6
Time
Motivation Today’s applications:
Huge amounts of data is whizzing by Objective
Mining the data, computing statistics etc.
Restrictions Expensive overload is not allowed
Useful for many applications Networking, databases etc.
Data Stream
Intensive theoretical research Streaming Systems
Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin),
Telegraph (Berkley) etc.
Data Stream
The model allows insertions only What about deletions?
Turnstile model Sliding Windows
p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN …pN-6
Time
p1 p2 p3 p4 p5 p6 p7 p8
Time
n=5
Sliding Windows
expired
active
SW contains n most recent elements that are “active”.
Older elements are “expired”.
p1 p2 p3 p4 p5 p6 p7 p8
Time
pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7
n=5
Sliding Windows
expired
active
SW contains n most recent elements that are “active”.
Older elements are “expired”.
p1 p2 p3 p4 p5 p6 p7 p8
Time
pN-6 pN-5 pN-4 pN-3…. pN-2 pN-1 pNpN-7
n=5,n is “huge”
Sequence-based Windows
expired
active
SW contains n most recent elements that are “active”.
Older elements are “expired”.
p1 p2 p3 p4 p5
Time
Timestamp-based windows
p6
p7
p8
p9
p10
p11
p12
p13
What is known on sliding windows
[BDM 02] Random sampling
[DGIM 02] Sum, Count, average, Lp, 0<p≤2, weakly additive functions.
[DM 02] Rarity, similarity
[GT 02] Distributed sum, count
[FKZ 02], [CS 04] Diameter
[BDMO 03] Variance, k-medians
[GDDLM 03] Frequent elements
[AM 04] Counts, quantiles
[AGHLRS 04] LIS
[LT 06] Frequent items
[LT 06] Count
[ZG 06] Variance
[CCM 07] Entropy
Random Sampling
Random Sampling
Fundamental approximation method
Pick a subset S of D Use f(S) to approximate f(D)
p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pNpN-6p9 p10
Types of k-sampling
With replacement Samples x1,…,xk are independent
Without replacement Repetitions are forbidden, i.e., xi ≠ xj
Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation
Allows to change f a posteriori. Can be used for multiple statistics.
Provides effective solutions with worst-case guarantees
The only known solution for many problems
Some Known Methodsfor Data Streams
Reservoir Sampling [V 85]
Concise Sampling [GM 98]
Inverse Sampling [CMR 05]
Weighted Sampling [CMN 99]
Biased Sampling [A 06]
Priority Sampling [ADLT 05]
Dynamic Sampling [FIS 05]
Chain Sampling [BDM 02]
Streaming Sampling
Easy if N is fixed Pick random index I from {1,2,…,N} Output pI
But: N is not known in advance Naïve methods
Store the whole stream Linear memory
“Guess” the final value of N Not really uniform
Reservoir Sampling (Vitter 85)
Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix
Intuition: The probability to pick p decreases as N
grows probabilities can be adjusted
dynamically
Reservoir Sampling (Vitter 85)
Reservoir (array) of k elements, initially empty
Algorithm: Insert k first elements into the reservoir. For i>k, pick pi with probability 1/i If pi is chosen
Pick one of samples in the reservoir randomly Replace it with pi
Sampling on Sliding Windows:Problem Definition
Maintain uniform random sampling on sliding windows Output a sample for every window
Use provably optimal memory
Sampling for Sliding Windows
Can we use previous methods? No - samples expire
p1 p2 p3 p4 p5 p6 p7 p8
Time
n=5
Naïve Approach
Store the whole window Linear memory => compute f(W)
directly
p1 p2 p3 p4 p5 p6 p7 p8
Time
n=5
Periodic Sampling
Pick a sample pi from the first window
When pi expires, take the new element
Continue…
Periodic Sampling: problems
Vulnerability to malicious behavior Given one sample, it is possible to
predict all future samples Poor representation of periodic
data If the period “agrees” with the sample
Unacceptable for applications
Sampling on Sliding Windows:Problem Definition
Maintain uniform random sampling on sliding windows
Use provably optimal memory Samples on distinct windows are
independent
Chain and Priority Methods Babcock, Datar, Motwani, SODA 2002. Maintain uniform random sampling on sliding
windows Chain Sampling
Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent
Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent
S3 Algorithms
Maintain uniform random sampling on sliding windows
Supports all cases Provably optimal Samples on distinct windows are
independent
Sequence-based Timestamp-based
With Replacement O(k) O(k*log n)
Without Replacement O(k) O(k*log n)
Window
Sam
plin
gS3: Recap
Concepts
Prior algorithms: Replacement policy for expired
samples S3 algorithms:
Divide stream into buckets Sample(s) for each bucket Combination rule
Sampling With Replacement for Sequence-Based Windows
p1 p2 p3 p4 p5 p6 p7 p8
Time
pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6
B1 B2
p9 p10
BN/n BN/n+1
pN+2 pN+3
Active element
Bucket
Expired element Future element
Notations
The Algorithm (for one sample) Divide D into buckets of size n
Maintain random sample for each bucket (reservoir algorithm)
Combine samples of buckets that have active elements: There are at most two such buckets
p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6
B1 B2
p9 p10
BN/n BN/n+1
pN+2 pN+3
R1 R2 RN/n RN/n+1
Time
pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6
BN/n BN/n+1
pN+2 pN+3
Time
…. ….
R1 R2
X
expired is if
active is if
12
11
RR
RRX
pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6
BN/M BN/M+1
pN+2 pN+3
Time
…. ….
X
n
pRP
pXP
11
Case 1
pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1pN-6
BN/n BN/n+1
pN+2 pN+3
Time
…. ….
R1 R2
X
nln
l
pRPRP
pXP
11*
expired is 21
l
Case 2
Sampling Without Replacement for Sequence-Based Windows
The Algorithm
Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active
elements:
p1 p2 p3 p4 p5 p6 p7 p8 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6
B1 B2
p9 p10
BN/M BN/M+1
pN+2 pN+3
R1,1 R1,2 R2,1 R2,2 R2,1 R2,2 R2,1 R2,2
Timek=2
pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6
BN/n BN/n+1
pN+2 pN+3
R1,1 R1,2 R2,1 R2,2
Time
….
R1,1 R2,2X=
2
1
fromrest
ofpart active
R
RX
R1= R2=
Sampling With Replacement for Timestamp-Based Windows
Timestamp-based window
n is unknown! Can be changed arbitrary
Does our concept work? How to divide stream into buckets? How to combine samples?
A B
pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3
n=13
a=|A|=5 b=|B|=10
What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩W| If sample from A expired, X = sample from B If sample from A is active,
X= sample from A with probability a/n Otherwise X= sample from B
c= |A∩W|=3
The main idea, revised
A B
pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3
n=13
a=|A|=5 b=|B|=10
c= |A∩W|=3
Correctness
nn
a
a
c
a
c
bpXP
11*1*
1)(
nn
a
apXP
1*
1)(
A B
pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3
n=13
a=|A|=5 b=|B|=10
The combination rule works if: 1. a ≤ n2. It is possible to generate events w.p.
a/M
c= |A∩W|=3
Conclusions
The First Problem
How to maintain A, B at any moment? |A| is less then n
The solution: ζ-decomposition List of buckets B1,…,Bs
Contain all active elements 2 samples from each buckets B1 may contain expired elements as well
B1 B2 B3 B4 Bs-1 Bs……
Define Ensure that |A| ≤ |B| and s = O(log n)
1
1 ,
j
jBBBA
ζ-decomposition : implementation
Similar idea to smooth histograms Slightly different structure
1log
2 ijjB
iB
A B
pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3
M=13
a=|A|=5
c= |A∩W|=3 b=|B|=10
Assuming a ≤ b ≤ n, how to generate events w.p. a/n?
a,b are known, c is unknown and n=b+c
The Second Problem
Approach
Generate “biased” sample Y on A, using such that Y expires w.p. b/n
Use Y to obtain probability a/n The details are in the paper
A B
pN-16 pN-15 pN-14 pN-13 pN-12 pN-11 pN-10 pN-9 pN-5 pN-4 pN-3 pN-2…. pN-1 pN pN+1pN-6pN-8 pN-7 pN+2 pN+3
M=13a=|A|=5 c= |A∩W|=3 b=|B|=10
Given random sample from A, it is possible to construct random variable Y on A such that
Lemma 1
1)(
2,...,0 )1)((
)(
1
ab
bpYP
aiibib
bpYP
abN
ibN
Lemma 2
Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n
Proof sketch:- Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known
cb
b
ibib
bpYPYP
c
i
c
iibN
1
0
1
0 )1)((1)(1)expired is (
n
a
b
a
cb
bYP
*)T expired, is (
Sampling Without Replacement for Timestamp-Based Windows
Main idea Implement k-sample without
replacement using k independent samples
What can we do if the same point is sampled more then once?
Approach: sample from different domains
Cascading lemma
Hij
j-sample (without replacement) from {1,…,i}
Given Hij and Hi+1
1, we can construct Hi+1
j+1 .
Cascading Lemma (Illustration)
H1n-k+1 H1
n-k+2 H1n-k+3 H1
n-k+4 H1n-1 H1
n…..
H2n-k+2
H3n-k+3
H4n-k+4
Hk-1n-1
Hkn
Conclusions Random Sampling
Optimally solved Gives worst-case solutions for many
problems
Thank you!