cs 410/510 data streams lecture 16: data-stream sampling: basic techniques and results
DESCRIPTION
CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results. Kristin Tufte, David Maier . Data Stream Sampling. Sampling provides a synopsis of a data stream Sample can serve as input for Answering queries “statistical inference about the contents of the stream” - PowerPoint PPT PresentationTRANSCRIPT
13/13/2012 Data Streams: Lecture 16
CS 410/510Data StreamsLecture 16: Data-Stream Sampling: Basic Techniques and Results
Kristin Tufte, David Maier
Data Streams: Lecture 16 23/13/2012
Data Stream Sampling Sampling provides a synopsis of a data
stream Sample can serve as input for
Answering queries “statistical inference about the contents of
the stream” “variety of analytical procedures”
Focus on: obtaining a sample from the window (sample size « window size)
Data Streams: Lecture 16 33/13/2012
Windows Stationary Window
Endpoints of window fixed (think relation) Sliding Window
Endpoints of window move What we’ve been talking about More complex than stationary window
because elements must be removed from sample when they expire from window
Data Streams: Lecture 16 43/13/2012
Simple Random Sampling (SRS) What is a “representative” sample? SRS for a sample of k elements from a
window with n elements Every possible sample (of size k) is equally
likely, that is has probability: 1/ Every element is equally likely to be in
sample Stratified Sampling
Divide window into disjoint segments (strata)
SRS over each stratum Advantageous when stream elements close
together in stream have similar values
nk( )
Data Streams: Lecture 16 53/13/2012
Bernoulli Sampling Includes each element in the sample
with probability q The sample size is not fixed, sample
size is binomially distributed Probability that sample contains k
elements is:
Expected sample size is nq( ) qk(1-q)n-
k
nk
Data Streams: Lecture 16 63/13/2012
Binomial Distribution - Example
Expected Sample Size = 20*0.5 = 10
Binomial Distribution (n=20, q=0.5)
Prob
abilit
y
Sample Size
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`
Data Streams: Lecture 16 73/13/2012
Binomial Distribution - Example
Expected Sample Size = 20*1/3 ≈ 6.667
Binomial Distribution (n=20, q=1/3)
Prob
abilit
y
Sample Size
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 2 4 6 8 10 12 14 16 18 20
Data Streams: Lecture 16 83/13/2012
Bernoulli Sampling - Implementation
Naïve: Elements inserted with probability q (ignored
with probability 1-q) Use a sequence of pseudorandom numbers
(U1, U2, U3, …) Ui [0,1] Element ei is included if Ui ≤ q
e1
Sample:
e2 e6e5e4e3
U1=0.5
U2=0.1
e2 e5
U3=0.9
e7
U4=0.8
U5=0.2
U6=0.3
e7
U7=0.0
Example q = 0.2
Data Streams: Lecture 16 93/13/2012
Bernoulli Sampling – Efficient Implementation
Calculate number of elements to be skipped after an insertion (Δi)
Pr {Δi = j} = q(1-q)j
If you skip zero elements, must get: Ui ≤ q (pr: q)
Skip one element, must get: Ui > q, Ui+1 ≤ q (pr: (1-q)q)
Skip two elements: Ui > q, Ui+1 > q, Ui+2 ≤ q (pr: (1-q)2q)
Δi has a geometric distribution
Data Streams: Lecture 16 103/13/2012
Geometric Distribution - Example
Geometric Distribution q = 0.2
0
0.05
0.1
0.15
0.2
0.25
0 2 4 6 8 10 12 14 16 18 20
Prob
abilit
y
Number of Skips (Δi)
Data Streams: Lecture 16 113/13/2012
Bernoulli Sampling - Algorithm
Data Streams: Lecture 16 123/13/2012
Bernoulli Sampling Straightforward, SRS, easy to
implement But… Sample size is not fixed! Look at algorithms with deterministic
sample size Reservoir Sampling
Stratified Sampling Biased Sampling Schemes
Data Streams: Lecture 16 133/13/2012
Reservoir Sampling Produces a SRS of size k from a window
of length n (k is specified) Initialize a “reservoir” using first k
elements For every following element, insert with
probability pi (ignore with probability 1-pi)
pi = k/i for i>k (pi = 1 for i ≤ k) pi changes as i increases
Remove one element from reservoir before insertion
Data Streams: Lecture 16 143/13/2012
Reservoir Sampling
e1
Reservoir Sample:
e2 e6e5e4e3
Sample size 3 (k=3) Recall: pi = 1 i≤k, pi = i/k i>k
p1=1
p2=1
e1 e2
p3=1
e3
p4=3/4 p5=3/5 p6=3/6e7
p7=3/7e8
p8=3/8U4=0.
5U5=0.
1U6=0.
9U4=0.
8U5=0.
2
e4 e5e8
Data Streams: Lecture 16 153/13/2012
Reservoir Sampling - SRS Why set pi = k/i? Want Sj to be a SRS from Uj = {e1, e2, …,
ej} Sj is the sample from Uj
Recall SRS means every sample of size k is equally likely
Intuition: Probability that ei is included in SRS from Ui is k/i k is sample size, i is “window” size
k/i = (#samples containing ei)/(#samples of size k)
=( )i-1
k-1 ( )ik
Data Streams: Lecture 16 163/13/2012
Reservoir Sampling - Observations Insertion probability (pi = k/i i>k)
decreases as i increases Also, opportunities for an element in
the sample to be removed from the sample decrease as i increases
These trends offset each other Probability of being in final sample is
same for all elements in the window
Data Streams: Lecture 16 173/13/2012
Other Sampling Schemes Stratified Sampling
Divide window into strata, SRS in each stratum
Deterministic & Semi-Deterministic Schemes i.e. Sample every 10th element
Biased Sampling Schemes Bias sample towards recently-received
elements Biased Reservoir Sampling Biased Sampling by Halving
Data Streams: Lecture 16 183/13/2012
Stratified Sampling
Data Streams: Lecture 16 193/13/2012
Stratified Sampling When elements close to each other in
window have similar values, algorithms such as reservoir sampling can have bad luck
Alternative: divide window into strata and do SRS in each strata
If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling
Data Streams: Lecture 16 203/13/2012
Deterministic Semi-deterministic Schemes Produce sample of size k by inserting
every n/k th element into the sample Simple, but not random
Can’t make statistical conclusions about window from sample
Bad if data is periodic Can be good if data exhibits a trend
Ensures sampled elements are spread throughout the window
e1 e2 e6e5e4e3 e7 e9e8 e11e10 e12 e13 e17e16e15e14 e18
n=18, k=6
Data Streams: Lecture 16 213/13/2012
Biased Reservoir Sampling Recall: Reservoir sampling – probability
of inclusion decreased as we got further into the window (pi = i/k)
What if pi was constant? (pi = p) Alternative: pi decreases more slowly than
i/k Will favor recently-arrived elements
Recently-arrived elements are more likely to be in sample than long-ago-arrived elements
Data Streams: Lecture 16 223/13/2012
( )
( )
Biased Reservoir Sampling For reservoir sampling, Probability that ei is
included in sample S:
If pi is fixed, that is set pi = p (0,1)
Probability that ei is in final sample increases geometrically as i increases
Pr {ei S} = pi
j=max(i, k)
+1
n k-pjk
Pr {ei S} = p
n - max(i, k)k-pk
Data Streams: Lecture 16 233/13/2012
Biased Reservoir Sampling
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25 30 35 40
Probability ei is included in final sample, p=0.2, k=10, n=40
Element index (i)
Prob
abilit
y
( ).240 - max(i,
10)10-.210
Data Streams: Lecture 16 243/13/2012
k k
Biased Sampling by Halving
Break into strata (Λi), Sample of size 2k Step 1: S = unbiased SRS samples of size k
from Λ1 and Λ2 (i.e. use reservoir sampling) Step 2: Sub-sample S to produce a sample of
size k, insert SRS of size k from Λ3 into S
Λ1 Λ2 Λ3 Λ4
k kk k
Data Streams: Lecture 16 253/13/2012
Sampling from Sliding Windows Harder than sampling from stationary
window Must remove elements from sample as the
elements expire from the window Difficult to maintain a sample of a fixed size
Window Types: Sequence-based windows - contain n most
recent elements (row-based window) Timestamp-based windows - contains all
elements that arrived within past t time units (time-based windows)
Unbiased sampling from within a window
Data Streams: Lecture 16 263/13/2012
Sequence-based Windows Wj is a window of length n, j ≥ 1 Wj = {ej, ej+1, … ej+n-1} Want a SRS Sj of k elements from Wj Tradeoff between amount of memory
required and degree of dependence between Sj’s
Data Streams: Lecture 16 273/13/2012
Complete Resampling
Window size = 5, Sample size = 2 Maintain full window (Wj) Each time window changes, use reservoir
sampling to create Sj from Wj Very expensive – memory, CPU O(n)
(n=window-size)
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15
W1 W2
S1= {e2, e4}S2= {e3,
e5}
Data Streams: Lecture 16 283/13/2012
Passive Algorithm
Window size = 5, sample size = 2 When an element in the sample expires,
insert the newly-arrived element into sample Sj is a SRS from Wj Sj’s are highly correlated
If S1 is a bad sample, S2 will be also… Memory is O(k), k = sample size
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15
W1 W2
S1 = {e2, e4}S2 = {e2,
e4}
W3
S3 = {e7, e4}
Data Streams: Lecture 16 293/13/2012
Chain Sampling (Babcock, et al.) Improved independence properties
compared to passive algorithm Expected memory usage: O(k) Basic algorithm – maintains sample of
size 1 Get sample of size k, by running k
chain-samplers
Data Streams: Lecture 16 303/13/2012
Chain Sampling - Issue Behaves as reservoir sampler for first n
elements Insert additional elements into sample
with probability 1/n
e1
Sample:
e2 e5e4e3
e1
W1
p1=1p2=1/2p3=1/3p4=1/3
e2
W2 W3
Now, what do we do?
Data Streams: Lecture 16 313/13/2012
Chain Sampling - Solution When ei is selected for inclusion in sample,
select K from {i+1, i+2, … i+n}, eK will replace ei if ei expires while part of sample S Know ek will be in window when ei expires
e1
Sample:
e2 e5e4e3
e1
W1
p2=1/2p3=1/3p4=1/3
e2
W2 W3
Choose K {3, 4, 5}, K=5
e5 Choose K {6, 7, 8}, K=7
e7e5 e7
Data Streams: Lecture 16 323/13/2012
Chain Sampling - Summary Expected memory consumptin O(k) Chain sampling produces a SRS with
replacement for each sliding window If we use k chain-samplers to get a sample
of size k, may get duplicates in that sample Can over sample (use sample size k +
α), then sub-sample to get a sample of size k
Data Streams: Lecture 16 333/13/2012
Stratified Sampling
Divide window into strata and do SRS in each strata
Data Streams: Lecture 16 343/13/2012
Stratified Sampling – Sliding Window
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15
W1
ss1 = {e1,e2}
Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k)
Wj overlaps between 3 and 4 strata (l, l+1 strata)
l = win_size/stratum_size = n/m (=3) Paper says sample size is between k(l-1) and
k∙l, think should be k(l-1) – k(l+1)
ss2 = {e6,e7}
ss3 = {e9,e11}
e16
ss2 = {e14,e16}
W2 W3
Data Streams: Lecture 16 353/13/2012
Timestamp-Based Windows Number of elements in window changes
over time Multiple elements in sample expire at once Chain sampling relies on insertion
probability = 1/n (n is window size) Stratified Sampling – wouldn’t be able to
bound sample size
Data Streams: Lecture 16 363/13/2012
Priority Sampling (Babcock, et al.) Priority Sampler maintains a SRS of size
1, use k priority samplers to get SRS of size k
Assign random, uniformly-distributed priority (0,1) to each element
Current sample is element in window with highest priority
Keep elements for which there is no other element with both higher priority and higher (later) timestamp
Data Streams: Lecture 16 373/13/2012
Priority Sampling - Example
Keep elements for which there is no element with: higher priority and higher (later) timestamp
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15
W1 W2 W3
.1
.8 .3priority: .4 .7 .1 .3 .5 .2 .6 .4 .1 .5 .3
elt in sampleelt stored in memelt in window, not
stored
Data Streams: Lecture 16 383/13/2012
Inference From a Sample What do we do with these samples? SRS samples can be used to estimate
“population sums” If each element ei is a sales transaction and
v(ei) is dollar value of transaction v(ei) = total sales of transactions in W
Count: h(ei) = 1 if v(ei) > $1000, h(ei) = number of transactions in window for >
$1000
Can also do average
ei W
ei W
Data Streams: Lecture 16 393/13/2012
SRS Sampling To estimate a population sum from a
SRS of size k, expansion estimator:
To estimate average, use sample average:α = Θ/n = (1/k)
h(ei)^
eiS^
eiSΘ = (n/k) h(ei)^
Also works for Stratified Sampling
Data Streams: Lecture 16 403/13/2012
Estimating Different Results SRS sampling is good for estimating
population sums, statistics But, use different algorithms for
different results Heavy Hitters algorithm
Find elements (values) that occur commonly in the stream
Min-Hash Computation set resemblance
Data Streams: Lecture 16 413/13/2012
Heavy Hitters Goal: Find all stream elements that
occur in at least a fraction s of all transactions
For example, find sourceIPs that occur in at least 1% of network flows sourceIPs from which we are getting a lot of
traffic
Data Streams: Lecture 16 423/13/2012
Heavy Hitters Divide window into buckets of width w Current bucket id = N/w, N is current
stream length Data structure D : (e, f, Δ)
e - element f – estimated frequency Δ – maximum possible error in f
If we are looking for common sourceIPs in a network stream D : (sourceIP, f, Δ)
Data Streams: Lecture 16 433/13/2012
Heavy Hitters Data structure D : (e, f, Δ) New element e:
Check if e exists in D If so, f = f+1 If not, new entry (e, 1, bcurrent -1)
At bucket boundary (when bcurrent changes) Delete all elements (e, f, Δ) if f + Δ bcurrent If only one instance of f in bucket, entry for f deleted Deleting items that occur once per bucket
For threshold s, output items: f (s-ε)N (w = 1/ε) (N is stream size)
Data Streams: Lecture 16 443/13/2012
Min-Hash Resemblance, ρ, of two sets A, B =
Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets
ρ(A,B) = | A B | / | A B |
Let h1, h2, … hn be hash functionssi(A) = min(hi(a) | a A) (minimum hash value of hi over A)Signature of A: S(A) = (s1(A), s2(A), …, sn(A))
Data Streams: Lecture 16 453/13/2012
Min-Hash Resemblance estimator:
ρ(A,B) = I(si(A), si(B))I(x,y) = 1 if x=y, 0 otherwise
ρ(A,B) = | A B | / | A B |
h1, h2, … hn hash functionssi(A) = min(hi(a) | a A)S(A) = (s1(A), s2(A), …, sn(A))
i=1
n
Count # times min hash value is equal Can substitute N minimum values of
one hash function for minimum values of N hash functions
^