cs 410/510 data streams lecture 16: data-stream sampling: basic techniques and results

13/13/2012 Data Streams: Lecture 16

CS 410/510Data StreamsLecture 16: Data-Stream Sampling: Basic Techniques and Results

Kristin Tufte, David Maier

Data Streams: Lecture 16 23/13/2012

Data Stream Sampling Sampling provides a synopsis of a data

stream Sample can serve as input for

Answering queries “statistical inference about the contents of

the stream” “variety of analytical procedures”

Focus on: obtaining a sample from the window (sample size « window size)


Windows Stationary Window

Endpoints of window fixed (think relation) Sliding Window

Endpoints of window move What we’ve been talking about More complex than stationary window

because elements must be removed from sample when they expire from window


Simple Random Sampling (SRS) What is a “representative” sample? SRS for a sample of k elements from a

window with n elements Every possible sample (of size k) is equally

likely, that is has probability: 1/ Every element is equally likely to be in

sample Stratified Sampling

Divide window into disjoint segments (strata)

SRS over each stratum Advantageous when stream elements close

together in stream have similar values

nk( )


Bernoulli Sampling Includes each element in the sample

with probability q The sample size is not fixed, sample

size is binomially distributed Probability that sample contains k

elements is:

Expected sample size is nq( ) qk(1-q)n-

k

nk


Binomial Distribution - Example

Expected Sample Size = 20*0.5 = 10

Binomial Distribution (n=20, q=0.5)

Prob

abilit

y

Sample Size

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

`


Binomial Distribution - Example

Expected Sample Size = 20*1/3 ≈ 6.667

Binomial Distribution (n=20, q=1/3)

Prob

abilit

y

Sample Size

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 2 4 6 8 10 12 14 16 18 20


Bernoulli Sampling - Implementation

Naïve: Elements inserted with probability q (ignored

with probability 1-q) Use a sequence of pseudorandom numbers

(U1, U2, U3, …) Ui [0,1] Element ei is included if Ui ≤ q

e1

Sample:

e2 e6e5e4e3

U1=0.5

U2=0.1

e2 e5

U3=0.9

e7

U4=0.8

U5=0.2

U6=0.3

e7

U7=0.0

Example q = 0.2


Bernoulli Sampling – Efficient Implementation

Calculate number of elements to be skipped after an insertion (Δi)

Pr {Δi = j} = q(1-q)j

If you skip zero elements, must get: Ui ≤ q (pr: q)

Skip one element, must get: Ui > q, Ui+1 ≤ q (pr: (1-q)q)

Skip two elements: Ui > q, Ui+1 > q, Ui+2 ≤ q (pr: (1-q)2q)

Δi has a geometric distribution


Geometric Distribution - Example

Geometric Distribution q = 0.2

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

Prob

abilit

y

Number of Skips (Δi)


Bernoulli Sampling - Algorithm


Bernoulli Sampling Straightforward, SRS, easy to

implement But… Sample size is not fixed! Look at algorithms with deterministic

sample size Reservoir Sampling

Stratified Sampling Biased Sampling Schemes


Reservoir Sampling Produces a SRS of size k from a window

of length n (k is specified) Initialize a “reservoir” using first k

elements For every following element, insert with

probability pi (ignore with probability 1-pi)

pi = k/i for i>k (pi = 1 for i ≤ k) pi changes as i increases

Remove one element from reservoir before insertion


Reservoir Sampling

e1

Reservoir Sample:

e2 e6e5e4e3

Sample size 3 (k=3) Recall: pi = 1 i≤k, pi = i/k i>k

p1=1

p2=1

e1 e2

p3=1

e3

p4=3/4 p5=3/5 p6=3/6e7

p7=3/7e8

p8=3/8U4=0.

5U5=0.

1U6=0.

9U4=0.

8U5=0.

2

e4 e5e8


Reservoir Sampling - SRS Why set pi = k/i? Want Sj to be a SRS from Uj = {e1, e2, …,

ej} Sj is the sample from Uj

Recall SRS means every sample of size k is equally likely

Intuition: Probability that ei is included in SRS from Ui is k/i k is sample size, i is “window” size

k/i = (#samples containing ei)/(#samples of size k)

=( )i-1

k-1 ( )ik


Reservoir Sampling - Observations Insertion probability (pi = k/i i>k)

decreases as i increases Also, opportunities for an element in

the sample to be removed from the sample decrease as i increases

These trends offset each other Probability of being in final sample is

same for all elements in the window


Other Sampling Schemes Stratified Sampling

Divide window into strata, SRS in each stratum

Deterministic & Semi-Deterministic Schemes i.e. Sample every 10th element

Biased Sampling Schemes Bias sample towards recently-received

elements Biased Reservoir Sampling Biased Sampling by Halving


Stratified Sampling


Stratified Sampling When elements close to each other in

window have similar values, algorithms such as reservoir sampling can have bad luck

Alternative: divide window into strata and do SRS in each strata

If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling


Deterministic Semi-deterministic Schemes Produce sample of size k by inserting

every n/k th element into the sample Simple, but not random

Can’t make statistical conclusions about window from sample

Bad if data is periodic Can be good if data exhibits a trend

Ensures sampled elements are spread throughout the window

e1 e2 e6e5e4e3 e7 e9e8 e11e10 e12 e13 e17e16e15e14 e18

n=18, k=6


Biased Reservoir Sampling Recall: Reservoir sampling – probability

of inclusion decreased as we got further into the window (pi = i/k)

What if pi was constant? (pi = p) Alternative: pi decreases more slowly than

i/k Will favor recently-arrived elements

Recently-arrived elements are more likely to be in sample than long-ago-arrived elements


( )

( )

Biased Reservoir Sampling For reservoir sampling, Probability that ei is

included in sample S:

If pi is fixed, that is set pi = p (0,1)

Probability that ei is in final sample increases geometrically as i increases

Pr {ei S} = pi

j=max(i, k)

+1

n k-pjk

Pr {ei S} = p

n - max(i, k)k-pk


Biased Reservoir Sampling

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25 30 35 40

Probability ei is included in final sample, p=0.2, k=10, n=40

Element index (i)

Prob

abilit

y

( ).240 - max(i,

10)10-.210


k k

Biased Sampling by Halving

Break into strata (Λi), Sample of size 2k Step 1: S = unbiased SRS samples of size k

from Λ1 and Λ2 (i.e. use reservoir sampling) Step 2: Sub-sample S to produce a sample of

size k, insert SRS of size k from Λ3 into S

Λ1 Λ2 Λ3 Λ4

k kk k


Sampling from Sliding Windows Harder than sampling from stationary

window Must remove elements from sample as the

elements expire from the window Difficult to maintain a sample of a fixed size

Window Types: Sequence-based windows - contain n most

recent elements (row-based window) Timestamp-based windows - contains all

elements that arrived within past t time units (time-based windows)

Unbiased sampling from within a window


Sequence-based Windows Wj is a window of length n, j ≥ 1 Wj = {ej, ej+1, … ej+n-1} Want a SRS Sj of k elements from Wj Tradeoff between amount of memory

required and degree of dependence between Sj’s


Complete Resampling

Window size = 5, Sample size = 2 Maintain full window (Wj) Each time window changes, use reservoir

sampling to create Sj from Wj Very expensive – memory, CPU O(n)

(n=window-size)

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

W1 W2

S1= {e2, e4}S2= {e3,

e5}


Passive Algorithm

Window size = 5, sample size = 2 When an element in the sample expires,

insert the newly-arrived element into sample Sj is a SRS from Wj Sj’s are highly correlated

If S1 is a bad sample, S2 will be also… Memory is O(k), k = sample size


W1 W2

S1 = {e2, e4}S2 = {e2,

e4}

W3

S3 = {e7, e4}


Chain Sampling (Babcock, et al.) Improved independence properties

compared to passive algorithm Expected memory usage: O(k) Basic algorithm – maintains sample of

size 1 Get sample of size k, by running k

chain-samplers


Chain Sampling - Issue Behaves as reservoir sampler for first n

elements Insert additional elements into sample

with probability 1/n

e1

Sample:

e2 e5e4e3

e1

W1

p1=1p2=1/2p3=1/3p4=1/3

e2

W2 W3

Now, what do we do?


Chain Sampling - Solution When ei is selected for inclusion in sample,

select K from {i+1, i+2, … i+n}, eK will replace ei if ei expires while part of sample S Know ek will be in window when ei expires

e1

Sample:

e2 e5e4e3

e1

W1

p2=1/2p3=1/3p4=1/3

e2

W2 W3

Choose K {3, 4, 5}, K=5

e5 Choose K {6, 7, 8}, K=7

e7e5 e7


Chain Sampling - Summary Expected memory consumptin O(k) Chain sampling produces a SRS with

replacement for each sliding window If we use k chain-samplers to get a sample

of size k, may get duplicates in that sample Can over sample (use sample size k +

α), then sub-sample to get a sample of size k


Stratified Sampling

Divide window into strata and do SRS in each strata


Stratified Sampling – Sliding Window


W1

ss1 = {e1,e2}

Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k)

Wj overlaps between 3 and 4 strata (l, l+1 strata)

l = win_size/stratum_size = n/m (=3) Paper says sample size is between k(l-1) and

k∙l, think should be k(l-1) – k(l+1)

ss2 = {e6,e7}

ss3 = {e9,e11}

e16

ss2 = {e14,e16}

W2 W3


Timestamp-Based Windows Number of elements in window changes

over time Multiple elements in sample expire at once Chain sampling relies on insertion

probability = 1/n (n is window size) Stratified Sampling – wouldn’t be able to

bound sample size


Priority Sampling (Babcock, et al.) Priority Sampler maintains a SRS of size

1, use k priority samplers to get SRS of size k

Assign random, uniformly-distributed priority (0,1) to each element

Current sample is element in window with highest priority

Keep elements for which there is no other element with both higher priority and higher (later) timestamp


Priority Sampling - Example

Keep elements for which there is no element with: higher priority and higher (later) timestamp


W1 W2 W3

.1

.8 .3priority: .4 .7 .1 .3 .5 .2 .6 .4 .1 .5 .3

elt in sampleelt stored in memelt in window, not

stored


Inference From a Sample What do we do with these samples? SRS samples can be used to estimate

“population sums” If each element ei is a sales transaction and

v(ei) is dollar value of transaction v(ei) = total sales of transactions in W

Count: h(ei) = 1 if v(ei) > $1000, h(ei) = number of transactions in window for >

$1000

Can also do average

ei W

ei W


SRS Sampling To estimate a population sum from a

SRS of size k, expansion estimator:

To estimate average, use sample average:α = Θ/n = (1/k)

h(ei)^

eiS^

eiSΘ = (n/k) h(ei)^

Also works for Stratified Sampling


Estimating Different Results SRS sampling is good for estimating

population sums, statistics But, use different algorithms for

different results Heavy Hitters algorithm

Find elements (values) that occur commonly in the stream

Min-Hash Computation set resemblance


Heavy Hitters Goal: Find all stream elements that

occur in at least a fraction s of all transactions

For example, find sourceIPs that occur in at least 1% of network flows sourceIPs from which we are getting a lot of

traffic


Heavy Hitters Divide window into buckets of width w Current bucket id = N/w, N is current

stream length Data structure D : (e, f, Δ)

e - element f – estimated frequency Δ – maximum possible error in f

If we are looking for common sourceIPs in a network stream D : (sourceIP, f, Δ)


Heavy Hitters Data structure D : (e, f, Δ) New element e:

Check if e exists in D If so, f = f+1 If not, new entry (e, 1, bcurrent -1)

At bucket boundary (when bcurrent changes) Delete all elements (e, f, Δ) if f + Δ bcurrent If only one instance of f in bucket, entry for f deleted Deleting items that occur once per bucket

For threshold s, output items: f (s-ε)N (w = 1/ε) (N is stream size)


Min-Hash Resemblance, ρ, of two sets A, B =

Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets

ρ(A,B) = | A B | / | A B |

Let h1, h2, … hn be hash functionssi(A) = min(hi(a) | a A) (minimum hash value of hi over A)Signature of A: S(A) = (s1(A), s2(A), …, sn(A))


Min-Hash Resemblance estimator:

ρ(A,B) = I(si(A), si(B))I(x,y) = 1 if x=y, 0 otherwise

ρ(A,B) = | A B | / | A B |

h1, h2, … hn hash functionssi(A) = min(hi(a) | a A)S(A) = (s1(A), s2(A), …, sn(A))

i=1

n

Count # times min hash value is equal Can substitute N minimum values of

one hash function for minimum values of N hash functions

^

cs 410/510 data streams lecture 16: data-stream sampling: basic techniques and results

Documents

knk3132012data streams

window3132012data streams

probability qthe sample

datastream sampling

david maier data streams

representative sample

ui q pr

data streamslecture