data stream methods - rutgers universitymuthu/198-4.pdf · probabilistic counting • the approach...

Data Stream Methods

Graham [email protected]

S. [email protected]

Plan of attack

• Frequent Items / Heavy Hitters• Counting Distinct Elements• Clustering items in Streams

Motivating Distinct Elements

Many network flows between (source, dest) pairs

Want a snapshot at time t of the flows

This defines a (massive) vector, and we ask:

• Summarise the current state

• How does state at time t compare with at t’?

• Which past situation does this most resemble,etc.?

Counting Distinct Values

Application 1: Maintaining number of distinct valuesin a relation with inserts and deletes

Important to know number of values for queryoptimization, approximate query answering, join sizeestimation etc.

Fully dynamic case, with inserts and deletes:sampling from the relation itself has been shown tobe inaccurate.

Computing the answer with a scan of the relationwill be slow, will consume a lot of memory

Application to NetworksApplication 2: Many questions possible about networkstreams:

• How many packet flows between distinct pairs of(source, destination)?• How many flows are losing packets (wherepackets in one side not equal to packets out)?• Denial of service attacks signalled by largenumbers of requests (from spoofed IPs) — so manydistinct sources.

All these can be solved by computing distinct values orextensions thereof.

Exact Algorithm

• Keep an array, a[1..U], initially all zero• Also keep a counter C• Every time an item i arrives, look at a[i]• If it is zero, increment C and set a[i]=1• Return C as the number of distinct items• Time: O(1) per update and per query• But space is O(U)•

Lower bound• Use the same trick as last time, take a bitstring B and

encode it as a stream: i is in the stream if B[i] = 1,and i is not in stream otherwise.

• Feed this stream to the algorithm• To test whether any item is in the bitstring, keep a

copy of the memory contents of the algorithm.• Query the number of distinct items.• Then send item i, and query again• If the number of distinct items has increased, then

bit i = 0, else bit i = 1• Roll back the memory contents, and repeat...

Lower bounds contd.This way we can extract the entire bitstring B, so the

memory space must be at least U bits.This hold even probabilistically: even if the procedure is

allowed to be wrong with some probability, it stillrequires Ω(U) bits, by a reduction from anothercommunication complexity problem, Index.

So can we make any progress on this problem?Yes, if we approximate: find some approximation d of

the true answer D so that (1-ε)D < d < (1+ε)D withprobability 1-δ

If we can choose the parameters ε and δ, then this is avery powerful approximation scheme.

Probabilistic Counting

• The approach of Probabilistic Counting, dueto Flajolet & Martin, 1982, is a powerfulmethod of approximating the number ofdistinct elements.

• Detailed analysis was given in the paper ofAlon Matias & Szegedy, 1996

• Fairly simple to implement, has some niceproperties.

Probabilistic Counting

• The basic idea:• Keep an array a[1..log U], initially 0.• Use a hash function f: 1..U ! 0.. log U• Compute f(i) for every item in the stream, and

set a[f(i)] to 1• Somehow extract from this the approximate

number of distinct items.

Probabilistic CountingWhat kind of hash function to use?We will use Universal hash functions from last

time (remember, these can be represented insmall space)

If we apply them directly, then how long beforewe have covered all items in the array?

Coupon collector problem again... if the numberof distinct items is more than (log U ln log U)then all items will be covered.

Instead, we’ll do something a bit different.

Probabilistic CountingSuppose the probability of mapping item i to a[r] is

1/2r

Then ½ the items fall in first cell, ¼ in the second, 1/8in the third, and so on.

(each item falls in the same cell every time it isencountered, so it is as if only one of each distinctitem arrives)

If there are D distinct items, then we might expectlog D cells to be occupied...

a ½ ¼ 1/8 1/16 1/32 1/64 ......

Probabilistic CountingLet’s do this formally:Let f be drawn from a family of strongly 2-Universal

hash functions mapping onto 0..U-1Let r(x) be the function that returns the number of

trailing zeros in the binary representation of xHence r(12) = r(11002) = 2, r(257) = 0For each item i in the data stream, set a[r(f(i))] = 1Let R be the maximum j such that a[j]=1. Output 2R

as the approximate number of distinct values.

ExampleSuppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1...Let f(x) = 3x + 1 mod 5So the transformed stream (f applied to each item) is

4, 5, 2, 4, 2, 5, 3, 5, 4, 2, 5, 4We compute r of each item in the stream:

2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2Hence: a[0] = 1, a[1] = 1, a[2] = 1, a[3]=0, a[4]=0So R = 2. Output 22 = 4.(We got lucky this time, on a toy example. How will

things work out in general? What can we proveabout this approach?

Probabilistic CountingWhat is the expectation of the quantity R?f() distributes uniformly over 0..U-1 so

Pr[r(f(i)) ≥ j] = 2-j = pj

Let Zj be the number of distinct items in thestream i for which r(f(i)) ≥ j.

By linearity of expectation, we getE(Zj) = D(1pj+ 0(1-pj)) = Dpj = D/2j.

What is the variance? (we will useE(XY)=E(X)E(Y) due to pairwise independence)

VarianceVar(X) = E(X2) – E(X)2

Var(X+Y) = E(X2+Y2 +2XY) – E(X+Y)2

= E(X2) + E(Y2) + 2E(XY) – (E(X)+E(Y))2

= E(X2)–E(X)2+E(Y2)–E(Y)2+2E(X)E(Y)–2E(X)E(Y)= Var(X) + Var(Y) (for pairwise independence)

Variance for a single value of i ispj(12) – (1pj)2 = pj(1-pj) = 2-j(1 – 2-j)

Variance for D of these is D2-j(1 – 2-j) < D/2j = E(Zj)

Probability BoundsMarkov Inequality:For a random variable Y which takes only non-negative values.

Pr[Y ≥ k] ≤ E(Y)/k(This will be < 1 only for k > E(Y))

Chebyshev’s Inequality:For a random variable Y:

Pr[|Y-E(Y)| ≥ k] ≤ Var(Y)/k2

Proof: Set X = (Y – E(Y))2

E(X) = E(Y2+E(Y)2–2YE(Y)) = E(Y2)+E(Y)2-2E(Y)2= Var(Y)So: Pr[|Y-E(Y)| ≥ k] = Pr[(Y – E(Y))2 ≥ k2]. Using Markov:

≤ E(Y – E(Y))2/k2 = Var(Y)/k2

Applying Probability BoundsWhat are the chances that the highest entry,

R, is too big (so we will overestimate)?Then some entry a[j] = 1 and 2j > cD for

some constant c > 1Use the Markov inequality for the probability

of this event occurring:Pr[Zj ≥ 1] ≤ E(Zj)/1

= D/2j

< 1/c

Applying Probability BoundsWhat if the answer R is too small, so we

underestimate?Then some entry a[j]=0 and 2j < D/c.What is the probability that entry j is zero?

Pr[Zj = 0] = Pr[|Zj – E(Zj)| ≥ E(Zj)] ≤ Var(Zj)/E(Zj)2 (by Chebyshev)

< E(Zj)/E(Zj)2

= 1/E(Zj)= 2j/D< 1/c

Putting the bounds togetherThe probability that the answer we get is

between D/c and cD is 2/cFairly weak bounds, in practice it performs

pretty well. Tighter analysis is possible.Some heuristics:

– Use the average of several different runs (do eachrun in parallel) using different hash function

– Use the location of the smallest zero, not thehighest one (need to do some scaling of theresult)

A Different ApproachStart again with Distinct ItemsRepresent current state as a vector, and represent theproblem as a problem on that vector.Initially, the vector is zeroAdd one to entry i when i arrives in the streamSubtract one from entry j when j departs in the stream.This is the exact algorithm we originally came up with.• Distinct items = number of non-zero entries• (Frequent items = index of entries with value > n/k)

Vector Reduction

So, we have a vector a, which is being updated.Formally, we want to approximate

|i | a[i] ≠ 0|Suppose instead, we computed the Lp norm of a

Σi |a[i]|p

What do we get?p = 2: Sum of squaresp = 1: Sum of the (absolute) entriesp < 1? As p→0, if a[i]=0, we get 0

but if a[i] ≠0, we get 1.

Hamming Norm of a StreamWe call the number of non-zero entries in the vector,the Hamming norm of the vector (explain why later).For a vector generated by a stream, call this theHamming norm of the stream(= number of distinct items, but more general).Example, when the stream consists of integer updates:(5,+3), (2,-1), (3,+2), (7,+9), (5,-2), (6,-1), (6,-3),(2,+1), (4,+2), (3,-2), (7,-5), (5,+2), (6,-2), (4,-3), (5,-1)

1 2 3 4 5 6 7 8

0 0 0 -1 2 -3 4 0

Hamming norm of the stream is 4 (4 non-zero entries)

Zeroing in on Hamming Norm

We can approximate the Hamming norm by finding theLp norm to the power p for small enough p,Provided we guarantee that total in any entry < B

Hamming norm of vector a is |a|H = Σ |ai|0

where 00 defined = 0

Lp norm of a vector is (Σ |ai|p)1/p = ||a||p|a|H = Σ |ai|0 ≤ Σ |ai|p ≤ Σ Bp |ai|0 ≤ Bp |a|H

Setting Bp = (1+ε) means |a|H ≤ ||ai||pp ≤ (1+ε) |a|H

Fixes p = ε / log B, so can approximate Hamming Norm

Intuition: Sum of squares

If we compute sum of squares of a, Σ a[i]2

Would be easy if we received each a[i]togetherSuppose we did store aCompute a vector r by drawing each entry ofr from a Gaussian (Normal) distribution.Compute s = r • a = Σ r[i]*a[i]What is the expectation of s?

Gaussian DistributionWe know the following:Sum of Gaussians. If X, Y are Gaussian then cX + dY is

a GaussianThe expectation is cE(X) + dE(Y)The variance is c2Var(X) + d2Var(Y)So, if each member of r is drawn from Normal(0,1),

(mean = 0, variance = 1) then:E(s) = 0Var(s) = a[1]2 + a[2]2 + ... = Σ a[i]2

Suppose we output s2. What is E(s2)?

Expected ResultsVar(X) = E(X2) – E(X)2

So, E(s2) = Var(X) + E(X)2

= Σ a[i]2 + 0Which is what we want. How does this help?We can compute s incrementally.Initially s is 0.When we see i arrive in the stream, we compute

s = s + r[i]After we have seen the whole of the stream, we have

computed s = r • a, without explicitly storing a.... but instead we have explicitly stored r.

Saving SpaceWe don’t have to store r.We just need that every time we ask for r[i] then we

get the same answer.Suppose we use a random number generator.Every time we ask for r[i], seed the rng with i, and

extract a “random” number [0..1].Put this into the Box-Muller transform that outputs a

value drawn from N(0,1)Every time we do this, we will get the same result.So it works, the space is reduced.

Some Extra DetailsDoing this for a single value will be fairly inaccurate.

Possible to improve the accuracy by repeating severaltimes, taking the average.

Analysis skipped in this presentation, need to analyzethe variance (we’ll analyze next alg in painful detail)

Bottom line: to get an approximation that is correctwithin a factor of 1± ε with probability 1-δ requiresspace O(1/ε2

log 1/δ)

We’d like to do the same thing for sum of a[i]p

Recapping our approachAn exact answer is not possible in small space, so wefind an approximate answer with probabilityguarantees.

We will use statistical distributions with provableproperties.

• Pairs (i, j) arrive (meaning “add j to location i”)

• The total of values xi is bounded |xi| < U

Will create a small summarizing “sketch” for thestream that allows distinct items to be approximated.

Stable Distributions

Let X be a random variable distributed with a stabledistribution. Stable distributions have the property that

a1X1+a2X2+a3X3+ … anXn ~ ||(a1, a2, a3, … , an)||pX

if X1 … Xn are stable with stability parameter p

Gaussian distribution is stable with parameter 2

Stable distributions exist and can be simulated for allparameters 0 < p < 2.

So, let X = x1,1 … xk,n be a matrix of values drawnfrom a stable distribution with parameter p...

Computing the Sketch

• Sketch s is s1 ... sk for small k• Initially 0• When item i in the stream arrives, we update

the sketch:• sj " sj + xj,i for j =1 to k• The result is

si = a • xj = a1X1+a2X2+a3X3+ … anXn

So we get what we wanted.

Finding the Hamming NormWe can use the sketch to extract the number of

distinct itemsCompute median |s1|p ... |sk|pWe know each sj is distributed as ||a||pXmedian|sj| distributed as median(||a||p|X|)

= ||a||p median(|X|)

• Bound probability of being far from correct answer• We take the median of k = 3/ε2 log 1/δ repeats

Probability Calculation

• Let min be defined by Pr[|X|<min] = ½-ε• Suppose Pr[|X|<median|sj|] < ½ - ε• Then median|si|<min• Then at least k/2 values are smaller than min• Define Yi = 0 if si < min, 1 otherwise• We want to know, what is Pr[Σ Yi < k/2]• Yi are independent, Pr[Yi = 1] = ½ + ε• E(Σ Yi) = k(½ + ε)•

Chernoff Bound

For independent 0/1trials X = Σ Xi and 0<ρ<1Pr[X < (1-ρ)E(X)] < exp(-E(X)ρ2/2)

• Apply this here: want to know Pr[Σ Yi < k/2]• k/2 = k/2(½ + ε)/(½ + ε) = E(Y)½ /( ½ + ε)

~ (1 – 2ε)E(Y)• So Pr[Σ Yi < k/2] < exp(-k(½ + ε)ε2/2)• = exp(-3log 1/δ (½ + ε)/2)• < exp(-¾ log 1/δ) < δ/2

Using the boundSo Pr[Pr[|X|<median|sj|] < ½ - ε ] < δ/2We can make a similar argument to show the same δ/2

bound for (1+ε)Write F(x) for cumulative distribution function of |X|Pr[F(median|sj|) ∈ [ ½ - ε, ½ + ε]] > 1-δPr[median|sj| ∈ [ F-1(½ - ε), F-1(½ + ε)]] > 1-δSince the derivative of F is bounded around the

median, we concludePr[median|sj|∈ [F-1(½)(1–O(ε)),F-1(½)(1+O(ε))]]>1-δ

Consequences

Pr[ (1-ε) median|X| ≤ median|sj|≤ (1+ε) median|X|] > 1-δ

• Overall probability we are within (1 ± ε) is > = 1- δ• Sets k = O(1/ε2 log 1/δ) repetitions

But… we need to store all xi,j = O(kn) storage

… which is more than just storing the vector a

Reducing space needs

• xi,j must be from stable distribution with parameter p

• xi,j must be the same every time it is used

We will generate values from a stable distribution bytransforming from a uniform distribution

Use a random number generator that is good enoughso that f(x) appears to be drawn from a uniform dbn.

Then x1,j = stable(f(x)) x2,j = stable(f(f(x)))

x3,j = stable(f(f(f(x)))) etc.

Generating Stable Distributions• Compute r1, r2 as uniform random variables in

range [0...1]• Set θ= π(r1 – ½)• Define

• stable(r1, r2, p) is distributed with stabledistribution with parameter p

•

pp

p rppprrstable

−

−

−=

1

2/121 ln

))1(cos(cossin),,( θ

θθ

Complete Algorithm

initialize sk[1…k] = 0.0for all for all for all for all tuples (i,j) dodododo initialize random with i for for for for s = 1 to to to to k dodododo r1 = random(); r2 = random() sk[s] = sk[s]+j*stable(r1,r2,p)

for for for for s = 1 to to to to k dodododo sk[s] = absolute(sk[s])p

return median(sk)*scalefactor(p)

Simple to implement, can run quickly, small space

How to measure streams?

The state at any time defines a massive vector

• Hamming norm: Σ (xi ≠ 0)

Number of non-zero entries of the vector

• Union Size: Σ (xi + yi ≠ 0)

• Hamming difference: Σ ((xi - yi) ≠ 0) = Σ (xi ≠ yi)

This is the number of places where the vectorsdiffer - a fundamental concept.

Properties

Difference and union of streams is easy tocompute:

sk(a + b) = sk(a) + sk(b)sk(a - b) = sk(a) - sk(b)

by linearity of dot product, so can approximate|a - b|H and |a + b|H with the same accuracy.Space usage is small: the L0 sketch consists of O(1/ε2 log 1/δ) countersTime per item is to update each counter, O(1/ε2 log 1/δ)

Practical Use

So with O(k) space we can create a sketch to allowrapid comparison of huge streaming vectors.Note k << n, in fact k is almost independent of n.Implemented and tested in:[C, Indyk, Koudas, Muthukrishnan02] - On massivetabular data, looking for clusterings using sketchcomputations to speed up comparisons for L1, L2 andother Lp distances[C, Datar, Indyk, Muthukrishnan02] - On streamingvectors, to count number of distinct elements, findHamming norm and Hamming distance.

Experimental Evaluation

Data Sets

• Generated synthetic data from Zipf distributionswith a range of parameters

• Took real Netflow data from one of AT&T’snetworks

• Each data stream was around 20Mb, working spacewas around a few Kb.

Parameters We fixed p = 0.02 (as small as possible),sets scale factor, median(|Stable(0.02,0)|) = 1.425

Existing Techniques

Compared against the “probabilistic counting”algorithm of Flajolet and Martin

+ Uses a similar amount of space

+ Operates in the data stream model

+ Fast per-item processing

– Can’t cope with all situations (eg negative values)

– Can’t find the difference between two streams

Hamming Norm Tests

• Performance of our algorithm is better than FM85

• Improves with more workspace

• Somewhat slower in practice

• Shows that FM85 can’t cope when values are allowedto be negative, but L0 sketches retain their accuracy.

• Good performance (~7% error), small memory cost

• Performance of finding union of streams (not shown)also good.

Conclusions

We examined techniques for computing numbers ofdistinct items.

Can approximate the Hamming norm, Number ofDistinct Items, Hamming difference with only a few kbof space

Suitable for indexing streams

The “L0 sketch” can be used as a surrogate for thestream in other computations: clustering, searching,querying, all based only on the sketches

Bonus Material: Dominance Norms

The “worst case influence”is important to knowSuppose we are receiving a number of signals.Stream consists of (i, ai), meaning signal i took

value ai

Take sum of maximum of each signal Σi max ai

(so not quite the cash register model)We define this to be the dominance norm of

the streamCan we compute the dominance norm?

Example

Stream consists of:(5,3), (2,1), (3,2), (7,9), (5,2), (6,1), (6,3), (2,1),

(4,2), (3,2), (7,5), (5,2), (6,2), (4,3), (5,1)

Worst case influence is 1+2+3+3+3+9=21We will use counting distinct elements as a tool

to help us answer this question.

1 2 3 4 5 6 7 8

0 1 2 3 3 3 9 0

Approximating Dominance Norm

• Want to approximate the dominance norm asbefore up to a (1± ε) factor

• Consider just approximating for a single signal• We see stream of values for this signal• Want to take the max of these• Suppose we represent ai as 1+1+2+4+... 2j

• Total = 2j+1, 2j ≤ ai ≤ 2j+1

A 2-approximationWe will insert symbols x0, x1, x2, ... xj into separate

distinct elements algorithms D0, D1, D2, ... Dj

If we do this for every ai encountered, then we count 1for xi

So we can compute max(ai) approx to a factor of 2.If we do the same for every signal i, then we can

compute the dominance norm up to a factor of 2:Output D0 + D1 + 2D2 + ... + 2jDj

Generalizing• Instead of powers of 2, we can use powers of (1+ε)• This will allow us to make a 1+ε approximation

Analysis

• How much space do we need?• Suppose B is the maximum value seen• Then we need j algorithms, (1+ε)j > B• j = log B / log (1+ε) ~ log B / ε• Space for each algorithm = O(1/ε2 log 1/δ)• Total space = O(1/ε3 log B log 1/δ)

Min dominance?

HW: Suppose instead you wish to compute thebest-case influence.

That is, the minimum of each signal,Σi min(ai)

Either: design an efficient algorithm to solve thisproblem on the stream, or give a lower boundon the space required.

ReferencesN. Alon, Y. Matias, M. Szegedy “The Space Complexity ofApproximating the Frequency Moments”, STOC 1996

G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan,“Comparing Data Streams Using Hamming Norms”, VLDB2002

G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan, “FastMining of Tabular Data via Approximate DistanceComputations”, ICDE 2002

P. Flajolet, N. Martin “Probabilistic Counting”, FOCS 1983

P. Indyk “Stable Distributions, Pseudorandom Generators,Embeddings and Data Stream Computations”, FOCS 2000

J. Nolan, “An Introduction to Stable Distributions”(on web)

data stream methods - rutgers universitymuthu/198-4.pdf · probabilistic counting • the approach...

Documents