streaming algorithms piotr indyk mit. data streams a data stream is a sequence of data that is too...

Streaming Algorithms

Piotr IndykMIT

Data Streams

• A data stream is a sequence of data that is too large to be stored in available memory

• Examples:– Network traffic

– Sensor networks

– Approximate query optimization and answering in large databases

– Scientific data streams– ..And this talk

Outline

• Data stream model, motivations, connections

• Streaming problems and techniques– Estimating number of distinct elements– Other quantities: Norms, moments, heavy

hitters…– Sparse approximations and compressed

sensing

Basic Data Stream Model

• Single pass over the data: i1, i2,…,in– Typically, we assume n is known

• Bounded storage (typically n or logc n)– Units of storage: bits, words, coordinates or

„elements” (e.g., points, nodes/edges)

• Fast processing time per element– Randomness OK (in fact, almost always necessary)

8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ...

Connections

• External memory algorithms– Linear scan works best– Streaming algorithms yield efficient

external memory algorithms

• Communication complexity/sketching– Low-space algorithms yield low-

communication protocols

• Metric embeddings, sub-linear time algorithms, pseudo-randomness, compressive sensing

A B

Counting Distinct Elements

• Stream elements: numbers from {1...m} • Goal: estimate the number of distinct

elements DE in the stream– Up to 1– With probability 1-P

• Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P:– Answers YES, if DE> (1+)T– Answers NO, if DE< (1-)T

• Run, in parallel, the algorithm with T=1, 1+, (1+)2,..., n

– Total space multiplied by log1+n ≈ log(n)/

Vector Interpretation

• Initially, x=0• Insertion of i is interpreted as

xi = xi +1• Want to estimate DE(x) = the number

of non-zero elements of x

Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5

Vector X:

1 2 3 4 5 6 7 8 9

Estimating DE(x) (based on[Flajolet-Martin’85])

• Choose a random set S of coordinates– For each i, we have Pr[iS]=1/T

• Maintain SumS(x) = iS xi

• Estimation algorithm I:– YES, if SumS(x)>0– NO, if SumS(x)=0

• Analysis:– Pr=Pr[SumS(x)=0] = (1-1/T)DE – For T large enough: (1-1/T)DE ≈e-DE/T

– Using calculus, for small enough:• If DE> (1+)T, then Pr ≈ e-(1+) < 1/e -

/3• if DE< (1-)T, then Pr ≈ e-( 1-) > 1/e +

/3

Vector X:

1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 3 5 7 9 11 13 15 17 19

Series1

Set S: + ++ (T=4)

DE

Pr

Estimating DE(x) ctd.• We have an algorithm:

– If DE> (1+)T, then Pr<1/e-/3– if DE< (1-)T, then Pr>1/e+/3

• Need to amplify the probability of correctness:– Select sets S1 … Sk , k=O(log(1/P)/2)

– Let Z = number of SumSj(x) that are equal to 0– By Chernoff bound with probability >1-P

• If DE> (1+)T, then Z<k/e• if DE< (1-)T, then Z>k/e

• Total space: – Decision version: O(log (1/P)/2 ) numbers in range 0…n

– Estimation: : O(log(n)/ * log (1/P)/2 ) numbers in range 0…n

(the probability of error is O(P log(n)/) )• Better algorithms known:

– Theory: O( 1/2 +log n) bits [Kane-Nelson-Woodruff’10]– Practice: need 128 bytes for all works of Shakespeare , ε≈10% [Durand-Flajolet’03]

Chernoff bound

Let X1…Xn be a sequence of i.i.d. 0-1 random variables such that Pr[Xi=1]=p.

Let X=sumi Xi Then for any eps<1 we have

Pr[ |X-pn| >eps pn] <2e-eps^2 pn/3

Comments

• Implementing S:– Choose a uniform hash function h: {1..m} ->

{1..T}– Define S={i: h(i)=1}– We have i S iff h(i)=1

• Implementing h– In practice: to compute h(i) do:

• Seed=i• Return random()

More comments

• The algorithm uses “linear sketches” SumSj(x)=iSj xi

• Can implement decrements xi=xi-1– I.e., the stream can contain deletions of

elements (as long as x≥0)– Other names: dynamic model, turnstile model

Vector X:

1 2 3 4 5 6 7 8 9

• What other functions of a vector x can we maintain in small space ?

• Lp norms: ||x||p = ( ∑i |xi|p )1/p

– We also have ||x|| =maxi |xi|– … and we can define ||x||0 = DE(x), since ||x||p

p =∑i |xi|p→DE(x) as p→0• Alternatively: frequency moments Fp = p-th power of Lp norms (exception: F0 = L0 )• How much space do you need to estimate ||x||p (for const. ) ?• Theorem:

– For p[0,2]: polylog n space suffices– For p>2: n1-2/p polylog n space suffices and is necessary

[Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram-Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05]

More General Problem

Estimating L2 norm: AMS

Alon-Matias-Szegedy’96

• Choose r1 … rm to be i.i.d. r.v., with

Pr[ri=1]=Pr[ri=-1]=1/2 • Maintain

Z=∑i ri xi

under increments/decrements to xi

• Algorithm I: Y=Z2

• “Claim”: Y “approximates” ||x||22 with “good” probability

Analysis

• We will use Chebyshev inequality– Need expectation, variance

• The expectation of Z2 = (∑i ri xi )2 is equal to

E[Z2] = E[∑i,j rixirjxj] = ∑i,j xi x j E[rirj]

• We have– For i≠j, E[rirj] = E[ri] E[rj] =0 – term disappears

– For i=j, E[rirj] =1

• Therefore

E[Z2] = ∑i xi2 =||x||22

(unbiased estimator)

Analysis, ctd.• The second moment of Z2 = (∑i ri xi )2 is equal to the expectation of

Z4 = (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) (∑i ri xi )• This can be decomposed into a sum of

– ∑i (ri xi )4 →expectation= ∑i xi 4

– 6 ∑i<j (ri rj xixj )2 →expectation= 6∑i<j xi2

xj2

– Terms involving single multiplier ri xi (e.g., r1x1r2x2r3x3r4x4) →expectation=0

Total: ∑i xi 4 + 6∑i<j xi

2 xj

2

• The variance of Z2 is equal to E[Z4]-E2[Z2] = ∑i xi

4 + 6∑i<j xi2

xj2 – (∑i xi

2 )2

= ∑i xi 4 + 6∑i<j xi

2 xj

2 – ∑i xi4 -2 ∑i<j xi

2 xj

2

= 4∑i<j xi2

xj2

≤ 2 (∑i xi 2 )2

Analysis, ctd.

• We have an estimator Y=Z2

– E[Y] = ∑i xi2

– σ2 =Var[Y] ≤ 2 (∑i xi 2 )2

• Chebyshev inequality:Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c2

• Algorithm II:– Maintain Z1 … Zk (and thus Y1 … Yk ), define Y’ = ∑i Yi /k– E[Y’] = k ∑i xi

2 /k = ∑i x i2

– σ’2 = Var[Y’] ≤ 2k(∑i xi 2 )2 /k2 = 2 (∑i xi

2 )2 /k• Guarantee:

Pr[ |Y’ - ∑i xi2 | ≥c (2/k)1/2 ∑i xi

2 ] ≤ 1/c2

• Setting c to a constant and k=O(1/ε2) gives (1 ε)-approximation with const. probability

Comments

• Only needed 4-wise independence of r1…rm – I.e., for any subset S of {1…m}, |S|=4, and any

b{-1,1}4, we have Pr[rS=b]=1/24

– Can generate such vars from O(log m) random bits

• What we did:– Maintain a “linear sketch” vector Z=[Z1...Zk] = R x

– Estimator for ||x||22 : (Z12 +... + Zk

2)/k = ||Rx||22 /k

– “Dimensionality reduction”: x→ Rx

… but the tail somewhat “heavy”

– Reason: only used second moment of the estimator

streaming algorithms piotr indyk mit. data streams a data stream is a sequence of data that is too...

Documents

t large

n estimation

sequence of data

probability of error

probability of correctness

e 3deprchart10

beststreaming algorithms

number of sumsjx