algorithms for distributed functional monitoring

34
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

Upload: hei

Post on 23-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Algorithms for Distributed Functional Monitoring. Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.). The Story Begins with . The Model. Alice observes A ( t ) by time t. 5. 4. 3. 1. 2. 4. 1. t. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for Distributed Functional Monitoring

Algorithms for Distributed Functional Monitoring

Ke YiHKUST

Joint work with Graham Cormode (AT&T Labs)S. Muthukrishnan (Google Inc.)

Page 2: Algorithms for Distributed Functional Monitoring

The Story Begins with ...

Page 3: Algorithms for Distributed Functional Monitoring

The Model

1421345

235212

Alice observesA(t) by time t

Bob observesB(t) by time t

A(t), B(t): multisets

Carole tries to computef (A(t)UB(t)) for all t

All parties have infinite computing powerGoal is to minimize communication

t

Page 4: Algorithms for Distributed Functional Monitoring

The Model

1421345

235212

2 31313

253322

k sites

Continuous Communication Model / Distributed Streaming Model

Page 5: Algorithms for Distributed Functional Monitoring

Combination of Two Models

3

11

2 4

2 3

11

2 4

2

Communication model

14213

Streaming model

Continuous Communication Model Distributed Streaming Model

One-shot Model

“ ”

Page 6: Algorithms for Distributed Functional Monitoring

Other Models [Gibbons and Tirthapura, 2001]

1421345

235212

Carole tries to computef (AUB) in the end

All parties make one pass using small memory small communication

t

Page 7: Algorithms for Distributed Functional Monitoring

Applied Motivation: Distributed Monitoring

Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites

E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring

Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …) Streaming data is spread throughout the network

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3S1

S2

Slide from the tutorial “Streaming in a connected world: Querying and trackingdistributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]

Page 8: Algorithms for Distributed Functional Monitoring

Applied Motivation: Distributed Monitoring

Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate

Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting

happens

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3S1

S2

Page 9: Algorithms for Distributed Functional Monitoring

Theoretical Questions

Upper bounds: Worst-case communication bounds for a given f ?

Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?

Page 10: Algorithms for Distributed Functional Monitoring

The Frequency Moments

Assume integer domain [n] = {1, …, n}i appears mi timesThe p-th frequency moment:F1 is the cardinality of AF0 is # unique items in A (define 00=0)F2 is

Gini’s index of homogeneity in statisticsself-join size in db

Extensively studied since [Alon, Matias, and Szegedy, 1999]

Page 11: Algorithms for Distributed Functional Monitoring

Approximate Monitoring

Must trigger alarm when Fp > τCannot trigger alarm when Fp < (1 − ε) τ

Why approximate: Exact monitoring is expensive and unnecessary

Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2,

(1+ε)3, …, so at most an O(1/ε) factor away.

time

Fp

τ

(1 − ε) τ

alarm

Page 12: Algorithms for Distributed Functional Monitoring

Prior Work

Several papers in the database literatureMostly heuristic basedBad worst-case bounds, no lower bounds

F1: O(k/ε log(τ/k)) [SIGMOD’06]

F0: Õ(k2/ε3) [ICDE’06]

F2: Õ(k2/ε4) [VLDB’05]Õ() suppresses polylog factors

O(k log(1/ε))Õ(k/ε2)Õ(k2/ε+k3/2/ε3)

Page 13: Algorithms for Distributed Functional Monitoring

Continuous vs One-Shot

If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits

Page 14: Algorithms for Distributed Functional Monitoring

Our Results

Good news: all continuous bounds (except F2) are close to their one-shot counterparts

Bad news: all continuous bounds (except F2) are close to their one-shot counterparts

Page 15: Algorithms for Distributed Functional Monitoring

Talk Outline

IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions

Page 16: Algorithms for Distributed Functional Monitoring

Deterministic F1 Algorithm

The first round:

τ/2k

coordinator

Terminates round after receiving k signalsτ/2k · k = τ/2 < F1 < τ

Page 17: Algorithms for Distributed Functional Monitoring

Deterministic F1 Algorithm

The second round:

τ/4k

coordinator

Page 18: Algorithms for Distributed Functional Monitoring

Deterministic F1 Algorithm

The second round:

τ/4k

coordinator

Terminates round after receiving k signals3τ/4 < F1 < τ

Page 19: Algorithms for Distributed Functional Monitoring

Deterministic F1 Algorithm

Each round communicates O(k) bitsContinue until Δ=ετ O(log(1/ε)) rounds

Δ=ετ

coordinator

After the last round, we have (1-ε)τ < F1 < τ

Total communication: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))

One-Shot: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))

Page 20: Algorithms for Distributed Functional Monitoring

Talk Outline

IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions

Page 21: Algorithms for Distributed Functional Monitoring

F0: # Distinct Items

Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits

Consider the one-shot case firstUse “sketches”: small-space streaming

algorithms “Combine” the sketches from the k sitesFM sketch [Flajolet and Martin 1985; Alon, Matias,

and Szegedy, 1999]

Page 22: Algorithms for Distributed Functional Monitoring

FM Sketch

Take a pair-wise independent random hash function h : {1,…,n} {1,…,2d}, where 2d > n

For each incoming element x, compute h(x)e.g., h(5) = 10101100010000Count how many trailing zerosRemember the maximum number of trailing zeroes in

any h(x)Let Y be the maximum number of trailing zeroes

Can show E[2Y] = # distinct elements

Page 23: Algorithms for Distributed Functional Monitoring

FM Sketch

So 2Y is an unbiased estimator for # distinct elementsHowever, has a large variance

Some recent techniques [Gibbons and Tirthapura, 2001; Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε

Space increased to Õ(1/ε2)FM sketch has linearity

Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB

A one-shot algorithm with communication Õ(k/ε2)

Page 24: Algorithms for Distributed Functional Monitoring

Continuously Monitoring F0

FM sketch is monotoneYi is non-decreasing, and Yi < log nWhenever Yi increases, notify the coordinatorThe coordinator can always have the up-to-

date combined FM sketch Total communication: Õ(k/ε2)

Lower bound: Ω(k)

Page 25: Algorithms for Distributed Functional Monitoring

Talk Outline

IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions

Page 26: Algorithms for Distributed Functional Monitoring

F2: The One-Shot Case

Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits

Consider the one-shot case firstUse “sketches”: small-space streaming

algorithms “Combine” the sketches from the k sitesAMS sketch [Alon, Matias, and Szegedy, 1999]

Page 27: Algorithms for Distributed Functional Monitoring

AMS Sketch: “Tug-of-War”

Take a 4-wise independent random hash functionh : {1,…,n} {−1,+1}

Compute Y = ∑ h(x)

over all xY2 is an unbiased estimator for F2

Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε

Linearity still holds!o One-shot case can be solved with communication Õ(k/ε2)

Page 28: Algorithms for Distributed Functional Monitoring

However…

Y is not monotone!

Can’t afford to send all changes of the local sketch to the coordinator

Page 29: Algorithms for Distributed Functional Monitoring

F2 Monitoring: Multi-Round Algorithm

Beginning of a round

sketch Õ(1/ε2)sketch Õ(1/ε2)

estimate for F2

coordinator

Page 30: Algorithms for Distributed Functional Monitoring

F2 Monitoring: Multi-Round Algorithm

During a round

estimate for F2

coordinator

sends a signal wheneverthe F2 of the updates increasesby t = (τ − F2)2/(64k2τ)

Page 31: Algorithms for Distributed Functional Monitoring

F2 Monitoring: Multi-Round Algorithm

End of a round: when k signals are received

estimate for F2

coordinator

old F2 + (τ − old F2) ∙ ε/k < new F2 < τ

# rounds: O(k/ε)Total cost: Õ(k2/ε3)

Page 32: Algorithms for Distributed Functional Monitoring

F2: Round / Sub-Round Algorithm

End of a sub-round: when k signals are received

estimate for F2

coordinator

old F2 + (τ − old F2) ∙ ε/k < new F2 < τ

“rough” sketchof size Õ(1)

“rough” sketchof size Õ(1)

combine sketchesmaintain an upper bound of F2

k

Total cost: Õ(k2/ε+k3/2/ε3)

One-shot: Õ(k/ε2)Lower bound: Ω(k)

Page 33: Algorithms for Distributed Functional Monitoring

Open Problems

Still no clear separation between the one-shot model and the continuous model F2 is an interesting case

Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, …

Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows?

“Continuous Communication Complexity”?

Page 34: Algorithms for Distributed Functional Monitoring

Thank you!