memory requirements of data streams reynold cheng 19 th july, 2002

54
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Upload: benedict-sullivan

Post on 18-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Memory Requirements of Data Streams

Reynold Cheng

19th July, 2002

Page 2: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Data Stream ModelUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultsResults

Page 3: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Data Stream ModelUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultsResults

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem

(DSMS)

Page 4: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Impact of Limited Memory

Continuous streams grow unboundedly Question: Can a continuous query be

evaluated using a finite amount of memory?

Page 5: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query Model

SPJ (Select-Project-Join) Queries L(P(S1 x S2 x … x Sn)) L: List of projected attributes P: Selection Predicate S1, S2,…, Sn: Input Streams : duplicate-eliminating; ’: duplicate-

preserving

Page 6: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query Model: Predicate P

P: conjunction of atomic predicates Atomic Predicate

Si.A Op Sj.B (i = j or i != j)

Si.A Op k

Op: {<, =, >}

Page 7: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query Model: Attributes

All attributes have discrete, ordered domains All attributes are of type integer

Page 8: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query Model: Monotonic and Exact Answers

The queries are monotonic Any tuple that appears in the answer at any

point in time continues to do so forever. The answers produced are exact i.e., no

approximation.

Page 9: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Motivating Examples

When does a SPJ query require only a bounded amount of memory?

Assume 2 streams: S(A, B, C) and T(D, E)

Page 10: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query 1

’A(A>10(S)) Is a simple filter on S Tuple-at-a-time processing No extra memory for saving stream tuples

Page 11: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query 1 (again)

A(A>10(S)) Keep track of each distinct value of A > 10 To eliminate duplicates Requires unbounded memory

Page 12: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query 2

A(A=D(S x T)) Save each tuple s in S, since s may join with

any tuples in T that arrive in the future Also need to save each tuple in T Requires unbounded memory

Page 13: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query 3

’A(A=D ^ A > 10 ^ D < 20 (S x T)) Can be evaluated with bounded memory! For each integer v in [11,19], keep:

current # of tuples in S with A = v (v.S) current # of tuples in T with D = v (v.T)

For an incoming tuple from stream S with A = v, output v for v.T times.

For an incoming tuple from stream T with D = v, output v for v.S times.

Page 14: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Query 4

A(B < D ^ A > 10 ^ A < 20 (S x T)) Can be evaluated with bounded memory! For each integer v in [11,19], keep:

Current min value of B among all tuples in S with A = v (v.B_min)

Current max value of D among all tuples in T with A = v (v.D_max)

For an incoming tuple from stream S, check: S.B < v.D_max.

For an incoming tuple from stream T, check: T.D > v.B_min.

Page 15: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Our Goals

Some queries involving join can be answered using finite amount of memory

Under what conditions can a query be answered using finite memory over all possible instances of streams?

Identify a class of queries that can be evaluated with a bounded amount of memory

Consider both duplicate-preserving and duplicate-eliminating projections

Page 16: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Bounded Memory Computability

An SPJ query is bounded-memory computable if we can find: A constant M and, An algorithm that evaluates the query using fewer

than M units of memory A unit of memory can store one attribute

value or a count

Page 17: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Memory boundness testing (Outline)

Rewrite Q as a union of Locally Totally Ordered Queries (LTO queries), based on the following theorem:

Any query Q can be rewritten as Q1 U Q2 U … U Qm, where each Qi is an LTO query and unions are duplicate-preserving

Page 18: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Memory boundness testing (Outline)

Q is computable in bounded memory if and only if all its decomposed LTO queries are computable in bounded memory.

Key: Key: Develop a theorem for checking the Develop a theorem for checking the memory boundness of a LTO querymemory boundness of a LTO query

Page 19: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Definitions

Can write a query Q as Q(P), when only P is important to discussions

Element: constant / attribute of a stream C(Q): Set of constants appearing in Q S(Q): Set of streams appearing in Q A(S): Set of attributes in stream S (S) = A(S) U C(Q): set of elements in Q

potentially relevant to stream S

Page 20: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Total Ordering

P+: transitive closure of P A = 10 and B < A => B < 10

Set of elements E is totally ordered by a set of predicates P if for any elements e1 and e2 in E, Exactly one of e1<e2, e1=e2, or e1>e2 is in P+

Page 21: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Order-Inducing Predicates

A set of predicates P is order-inducing if a set of elements E is totally ordered by P.

TO(E): Set of all order-inducing sets of predicates for elements E

Example E = {A,B,5} Order-inducing sets: {A < B, 5 < A}, {A = B, B < 5}. These two sets belong to TO(E).

Page 22: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

LTO Query

A query Q(P) is LTO query if for every S S(Q), (S) is totally ordered by P Where S(Q): Set of streams appearing in Q (S) = A(S) U C(Q): set of elements in Q

potentially relevant to stream S

Page 23: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Decomposition of a Query into LTO Queries

Q = ’L(P(S1 x S2 x … x Sn)) can be decomposed into Q1 U Q2 U … U Qm, where Qi is an LTO query.

Let TO((Si)) = {Ti1, Ti

2,…, Timi},

Tij is a local total ordering – one possible

ordering of Si and query constants.

Page 24: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Decomposition Theorem

An exhaustive union of m1 x m2 x mn queries:

Q(P U T11 U T2

1 U … U Tn1) U

Q(P U T12 U T2

1 U … U Tn1) U

… U

Q(P U T11 U T2

2 U … U Tn1) U

Q(P U T12 U T2

2 U … U Tn1) U

… U

Q(P U T1m1 U T2

m2 U … U Tnmn)

Page 25: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Boundedness of Attributes

An attribute A is lower-bounded by a set of predicates P if there is an atomic predicate A > k P+ for some constant k

An attribute A is upper-bounded by a set of predicates P if there is an atomic predicate A < k P+ for some constant k

An atomic attribute is bounded if it is both upper-bounded and lower-bounded

An attribute is unbounded if it is not bounded

Page 26: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

MaxRef and MinRef

MaxRef(Si) is the set of all unbounded attributes A of Si that participate in join of the form Sj.B < Si.A, i j

MinRef(Si) is the set of all unbounded attributes A of Si that participate in join of the form Si.A < Sj.B, i j

Page 27: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Boundness Testing of LTO Query (Duplicate-Preserving)

Let Q = ’L(P(S1 x S2 x … x Sn)) be an LTO query where n > 1. Q is bounded memory computable iff:

1. Every attribute in L is bounded

2. For every equality join predicate Si.A=Sj.B where i j, Si.A and Sj.B are both bounded

3. |MaxRef(Si)| = |MinRef(Si)| = 0 for i = 1,…,n

Page 28: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Proof Outline

We create synopses of n data streams For each stream Si, partition the tuples that

satisfy the total order condition on Si into distinct buckets based on the values of the bounded attributes

Tuples with the same values of all bounded attributes are placed in the same bucket

Tuples that differ on at least one bounded attribute are placed in different buckets

Page 29: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Proof Outline (2)

For each bucket, store The values for the bounded attributes Total number of tuples falling into the bucket

The total size of these synopses is bounded by constant

Can be proved that if the 3 conditions are satisfied, these synopses suffice to evaluate the query.

Page 30: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Proof (If):

Attributes not in project list / join condition can be ignored, because all local selection conditions must be implied by the total order

C1 and C2 guarantee that all attributes in equijoin and project list are bounded

Each synopsis maintains full information about the values of all bounded attributes for every tuple

Equijoin and projections can be handled properly

Page 31: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Proof (If):

C3 asserts For every Si.A < Sj.B, i j, either one is true:

1. Both attributes are bounded --- Synopses have full information for A & B

2. The total order implies Si.A < c < Sj.B where c is some constant in Q --- No need to store actual attribute values, since all tuples from Si that satisfy the total order on Si join with all tuples from Sj that satisfy the total order on Sj

No relevant information is lost by consolidating tuples into buckets

Page 32: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Proof (Only if):

If one of the conditions C1, C2 and C3 does not hold, then Q cannot always be evaluated in bounded memory.

For each condition, if the condition is violated, one can construct instances and presentations of the input streams that require more than M memory to correctly evaluate Q.

Page 33: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Boundness Testing of LTO Query (Duplicate-Eliminating)

Let Q = L(P(S1 x S2 x … x Sn)) be an LTO query where n > 1. Q is bounded memory computable iff: Every attribute in L is bounded For every equality join predicate Si.A=Sj.B where i

j, Si.A and Sj.B are both bounded

|MaxRef(Si)|eq + |MinRef(Si)|eq 1 for i = 1,…,n

|E|eq denotes the number of P-induced equivalence classes in the element set E.

Page 34: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Comments

Q is computable in bounded memory if and only if all its decomposed LTO queries are computable in bounded memory.

The checking algorithm needs O(|A(Q)|4) times. If a query is not memory-bounded computable,

approximation algorithms are necessary. For a memory-bounded computable query,

computation is costly in the evaluation of each LTO query. Need to evaluate the query for every arrival of tuples.

Page 35: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Comments

Only spatial requirement, not query speed, is considered.

Can we extend the memory-boundness checking to general (including aggregate) queries?

It is not always possible to provide exact answers to queries, so approximation query algorithms have to be developed.

For these approximation queries, need to consider memory-boundness, execution speed and approximation quality.

Page 36: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Outline of this Talk

Memory Requirements of QueriesMemory Requirements of Queries Approximation QueriesApproximation Queries Other Research Issues

Page 37: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Approximate Query Evaluation Why? Handling load – streams coming too fast Data stream is archived in a off-site data

warehouse, expensive access of archived data

Avoid unbounded storage and computation Ad hoc queries need approximate history Try to look at the data items only once and

in a fixed order

Page 38: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Approximate Query Evaluation

How? Sliding windows, synopsis, samples Major Issues?

Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource

allocation Accuracy-efficiency-storage tradeoff and global

metric

Page 39: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Synopses

Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis?

Succinct summary of old stream tuples Like indexes/materialized-views, but base data is unavailable

Examples Sliding Windows Samples Sketches Histograms Wavelet representation

Page 40: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Sketching Techniques

Self-Join Size Estimation Stream of values from D = {1,2,…,n} Let fi = frequency of value i Consider S = Σ fi

2, or Gini’s index of homogeneity.

Useful in parallel DB applications, error estimation in query result size estimation and access plan costs.

Equivalent query: count (R |><|D R)

Page 41: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Evaluating S = Σ fi2

To update S, keep a counter fi for each value i in the domain D (n) space

Has to be kept for each self-join Question – estimating S in sub-linear space?

(O(log n))

Page 42: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Self-Join Size Estimation

AMS Technique (randomized sketches) Given (f1,f2,…,fN)

Zi = random{-1,1}

X = Σ fiZi (X incrementally computable)

Theorem Exp[X2] = Σ fi2

Cross-terms fiZi fjZj have 0 expectation

Square-terms fiZi fiZi = fi2

Space = log (N + Σ fi)

Independent samples Xk reduce variance

Page 43: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Estimation Quality

How can independent samples Xk improve the quality of estimation?

Keep s1 x s2 samples for Xk

s1 reduces variance, s2 boosts confidence

Avg(X1j2)

Avg(X2j2)

Avg(X3j2)

Avg(X4j2)

Avg(X5j2)

Median(sketch)

s1

s2

AtomicSketch

Page 44: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Sample Run of AMS

3 6 2 5 7V =

Z1 = -1 1 -1 1 11 1 -1 1 -1 Z2 =

X1= 5, X12 = 25 X2= 14, X2

2 = 196 Est = 110.5Σvi2 = 123

4 6 2 5 7V =

Z1 = -1 1 -1 1 11 1 -1 1 -1 Z2 =

X1= 6, X12 = 36, X2= 12, X2

2 = 144, Est = 90Σ vi2 = 130,

Page 45: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Comments on AMS

The self-join size can be computed on-line Sufficiently small variance (controlled by s1 and s2) Can this method be extended to answer other

queries?

Page 46: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Complex Aggregate Queries

A. Dobra et al. extend the idea of AMS to provide approximate answers to complex aggregate queries.

SELECT AGGAGG FROM R1,R2,…,Rr where EE AGG: COUNT/SUM/AVERAGE E: conjunction of (Ri.Aj = Rk.Al) It is proved that the error of these estimates is at

most ε with probability 1-δ.

Page 47: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Basic Notions of Approximation For aggregate queries (e.g., SUM, COUNT), approximation

quality can be measured by relative error: (Estimated value – Actual value) / Actual value

Open question: for queries involving more than simple aggregation, how should we define approximation?

Consider S |><|BT: (S: {A,B}, T: {B,C})

A B C

10 20.5 Doctor

8 10.3 Lawyer

3 10.2 Teacher

A B C

8 10.3 Lawyer

3 10.2 Teacher

Actual Result Approximate Result

Page 48: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Basic Notions of Approximation (2)

Can we accept this kind of approximation?

A B C

10 20.5 Doctor

8 10.3 Lawyer

3 10.2 Teacher

Actual Result Approximate Result

A B C

11 21.6 Doctor

8 10.3 Student

3 10.2 Teacher

Page 49: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Basic Notions of Approximation (3)

Can we provide useful (semantically correct) but stale results?

A B C

10 20.5 Doctor

8 10.3 Lawyer

3 10.2 Teacher

Actual Result (at time t)Approximate Result (correct result at time t - )

A B C

10 20.5 Doctor

8 10.3 Lawyer

Page 50: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Outline of this Talk

An Overview of Streams Data and Query Model Approximation Queries Other Research IssuesOther Research Issues

Page 51: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Data Mining

High-Speed Stream Data Mining Association Rules Stream Clustering Decision Trees

Single-pass algorithms for infering interesting patterns on-line (as the data stream arrives)

Useful for mission-critical tasks like telecom fraud detection

Page 52: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Conclusion: Future Work

Query Processing Stream Algebra and Query Languages Approximations Blocking Operators, Constraints, Punctuations

Runtime Management Scheduling, Memory Management, Rate Management Query Optimization (Adaptive, Multi-Query, Ad-hoc) Distributed processing

Synopses and Algorithmic Problems Systems

UI, statistics, crash recovery and transaction management System development and deployment

Page 53: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

References

B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems, PODS ’02. (Paper and Talk Slides)

A. Arasu, B. Babcock, S. Babu, J. McAlister, J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams, PODS ’02.

A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD ’02.

N. Alon, Y. Matias, M. Szegedy. The Space Complexity of Approximating the Frequency Moments, STOC ’96.

Page 54: Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Thank You!