data stream management systems

1

Data Stream Management Systems

CS240B Notesby

Carlo Zaniolo

2

Data Streams

Continuous, unbounded, rapid, time-varying streams of data elements

Occur in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes

DSMSDSMS = Data Stream Management System

3

Many Research Projects …

Amazon/CougarAmazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring,

dataflow Hancock Hancock (AT&T) – Telecom streams Niagara (OGI/Wisconsin) – Internet DBs & XML OpenCQ OpenCQ (Georgia) – triggers, view maintenance Stream (Stanford) – general-purpose DSMS TapestryTapestry (Xerox) – pubish/subscribe filtering Telegraph (Berkeley) – adaptive engine for

sensors TribecaTribeca (Bellcore) – network monitoring Stream Mill (UCLA) - power & extensibility Gigascope: AT&T Labs – Network Monitoring

4

DSMS

Scratch Store

The (Simplified) Big Picture

Input streams

RegisterQuery Streamed

Result

Archive

StoredRelations

Clients

Server

5

Databases vs Data Streams

Database Systems Model: persistent data Table: set|bag of tuples Updates: All Query: transient Query Answer: exact Query Eval. multi-pass Operator: blocking OK Query Plan: fixed

Data Stream Systems Model: transient data Infinite sequence of tuples Updates: append only Query: persistent Query Answer: Often approx Query Eval. one-pass Operators: unblocking only Query Plan: adaptive

6

Research Challenges

Data Models Relational Streams first, XML streams important

too Tuple-Time Stamping Order is important Windows or other synopses

Query Languages: SQL or XQUERY + extensions

Blocking operators and Expressive Power

Query Plans: Optimized scheduling for response time or memory

Quality of Services (QoS) & Approximation Load shedding, sampling

Support for Advanced Applications Data Stream Mining

7

Data Models

Relational Data Streams Each data stream consists of relational

tuples The stream can be modelled as an

append-only relation But repetitions are allowed and order is

very important! Order based on timestamps—or arrival

order

Streaming XML Data. A stream of structured SAX elements

8

Timestamps Data streams are (basically) ordered according to their

timestamps The meaning of windows, unions an joins is based on timestamps External

Injected by data source Model real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small

buffers Internal

Introduced as special field by the DSMS Approx. based on the time they arrived

Missing (called latent in Stream Mill) The system assigns no timestamp to arriving tuples, But tuples are still processed as ordered sequences By operators whose semantics expects timestamps… Thus operators might instantiated timestamps as/when needed

9

Data Stream Query Languages

Continuous queries and

Blocking Operators

10

Query Operators: Sample Stream

Traffic (sourceIP, %source IP address

sourcePort, %port number on source

destIP, % destination IP address

destPort, % port number on destination

length , %length in bytes

time % time stamp

);

11

Blocking Query Operators

No output until the entire input has been seen—i.e., the last tuple in the input, … often detected after we hit the EOF.

Streams – input never ends: thus blocking operators cannot be used as such

Traditional SQL aggregates are blocking Many SQL operators have DBMS implementations that

are blocking but are not intrinsically blocking group by, sort join can be implemented in blcoking and

nonblocking ways Other operators are intrinsically blocking Can we formally characterize which is which?

We will see that nonblocking operators are the monotonic ones

12

Problematic Operators for Data Streams

Blocking query operators—i.e., those that must see everything in the input before they can return anything in the output

NonBlocking query operators are those that can return results now, without seeing the rest of the stream

Selection and projection are nonblocking Set Difference, and Traditional aggregates

are blocking Continuous aggregates are not.

13

Aggregate Invocation: two Forms Traditional

select G, F1 from S where P group by G having F2 op J

With windows (SQL:2003 OLAP Functions)

traffic (sourceIP, sourcePort, destIP , destPort, length, Time)select sourceIP, Time, avg(lenght) over(order by Time, partition by sourceIP 50 rows preceding)

Cumulative (running) window:

... over(order by Time, partition by sourceIP unlimited preceding)

G: grouping attributes,F1,F2: aggregate expressions

14

Aggregate Function Properties

1. distributive: sum, count, min, max2. algebraic: AVG3. holistic: count-distinct, median4. On-line aggregates such as exponentially decaying

AVG5. User-Defined Aggregates (UDAs)

Sliding window invocation 1—2. Efficient computation for memory and CPU

Sliding window invocation on 3 ? Continuous window on these ? Yes, also for 5. UDAs can be similar to any of those

15

Avoiding Blocking Behavior

Windows: aggregates on a limited size window are approximate and nonblocking

DSMS do windows of all kinds: Sliding windows (same as OLAP functions) Tumbles: restart every new window (traditional

definition) Panes: the window is broken up into panes

Punctuation [Tucker, Maier, Sheard, Fegaras]

Assertion about future stream contents Unblocks operators, reduces state

Construct used for avoiding blocking are also useful for avoiding infinite memory

16

Joins

General case problematic on streams: May need to join arbitrarily far-apart stream tuples

Equijoin on timestamps is easy to compute—but not very useful

Majority of work focuses on joins between one stream and a window specified on the other

The symmetric case also common… Traffic2 as B [window TB] …

Multi-joins less common but possible.

Select A.sourceIP, B.sourceIPfrom Traffic1 as A [window TA], Traffic2 as B where A.destIP = B.destIP

17

Join of Stream S with a Table T (where T is a DB relation or a Window on a

Stream)

When a new tuple z with timestamp ts(z) arrives in S, join it with all the tuples in T.- ts(z) is the timestamp of tuples so produced

If T is a window on a stream S’ T must contain all the tuples up to ts(z)

included: cumulative window on S’ But we do not have infinite memory: so we

must approximate T with a synopsis. E.g., 30 minutes preceding

18

Multi-way Sliding Window Joins

Evaluation of n-way sliding window joins queries n streams with associated sliding windows continuously evaluate the joins of all n windows

Two natural joins strategies eager: join is evaluated each time a new tuple arrives

in any of the input streams lazy: join is evaluated with some pre-specified

frequency, e.g., every t time units

Computation incremental, as in differential fixpoint of recursive rules.

19

Query Optimizationand Scheduling

Sceduling to minimize response time or minimize memory—no real change in CPU time

Optimization based on sharing, query plans, operators, buffers, …

20

A Query Plan

⋈

Stream1 Stream2

Stream3

Q1 Q2

⋈

SchedulerGiven – query plan and selectivity estimatesSchedule – tuples through operator chains

21

Schedulers and QoS Metrics

Round Robin (RR) is perhaps the most basic operators in a circular queue are given a fixed

time slice. Starvation is avoided, but little adaptivity

FIFO: takes the first tuple in input and moves it through the chain Minimal latency, poor memory

Greedy Alogrithms: Buffers with most tuples first Tuples that waited longest first Operators that release more memory first

22

Memory Optimization on a Chain[Babcock, Babu, Datar, Motwani]

Time

selectivity = 0.0

selectivity = 0.6

selectivity = 0.2

Net

Sel

ecti

vity

σ1

σ2

σ3

best slopeσ3

σ2

σ1

Input

Output

starvation point

23

Main ideas

Operators are thought of as filters which Operate on a set of tuples Produce s tuples in return

s selectivity of an operator If s = 0.2 we can interpret the value in two

ways Out of every 10 tuples, the operator outputs 2

tuples If the input requires 1 unit of memory, the output

will require 0.2 units of memory

24

The lower envelope

Imagine there is a line from this point to every operator point (ti, si) to its right

The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point”

25

The Lower Envelope By starting at

the first point (t0, s0) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P

Notice that the slopes of the segments are non-increasing

The operators in each segment form a chain.

FIFO within chain Greedy across

chains

26

Scheduling Chain minimizes memory be required in special

overload situations But increases response time (latency) Typically though we want to optimize for response time

Different scheduling protocols optimize different objectives: latency, inaccuracy, memory use, computation, starvation, … Computation complexity is independent from scheduler Different policies give significantly different results

only for bursty loads

Research Issues: Complex query plans (beyond simple paths) Minimization of response time Adaptive strategies: how do we switch between the

two to adapt to load changes?

27

Optimization by Sharing

In traditional multi-query optimization: sharing (of expressions, results etc) among

queries can lead to improved performance

Examples:Similar issues arise when processing queries on streams: sharing of query operators and expressions sharing of sliding windows

28

Multi-query Processing on Streams

Opportunities for optimization when windows are shared---e.g:

select sum (A.length)from Traffic1 A [window 1hour], Traffic2 B [window 1 hour]where A.destIP = B.destIP

select count (distinct A.sourceIP)from Traffic1 A [window 1 min], Traffic2 B [window 1 min]where A.destIP = B.destIP

Strategies for scheduling the evaluation of shared joins: Largest window only Smallest window first Process at any instant the tuple that is likely to benefit the

largest number of joins (maximize throughput)

29

Shared Predicates [Niagara, Telegraph]

R.A > 1R.A > 7

R.A > 11

R.A < 3R.A < 5

R.A = 6R.A = 8

R.A ≠ 9

Predicatesfor R.A

7

1 11

A>7 A>11

9

A<3

3

6 8

A<5

A>1

>

<

=

≠

TupleA=8

30

QoS and Load Schedding

When input stream rate exceeds system capacity a stream manager can shed load (tuples)

Load shedding affects queries and their answers

Introducing load shedding in a data stream manager is a challenging problem

Random and semantic load shedding

31

DSMSQuality of Service (QOS)

Approximation and Load Shedding

32

QOS via Synopses and Approximation

Synopsis: bounded-memory history-approximation Succinct summary of old stream tuples Like indexes/materialized-views, but base data is

unavailable Examples

Sliding Windows Samples Sketching techniques Histograms Wavelet representation

Approximate Algorithms: e.g., median, quantiles,…

Fast and light Data Mining algorithms

33

QoS and Load Schedding

When input stream rate exceeds system capacity a stream manager can shed load (tuples)

Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss

Introducing load shedding in a data stream manager is a challenging problem

Random load shedding or semantic load shedding

34

XML Data Streams

35

XML Data Streams: Applications

• An XML data stream is a sequence of tokens

• Data and application integration

• Distributed monitoring of computing systems

• Message-based web services

• Purchase orders, retail transactions

• Personalized content delivery

36

XML Streams: Data ModelXML data: tree structure

<Purchase_Doc><PR_Number val = “50”/><Supp_Name>ABC</

Supp_Name><Address><City>Florham Park</City><State>New Jersey</State></Address><Line_Items><Item><Part_Number val=

“1050”/><Quantity val=“20”/></Item>

Data stream: ~ SAX events

[element Purchase_Doc anyType]

[element PR_Number anyType]

[attribute val anySimpleType][chardata 50][end-attribute][end-element][element Supp_Name

anyType][text ABC][end-element]…

37

XML Query Languages

XML query languages Xquery, XSLT, Xpath Declarative matching of structured data and

text Easy restructuring to meet needs of data

consumers

38

XML Streams: research Issue

Efficient Processing of single/multiple queries (e.g., Xfilters/Yfilters)

Blocking operators/constructs in XQuery—e.g., XQuery new function definition mechanisms are blocking

Integration of relational and XML DSMS—just like relational and XML DBMS are now being intergrated.

39

Prototype Systems

Aurora (Brandeis, Brown, MIT) [CCC+02] Gigascope (AT&T) [CJSS03] Hancock (AT&T) [CFP+00] STREAM (Stanford) [MWA+03] Telegraph (Berkeley) [CCD+03] … Stream Mill [UCLA]

40

Aurora (Brandeis, Brown, MIT)

Geared towards monitoring applications (streams, triggers, imprecise data, real time requirements)

Specified set of operators, connected in a data flow graph

Optimization of the data flow graph Three query modes (continuous, ad-hoc, view) Aurora accepts QoS specifications and attempts to

optimize QoS for the outputs produced Real time scheduling, introspection and load

shedding

41

AT&T: Hancock and Gigascope

Hancock: A C-based domain specific language which facilitates signature extraction from transactional data streams.

Signature: charetizes behavior of customer or services Support for efficient and tunable representation of

signature collections Support for custom scalable persistent data structures Elaborate statistics collection from streams

Gigascope: SQL based DSMS for monitoring of network data

42

STREAM [Stanford Uiversity]

General purpose stream data manager CQL (continuous query language) for

declarative query specification Consider query plan generation Resource management:

Operator scheduling Static and dynamic approximations

43

Telegraph [UCB]

Continuous query processing system Support for stream oriented operators Support for adaptivity in query processingVarious aspects of optimized multi-query

stream processing

44

Commercial Systems

Sybase: publish-subscribe using MQ (Memory Queues) MQs: are in-memory tables processed using

active rules and stored procedures Similar solutions in Oracle and Teradata. But

IBM's MQSeries, Microsoft's MSMQ are web-service oriented: Java Message Service (JMS), WebSphere, CORBA.

Two DSMS startups: CORAL8: http://coral8.com/

Streambase: http://www.streambase.com/

http://coral8.com/

http://www.streambase.com/



45

More Tutorial Talks

Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani,Jennifer Widomhttp://theory.stanford.edu/~rajeev/pods-full-talk.ppt

Nick Koudas and Divesh Srivastava. Data stream query processing. Tutorial presented at International Conference on Very Large Databases (VLDB), 1149, 2003. [ PDF | talk slides (PDF)

Nick Koudas et al. Matching XML Documents Approximately (with S. Yahia and D. Srivastava) Tutorial delivered at ICDE 2003

Nick Koudas et al. Stream Data Management: Research Directions and Opportunities. Invited Talk at IDEAS 2002.

Nick Koudas et al. Mining Data Streams (with S. Guha) Invited Tutorial delivered at PAKDD 2003

http://www.research.att.com/~divesh/papers/ks2003-streamqp-tutorial.pdf

http://www.research.att.com/~divesh/papers/ks2003-streamqp-tutorial-talk.pdf

46

Implementation Approaches for Continuous Queries on Streaming XML

Automata-based techniques: XFilter [AF00]: finite state machine per path

expression XTrie [CFGR02]: shares common sub-paths of PC

paths YFilter [DF03]: single NFA for all path expressions [GMOS03]: single DFA, limitations on flexibility XPush [GS03]: pushdown automaton for tree patterns

Index-based techniques: MatchMaker [LP02]: shared tree patterns IndexFilter [BGKS03]: shared path expressions,

comparison

47

XML Stream Processing: Key Ideas

Obtain bindings of for clause path expression variables Ordered sequence, no duplicates

Filter bindings using where clause path expression predicates Existential check suffices

Compute bindings of return clause path expressions Ordered (possibly null) sequence

Goal: Efficient matching/binding of XML path expressions Very large number of path expressions

data stream management systems

Documents

sql operators

streams input

stream millthe system

persistent query answer

response time

bytes time

timestampsthus operators

sequencesby operators