1 stream-based data management is698 min song 2 characteristics of data streams data streams data...

1

Stream-based Data Management

IS698

Min Song

2

Characteristics of Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge

amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can

only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing

3

Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply

& manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random

access is too expensive)

4

Data Streams vs. Data Sets

Data Sets: Data Streams:

Updates infrequent

Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data

required many required many timestimes

Mostly only freshest Mostly only freshest data useddata used

Example: Example: employees employees personal data personal data tabletable

Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc

5

Using Traditional DatabaseUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

LoaderLoaderLoaderLoader

QueryQuery ResultResult

ResultResult……

QueryQuery……

6

Data Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

7

Data Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem

(DSMS)

8

DBMS versus DSMS• Persistent relations • Transient streams (and

persistent relations)

9

DBMS versus DSMS• Persistent relations

• Transient streams (and persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

10




• Random accessRandom access • Sequential accessSequential access

11





• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

12





• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory

13

Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous

Evaluated continuously as stream data arrives Answer updated over time

Queries are often complex Beyond element-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-

dimensional in nature

14

Processing Stream Queries

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce

exact answers High-quality approximate answers are desired Data reduction and synopsis construction methods

Sketches, random sampling, histograms, wavelets, etc.

15

Methods for Approximate Query Answering

Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications

Batched processing, sampling and synopses Batched if update is fast but computing is slow

Compute periodically, not very timely Sampling if update is slow but computing is fast

Compute using sample data, but not good for joins, etc. Synopsis data structures

Maintain a small synopsis or sketch of data Good for querying historical data

Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the

entire input

16

Projects on DSMS (Data Stream Management System)

Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer Streaminer (UIUC): new project for stream data mining

17

Stream Data Mining vs. Stream Querying

Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis

Not necessarily continuous queries Stream data mining tasks

Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data

18

Challenges for Mining Unusual Patterns in Data Streams

Most stream data are at pretty low-level or multi-dimensional

in nature: needs ML/MD processing

Analysis requirements

Multi-dimensional trends and unusual patterns

Capturing important changes at multi-dimensions/levels

Fast, real-time detection and response

19

Summary

Stream data analysis: A rich and largely unexplored field Current research focus in database community: DSMS system

architecture, continuous query processing, supporting mechanisms

Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Largely unexplored: current studies only touched the surface

Lots of exciting issues in further study A promising one: Multi-level, multi-dimensional analysis and

mining of stream data

20

What Is A Continuous Query ?

Query which is issued once and logically run continuously.

21

What is Continuous Query ?

Query which is issued once and run continuously.

Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

22

What is Continuous Query ?

Query which is issued once and run continuously.

More examples:

Continues queries used to support load balancing, online automatic trading at Stock Exchange

23

Special Challenges

Timely online answers even for rapid data streams Ability of fast access to large portions of data Processing of multiple streams simultaneously

24

Making Things Concrete

Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)

event = start or end

CentralOffice

CentralOffice

DSMS

BOB ALICE

25

Making Things Concrete Database = two streams of mobile call records

Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)

Query language = SQL

FROM clauses can refer to streams and/or relations

26

Query 1 (self-join)

Find all outgoing calls longer than 2 minutes

SELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)

Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end

27

Query 2 (join)

Pair up callers and callees

SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_ID

Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized

28

Query 3 (group-by aggregation)

Total connection time for each callerSELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)GROUP BY O1.caller

Cannot provide result in (append-only) stream. Alternatives:

• Output stream with updates• Provide current value on demand• Keep answer in memory

29

Conclusions

Conventional DBMS technology is inadequate

We need reconsider all aspects of data management and processing in presence of data streams

1 stream-based data management is698 min song 2 characteristics of data streams data streams data...

Documents

data sets data sets

data processing

data feeds

freshest data

old data

traditional dbms data

massive data sets

persistent relations