1 stream-based data management is698 min song 2 characteristics of data streams data streams data...
Post on 20-Dec-2015
219 views
TRANSCRIPT
1
Stream-based Data Management
IS698
Min Song
2
Characteristics of Data Streams
Data Streams Data streams—continuous, ordered, changing, fast, huge
amount
Traditional DBMS—data stored in finite, persistent data setsdata sets
Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can
only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
3
Stream Data Applications
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply
& manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random
access is too expensive)
4
Data Streams vs. Data Sets
Data Sets: Data Streams:
Updates infrequent
Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data
required many required many timestimes
Mostly only freshest Mostly only freshest data useddata used
Example: Example: employees employees personal data personal data tabletable
Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc
5
Using Traditional DatabaseUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
LoaderLoaderLoaderLoader
QueryQuery ResultResult
ResultResult……
QueryQuery……
6
Data Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
7
Data Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)
DataStream
ManagementSystem
(DSMS)
8
DBMS versus DSMS• Persistent relations • Transient streams (and
persistent relations)
9
DBMS versus DSMS• Persistent relations
• Transient streams (and persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
10
DBMS versus DSMS• Persistent relations
• Transient streams (and persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
11
DBMS versus DSMS• Persistent relations
• Transient streams (and persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
12
DBMS versus DSMS• Persistent relations
• Transient streams (and persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory
13
Challenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous
Evaluated continuously as stream data arrives Answer updated over time
Queries are often complex Beyond element-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-
dimensional in nature
14
Processing Stream Queries
Query types One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples
Approximate query answering With bounded memory, it is not always possible to produce
exact answers High-quality approximate answers are desired Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.
15
Methods for Approximate Query Answering
Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications
Batched processing, sampling and synopses Batched if update is fast but computing is slow
Compute periodically, not very timely Sampling if update is slow but computing is fast
Compute using sample data, but not good for joins, etc. Synopsis data structures
Maintain a small synopsis or sketch of data Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the
entire input
16
Projects on DSMS (Data Stream Management System)
Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer Streaminer (UIUC): new project for stream data mining
17
Stream Data Mining vs. Stream Querying
Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis
Not necessarily continuous queries Stream data mining tasks
Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data
18
Challenges for Mining Unusual Patterns in Data Streams
Most stream data are at pretty low-level or multi-dimensional
in nature: needs ML/MD processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels
Fast, real-time detection and response
19
Summary
Stream data analysis: A rich and largely unexplored field Current research focus in database community: DSMS system
architecture, continuous query processing, supporting mechanisms
Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Largely unexplored: current studies only touched the surface
Lots of exciting issues in further study A promising one: Multi-level, multi-dimensional analysis and
mining of stream data
20
What Is A Continuous Query ?
Query which is issued once and logically run continuously.
21
What is Continuous Query ?
Query which is issued once and run continuously.
Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.
22
What is Continuous Query ?
Query which is issued once and run continuously.
More examples:
Continues queries used to support load balancing, online automatic trading at Stock Exchange
23
Special Challenges
Timely online answers even for rapid data streams Ability of fast access to large portions of data Processing of multiple streams simultaneously
24
Making Things Concrete
Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)
event = start or end
CentralOffice
CentralOffice
DSMS
BOB ALICE
25
Making Things Concrete Database = two streams of mobile call records
Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)
Query language = SQL
FROM clauses can refer to streams and/or relations
26
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)
Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end
27
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_ID
Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized
28
Query 3 (group-by aggregation)
Total connection time for each callerSELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)GROUP BY O1.caller
Cannot provide result in (append-only) stream. Alternatives:
• Output stream with updates• Provide current value on demand• Keep answer in memory
29
Conclusions
Conventional DBMS technology is inadequate
We need reconsider all aspects of data management and processing in presence of data streams