stream data management
TRANSCRIPT
-
7/29/2019 Stream Data Management
1/68
Sangeetha Seshadri
Data Stream Processing An Overview
CS 4440 Lecture 6
-
7/29/2019 Stream Data Management
2/68
Agenda
Data Streams
What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
-
7/29/2019 Stream Data Management
3/68
3
Data Streams What and Where?
Continuous, unbounded, rapid, time-varying streams ofdata elements (tuples).
Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial applications
Web logs and click-streams
Manufacturing processes
DSMS = Data Stream Management System
stanfordstreamdatamanager
-
7/29/2019 Stream Data Management
4/68
4
DBMS versus DSMS
Persistent relations
One-time queries
Random access
Access plan determined by
query processor and
physical DB design
Transient streams (andpersistent relations)
Continuous queries
Sequential access
Unpredictable data
characteristics and arrival
patterns
stanfordstreamdatamanager
-
7/29/2019 Stream Data Management
5/68
Continuous Queries
One time queries Run once to completion over thecurrent data set.
Continuous queries Issued once and then continuously
evaluated over the data.
Example:
Notify me when the temperature drops below X
Tell me when prices of stock Y > 300
-
7/29/2019 Stream Data Management
6/68
stanfordstreamdatamanager6
DSMS
Scratch Store
The (Simplified) Big Picture
Input streams
Register
Query
StreamedResult
StoredResult
Archive
Stored
Relations
-
7/29/2019 Stream Data Management
7/68
stanfordstreamdatamanager7
(Simplified) Network Monitoring
RegisterMonitoring
Queries
DSMS
Scratch Store
Network measurements,
Packet traces
IntrusionWarnings
OnlinePerformanceMetrics
Archive
Lookup
Tables
-
7/29/2019 Stream Data Management
8/68
8
Triggers?
Recall triggers in traditional DBMSs? Why not use triggers to process continuous queries over
data streams?
-
7/29/2019 Stream Data Management
9/68R.Motwani, Models & Issues in Data Streams PODS 2002
9
Making Things Concrete
DSMS
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
event = startorend
Central
Office
Central
Office
ALICE BOB
-
7/29/2019 Stream Data Management
10/68R.Motwani, Models & Issues in Data Streams PODS 2002
10
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time O1.time > 2
AND O1.call_ID = O2.call_IDAND O1.event = start
AND O2.event = end)
Result requires unbounded storage
Can provide result as data stream Can output after 2 min, without seeing end
-
7/29/2019 Stream Data Management
11/68
R.Motwani, Models & Issues in Data Streams PODS 200211
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
Can still provide result as data stream Requires unbounded temporary storage
unless streams are near-synchronized
-
7/29/2019 Stream Data Management
12/68
R.Motwani, Models & Issues in Data Streams PODS 200212
Query 3 (group-by aggregation)
Total connection time for each caller
SELECT O1.caller, sum(O2.time O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = startAND O2.event = end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream Output updates?
Provide current value on demand?
Memory?
-
7/29/2019 Stream Data Management
13/68
13
DSMS Architecture & Issues
Data streams and stored relations Architecturaldifferences.
Declarative language for registering continuous queries
Flexible query plans and execution strategies
Centralized ? Distributed ?
-
7/29/2019 Stream Data Management
14/68
Agenda
Data Streams What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
-
7/29/2019 Stream Data Management
15/68
DSMS Issues
Relation: Tuple Set or Sequence?
Updates: Modifications or Appends?
Query Answer: Exact or Approximate?
Query Evaluation: One of multiple Pass? Query Plan: Fixed or Adaptive?
-
7/29/2019 Stream Data Management
16/68
Architectural Issues
DSMS
DBMS
Resource (memory, per-tuple
computation) limited
Reasonably complex, near realtime, query processing
Useful to identify what data to
populate in database
Query Evaluation: One pass
Query Plan: Adaptive
Resource (memory, disk,
per-tuple computation) rich
Extremely sophisticated
query processing, analysis
Useful to audit query results
of data stream systems.
Query Evaluation: Arbitrary
Query Plan: Fixed.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
-
7/29/2019 Stream Data Management
17/68
stanfordstreamdatamanager17
STREAM System Challenges
Must cope with: Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
-
7/29/2019 Stream Data Management
18/68
stanfordstreamdatamanager18
STREAM System Challenges
Must cope with: Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
Overload
-
7/29/2019 Stream Data Management
19/68
stanfordstreamdatamanager19
STREAM System Challenges
Must cope with: Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
Overload need to use resources very carefully.
Changing conditions adaptive strategy.
-
7/29/2019 Stream Data Management
20/68
R.Motwani, Models & Issues in Data Streams PODS 200220
Query Model
User/ApplicationQuery Registration
Predefined
Ad-hoc
Predefined, inactive
until invoked
Answer Availability
One-time
Event/timer based
Multiple-time, periodic
Continuous (stored or
streamed)
Stream Access
Arbitrary
Weighted history
Sliding window
(special case: size = 1)
DSMS
Query Processor
-
7/29/2019 Stream Data Management
21/68
Agenda
Data Streams What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
Language
Operators Optimization
Multi-Query Optimization
-
7/29/2019 Stream Data Management
22/68
N.Koudas, D. Srivastava (2003) AT&T Labs-Research22
Stream Query Language
SQL extension
Queries reference/produce relations or streams
Examples: GSQL [Gigascope], CQL [STREAM]
Stream or
Finite
Relation
Stream orFinite
Relation
Stream Query
Language
-
7/29/2019 Stream Data Management
23/68
stanfordstreamdatamanager23
Example: Continuous Query Language CQL
Start with SQL
Then add
Streams as new data type
Continuous instead of one-time semantics
Windows on streams (derived from SQL-99)
Sampling on streams (basic)
-
7/29/2019 Stream Data Management
24/68
R.Motwani, Models & Issues in Data Streams PODS 200224
Impact of Limited Memory
Continuous streams grow unboundedly
Queries may require unbounded memory
One solution: Approximate query evaluation
-
7/29/2019 Stream Data Management
25/68
R.Motwani, Models & Issues in Data Streams PODS 200225
Approximate Query Evaluation
Why? Handling load streams coming too fast
Avoid unbounded storage and computation
Ad hoc queries need approximate history
How? Sliding windows, synopsis, samples, load-shed
Major Issues? Metric for set-valued queries
Composition of approximate operators
How is it understood/controlled by user?
Integrate into query language Query planning and interaction with resource allocation
Accuracy-efficiency-storage tradeoff and global metric
-
7/29/2019 Stream Data Management
26/68
Windows
Mechanism for extracting a finite relation from an infinitestream
Various window proposals for restricting operator scope.
Windows based on ordering attribute (e.g. time)
Windows based on tuple counts
Windows based on explicit markers (e.g. punctuations)
Variants (e.g., partitioning tuples in a window)
Stream Stream
Finite
relations
manipulated
using SQL
Windowspecifications streamify
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
-
7/29/2019 Stream Data Management
27/68
Windows
Terminology
Start time Current time
time
t1 t2 t3 t4 t5
Sliding Window
time Tumbling Window
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
-
7/29/2019 Stream Data Management
28/68
Query Operators
Selections - Where clause
Projections - Select clause
Joins - From clause
Group-by (Aggregations)Group-by clause
-
7/29/2019 Stream Data Management
29/68
Query Operators
Selections and projections on streams - straightforward Local per-element operators
Projection may need to include ordering attribute.
Joins Problematic
May need to join tuples that are arbitrarily far apart.
Equijoin on stream ordering attributes may be tractable.
Majority of the work focuses on joins using windows.
-
7/29/2019 Stream Data Management
30/68
R.Motwani, Models & Issues in Data Streams PODS 200230
Blocking Operators
Blocking No output until entire input seen
Streams input never ends
Simple Aggregatesoutput update stream
Set Output (sort, group-by) Root could maintain output data structure
Intermediate nodes try non-blocking analogs
Join Apply sliding-window restrictions
-
7/29/2019 Stream Data Management
31/68
Optimization in DSMS
Traditionally table based cardinalities used in queryoptimizer.
Goal of query optimizer: Minimize the size of intermediate
results.
Problematic in a streaming environment All streams are
unbounded = infinite size!
Need novel optimization objectives that are relevant when
the input sources are streams.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
-
7/29/2019 Stream Data Management
32/68
Query Optimization in DSMS
Novel notions of optimization: Stream rate based [e.g. NiagaraCQ]
Resource-based [e.g. STREAM]
QoS based [e.g. Aurora]
Continuous adaptive optimization
Possibilities that objectives cannot be met: Resource constraints
Bursty arrivals under limited processing capabilities.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
-
7/29/2019 Stream Data Management
33/68
R.Motwani, Models & Issues in Data Streams PODS 200233
Stream Projects
Amazon/Cougar (Cornell) sensors Aurora(Brown/MIT) sensor monitoring, dataflow
Hancock (AT&T) telecom streams
Niagara (OGI/Wisconsin) Internet XML databases
OpenCQ(Georgia) triggers, incr. view maintenance Stream(Stanford) general-purpose DSMS
Tapestry (Xerox) pub/sub content-based filtering
Telegraph (Berkeley) adaptive engine for sensors
Tribeca (Bellcore) network monitoring
-
7/29/2019 Stream Data Management
34/68
Optimizing Multiple Distributed Stream Queries Using
Hierarchical Network Partitions
Sangeetha Seshadri*
Jointly with: Vibhore Kumar*, Brian F. Cooper, Ling Liu* and Karsten Schwan *
*College of Computing
Georgia Tech
Yahoo! Research
IPDPS07
March 29th 2007
-
7/29/2019 Stream Data Management
35/68
35
Talk Outline
Motivation Challenges
Our Approach
Experimental Results Future Work
-
7/29/2019 Stream Data Management
36/68
36
Distributed Data Stream Systems
Weather
Local Weather
Web sources
Flight information
Travel Agent CentralizedDB
What is the status ofmy flight?
Can low-capacityflights be cancelled?
-
7/29/2019 Stream Data Management
37/68
Lots of data produced in lots of places Examples: operational information systems, scientific collaborations,
web traffic data, financial applications
Centralized processing does not scale
Motivation
-
7/29/2019 Stream Data Management
38/68
38
Challenges
Choosing efficient deployments. Fast and efficient initial deployments.
Utilize reuse opportunities.
Handling dynamic nature of system.
Queries arrive or leave.
Nodes join (recover) or leave (fail).
Network conditions change.
Data conditions (e.g. rate) changes.
-
7/29/2019 Stream Data Management
39/68
39
Approach Outline
Query Planning Deployment Adaptivity
Typical Approaches
Our Approach
Query Planning&
DeploymentAdaptivity
-
7/29/2019 Stream Data Management
40/68
40
Query Planning
C
B
A
B C
(B C) ASink
A B (A B) C
SELECT * FROM A B C
-
7/29/2019 Stream Data Management
41/68
41
Query Deployment
Sink1
Sink5
Sink4
Sink3
Sink2
N1
C
B
A
N3
N2
N4
N5
A B (A B) C
-
7/29/2019 Stream Data Management
42/68
42
An Illustrative Example..
SELECT * FROM A C
SELECT * FROM A B C
-
7/29/2019 Stream Data Management
43/68
Why an integrated approach?
Integrated approach decreases cost by > 50 %
Setup: 64 node network, 100 queries over 5 stream sources each.
Y-axis represents communication costs.
-
7/29/2019 Stream Data Management
44/68
44
Problem
Massive Search Space. Example: 5 stream sources, 64 nodes
2,880,000,000 (approx) plans considered.
Lemma 1:
Our Solution:
Trade some optimality for smaller search space
( 1)( 1)( 1) ( )6
Kexhaustive
K K KN
-
7/29/2019 Stream Data Management
45/68
45
Solution
Organize the nodes into a virtual Network Hierarchy. Operator reuse through Stream Advertisements
Two approximation based algorithms:
Top-Down
Bottom-Up
-
7/29/2019 Stream Data Management
46/68
46
Optimization Metric
Minimize `network usage Network usage: total amount of data in transit at any
point in time.
Encapsulates both bandwidth and latency of links.
-
7/29/2019 Stream Data Management
47/68
47
Network Hierarchy
Coordinator Nodes
Cluster network nodes based on cost. User defined parameter max
cs
-
7/29/2019 Stream Data Management
48/68
48
Stream Advertisements for Reuse
AB
A, C andA C B
CA C
Coordinator Nodes
-
7/29/2019 Stream Data Management
49/68
49
Optimization Algorithms
Top-Down
Bottom-Up
-
7/29/2019 Stream Data Management
50/68
50
Planning algorithms
Top down A B C D
C DA B
C DA B
DCBA
-
7/29/2019 Stream Data Management
51/68
51
Top-Down Algorithm: Features
Reduced search space: Search space reduced by a factor .
(h = height of hierarchy, N = network size, K = number ofsources).
User defined parameter maxcs
allows to tune trade-offbetween search space and sub-optimality.
Operators re-used when beneficial through streamadvertisements.
1max
K
cs
h
N
-
7/29/2019 Stream Data Management
52/68
52
Planning algorithms
Bottom up
A B C D
A B
A B C D
A B
A B
DCBA
-
7/29/2019 Stream Data Management
53/68
53
Bottom-Up Algorithm: Features
Reduced search space. Deploys only sub-queries within current cluster.
Analytical bounds: Search space reduced by factor .
Operators re-used when beneficial.
But, may choose sub-optimal join-orders.
-
7/29/2019 Stream Data Management
54/68
54
Experiments
Simulation and prototype based experiments.
128 node network: Used GT-ITM internetwork topology
generator.
Uniformly random workload generator: 10 sources, 100
queries, 2-5 join operators, random sink placements.
-
7/29/2019 Stream Data Management
55/68
Cost with Bottom-Up Algorithm
-
7/29/2019 Stream Data Management
56/68
56
Comparison with existing approaches
-
7/29/2019 Stream Data Management
57/68
57
Comparison of Search Space
-
7/29/2019 Stream Data Management
58/68
58
Future Work
We have built a prototype based on IFLOW a distributeddata stream system built at Georgia Tech.
Aggregations
Modifying existing deployments at runtime
Relaxing filter conditions
Modifying join ordering at runtime.
-
7/29/2019 Stream Data Management
59/68
59
Related Work
Distributed query optimization Distributed INGRES, R*, SDD-1
Stream data processing engines
Centralized - STREAM, Aurora, TelegraphCQ
Distributed - Borealis, Flux
-
7/29/2019 Stream Data Management
60/68
60
Conclusion
Integrated approach to query optimization
Hierarchical clustering of network and streamadvertisements.
Approximation based algorithms
Top-Down Bottom-Up
Design Highlights
Trade some optimality for smaller search space.
Decrease search space while offering bounds on the sub-optimality.
-
7/29/2019 Stream Data Management
61/68
61
For further information
http://www.cc.gatech.edu/~sangeeta Contact: [email protected]
Thank You!
-
7/29/2019 Stream Data Management
62/68
62
Deployment Times
-
7/29/2019 Stream Data Management
63/68
63
Example
Simple use-case for pushing down selections:
Query 1:
SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name
FROM FLIGHTS, CARRIER_CODES
WHEREFLIGHTS.Departing =ATLANTA
AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code
AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH
Query 2:
SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name
FROM FLIGHTS, CARRIER_CODES
WHEREFLIGHTS.Departing =ATLANTA
AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code
AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'
-
7/29/2019 Stream Data Management
64/68
64
The Big Picture
Large number of possibilities System Model
Stream processing systems (SQL-style queries)
Pub-sub systems
Runtime annotators (keyword-based queries).
Trade-offs Cost with Search space
Reliability
Availability.
Adaptivity Admission Control
Moving operators
Dropping data
Migrating plans.
-
7/29/2019 Stream Data Management
65/68
65
Real Enterprise Workload
Delta Airlines Operational information system Q1 (15%): Terminal Overhead Display (Lifetime = 12 hours)
Q2 (80%): Gate Agent Query (Lifetime = 2 hours)
Q3 (5%): Ad-hoc flight status monitoring queries (Lifetime =
6 hours)
-
7/29/2019 Stream Data Management
66/68
66
Real Enterprise Workload
-
7/29/2019 Stream Data Management
67/68
Backups
-
7/29/2019 Stream Data Management
68/68
Sliding Window Approximation
Why?
Approximation technique for bounded memory
Natural in applications (emphasizes recent data) Well-specified and deterministic semantics
Issues
Extend relational algebra, SQL, query optimization
Algorithmic work Timestamps?
0 1 1 0 0 0 0 1 1 1 000 0 0 1 0 1 010