data stream management

Data Stream Management

Authors

Lukasz Golab & M. Tamer Özsu

Supervised by

Dr. Sakti Pramanik

Presented by

AKM Tauhidul Islam

Outline• Introduction

o Motivation

o Problem Statement

o Definitions

• Data Stream Management System (DSMS)

• Streaming Data Warehouse (SDW)

• Discussion

Introduction• Stream data - Produced incrementally over time, rather than

being available in full before its processing begins

• Examples:

• Applications:o Sensor Networks - E.g. TinyDB

o Network Traffic Analysis - E.g. Traffic statistics and critical condition

detection.

o Financial Tickers - On-line analysis of stock prices, discover correlations,

identify trends.

o Transaction Log Analysis - E.g. Web click streams and telephone calls

Transaction data streams Log Streams

Credit card purchases,Telecommunications,Web Accesses

Climate DataGPS trackingSensor networksIP networks

Motivation• Massive data sets:

o Huge numbers of users, e.g.,• AT&T long-distance: ~ 300M calls/day

• AT&T IP backbone: ~ 10B IP flows/day

o Highly detailed measurements, e.g.,• NOAA: satellite-based measurements of earth geodetics

o Huge number of measurement points, e.g.,• Sensor networks with huge number of sensors

• Near real-time analysiso ISP: controlling service levels

o NOAA: tornado detection using weather radar

o Hospital: Patient monitoring

• Traditional data feedso Simple queries (e.g., value lookup) needed in real-time

o Complex queries (e.g., trend analyses) performed off-line

Problem StatementDBMS DSMS

Data Persistent Relations Streams, time windows

Data Access Random Sequential, One-pass

Updates Arbitrary Append Only

Update Rates Relatively Low High, bursty

Processing Model Query Driven Data driven

Queries One time Continuous

Query Plans Fixed Adaptive

Query Optimizations One Query Multi-query

Query Answers Exact Exact or Approximate

Latency Relatively High Low

DataWarehouse

SDW

Data Historical Recent and Historical

UpdateFrequency

Low High

UpdatePropagation

Synchronous Asynchronous

ETL Process Complex Fast, Light-weight

Fig : Comparison of Data Stream Management Systems and Streaming Data Warehouses with traditional database and warehouse systems

Definitions• Non-blocking Execution : Query operator Q doesn’t require

entire input

• Monotonicity : All previous results preserved o Q(т) € Q(т’), for query operator Q, where т <= т’

o Q is monotonic only if non-blocking

• Delta : Doesn’t hold monotonicity property , produce update

result at time т, negative / Positive delta

• Punctuation : Special tuple containing a predicate that is

guaranteed to be satisfied by the remainder of the data stream

• Heartbeat : Punctuations that govern timestamps of future

tuples

• Average slowdown = Tuple response time/ shortest processing

time



o Stream Data Models

o Query Language & Semantics

o Query Processing

o Query Optimization

• Streaming Data Warehouse (SDW)

• Discussion

DSMS• Input Buffer/Monitor

o Captures streaming inputs

o May collect statistics on streams

o Random sampling

• Working storageo Stores recent stream data

o Used for query processing

• Local Storageo Used for metadata

o Foreign key mapping

o Naming translation

• Query Processoro Convert queries into execution plans

o Change plans for different workloads /

input rates

o Contains buffers, operator queues

o Deploys scheduling methods

• Continuous Query Repository

• Resultso May input to users, to other applications

o Stored in an SDW for further analysis

Fig : i) Abstract reference architecture of a DSMS & ii) A traditional DBMS

Stream Data Models• Base Streams – Produced by sources, append only

• Derived streams – produced by continuous queries

• Streams have fixed schemao <timestamp, source IP Addr, source port, destination IP Addr, destination port, size>

• Data Stream Modelso Describe underlying signals S : [l ... N] -> R

o Aggregate model – Range value for a signal

o Cash Register model – Partial non-negative range value

o Turnstile model – Partial range value

o Reset model – Range value; Reset previous value of a signal

• Stream Windows – important to user and query points of view

o Fixed window

o Sliding window

o Landmark window

o Jumping window – update every k-ticks or k-arrivals

o Tumbling window - update every k-ticks or k-arrivals , k = window size

Query Language & Semantics

• Query Algebrao Stream-to-stream

o Mixed Algebra

• Query Operators – Similar syntax to DBMS, very different semantics

• Relation-like query operatorso Selection, projection, union – stateless operators

o Join – window joins

o Aggregate operators

• DSMS exclusive operatorso Buffered sort operator

o Random sampling operator

o User defined aggregate functions (UDAF)

• Query Languageso GSQL

o CQL

o ESL

Query Operators• Selections, (duplicate preserving)

projections are straightforwardo Local, per-element operators

o Duplicate eliminating projection is like grouping

o Projection needs to include ordering attribute

o No restriction for position ordered streams

• Aggregate expressions:o distributive: sum, count, min, max

o algebraic: average

o holistic: count-distinct, median

Fig: Simple continuous query operators: i) - Selection, ii) Count, iii) Negation

Query Operators• Join operators problematic on

streamso May need to join arbitrarily far apart

stream tuples

o Operations on implicit / explicit windows

• SELECT * FROM S1, S2

WHERE Sl.attr = S2.attr

GROUP BY Sl.timestamp/60 AS minute

• SELECT * FROM S1, S2


GROUP BY IS1 .timestamp| - |S2.timestampl <= w

• SELECT * FROM S1 [RANGE w] , S2 [RANGE w]


Fig: Simple continuous query operators: i) Join, ii) Sliding window join with state

Query Processing• Declarative queries ->Logical query plan -> Physical Plan

o Directed Acyclic Graphs (nodes->operators, edges -> data flow)

• Queries sharing memory/streams combined to a single plan

Fig: a) Query plan for two queries: i) a join of streams Sl and S2 with a selection predicate on Sl, and 2) an aggregate on S2. b) A continuous query with selection and tumbling window aggregation

• Scheduling o FIFS, Round Robin – simple, not efficient

o Operators with higher throughput – low latency

o Operators with min processing & selectivity –smaller queue

• Heartbeats & Punctuationso Typically issued by sources

o Reduce amount of states needed by operators

o Prevent operators doing unnecessary tasks

o Query plans can also issue heartbeats to avoid pipeline stalls and delayed results

SELECT minute, SUM(size) FROM s WHERE destination_port <= 80 GROUP BY timestamp/60 AS minute

Query Processing Cont..

• Queries as views & Negative tupleso Negative tuples implemented by sign on

explicit windows

o Explicit windows on time or count based

o Generated negative tuples processed by

cascading operators

o Negative tuple on aggregate operators

• Count – easy to compute

• Max/Min – Memory intensive

o Twice as many tuples are considered

• Possible avoiding for monotonic

operators

• Tag tuples with expiration time

• Operators known as weak non-

monotonic

Fig: a) Maintaining a view over a sliding window join using negative tuples b) Finding the maximum element in a sliding window

Query Optimization• Finds efficient query plans

• DBMS focus on minimizing I/O while DSMS try to reduce cost per unit

• Static Analysis and Query Rewritingo Ensures query can be evaluated in non-

blocking fashion with limited memory

• S(A,B,C), T(D,E)

• ∏A (бA=D & A>I0 & D<20(S x T) ) , Yes

• ∏A (бA=D (S x T) ), No

• ∏A (бB<D & A>I0 & D<20(S x T) ), Yes, if no duplicate

o Common Rules

• Evaluate inexpensive predicates before complex ones

o Performing selections before joins

o Rules for continuous query operators only

• Selections and explicit time-based windows commute

• Selections and explicit count-based windows don’t commute

o Rewrite based on input(s) constraints

• Join of unbounded streams if matching tuples arrive at most t time units apart

• Multi Query Optimization

Fig : Separate and shared query plans for Ql and Q2

Operator Optimization• Joino Need to remove expired tuples

o Expiration in each time tick costly

o Periodic removal reduce cost but increase join processing cost

o Probe streams with fewer matches

• Aggregationo Synopses allow efficient re-computations

o Prefix synopses

• Suitable for sub-tractable aggregates

• For ex: Sum, Count

o Interval synopses

• Suitable for distributive aggregates

• For ex: Min, Max

• Need to access log b intervals

• Basic interval synopses require b accesses

o Holistic aggregates require additional info in synopses

o Algebraic aggregates computed from derived info

• Avg = Sum / Count

Fig : i) Prefix synopses, ii) Interval synopses, iii) Basic interval synopses

Query Optimization• Load Shedding & Approximationo Random sampling

o Semantic load shedding to drop less important

o Objective is to minimize the drop in accuracy

• Challenging for complex query plan with multiple streams and operators

• Load Balancingo Write part of stream if possible

• Adaptive Query Optimization o Query cost-per-unit time may change

o Query plan dynamically re-ordered on speed, selectivity and queue length

o Trade-off between resulting adaptivity and overhead of dynamic routing

• Distributed Query Optimizationo Parallelizing and distributing the system itself

• Split query plan across nodes

• Partition the streams

o Shifting partial computation to the sources

• In-network processing reduce the communication overhead



• Streaming Data Warehouse (SDW)o Data ETL

o Update Propagation

o Data Expiration

o Update Scheduling

o Query Processing on SDW

• Discussion

SDW• Data streams/feeds arrive periodically

• ETL process - data cleaning, standardization and so on

• Table types o Base tables – Sourced directly from raw files

o Derived tables – Materialized view over base or other derived table

• Update scheduler selects files update order o Based on dependencies and workloads

Fig : Abstract reference architecture of a SDW

ETL• Simple tasks – un-compression, standardization

• Complex tasks

o Joining new data with descriptive attributes relations

• Relations R are disk based

• Data buffer at main memory

• Mesh Join

o Access blocks of R in sequential order

o Tuple removed from buffer when join to all blocks of R

o Loading data into tables

• Tables are partitioned into timestamp ranges

• Affect small number or recent partitions

Fig : Partitioning a table on a timestamp attribute

Update Propagation• Goals

o Propagate changes across layers of derived

tables

o Avoid recomputing an entire derived table

o Efficiently identify partition dependency

• Partition dependencies may not be

obvious from the SQL specification

Fig : Updating a partitioned derived table

Fig : Partition dependency

Data Expiration• Tuples may have variable lifetime

• Tables can be partitioned on insertion and expiration timestampso Partitions may not have equal size

• One solution is to assign updates in round robin fashion

Fig : Partitioning a table on two attributes: insertion and expiration timestamp

Update Scheduling• External sources push new data

• So many data feeds and derived

tables

• Resource usage control by using

scheduler

• Minimize data staleness

• Priority weighted staleness metric

to select tables which minimize it

most

Fig : plot of the staleness of a SDW table over time

Query Processing• Overhead of partitioned tables

o Too small partitions are difficult to manage

o Too big ones need to be recomputed as new data arrives

o Solution : Bigger partitions as data become old

• Data Availability and Concurrency controlo Tables are updated frequently

o Queries should not be blocked and output consistent data

o Solution : Multi-version concurrency control at partition level

Discussion• End-to-end data stream management

• DSMS allows relational like queries as well as pattern matching

and event processing queries

• Query semantics are different than traditional ones

• SDW research problems introduced recently

• Didn’t cover data mining techniques, fault tolerance and distributed processing in the lecture

References1. Data stream management, Luckasz Golab & M. Tamer Özsu

• Data stream management system – introduction, concepts and issues. Morton Lindeberg, University of Oslo

data stream management

Engineering