stream data management

Upload: suratsujit

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Stream Data Management

    1/68

    Sangeetha Seshadri

    [email protected]

    Data Stream Processing An Overview

    CS 4440 Lecture 6

  • 7/29/2019 Stream Data Management

    2/68

    Agenda

    Data Streams

    What are they?

    Why now? Applications..

    DSMS: Architecture & Issues

    Query Processing

  • 7/29/2019 Stream Data Management

    3/68

    3

    Data Streams What and Where?

    Continuous, unbounded, rapid, time-varying streams ofdata elements (tuples).

    Occur in a variety of modern applications

    Network monitoring and traffic engineering

    Sensor networks, RFID tags

    Telecom call records

    Financial applications

    Web logs and click-streams

    Manufacturing processes

    DSMS = Data Stream Management System

    stanfordstreamdatamanager

  • 7/29/2019 Stream Data Management

    4/68

    4

    DBMS versus DSMS

    Persistent relations

    One-time queries

    Random access

    Access plan determined by

    query processor and

    physical DB design

    Transient streams (andpersistent relations)

    Continuous queries

    Sequential access

    Unpredictable data

    characteristics and arrival

    patterns

    stanfordstreamdatamanager

  • 7/29/2019 Stream Data Management

    5/68

    Continuous Queries

    One time queries Run once to completion over thecurrent data set.

    Continuous queries Issued once and then continuously

    evaluated over the data.

    Example:

    Notify me when the temperature drops below X

    Tell me when prices of stock Y > 300

  • 7/29/2019 Stream Data Management

    6/68

    stanfordstreamdatamanager6

    DSMS

    Scratch Store

    The (Simplified) Big Picture

    Input streams

    Register

    Query

    StreamedResult

    StoredResult

    Archive

    Stored

    Relations

  • 7/29/2019 Stream Data Management

    7/68

    stanfordstreamdatamanager7

    (Simplified) Network Monitoring

    RegisterMonitoring

    Queries

    DSMS

    Scratch Store

    Network measurements,

    Packet traces

    IntrusionWarnings

    OnlinePerformanceMetrics

    Archive

    Lookup

    Tables

  • 7/29/2019 Stream Data Management

    8/68

    8

    Triggers?

    Recall triggers in traditional DBMSs? Why not use triggers to process continuous queries over

    data streams?

  • 7/29/2019 Stream Data Management

    9/68R.Motwani, Models & Issues in Data Streams PODS 2002

    9

    Making Things Concrete

    DSMS

    Outgoing (call_ID, caller, time, event)

    Incoming (call_ID, callee, time, event)

    event = startorend

    Central

    Office

    Central

    Office

    ALICE BOB

  • 7/29/2019 Stream Data Management

    10/68R.Motwani, Models & Issues in Data Streams PODS 2002

    10

    Query 1 (self-join)

    Find all outgoing calls longer than 2 minutes

    SELECT O1.call_ID, O1.caller

    FROM Outgoing O1, Outgoing O2

    WHERE (O2.time O1.time > 2

    AND O1.call_ID = O2.call_IDAND O1.event = start

    AND O2.event = end)

    Result requires unbounded storage

    Can provide result as data stream Can output after 2 min, without seeing end

  • 7/29/2019 Stream Data Management

    11/68

    R.Motwani, Models & Issues in Data Streams PODS 200211

    Query 2 (join)

    Pair up callers and callees

    SELECT O.caller, I.callee

    FROM Outgoing O, Incoming I

    WHERE O.call_ID = I.call_ID

    Can still provide result as data stream Requires unbounded temporary storage

    unless streams are near-synchronized

  • 7/29/2019 Stream Data Management

    12/68

    R.Motwani, Models & Issues in Data Streams PODS 200212

    Query 3 (group-by aggregation)

    Total connection time for each caller

    SELECT O1.caller, sum(O2.time O1.time)

    FROM Outgoing O1, Outgoing O2

    WHERE (O1.call_ID = O2.call_ID

    AND O1.event = startAND O2.event = end)

    GROUP BY O1.caller

    Cannot provide result in (append-only) stream Output updates?

    Provide current value on demand?

    Memory?

  • 7/29/2019 Stream Data Management

    13/68

    13

    DSMS Architecture & Issues

    Data streams and stored relations Architecturaldifferences.

    Declarative language for registering continuous queries

    Flexible query plans and execution strategies

    Centralized ? Distributed ?

  • 7/29/2019 Stream Data Management

    14/68

    Agenda

    Data Streams What are they?

    Why now? Applications..

    DSMS: Architecture & Issues

    Query Processing

  • 7/29/2019 Stream Data Management

    15/68

    DSMS Issues

    Relation: Tuple Set or Sequence?

    Updates: Modifications or Appends?

    Query Answer: Exact or Approximate?

    Query Evaluation: One of multiple Pass? Query Plan: Fixed or Adaptive?

  • 7/29/2019 Stream Data Management

    16/68

    Architectural Issues

    DSMS

    DBMS

    Resource (memory, per-tuple

    computation) limited

    Reasonably complex, near realtime, query processing

    Useful to identify what data to

    populate in database

    Query Evaluation: One pass

    Query Plan: Adaptive

    Resource (memory, disk,

    per-tuple computation) rich

    Extremely sophisticated

    query processing, analysis

    Useful to audit query results

    of data stream systems.

    Query Evaluation: Arbitrary

    Query Plan: Fixed.

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research

  • 7/29/2019 Stream Data Management

    17/68

    stanfordstreamdatamanager17

    STREAM System Challenges

    Must cope with: Stream rates that may be high,variable, bursty

    Stream data that may be unpredictable, variable

    Continuous query loads that may be high, variable

  • 7/29/2019 Stream Data Management

    18/68

    stanfordstreamdatamanager18

    STREAM System Challenges

    Must cope with: Stream rates that may be high,variable, bursty

    Stream data that may be unpredictable, variable

    Continuous query loads that may be high, variable

    Overload

  • 7/29/2019 Stream Data Management

    19/68

    stanfordstreamdatamanager19

    STREAM System Challenges

    Must cope with: Stream rates that may be high,variable, bursty

    Stream data that may be unpredictable, variable

    Continuous query loads that may be high, variable

    Overload need to use resources very carefully.

    Changing conditions adaptive strategy.

  • 7/29/2019 Stream Data Management

    20/68

    R.Motwani, Models & Issues in Data Streams PODS 200220

    Query Model

    User/ApplicationQuery Registration

    Predefined

    Ad-hoc

    Predefined, inactive

    until invoked

    Answer Availability

    One-time

    Event/timer based

    Multiple-time, periodic

    Continuous (stored or

    streamed)

    Stream Access

    Arbitrary

    Weighted history

    Sliding window

    (special case: size = 1)

    DSMS

    Query Processor

  • 7/29/2019 Stream Data Management

    21/68

    Agenda

    Data Streams What are they?

    Why now? Applications..

    DSMS: Architecture & Issues

    Query Processing

    Language

    Operators Optimization

    Multi-Query Optimization

  • 7/29/2019 Stream Data Management

    22/68

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research22

    Stream Query Language

    SQL extension

    Queries reference/produce relations or streams

    Examples: GSQL [Gigascope], CQL [STREAM]

    Stream or

    Finite

    Relation

    Stream orFinite

    Relation

    Stream Query

    Language

  • 7/29/2019 Stream Data Management

    23/68

    stanfordstreamdatamanager23

    Example: Continuous Query Language CQL

    Start with SQL

    Then add

    Streams as new data type

    Continuous instead of one-time semantics

    Windows on streams (derived from SQL-99)

    Sampling on streams (basic)

  • 7/29/2019 Stream Data Management

    24/68

    R.Motwani, Models & Issues in Data Streams PODS 200224

    Impact of Limited Memory

    Continuous streams grow unboundedly

    Queries may require unbounded memory

    One solution: Approximate query evaluation

  • 7/29/2019 Stream Data Management

    25/68

    R.Motwani, Models & Issues in Data Streams PODS 200225

    Approximate Query Evaluation

    Why? Handling load streams coming too fast

    Avoid unbounded storage and computation

    Ad hoc queries need approximate history

    How? Sliding windows, synopsis, samples, load-shed

    Major Issues? Metric for set-valued queries

    Composition of approximate operators

    How is it understood/controlled by user?

    Integrate into query language Query planning and interaction with resource allocation

    Accuracy-efficiency-storage tradeoff and global metric

  • 7/29/2019 Stream Data Management

    26/68

    Windows

    Mechanism for extracting a finite relation from an infinitestream

    Various window proposals for restricting operator scope.

    Windows based on ordering attribute (e.g. time)

    Windows based on tuple counts

    Windows based on explicit markers (e.g. punctuations)

    Variants (e.g., partitioning tuples in a window)

    Stream Stream

    Finite

    relations

    manipulated

    using SQL

    Windowspecifications streamify

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research

  • 7/29/2019 Stream Data Management

    27/68

    Windows

    Terminology

    Start time Current time

    time

    t1 t2 t3 t4 t5

    Sliding Window

    time Tumbling Window

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research

  • 7/29/2019 Stream Data Management

    28/68

    Query Operators

    Selections - Where clause

    Projections - Select clause

    Joins - From clause

    Group-by (Aggregations)Group-by clause

  • 7/29/2019 Stream Data Management

    29/68

    Query Operators

    Selections and projections on streams - straightforward Local per-element operators

    Projection may need to include ordering attribute.

    Joins Problematic

    May need to join tuples that are arbitrarily far apart.

    Equijoin on stream ordering attributes may be tractable.

    Majority of the work focuses on joins using windows.

  • 7/29/2019 Stream Data Management

    30/68

    R.Motwani, Models & Issues in Data Streams PODS 200230

    Blocking Operators

    Blocking No output until entire input seen

    Streams input never ends

    Simple Aggregatesoutput update stream

    Set Output (sort, group-by) Root could maintain output data structure

    Intermediate nodes try non-blocking analogs

    Join Apply sliding-window restrictions

  • 7/29/2019 Stream Data Management

    31/68

    Optimization in DSMS

    Traditionally table based cardinalities used in queryoptimizer.

    Goal of query optimizer: Minimize the size of intermediate

    results.

    Problematic in a streaming environment All streams are

    unbounded = infinite size!

    Need novel optimization objectives that are relevant when

    the input sources are streams.

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research

  • 7/29/2019 Stream Data Management

    32/68

    Query Optimization in DSMS

    Novel notions of optimization: Stream rate based [e.g. NiagaraCQ]

    Resource-based [e.g. STREAM]

    QoS based [e.g. Aurora]

    Continuous adaptive optimization

    Possibilities that objectives cannot be met: Resource constraints

    Bursty arrivals under limited processing capabilities.

    N.Koudas, D. Srivastava (2003) AT&T Labs-Research

  • 7/29/2019 Stream Data Management

    33/68

    R.Motwani, Models & Issues in Data Streams PODS 200233

    Stream Projects

    Amazon/Cougar (Cornell) sensors Aurora(Brown/MIT) sensor monitoring, dataflow

    Hancock (AT&T) telecom streams

    Niagara (OGI/Wisconsin) Internet XML databases

    OpenCQ(Georgia) triggers, incr. view maintenance Stream(Stanford) general-purpose DSMS

    Tapestry (Xerox) pub/sub content-based filtering

    Telegraph (Berkeley) adaptive engine for sensors

    Tribeca (Bellcore) network monitoring

  • 7/29/2019 Stream Data Management

    34/68

    Optimizing Multiple Distributed Stream Queries Using

    Hierarchical Network Partitions

    Sangeetha Seshadri*

    Jointly with: Vibhore Kumar*, Brian F. Cooper, Ling Liu* and Karsten Schwan *

    *College of Computing

    Georgia Tech

    Yahoo! Research

    IPDPS07

    March 29th 2007

  • 7/29/2019 Stream Data Management

    35/68

    35

    Talk Outline

    Motivation Challenges

    Our Approach

    Experimental Results Future Work

  • 7/29/2019 Stream Data Management

    36/68

    36

    Distributed Data Stream Systems

    Weather

    Local Weather

    Web sources

    Flight information

    Travel Agent CentralizedDB

    What is the status ofmy flight?

    Can low-capacityflights be cancelled?

  • 7/29/2019 Stream Data Management

    37/68

    Lots of data produced in lots of places Examples: operational information systems, scientific collaborations,

    web traffic data, financial applications

    Centralized processing does not scale

    Motivation

  • 7/29/2019 Stream Data Management

    38/68

    38

    Challenges

    Choosing efficient deployments. Fast and efficient initial deployments.

    Utilize reuse opportunities.

    Handling dynamic nature of system.

    Queries arrive or leave.

    Nodes join (recover) or leave (fail).

    Network conditions change.

    Data conditions (e.g. rate) changes.

  • 7/29/2019 Stream Data Management

    39/68

    39

    Approach Outline

    Query Planning Deployment Adaptivity

    Typical Approaches

    Our Approach

    Query Planning&

    DeploymentAdaptivity

  • 7/29/2019 Stream Data Management

    40/68

    40

    Query Planning

    C

    B

    A

    B C

    (B C) ASink

    A B (A B) C

    SELECT * FROM A B C

  • 7/29/2019 Stream Data Management

    41/68

    41

    Query Deployment

    Sink1

    Sink5

    Sink4

    Sink3

    Sink2

    N1

    C

    B

    A

    N3

    N2

    N4

    N5

    A B (A B) C

  • 7/29/2019 Stream Data Management

    42/68

    42

    An Illustrative Example..

    SELECT * FROM A C

    SELECT * FROM A B C

  • 7/29/2019 Stream Data Management

    43/68

    Why an integrated approach?

    Integrated approach decreases cost by > 50 %

    Setup: 64 node network, 100 queries over 5 stream sources each.

    Y-axis represents communication costs.

  • 7/29/2019 Stream Data Management

    44/68

    44

    Problem

    Massive Search Space. Example: 5 stream sources, 64 nodes

    2,880,000,000 (approx) plans considered.

    Lemma 1:

    Our Solution:

    Trade some optimality for smaller search space

    ( 1)( 1)( 1) ( )6

    Kexhaustive

    K K KN

  • 7/29/2019 Stream Data Management

    45/68

    45

    Solution

    Organize the nodes into a virtual Network Hierarchy. Operator reuse through Stream Advertisements

    Two approximation based algorithms:

    Top-Down

    Bottom-Up

  • 7/29/2019 Stream Data Management

    46/68

    46

    Optimization Metric

    Minimize `network usage Network usage: total amount of data in transit at any

    point in time.

    Encapsulates both bandwidth and latency of links.

  • 7/29/2019 Stream Data Management

    47/68

    47

    Network Hierarchy

    Coordinator Nodes

    Cluster network nodes based on cost. User defined parameter max

    cs

  • 7/29/2019 Stream Data Management

    48/68

    48

    Stream Advertisements for Reuse

    AB

    A, C andA C B

    CA C

    Coordinator Nodes

  • 7/29/2019 Stream Data Management

    49/68

    49

    Optimization Algorithms

    Top-Down

    Bottom-Up

  • 7/29/2019 Stream Data Management

    50/68

    50

    Planning algorithms

    Top down A B C D

    C DA B

    C DA B

    DCBA

  • 7/29/2019 Stream Data Management

    51/68

    51

    Top-Down Algorithm: Features

    Reduced search space: Search space reduced by a factor .

    (h = height of hierarchy, N = network size, K = number ofsources).

    User defined parameter maxcs

    allows to tune trade-offbetween search space and sub-optimality.

    Operators re-used when beneficial through streamadvertisements.

    1max

    K

    cs

    h

    N

  • 7/29/2019 Stream Data Management

    52/68

    52

    Planning algorithms

    Bottom up

    A B C D

    A B

    A B C D

    A B

    A B

    DCBA

  • 7/29/2019 Stream Data Management

    53/68

    53

    Bottom-Up Algorithm: Features

    Reduced search space. Deploys only sub-queries within current cluster.

    Analytical bounds: Search space reduced by factor .

    Operators re-used when beneficial.

    But, may choose sub-optimal join-orders.

  • 7/29/2019 Stream Data Management

    54/68

    54

    Experiments

    Simulation and prototype based experiments.

    128 node network: Used GT-ITM internetwork topology

    generator.

    Uniformly random workload generator: 10 sources, 100

    queries, 2-5 join operators, random sink placements.

  • 7/29/2019 Stream Data Management

    55/68

    Cost with Bottom-Up Algorithm

  • 7/29/2019 Stream Data Management

    56/68

    56

    Comparison with existing approaches

  • 7/29/2019 Stream Data Management

    57/68

    57

    Comparison of Search Space

  • 7/29/2019 Stream Data Management

    58/68

    58

    Future Work

    We have built a prototype based on IFLOW a distributeddata stream system built at Georgia Tech.

    Aggregations

    Modifying existing deployments at runtime

    Relaxing filter conditions

    Modifying join ordering at runtime.

  • 7/29/2019 Stream Data Management

    59/68

    59

    Related Work

    Distributed query optimization Distributed INGRES, R*, SDD-1

    Stream data processing engines

    Centralized - STREAM, Aurora, TelegraphCQ

    Distributed - Borealis, Flux

  • 7/29/2019 Stream Data Management

    60/68

    60

    Conclusion

    Integrated approach to query optimization

    Hierarchical clustering of network and streamadvertisements.

    Approximation based algorithms

    Top-Down Bottom-Up

    Design Highlights

    Trade some optimality for smaller search space.

    Decrease search space while offering bounds on the sub-optimality.

  • 7/29/2019 Stream Data Management

    61/68

    61

    For further information

    http://www.cc.gatech.edu/~sangeeta Contact: [email protected]

    Thank You!

  • 7/29/2019 Stream Data Management

    62/68

    62

    Deployment Times

  • 7/29/2019 Stream Data Management

    63/68

    63

    Example

    Simple use-case for pushing down selections:

    Query 1:

    SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name

    FROM FLIGHTS, CARRIER_CODES

    WHEREFLIGHTS.Departing =ATLANTA

    AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code

    AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH

    Query 2:

    SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name

    FROM FLIGHTS, CARRIER_CODES

    WHEREFLIGHTS.Departing =ATLANTA

    AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code

    AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'

  • 7/29/2019 Stream Data Management

    64/68

    64

    The Big Picture

    Large number of possibilities System Model

    Stream processing systems (SQL-style queries)

    Pub-sub systems

    Runtime annotators (keyword-based queries).

    Trade-offs Cost with Search space

    Reliability

    Availability.

    Adaptivity Admission Control

    Moving operators

    Dropping data

    Migrating plans.

  • 7/29/2019 Stream Data Management

    65/68

    65

    Real Enterprise Workload

    Delta Airlines Operational information system Q1 (15%): Terminal Overhead Display (Lifetime = 12 hours)

    Q2 (80%): Gate Agent Query (Lifetime = 2 hours)

    Q3 (5%): Ad-hoc flight status monitoring queries (Lifetime =

    6 hours)

  • 7/29/2019 Stream Data Management

    66/68

    66

    Real Enterprise Workload

  • 7/29/2019 Stream Data Management

    67/68

    Backups

  • 7/29/2019 Stream Data Management

    68/68

    Sliding Window Approximation

    Why?

    Approximation technique for bounded memory

    Natural in applications (emphasizes recent data) Well-specified and deterministic semantics

    Issues

    Extend relational algebra, SQL, query optimization

    Algorithmic work Timestamps?

    0 1 1 0 0 0 0 1 1 1 000 0 0 1 0 1 010