continuous analytics over discontinuous streams

23
Continuous Analytics Over Discontinuous Streams Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre June 10, 2010 SIGMOD, Indianapolis

Upload: aadi

Post on 22-Feb-2016

54 views

Category:

Documents


1 download

DESCRIPTION

Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko , Alan Li, Neil Thombre June 10, 2010 SIGMOD, Indianapolis. Continuous Analytics Over Discontinuous Streams. Founded in 2005 Roots in TelegraphCQ project from UC Berkeley HQ in Foster CIty , CA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Continuous Analytics Over Discontinuous Streams

Continuous Analytics Over Discontinuous Streams

Sailesh Krishnamurthy, Michael Franklin,

Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre

June 10, 2010SIGMOD, Indianapolis

Page 2: Continuous Analytics Over Discontinuous Streams

• Founded in 2005• Roots in TelegraphCQ project from UC

Berkeley• HQ in Foster CIty, CA• Focus on “Continuous Analytics”• Fortune 100 and web-based Big Data

Customers

Page 3: Continuous Analytics Over Discontinuous Streams

3

Data Records / “Events”

Update Display

Real-TimeAnalysis

CQ ProcessorSource Data

Stream Query Processing (Traditional View)

Page 4: Continuous Analytics Over Discontinuous Streams

4

SQL Execution On Streaming Data

• A stream is an unbounded sequence of records• A table is a set of records• Window operators convert streams to tables• SQL queries apply to tables

Window Operator

• Each window produces a set of records (a table)• Semantics:

• Repeatedly apply generic SQL to the results of window operators

• Results are continuously appended to the output stream

Page 5: Continuous Analytics Over Discontinuous Streams

5

Example: SQL Queries over Streams

SELECT I.Advertiser, SUM(I.price*I.volume)FROM Impressions I <VISIBLE ‘5 sec’ ADVANCE ‘3 sec’>, Campaigns CWHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’GROUP BY I.Advertiser

“I want to look at 5 seconds worth of impressions”

“I want results every 3 seconds”

Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window”

Result(s)

Impression Data Stream

Result(s)…

Window

Window Operator Clause

Page 6: Continuous Analytics Over Discontinuous Streams

Assumptions About Streams

6

Continuous sequencesArriving mostly in order

467 5 38 1, 2

Page 7: Continuous Analytics Over Discontinuous Streams

The Reality

7

6

9

10 5

3

3

5

4 2

94 3

2

4

Minutes, Hours, Days, late arriving DataMultiple streams out of sync, with gaps, …

1, 5, ?

Page 8: Continuous Analytics Over Discontinuous Streams

Traditional (in Order) Solution #1: “Slack”

8

1 1 1 2 2 1,2 3 3 1,2,3 4 2 1,2,2,3 5 6 6 1,2,2,3 6 5 5,6 7 1 5,6 8 9 9 5,6 9 8 8,9

Time Stamp

3-Second Slack Buffer OUTPUTTuple #

Page 9: Continuous Analytics Over Discontinuous Streams

Slack

9

• Pros• Simple• Handles “jitter” (slightly out of

order arrival)

• Cons• Introduces delay• Permanently drops arrivals later than buffer• Unbounded buffer size• Permanently drops arrivals if lulls in multiple

input streams

Page 10: Continuous Analytics Over Discontinuous Streams

Traditional (in Order) Solution #2: “Drift”

10

(A,1) (a,2) (A,1)(B,2) (b,3) (a,2), (B,2)(C,3) (c,4) (b,3), (C,3)(G,4) (d,5) (c,4), (G,4)(D,6) (d,5)(E,7) (D,6),(E,7)(R,8) (E,7),(R,8) (D,6)(F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9)

Source2

2-Second Drift Buffer

OUTPUTSource 1

Page 11: Continuous Analytics Over Discontinuous Streams

Drift

11

• Pros• Simple• Handles multiple streams with

short “lulls” in arrival

• Cons• Doesn’t handle streams with dramatically

different arrival rates• Permanently drops data that arrives after drift

window has expired

Page 12: Continuous Analytics Over Discontinuous Streams

Traditional Solution #3: Order-agnostic Operators

12

• Slack and Drift aim to order streams before presenting them to order-sensitive operators

• Many operators don’t care about order

SELECT count(*), cq_close(*) tsFROM S <slices ‘5 seconds’>

Page 13: Continuous Analytics Over Discontinuous Streams

Out of Order Processing: Count Example

13

1 1 1 2 3 2 3 2 3 4 4 4 5 5 (4,t=5) 6 6 1 7 2 1 8 9 2 9 7 3 10 3 3 11 10 (3,t=10)

Time Stamp

CountState OUTPUT

Tuple #

Heart-Beat

Page 14: Continuous Analytics Over Discontinuous Streams

Order-agnostic Operators

14

• Pros• No buffering• No extra delays• Handles out-of-order tuples that

make it before heart-beat

• Cons• Some operators do care about order• Permanently drops data that arrives after

heartbeat• Note: Lost data also impacts bigger “roll up

queries” e.g. <slices 15 seconds> with sharing

Page 15: Continuous Analytics Over Discontinuous Streams

So, how to handle very late data and discontinuous streams?

15

Page 16: Continuous Analytics Over Discontinuous Streams

16

Integration Framework

Shared Stream Query Processor

Persistent Data Store

SQL Interface

Raw Data Aggregates

“Stream-Relational” Architecture [CIDR 09]

JDBC / JMS XML Flat files ETL tools SOAP APIs

Data Warehouse

App Logic / UDFs

Other TrucQ’s

Page 17: Continuous Analytics Over Discontinuous Streams

17

Order-Independent Processing: Overview

• Answers that have already been delivered can only be compensated

• Need to preserve all arriving data • Queries return answers based on

all relevant data that has arrived:• CQ’s: Continuous Queries• SQ’s: SQL queries on archived streams & answers

• Approach: Leverage benefits of SQL(!):• Data-Parallel processing w/on-demand consolidation• Powerful “View” mechanisms

• Basically, create parallel partitions for late data• Rewrite queries as views over partial results

Page 18: Continuous Analytics Over Discontinuous Streams

Out of Order Processing: Count Example

18

1 1 1 2 3 2 3 2 3 4 4 4 5 2 5 6 1 6 7 5 (6,t=5) 8 6 1 9 2 1 1 10 9 2 1 11 7 3 1

DataTS Control

Count State Partitions OUTPUT

Tuple #

Page 19: Continuous Analytics Over Discontinuous Streams

Out of Order Processing: Count Example

19

11 7 3 1 12 3 3 2 13 10 2 (3,t=10) 14 12 1 2 15 8 1 1 (2,t=5) 16 4 1 1 17 3 1 2 18 9 2 2 19 15 2 2 (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5)

DataTS Control

Count State Partitions

OUTPUTTuple # (6,t=5)

Page 20: Continuous Analytics Over Discontinuous Streams

Out of Order Processing: Count Example

20

(6,t=5)(3,t=10) (2,t=5)(1,t=15)(2,t=10)(2,t=5)

OUTPUT• Treat output as “Partial State Records”• Rewrite queries using views over PSRs

• i.e., consolidate On-Demand• Paper goes into substantial detail

on how rewrites work• <Slices 5 second>

• Same answer as Order-Insensitive• <Slices 15 second> as roll-up

• Answer contains all data• Subsequent SQs over archived results

and raw data contain all data too!

Page 21: Continuous Analytics Over Discontinuous Streams

Handles Very Late Data, Plus You Get…

21

• Parallel Processing – Multicore and Cluster

U

U

D

D

D

D

D

Client

Client

Client

ClientH

igh-

band

wid

th N

etw

ork

Inte

rcon

nect

D = Distributed Processing NodeU = Unified Processing Node

Page 22: Continuous Analytics Over Discontinuous Streams

Other Details in the Paper

22

• Beyond late data and parallelism, approach also is key to supporting:• Fault Tolerance using replication• High-Availability via fast restart• “Nostalgic” continuous queries that start in the

past and catch up to the present• Fast concurrent creation of archives for new CQs

• Algorithmic/Systems details on• Integration with overall system architecture• Interaction with Transaction Mechanism• Need for Background Reducer task• Hybrid Plans for non-parallelizable parts of queries

Page 23: Continuous Analytics Over Discontinuous Streams

Conclusions

23

• Early Stream Processing Systems were based on simplistic assumptions about ordering

• Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped

• Approach leverages strengths of SQL• Data-parallel processing models• Sophisticated and efficient view functionality

• Key is On-Demand Consolidation• Of course, you can only do it if you have an

integrated stream-relational systemFor more info: [email protected] or [email protected]