lukaszgolab$ - university of waterloolgolab/icde2014_tutorial.pdf ·...

Data Stream Warehousing

Lukasz Golab [email protected] University of Waterloo

Theodore Johnson [email protected]

AT&T Labs -‐ Research

About Us

•  Lukasz Golab –  Assistant Professor at U. Waterloo –  Previously senior member of research staff at AT&T Labs –  Research interests: streaming data, data warehousing, data quality, big data for sustainability

•  Ted Johnson –  Lead member of research staff at AT&T Labs –  AT&T Fellow –  Previously Associate Professor at U. Florida –  Research interests: streaming data, data warehousing, data quality

About This Tutorial

•  Short version last SIGMOD 2013 •  This version

– More examples – More technical details

•  Slides available online – www.engineering.uwaterloo.ca/~lgolab

•  CitaZons at the end

Outline •  IntroducZon (What?) •  MoZvaZon (Why?)

–  ApplicaZons –  Requirements –  Why yet another data management system? –  Detailed examples

•  Technical discussion (How?) –  System architectures –  Common elements –  Performance opZmizaZons –  Data stream quality

•  Conclusions and open problems (What’s next?)

Big Data

•  Every 2 days we create as much informaZon as we did up to 2003 (Eric Schmidt)

•  Becoming easier to produce/collect – Sensors, Web, cheap bandwidth

•  Becoming easier/cheaper to store – Cheap hard disks, commodity hardware

Five ‘V’s of Big Data •  Volume •  Velocity •  Variety

– Data integraZon •  VerificaZon

– Data cleaning •  Value

– Data mining

What is Data Stream Warehousing?

•  Volume and velocity in one box •  Currently handled by separate systems

– Data warehouses or Hadoop/MapReduce for big data

– Data stream management systems for fast data

But first…

•  Quick review of big data management and data streaming

Typical Big Data Workflow •  Wait for data to arrive •  Prepare and load data

–  Into HDFS – or into the data warehouse, then index

•  Compute the result •  Start over

Hadoop New Data

HDFS

Hadoop

HDFS

Analysis Job

Hadoop

HDFS

Analysis Job

Map/Reduce

TradiZonal Data Warehousing •  Load a unit of data aier it

has been collected. –  Many data sources

•  Update all materialized views during the load process.

•  Massive tables. •  Lots of OLAP.

•  Data in the warehouse is

consistent. •  Data loading is

infrequent.

Monday Tuesday Wednesday

Base Tables

Derived Data products

Typical Stream Data Workflow

•  For each item or batch of items – Do some processing – Compute/update results

•  ConZnue indefinitely

Fast Data Systems •  Data Stream Management Systems (DSMS)

– Borealis, StreamBase, GS Tool, Storm – Simple queries over fast append-‐only data – Results streamed out, usually not stored

•  Key-‐value stores have fast transacZonal response, but analyZcs are difficult – Put/get interfaces – Scalability via parZZoning

•  AnalyZcs require correlaZon

TradiZonal Data Streaming •  Process a record

immediately upon arrival. •  Small number of data

sources. –  Oien, one.

•  Data arrives in (nearly) sorted order.

•  In-‐memory processing –  Store a small state

•  Consistency is easy because the acZve Zme range is small.

External Data

Hadoop/MapReduce Streaming •  E.g., MapUpdate/Muppet

– Stream processing front end – Map stream records to “slates” for updates

•  Stored in a key value store – Streaming updates to customer signatures.

Input stream

Stream Processing

Slates Key-‐Value Store

Slates Server 1

Server 2

Server 2

Joe Mary

Bill Hillary

Pete Susan

In This Tutorial •  Big Data Management

–  Focus on scalability and deep analyZcs, but high latency

•  Fast Data Management – DSMSs have low latency, but limited capability and no persistent storage

•  Can/should we do both? –  Combine large scale and deep analyZcs with low-‐latency processing

– Data Stream Warehousing

ApplicaZons •  Monitoring (IP networks, infrastructure, smart transportaZon systems and power grids, RFID, system logs, manufacturing processes)

•  TransacZons (stock Zckers, credit card purchases)

•  User behavior logs (Web, social media)

Example 1 : Network Monitoring

•  Historical analysis –  ForecasZng, planning, what-‐if analysis

•  Real-‐Zme alerZng and troubleshooZng – High load, too many connecZons, …

•  Alerts need historical data – High load during otherwise low-‐usage Zmes

•  Alerts lead to troubleshooZng/analyZcs – Automated troubleshooZng needs (current and past) data on which alert was based plus related data

Example 2: Smart Grid

•  GeneraZon, transmission & consumpZon data •  Historical analysis

– Planning, forecasZng, what-‐if analysis •  Real-‐Zme alerZng and troubleshooZng

– Voltage sag, outages, lights lei on •  Alerts need historical data

– Typical consumpZon profiles, line temperature trends

•  Alerts lead to troubleshooZng/analyZcs

Example 3: Manufacturing •  E.g., semiconductor manufacturing

– www.extremetech.com/extreme/155588-‐applied-‐materials-‐designs-‐tools-‐to-‐leverage-‐big-‐data-‐and-‐build-‐beCer-‐chips

•  Sensor measurements throughout the process – 141 Tbyte/year for a 14 nm process

•  Historical analysis – OpZmizaZon of manufacturing processes

•  Real-‐Zme alerZng and troubleshooZng – Fault detecZon

Example 4: Social Media

•  Historical analysis: – Related topics

•  Real-‐Zme queries: – TwiCer – related query suggesZons

•  Real-‐Zme queries need up-‐to-‐date results of related topic analysis

Requirements •  Load data from a mulZtude of streaming sources

– Wide variaZon in data latencies

•  Maintain base and derived data –  Complex analyZcs through materialized views

•  Provide transparent access to both real-‐Zme and historical data

•  Gracefully handle late-‐arriving data •  Schedule queries and updates in spite of highly variable workloads –  Load shedding by dropping data is not an opZon

MulZtude of streaming sources •  Data become most useful when you can correlate results from many sources – Hundreds to thousands of disZnct data feeds

•  Network monitoring – Correlate twiCer feeds, acZve monitoring streams, and link uZlizaZons to idenZfy trouble spots

•  Smart Grid – Correlate smart meter readings, line temperature measurements, and phasor measurement units to proacZvely react to overloads and avoid blackouts

0

2

4

6

8

10

12

0 100000 200000 300000 400000 500000 600000

Num

ber o

f Windo

ws

Time ( seconds)

Late-‐arriving data •  Late arriving data is a common problem for streaming systems.

•  DSMS : data arrive minutes late

•  Stream Warehouse : data can arrive days late

•  Load all data and propagate their results in spite of lateness.

Derived Data •  Load raw data into base tables

•  Derived tables are materialized views for complex analyZcs.

TwiCer feeds

AcZve measure

Link UZl.

Customer complaint

Service alerts

SenZment analysis

Hourly aggregate

Daily aggregate

Raw Streams

Base Tables

Derived Tables

Progress in the Stream

•  Data stream warehousing just tries to make progress in the stream.

•  If there is more source data, compute an update to the stream table.

Source tables

Derived data product

Update

Why a new system? •  Could have 2 separate systems, but

– Not clear where to divide the systems – Overhead of moving data from one system to the other

– Harder to develop applicaZons •  Different SQL/noSQL dialects, etc.

– Historical data provides context for real-‐Zme data – Even tradiZonal analyZcs/reporZng is becoming more real-‐Zme

•  Reduce Zme from ingest to insight

Detailed Example 1: Darkstar •  Darkstar project at AT&T Labs -‐ Research •  MoZvaZng applicaZon for the Data Depot stream warehouse

system •  Data collected:

–  Passive and acZve probe measurements, route monitoring, system logs, configuraZon data, customer service Zckets and notes, ….

•  For: –  Networking research, data mining, alerZng, troubleshooZng, …

•  The network is a large and complex system –  Not just IPV4.

•  Argus –  He Yan, Zihui Ge

•  Ptolemy –  Zihui Ge, Don Caldwell, Bill BeckeC

Darkstar: Mining Vast Amounts of Data

Network

Route monitors (OSPFmon, BGPmon)

Device service monitoring (CIQ, MTANet, STREAM)

AcZve service and connecZvity monitoring

Syslog Config

SNMP Polling (router, link) Neulow

Deep Packet InspecZon (DPI)

Alarms

Tickets

AuthenZcaZon/ logging (tacacs)

Customer feedback – IVR, Zckets, MTS

IP Backhaul Enterprise IP, VPNs

Ethernet Access

IPTV

Layer one

Mobility

ARGUS: DetecZng Service Issues… • Goal: detect and isolate ac#onable anomaly events using comprehensive end-‐to-‐end performance measurements (e.g. GS tool) •  SophisZcated anomaly detecZon and heurisZcs •  SpaZal localizaZon •  Accurately accounts for service performance that varies considerably by Zme-‐of-‐day

and locaZon •  Impact: •  Reduced detecZon Zme from days to approx. 15 mins for detecZng data service issues

•  OperaZonal naZon-‐wide monitoring data service performance for 3G and LTE (TCP retransmission, RTT, throughput from GS Tool)

Market

Sub-‐Market Sub-‐Market …

SGSN SGSN

… RNC RNC

…

SITE SITE …

SITE

SITE

RNC

SITE

SITE

RNC

SITE

SITE RNC

SGSN

SGSN GGSN

GGSN

Collect end-‐to-‐

end Performance

Data

Approach: Mobility LocalizaZon Hierarchy

Case Example: Silent CE Overload CondiZon • ARGUS detected event: 2 Columbia 3G Ericsson SGSN’s impacZng RNC’s in West Virginia, Norfolk, and Richmond •  No other indicaZon of issue •  Topology highlighted CE used by only impacted SGSNs

•  RCA: “6148 48 port 1gig card is limited to a shared 1 gig bus for each set

of 8 gig ports”

ARGUS alarm: clmamdorpn2 (TCP retransmissions) CE UGlizaGon flaJening

ARGUS As A General Capability… Spike in call drop rate on MSC hrndvacxca1 RTT anomalies (SGSN level)

Outage start 5:30 GMT

First Anomaly 5:40 GMT

CTS Ticket Created 08:21 GMT

Social media (TwiCer) NY outage

LA outage

Node metrics, acZve measurements (CBB, IPAG WIPM delay)…

Mobility customer Zckets (Boston market – PE isolaZon)

• 1. At-‐a-‐glance view of network topology and state

• VisualizaZon to summarize important informaZon on network health •  Color-‐coded

• Complimentary to ZckeZng system – reporZng issues below “alarming” status

hCp://ptolemy.research.aC.com/

Use network visualiza9on and convenient data explora9on to help network operators with network health monitoring and service problem troubleshoo9ng

Ptolemy

hCp://ptolemy.research.aC.com/mobility

Assess damage, idenZfy remaining capacity

Loss of many links out of Japan. What’s lei?

Example 1: Japan Earthquake, March 11th 2011

IdenZfy traffic shiis, no congesZon

Increase in link load as traffic re-‐routed

Link load

Example 1: Japan Earthquake, March 11th 2011

Detailed example 2: Smart grid

Credit: US Government Accountability Office, www.gao.gov/new.items/d11117.pdf

Smart grid data sources

Credit: www.sas.com/offices/NA/canada/en/news/preleases/energy-‐visual-‐analyZcs.html

Smart grid data sources

Credit: hCp://www.ambientcorp.com/prod-‐nodes/

Smart grid data analyZcs

Credit: www.thegreenitreview.com/2010/12/smart-‐grid-‐data-‐analyZcs-‐market-‐will.html

Smart meter data mining

•  ConsumpZon profiles

Solar panel anomaly detecZon

Source: ecinstalls.co.uk/solar-‐panels-‐monitor-‐payback

Recap

•  Need to collect and analyze data from diverse streaming sources

•  Transparent access to real-‐Zme and historical data

•  Could have two separate systems, but…

Data Steam Warehouse Architecture •  Common elements (more on this later)

– Temporal data •  storage, querying, consistency

– Update propagaZon/workflow •  OpZons

– DBMS based – DSMS based – Hadoop/MapReduce

DBMS-‐based Architecture •  Use the query processing and storage engine of a DBMS

•  Add layers/opZmizaZons for addiZonal services – Temporal parZZoning – Update propagaZon – Refresh scheduling

•  Examples: Data Depot/Daytona, Truviso/Postgres, DBStream/Postgres, DataCell/MonetDB

DSMS-‐based Architecture

•  Dataflow operators + persistent storage – Enables joins of streaming data and “staZc” tables – Enables storage of streaming results

•  Examples: Aurora/Borealis, MillWheel

Output stream

“StaZc” data set

ConnecZon point

Hadoop/MapReduce-‐based Architecture

•  1. Need to reduce latency –  Incrementally send mapper output to reducers (Hadoop Online)

– Hashing instead of Sort-‐merge to group by key (INC-‐Hadoop)

•  2. Need to enable incremental computaZon – Save results of previous iteraZons (Spark, Muppet)

–  Incremental workflows (Pig/Nova)

Discussion •  DBMS-‐based

–  Leverage SQL, query opZmizaZon, etc. –  But not quite real-‐Zme

•  DSMS-‐based –  Enable real-‐Zme processing with some context – Must keep up with live data, so limited analyZcs and storage capabiliZes

•  Hadoop/MapReduce-‐based –  Leverage scale-‐out and fault tolerance –  But not quite real-‐Zme – And analyZcs may not be as efficient as in DBMS

Outline •  Introduc.on (What?) •  Mo.va.on (Why?)

–  Applica.ons –  Requirements –  Why yet another data management system? –  Detailed examples

•  Technical discussion (How?) –  System architectures –  Common elements –  Performance op.miza.ons –  Data stream quality


Challenges •  Tension between consistency and .meliness.

–  Either wait un.l all the data has arrived, or load data that isn’t (yet) consistent.

•  Con.nual data loading – Mul.-‐version Concurrency Control (MVCC) – Update propaga.on algorithms which avoid the “missing update” problem.

–  Consistency markers in the tables.

•  All metadata is temporal –  E.g., network topology such as NetDB –  Can’t use sta.c snapshot tables for metadata.

Common Elements

•  Temporal par..oning •  Update propaga.on / workflow •  Temporal dimension tables •  Temporal consistency management

Temporal Par..oning

•  The primary par..oning field is the record .mestamp •  Stream data is mostly sorted •  Most new data loads into a new par..on

–  Avoid rebuilding indices •  Simplified data expira.on – roll off oldest par..ons

Time

Data

Index

New data

Derived Data •  Streaming analy.cs –

maintain a system of complex materialized views

•  Load raw data into base tables

•  Derived tables are materialized views for complex analy.cs.

•  Push new data through base tables to all dependent tables –  Create new par..ons –  Update exis.ng

par..ons as needed

Twi\er feeds

Ac.ve measure

Link U.liza.on

Customer complaint

Service alerts

Sen.ment analysis

Hourly aggregate

Daily aggregate

Raw Data

Base Tables

Derived Tables

Update Propaga.on •  Basic algorithm: recompute par..ons

–  In general, most of the par..on is affected. –  Non-‐SQL views : outputs of analyses

•  Determine the source par..ons of a Derived par..on •  Recompute if a source changes •  Eventual Consistency

S

D

update

Update Propaga.on

Temporal Dimension Tables •  Most streaming data describes events

–  Occurs in a point in .me, or is a measurement during a well-‐defined interval

•  Some streaming data defines condi*ons –  Proper.es of an en.ty that endures for a .me interval –  Temporal dimension tables – .mestamp is valid .me interval.

•  Pervasive use –  You can’t evaluate an event without knowing about the environment

–  Link speeds, cell tower loca.ons, power grid organiza.on •  Snapshot tables don’t work

–  Late arriving data, recomputa.on, new long-‐term analyses.

Why don’t snapshot tables work? •  Temperature (ts,id, degrees)

–  Event stream •  Loca.on(id,X,Y,Z)

–  Store as a snapshot –  Update periodically

•  Join Temperature and Loca.on to find server room hot spots •  Problems

–  A set of sensors are moved, but Loca.on isn’t refreshed un.l three hours later

–  Sensors are moved, Loca.on is updated within 15 minutes, but a networking problem delays reports of sensor data for 2 hours.

–  Some sensors have been repor.ng Celsius instead of Fahrenheit for the last 3 days, so the data has to be reloaded.

•  Result –  Incorrect diagnosis of hot spots.

Temporal Dimension Table Example SNMP_BytesTransferred

Ip_address Timestamp Bytes_xfered

4.3.2.1 1:05 1,000,000

4.3.2.1 1:10 1,200,000

4.3.2.1 1:15 2,200,000

LinkSpeed Ip_address Tlo Thi Speed

4.3.2.1 12:15 1:15 1,000,000 B/min

4.3.2.1 1:15 -‐ 2,000,000 B/min

Ip_address Timestamp U.liza.on

4.3.2.1 1:05 .2

4.3.2.1 1:10 .24

4.3.2.1 1:15 .22

LinkU.liza.on

Temporal Dimension Tables •  Updates

–  Snapshots of current status, deltas. •  Snapshot windows in StreamInsight •  Compute from the stream

–  Frames – based on a condi.on of records in a stream –  Interval punctua.on

•  Maintain temporal dimension tables as a stream – Use par..oning to preserve the locality of updates – Merge par..ons to avoid space blowup.

Op.miza.ons

•  Mul.-‐version Concurrency Control •  Data Layout Op.miza.ons

– Data Aging – Par..on Restructuring

•  Par..on Revisions •  Temporal Consistency Management •  Workflow Scheduling •  Real-‐.me View Maintenance •  Distributed Data Placement

Mul.-‐version Concurrency Control •  MVCC allows queries and updates to proceed concurrently – Read isola.on – Long analy.c queries do not block real-‐.me updates

•  Single-‐updater MVCC is cheap and easy – Use a directory-‐swap algorithm

•  Encourages use of cloud-‐friendly write-‐once files.

Mul.-‐version Concurrency Control

Version 1 Version 2

Directory swap

Op.miza.ons


– Data aging – Par..on Restructuring


Par..on Restructuring •  As data ages, its best representa.on changes

– Most recent data : op.mize for fast ingest – Stable data : op.mize for queries – Historical data : minimize storage cost

•  Restructure par..ons as the data ages – MVCC allows data maintenance to occur as a non-‐interfering background task

•  Move the data as it ages – Ramdisk → SSD → fast disk storage → archival disk storage

Par..on Sizing

•  New par..ons should match the update increment

•  Problem : par..on explosion –  1 minute par..ons, 1440 per day, 525,600 per year

•  Merge par..ons as they age

Time

Data

Index

Indexes op.onal

Data Lifecycle Management •  Write-‐op.mized data

–  Row-‐oriented, lightly indexed, uncompressed •  Read-‐op.mized data

– Highly indexed, lightly compressed, column storage if beneficial

•  Transform as a background task when the data becomes stable –  Combine with par..on merging

•  Aggressive compression for archival data •  Implementa.ons in SAP HANA and Ver.ca

Data Lifecycle Management in SAP HANA

•  L1 delta -‐> L2 delta -‐> main •  L1 delta

–  Raw data, row storage, no compression, completely write-‐op.mized (10k-‐100k rows)

•  L2 delta –  Column storage, bulk updates, dic.onary encoding to save space, but unsorted dic.onary (up to 10million rows)

•  Main store – Highest compression, sorted dic.onary, completely read-‐op.mized

Data Lifecycle Management in SAP HANA

•  L1 delta -‐> L2 delta is easy –  Incremental merge – Can be done onen

•  L2 delta -‐> main is harder – L2 delta already read-‐op.mized but larger memory footprint than main

–  Incremental merge is difficult – Not done too onen

Op.miza.ons




Par..on Revisions

•  Some data always arrives late •  Problem : need to recompute exis.ng par..ons – Disk prefers sequen.al access – Write-‐once files : need to recompute the en.re par..on

•  Solu.on: chain updates to the par..on – Value of the par..on is the sum of the primary (anchor) contents plus the updates (revisions).

Par..on Revisions

•  Problem: Don’t change old par..ons, but what if data arrives out-‐of-‐order?

•  Solu.on: Overflow chains (Truviso)

Time

anchor

revisions

Packet_Stream

Packets

When can we use overflow chains? •  Whenever there is a simple and fast way to

– Compute a delta – Add the delta to the main result.

•  Examples: – Select/project

•  Select/project on the source delta, Union to integrate with the anchor par..on.

– Aggrega.on queries (commuta.ve aggregates) •  Subaggrega.on / superaggrega.on

– Outer join on a foreign key •  Collect replacement values

•  Works with “raw” and derived/aggregated data

•  E.g., packet counts:

Data Layout

A: 1000 B: 1500 C: 1750

A:1090 B:1200 C:2000

A:1150 B:2975 C:1450

A:1400 B:1150 C:1925

Time

25

Packet_Stream

Packets

Packet_counts

B:25

Propagate Incremental Updates

Packets

Anchor

Revision 1

Revision 2

Revision 3

Packet_counts

Anchor

Revision 1

Revision 3

Using Revisions •  On access, load the anchor and update using the delta par..ons –  Select/project : scan all par..ons

–  Aggrega.on: load hash table with delta, add to matching anchor records

•  Merge anchor and revisions as a par..on restructuring task.

Anchor

Revision 1

Revision 3

Scan

Load subaggregates Hash Table

Merged Records

Op.miza.ons




Temporal Consistency Management

•  Tradi.onal no.on of consistency : a snapshot of the system.

•  Doesn’t apply in a stream warehouse – Late-‐arriving data is common – Different data sources have different .me lags and different likelihoods of late data

•  Instead, label data by its degree of completeness

0

2

4

6

8

10

12

0 100000 200000 300000 400000 500000 600000

Num

ber o

f Windo

ws

Time ( seconds)

Number of windows per package

Query Stability •  How do I know when the data is stable enough to query?

•  What is stable enough? – Data will never change – Data won’t change much. –  I’ll take whatever is there.

Consistency Levels •  Punctua.ons on par..ons that indicate completeness.

•  Example (simple) collec.on of consistency levels – Open : The par..on should have some data in it. –  Closed : The par..on will not change. –  Complete : the par..on will not change, and all data has been received.

•  Closed is a guess – WeaklyClosed, StronglyClosed

•  Infer at base tables, propagate inferences to materialized views.

Example •  Troubleshoo.ng

–  Compass_BPS correlates •  Compass_BPS_Base : SNMP measurements of bytes-‐per-‐second over a router interface

•  NetDB_SPEED : Link speed of an interface

–  to provide link u.liza.ons. –  Compass_BPS_RAW : Mostly_Closed

Open

Mostly_Closed

Closed

Compass_BPS_Base NetDB_Speed

Compass_BPS

OK for troubleshoo.ng

Examples

•  Update Propaga.on – Roll up Compass_BPS into daily aggregates. – All source Compass_BPS data should be Closed to avoid expensive recomputa.ons.

Op.miza.ons




Workflow Scheduling •  Need to limit resource use to avoid thrashing.

– Hundreds of tables to update, limited (CPU, memory, cache, network) resources.

–  Exclusive resources: non-‐preemp.ve scheduling. •  Ensure that high-‐priority jobs can execute

–  Real-‐.me scheduling •  Measures of lateness:

–  Staleness : difference between current .me and most recent data.

–  Tardiness : the difference between a task deadline and task comple.on.

Workflow Scheduling •  Staleness func.on:

difference between current .me and most recent data loaded

•  Hierarchies of views with highly varying execu.on .mes.

9:30 9:45 10:00 10:15

Twi\er feeds

Ac.ve measure

Link u.l

Customer complaint

Service alerts

Sen.ment analysis

Hourly aggregate

Daily aggregate

fast frequent

slow infrequent

Bounded Tardiness Scheduling •  Bound on the maximum tardiness of any task in a task set.

•  If update jobs are scheduled regularly, bounded tardiness => bounded staleness

•  Most real-‐.me scheduling algorithms have bounded tardiness –  EDF, minimum slack, etc. –  There can be differences in the tardiness bounds

•  Pick a heuris.c that works well –  E.g. pick the task that provides the largest marginal reduc.on in staleness.

Track Scheduling •  Complica.on: Large differences in task execu.on .me – Update a base table with 1 minute of data vs. compute a daily aggregate.

•  Tardiness bounds depend on the largest task execu.on .mes. –  Long tasks block short cri.cal tasks.

•  Track Scheduling : –  par..on tasks by execu.on .me. –  Restrict the number of long tasks that can execute concurrently

–  Reserve resources for short cri.cal tasks

Transient Overload •  Common source of overload : catch-‐up processing. – A feed breaks for a day, then is restored. –  The source schema changes, requiring a pause in processing to change update procedures.

– New tables load a long history •  Update Chopping

–  Break a (temporally) long update into short segments. •  Update period adjustment

– Decrease the period of backlogged tables to use up (but not oversubscribe) available resources.

Op.miza.ons




DB-‐toaster •  Maintain complex

aggregate views over streaming data.

•  In-‐memory architecture : all storage is via hash table. –  1TB main memory servers are

inexpensive •  Uses novel recursive-‐delta

technique to accelerate maintenance –  Collec.on of support views

that can significantly reduce update .me.

Join(R,S,T))

Join(S,T)) Join(R,T)) Join(R,S))

T) S) R)

R-‐Store •  Maintain data cubes from OLTP streams using Hbase and MapReduce –  Base table stored in Hbase – MapReduce used to periodically materialize a data cube –  Use mul.-‐versioning feature of Hbase

•  Query is assigned a .mestamp of its entry .me. •  Use to find records in Hbase which arrived since last data cube materializa.on but before query arrivel

•  Compute incremental update to data cube •  Data cube materializa.on allows Hbase compac.on (discard old records).

–  Use MapReduce to compute query result from incrementally updated data cube

•  Tuesday 10:30 – 12:00 session.

Op.miza.ons




Distributed Stream Warehousing

•  Distribute (and replicate) data and queries across mul.ple servers – Fault tolerance – Parallel processing – But: the data you need may be on different servers

• à data communica.on/shuffling penalty

Data Placement

•  OLTP: distribute data such that (most) transac.ons don’t have touch more than one server –  i.e., can make local commit decisions

•  Minimize the number of distributed transac.ons -‐-‐> hypergraph par..oning to minimize the number of cut hyperedges

Hypergraph Par..oning

•  4 queries/transac.ons –  Each is a hyperedge

•  6 tuples, T1 through T6 –  Each is a node

•  E.g., if T1 and T2 are placed on different servers, Q3 hyperedge is cut à Q3 is a distributed transac.on

T1

T5

Q1

T2 T6

Q2

T4

Q3 T3

Q4

Stream Warehouse Sewng

•  Don’t have transac.ons/mul.ple writers •  Instead, standing queries/views/workflows over data feeds – Whenever a new batch of data arrives, same workload runs

•  Which are onen very data-‐intensive – Have to collect data from mul.ple servers before doing any computa.on

Stream Warehouse Sewng

•  Turns out that – Minimize data transfers à (regular) graph par..oning to minimize the weight of the cut edges

–  (For a carefully constructed bipar.te graph)

Bipar.te Graph Par..oning

•  Queries on the len •  Tables on the right Q1

Q2

Q3

Q4

T1

T2

T3

T4

T5

T6

Data Stream Quality

•  New data quality problems – Systema.c errors in machine-‐generated streams – Correlated glitches

•  Missing values followed by very large value

– Missing/delayed/future/imprecise/duplicate data •  CPU u.liza.on = -‐1 •  Counter looparound •  Link u.liza.on = 5000% •  Failing smart meters -‐> undercount

Integrity Constraints for Data Quality

•  Func.onal Dependencies – Postcode -‐> City

•  Condi.onal Func.onal Dependencies – Country,Postcode -‐> City but only for Country=Canada

•  Inclusion Dependencies •  CHECK Constraints

New Integrity Constraints for Streaming Data

•  New seman.cs – Order – Data arrival frequency – Conserva.on laws among related streams

62

Examples

●  Sequential Dependencies ●  Golab, Karloff, Korn, Saha, Srivastava, VLDB

2009

●  Conservation Dependencies ●  Golab, Karloff, Korn, Saha, Srivastava, ICDE

2012 and TKDE (to appear)

63

Sequen.al Dependencies

●  Given an interval g, X àg Y asserts that the distance between two consecu.ve Y values, when sorted on X, is within g

●  X à(0,∞) Y means that Y is strictly increasing with X ●  e.g., .me à(0,∞) sequence_number ●  Viola.ons indicate out-‐of-‐order arrivals ●  shipping_date à(0,∞) delivery_date

64

Sequen.al Dependencies

●  More complex examples: ●  sequence_number à[4,5] .me

●  i.e., .me "gaps" between consecu.ve records (sequence numbers) are between 4 and 5

●  Can measure QoS of a data feed expected to arrive with some frequency

65

Example 1: IP Network Monitoring

●  Network operators collect stats from routers, e.g., number of packets

●  Routers store cumula.ve counters that can be queried

●  Counters periodically loop around ●  and reset aner reboot

●  Spurious mesurements ●  queried the wrong router?

66

.me à(0,∞) packet_count

time

●  [1,10], [11,15] and [17,20] sa.sfy the SD exactly

●  [1,11] and [11,20] sa.sfy the SD approximately with one excep.on each

67

.me à(0,∞) packet_count

●  Discovered intervals summarize data seman.cs and help iden.fy problems ●  Disrup.ons in ordering

●  Many short intervals -‐> premature counter rollover?

●  Note: may not need 100% confidence ●  Helps avoid overfiwng

68

Example 2: IP Network Monitoring

●  Audi.ng the polling frequency

●  Suppose data collec.ng mechanism configured to probe every 10 sec ●  Too onen -‐> unnecessary overhead

●  Not onen enough -‐> not enough info for troubleshoo.ng

69

poll_number à[9,11] .me

●  [10,90] and [180,240] iden.fy intervals that sa.sfy the SD (almost exactly)

●  Can also try, say, poll_number à(20,∞) .me to find periods with long gaps ●  [30,60] and [120,200] ●  Easier to analyze than a long list of individual viola.ons

70

Examples

●  Sequential Dependencies ●  Golab, Karloff, Korn, Saha, Srivastava, VLDB

2009

●  Conservation Dependencies ●  Golab, Karloff, Korn, Saha, Srivastava, ICDE

2012 and TKDE (to appear)

71

Conserva.on Dependencies

●  Onen there exist conserva.on laws between related quan..es ●  Kirchhoff's law of conserva.on of electricity ●  Also holds in road networks (interesec.ons), telecom networks (routers), smart grid, bulidings, etc.

●  Viola.ons of conserva.on laws ●  Temporary viola.ons due to delays ●  Permanent viola.ons due to missing or incorrect measurements

72

Assump.on

●  Don‘t have exact correspondence between „incoming“ and „outgoing“ events –  Just periodic incoming and outgoing counts

●  E.g., SNMP counters, highway sensors, door sensors, transac.ons, etc.

–  Otherwise easy to track viola.ons/delays ●  E.g., order shipment .mes & delivery .mes

73

Conserva.on Dependencies

50 in, 70 out

65 in, 70 out

65 in, 25 out

50 in, 65 out

74

Example

75

Example

●  Persistent viola.on: unmonitored side exit

●  Also smaller viola.ons ●  Correspond to mee.ngs in the building

●  Plus, viola.ons around lunch.me

●  Other examples: unmonitored links in IP network monitoring data streams

Tidalrace •  Tidalrace is a next-‐genera.on stream warehousing system incorpora.ng – Micro-‐batch updates

•  Patented update propaga.on algorithm –  Deep levels of derived data products – Mul.-‐version concurrency control –  Streaming updates to temporal metadata tables

•  Patent applica.on pending –  Temporal consistency support

•  Patent applica.on pending –  Anchor/revision op.miza.ons for in-‐the-‐past updates.

•  Patent applica.on pending –  Data layout op.miza.ons (row vs. column store, …) –  Cloud-‐friendly write-‐once files

Tidalrace Architecture

Tidalrace metadata system

MySQL

File system (local, D3FS, HDFS) Data par..ons and indices

Data loading and update propaga.on

Queries Maintenance

Storage Manager (D3SM)

Tidalrace Status •  The Tidalrace server is implemented on top of MySQL.

–  Backing store database can be replaced. •  Base table loading and update propaga.on algorithms. •  SQL par.ally implemented.

–  Select, project, aggrega.on. –  Aggregates speed: 500,000 records/sec per thread.

•  Update propaga.on to materialized views defined using the Tidarace SQL.

•  Streaming updates to temporal metadata tables. •  Materialized views defined by outer join to a temporal metadata

table. •  Supports local storage (mul.ple hosts), D3FS, HDFS versions 1 and

2.

•  More under development.

Open Problems •  Hybrid system architectures and cross-‐system op.miza.ons

•  Big and fast analy.cs as a cloud service •  Big/fast data mining •  Data stream quality/profiling •  Complexity management and administra.on of a big/fast data management system

Bibliography

Bibliography: Applica.ons •  Networking

–  L. Golab, T. Johnson, S. Sen, J. Yates: A Sequence-‐Oriented Stream Warehouse Paradigm for Network Monitoring Applica.ons. PAM 2012: 53-‐63

–  C. Kalmanek et al., Darkstar: Using Exploratory Data Mining to Raise the Bar on Network Reliability and Performance, DRCN 2009

–  H. Yan, A. Flavel, Z. Ge, A. Gerber, D. Massey, C. Papadopoulos, H. Shah, J. Yates: Argus: End-‐to-‐end service anomaly detec.on and localiza.on from an ISP's point of view. INFOCOM 2012:2756-‐2760

•  Smart Grid –  S. Ramchurn et al.: Puwng the 'smarts' into the smart grid: a grand challenge for

ar.ficial intelligence. Commun. ACM 55(4): 86-‐97 (2012) •  Semiconductor Manufacturing

–  www.appliedmaterials.com/technologies/library/techedge-‐prizm –  www.extremetech.com/extreme/155588-‐applied-‐materials-‐designs-‐tools-‐to-‐

leverage-‐big-‐data-‐and-‐build-‐be\er-‐chips •  Social media

–  G. Mishne, et al.: Fast data in the era of big data: Twi\er’s real-‐.me related query sugges.on architecture. SIGMOD 2013: 1147–1158

Bibliography: DBMS-‐based Systems •  DataCell: E. Liarou, R. Goncalves, S. Idreos: Exploi.ng the

power of rela.onal databases for efficient stream processing. EDBT 2009: 323-‐334

•  Data Depot: L. Golab, T. Johnson, J. S. Seidel, V. Shkapenyuk: Stream warehousing with DataDepot. SIGMOD Conference 2009: 847-‐854

•  DBStream: A. Baer et al.: DBStream: an Online Aggrega.on, Filtering and Processing System for Network Traffic Monitoring. TRAC workshop at IWCMC 2014, to appear

•  Truviso : S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N. Thombre: Con.nuous analy.cs over discon.nuous streams. SIGMOD 2010:1081-‐1092

Bibliography: DSMS-‐based Systems •  D. J. Abadi, D. Carney, U. Çe.ntemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. B. Zdonik: Aurora: a new model and architecture for data stream management. VLDB J. 12(2): 120-‐139 (2003)

•  T. Akidau, A. Balikov, et al. MillWheel: Fault-‐tolerant stream processing at internet scale. VLDB 2013: 734–746

•  M. Balazinska, Y. C. Kwon, N. Kuchta, D. Lee: Moirae: History-‐Enhanced Monitoring. CIDR 2007: 375-‐386

Bibliography: Distributed Systems •  P. Bhato.a, A. Wieder, R. Rodrigues, U. A. Acar, R. Pasquin: Incoop:

MapReduce for incremental computa.ons. SoCC 2011: 7 •  T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R.

Sears: MapReduce Online. NSDI 2010: 313-‐328 •  W. Lam, L. Liu, S. T. S. Prasad, A. Rajaraman, Z. Vacheri, A. H.i Doan:

Muppet: MapReduce-‐Style Processing of Fast Data. PVLDB 5(12): 1814-‐1825 (2012)

•  B. Li, E. Mazur, Y. Diao, A. McGregor, P. J. Shenoy: SCALLA: A Pla�orm for Scalable One-‐Pass Analy.cs Using MapReduce. ACM Trans. Database Syst. 37(4): 27 (2012)

•  C. Olston et al.: Nova: con.nuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090

•  M. Zaharia et al.: Discre.zed streams: fault-‐tolerant streaming computa.on at scale. SOSP 2013: 423–438

Bibliography

•  Late Arriving Data –  S. Krishnamurthy et al., Con.nuous analy.cs over discon.nuous

streams, SIGMOD 2010, 1081-‐1092 –  J. Li. K.Tune, V. Shkapenyuk, V. Papadimos, T. Johnson, D. Maier, Out-‐

of-‐order processing: a new architecture for high-‐performance stream systems, PVLDB 1(1): 274-‐288 (2008).

–  Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse. CIDR 2011: 114-‐122

Bibliography •  Update Propaga.on / Workflow

–  T. Johnson, V. Shkapenyuk: Update Propaga.on in a Streaming Warehouse. SSDBM 2011: 129-‐149

–  C. Olston et al. Nova: con.nuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090

•  Temporal Dimension Tables –  Interval Event Stream Processing, M. Li, M. Mani, E. A. Rundensteiner., D. Wang, T Lin, DEBS 2008

–  David Maier, Michael Grossniklaus, Sharmadha Moorthy, Kris.n Tune: Capturing episodes: may the frame be with you. DEBS 2012:1-‐11

–  Snapshot windows: h\p://msdn.microson.com/en-‐us/library/ff518550.aspx

Bibliography •  MVCC

–  D. Quass, J. Widom: On-‐Line Warehouse View Maintenance. SIGMOD Conference 1997: 393-‐404

–  V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, Christof B.: Efficient transac.on processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742

•  Data Par..on Transforma.ons –  V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, B. Christof: Efficient transac.on processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742

–  A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, C. Bear: The Ver.ca Analy.c Database: C-‐Store 7 Years Later . PVLDB 5(12): 1790-‐1801 (2012)

Bibliography •  DB Toaster

–  DBToaster: Higher-‐order Delta Processing for Dynamic, Frequently Fresh Views, Y. Ahmad O. Kennedy, C. Koch, . M. Nikolic, Proc VLDB 2012

–  R-‐Store: A Scalable Distributed System for Suppor.ng Real-‐.me Analy.cs, F. Li, T. Oszu, G. Chen, B. C. Ooi, Proc. ICDE 2014.

•  Par..on Revisions –  S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N.

Thombre: Con.nuous analy.cs over discon.nuous streams. SIGMOD 2010:1081-‐1092

•  Temporal Consistency Management –  Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse.

CIDR 2011:114-‐122 •  Bounded Tardiness Scheduling

–  H. Leontyev, J. H. Anderson: Generalized tardiness bounds for global mul.processor scheduling. Real-‐Time Systems 44(1-‐3): 26-‐71 (2010)

Bibliography •  Stream Warehouse Scheduling

– Lukasz Golab, Theodore Johnson, Vladislav Shkapenyuk: Scalable Scheduling of Updates in Streaming Data Warehouses. IEEE Trans. Knowl. Data Eng. 24(6): 1092-‐1105 (2012)

– S. Guirguis, M. A. Sharaf, P. K. Chrysanthis, A. Labrinidis, K. Pruhs, Adap.ve Scheduling of Web Transac.ons. Proc. 2009 Intl. Conf. on Data Engineering

Bibliography •  Distributed Stream Warehousing

– C. Curino, E. Jones, Y. Zhang, S. Madden. Schism: a workload-‐driven approach to database replica.on and par..oning. PVLDB, 3(1-‐2):48-‐57, 2010.

– L. Golab, M. Hadjielenheriou, H. Karloff, B. Saha, Distributed data placement to minimize communica.on costs via graph par..oning, CoRR abs/1312.0285

Bibliography •  Data stream quality

–  Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequen.al Dependencies. PVLDB 2(1): 574-‐585 (2009)

–  Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, Divesh Srivastava: Discovering Conserva.on Rules. ICDE 2012: 738-‐749

–  Tamraparni Dasu, Ji Meng Loh: Sta.s.cal Distor.on: Consequences of Data Cleaning. PVLDB 5(11): 1674-‐1683 (2012)

–  Lukasz Golab, Data Warehouse Quality: Summary and Outlook, In: S. Sadiq (ed.), Handbook of Data Quality -‐ Research and Prac.ce, Springer-‐Verlag Berlin Heidelberg 2013

lukaszgolab$ - university of waterloolgolab/icde2014_tutorial.pdf ·...

Documents