lukaszgolab$ - university of waterloolgolab/sigmod13_tutorial.pdf ·...
TRANSCRIPT
![Page 1: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/1.jpg)
Data Stream Warehousing
Lukasz Golab [email protected] University of Waterloo
Theodore Johnson [email protected]
AT&T Labs -‐ Research
![Page 2: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/2.jpg)
Big Data
• Every 2 days we create as much informaKon as we did up to 2003 (Eric Schmidt)
• Becoming easier to produce/collect – Sensors, Web, cheap bandwidth
• Becoming easier/cheaper to store – Cheap hard disks, commodity hardware
![Page 3: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/3.jpg)
TradiKonal Big Data Workflow • Wait for data to arrive • Prepare and load data
– Into HDFS, key-‐value store, … – or into a database, then index
• Compute result • Start over
![Page 4: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/4.jpg)
But • Many interesKng data sets are “streaming”
– Monitoring (IP networks, infrastructure, smart transportaKon systems and power grids, RFID, system logs, manufacturing)
– TransacKons (stock Kckers, credit card purchases) – User behaviour logs (Web, social media)
![Page 5: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/5.jpg)
Stream Data Workflow
• For each item or batch of items – Do some processing – Compute/update results
• Now feasible due to cheap RAM, mulK-‐cores, etc.
![Page 6: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/6.jpg)
“Fast Data” Systems • Data Stream Management Systems (DSMS)
– Borealis, StreamBase, Gigascope – Simple queries over fast append-‐only data – Results streamed out, usually not stored
• Key-‐value stores have fast transacKonal response, but analyKcs are difficult – Put/get interface makes correlaKon difficult – AnalyKcs are inefficient on distributed stores
![Page 7: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/7.jpg)
In This Tutorial • Big Data Management
– Focus on scalability and deep analyKcs, but high latency
• Fast Data Management – Low latency, but limited capability and no persistent storage
• Can we do both? – Data Stream Warehousing
![Page 8: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/8.jpg)
Five “V”s of Big Data • Volume • Velocity • Variety
– Data integraKon • VerificaKon
– Data cleaning • Value
– Data mining
![Page 9: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/9.jpg)
Outline • Why? • What? • Detailed example • How?
– Common elements – System architectures – Performance opKmizaKons – Data stream quality
• Open Problems
![Page 10: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/10.jpg)
Why? • Could have 2 separate systems, but
– Not clear where to divide the systems – Overhead of moving data from one system to the other
– Harder to develop applicaKons • Different SQL dialects, etc.
– Historical data provides context for real-‐Kme data – Even tradiKonal analyKcs/reporKng is becoming more real-‐Kme
• Reduce Kme from ingest to insight
![Page 11: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/11.jpg)
What? • Load data from a mulKtude of streaming sources – Wide variaKon in data latencies
• Provide transparent access to both real-‐Kme and historical data
• Gracefully handle late-‐arriving data • Schedule queries and updates to materialized views in spite of highly variable workloads – Load shedding by dropping data is not an opKon
![Page 12: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/12.jpg)
MulKtude of streaming sources • Data becomes most useful when you can correlate results from many sources – Hundreds to thousands of disKnct data feeds
• Network monitoring – Correlate twiCer feeds, acKve monitoring streams, and link uKlizaKons to idenKfy trouble spots
• Smart Grid – Correlate smart meter readings, line temperature measurements, and phasor measurement units to proacKvely react to overloads and avoid blackouts
![Page 13: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/13.jpg)
0
2
4
6
8
10
12
0 100000 200000 300000 400000 500000 600000
Num
ber o
f Windo
ws
Time ( seconds)
Late-‐arriving data • Late arriving data is a
common problem for streaming systems.
• DSMS : data arrives minutes late
• Stream Warehouse : data can arrive days late
• Load all data and propagate their results in spite of lateness.
![Page 14: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/14.jpg)
• AlerKng, troubleshooKng, real-‐Kme data mining all depend on access to real-‐Kme and historical data
• Hard to draw a boundary between new and old
Transparent Access
Scheduling • Ensure that the most Kme criKcal applicaKons/views get priority service.
• Ensure that no applicaKon is starved • In spite of temporary overload
![Page 15: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/15.jpg)
Network Monitoring • Darkstar project at AT&T Labs • MoKvaKng applicaKon for the Data Depot stream warehouse system
• Data collected: – Passive and acKve probe measurements, route monitoring, system logs, configuraKon data, customer service Kckets and notes
• For: – Networking research, data mining, alerKng, troubleshooKng
![Page 16: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/16.jpg)
Darkstar: Mining Vast Amounts of Data
Network
Route monitors (OSPFmon, BGPmon)
Device service monitoring (CIQ, MTANet, STREAM)
AcKve service and connecKvity monitoring
Syslog Config
SNMP Polling (router, link) Nenlow
Deep Packet InspecKon (DPI)
Alarms
Tickets
AuthenKcaKon/ logging (tacacs)
Customer feedback – IVR, Kckets, MTS
IP Backhaul Enterprise IP, VPNs
Ethernet Access
IPTV
Layer one
Mobility
![Page 17: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/17.jpg)
ARGUS: DetecKng Service Issues… • Goal: detect and isolate ac#onable anomaly events using comprehensive end-‐to-‐end performance measurements (e.g. GS tool) • SophisKcated anomaly detecKon and heurisKcs • SpaKal localizaKon • Accurately accounts for service performance that varies considerably by Kme-‐of-‐day
and locaKon • Impact: • Reduced detecKon Kme from days to approx. 15 mins for detecKng data service issues
• OperaKonal naKon-‐wide monitoring data service performance for 3G and LTE (TCP retransmission, RTT, throughput from GS Tool)
![Page 18: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/18.jpg)
Market
Sub-‐Market Sub-‐Market …
SGSN SGSN
… RNC RNC
…
SITE SITE …
SITE
SITE
RNC
SITE
SITE
RNC
SITE
SITE RNC
SGSN
SGSN GGSN
GGSN
Collect end-‐to-‐
end Performance
Data
Approach: Mobility LocalizaKon Hierarchy
![Page 19: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/19.jpg)
Case Example: Silent CE Overload CondiKon • ARGUS detected event: 2 Columbia 3G Ericsson SGSN’s impacKng RNC’s in West Virginia, Norfolk, and Richmond • No other indicaKon of issue • Topology highlighted CE used by only impacted SGSNs
• RCA: “6148 48 port 1gig card is limited to a shared 1 gig bus for each set
of 8 gig ports”
ARGUS alarm: clmamdorpn2 (TCP retransmissions) CE UGlizaGon flaJening
![Page 20: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/20.jpg)
ARGUS As a General Capability… Spike in call drop rate on MSC hrndvacxca1 RTT anomalies (SGSN level)
Outage start 5:30 GMT
First Anomaly 5:40 GMT
CTS Ticket Created 08:21 GMT
Social media (TwiCer) NY outage
LA outage
Node metrics, acKve measurements (CBB, IPAG WIPM delay)…
Mobility customer Kckets (Boston market – PE isolaKon)
![Page 21: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/21.jpg)
• 1. At-‐a-‐glance view of network topology and state
• VisualizaKon to summarize important informaKon on network health • Color-‐coded
• Complimentary to KckeKng system – reporKng issues below “alarming” status
Page 21
hCp://ptolemy.research.aC.com/
Use network visualiza9on and convenient data explora9on to help network operators with network health monitoring and service problem troubleshoo9ng
Ptolemy
hCp://ptolemy.research.aC.com/mobility
![Page 22: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/22.jpg)
Assess damage, idenKfy remaining capacity
Page 22
Loss of many links out of Japan. What’s ler?
Example 1: Japan Earthquake, March 11th 2011
![Page 23: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/23.jpg)
IdenKfy traffic shirs, no congesKon
Page 23
Increase in link load as traffic re-‐routed
Link load
Example 1: Japan Earthquake, March 11th 2011
![Page 24: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/24.jpg)
Recap • Load data from mulKple diverse sources • Transparent access to real-‐Kme and historical data
• Schedule queries/updates – And materialized views
• Handle late/out-‐of-‐order data • Could have two separate systems, but …
![Page 25: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/25.jpg)
Architectures • DSMS-‐based • DBMS-‐based • Hadoop-‐based
![Page 26: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/26.jpg)
DSMS-‐based • Add ability to store data (e.g., Aurora/Borealis)
Output stream
“StaKc” data set
ConnecKon point
![Page 27: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/27.jpg)
DSMS-‐based
• Example2: Moirae: History-‐enhanced monitoring
Postgres
Borealis
SQL query Data export
![Page 28: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/28.jpg)
DSMS-‐based • Example 3: Dejavu: paCern matching over live and historical streams – Actually DBMS-‐based (MySQL)
MySQL
PaCern matching engine
PaCern match query Data export
![Page 29: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/29.jpg)
DSMS-‐based • Pros
– Enables real-‐Kme processing with context
• Cons – Does not enable complex analyKcs
• Must keep up with live data
– Stores limited history
![Page 30: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/30.jpg)
DBMS-‐based • Use the query processing and storage engine of a DBMS
• Add layers for addiKonal services – Fast data load – Temporal parKKoning – Update propagaKon – Scheduling
• Add stream warehouse-‐specific features and opKmizaKons
![Page 31: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/31.jpg)
DBMS-‐based
• Design decisions: – Row store (Data Depot/Daytona, Truviso/Postgres) vs. column store (DataCell/MonetDB, SAP HANA, VerKca)
– Disk (Data Depot, Truviso) vs. main memory (DBToaster, SAP HANA)
![Page 32: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/32.jpg)
DBMS-‐based • Pros:
– Leverage SQL, query opKmizaKon, data storage
• Cons: – Not quite real-‐Kme
![Page 33: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/33.jpg)
Hadoop / Map-‐Reduce based • HOP (Hadoop Online Prototype) • Idea: instead of waiKng for all mappers to finish, send output incrementally from mappers to reducers – periodically invoke reducers on the available data
![Page 34: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/34.jpg)
Hadoop / Map-‐Reduce based
• MapUpdate/Muppet (Walmart Labs), similar ideas in: Incoop, SCALLA – Reduce: for each key, process all values and return a single output value
– Update: given a new (k,v) pair, return an updated output value using the new pair and state of k
• And update the state
![Page 35: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/35.jpg)
Hadoop / Map-‐Reduce based • Nova (Yahoo)
– “Pipelining” between jobs in a workflow (in large batches)
– Pass a “delta” to the next job in a workflow
![Page 36: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/36.jpg)
Hadoop / Map-‐Reduce based • Pros:
– Leverage scale-‐out and fault tolerance • Cons:
– Again, not quite real-‐Kme
![Page 37: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/37.jpg)
How? • Common elements in a stream warehouse – Temporal parKKoning – Update propagaKon / workflow – Temporal dimension tables – Temporal consistency management
![Page 38: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/38.jpg)
Temporal ParKKoning
• The primary parKKoning field is the record Kmestamp • Stream data is mostly sorted • Most new data loads into a new parKKon
– Avoid rebuilding indices • Simplified data expiraKon – roll off oldest parKKons
Time
Data
Index
New data
![Page 39: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/39.jpg)
Update PropagaKon / Workflow • Streaming analyKcs – maintain a system of complex materialized views
• Push new data through base tables to all dependent tables – Create new parKKons – Update exisKng parKKons as needed
TwiCer feeds
AcKve measure
Link uKl
Customer complaint
Service alerts
SenKment analysis
Hourly aggregate
Daily aggregate
![Page 40: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/40.jpg)
Temporal Dimension Tables • Most streaming data describes events
– Occurs in a point in Kme, or is a measurement during a well-‐defined interval
• Some streaming data defines condi#ons – ProperKes of an enKty that endures for a Kme interval – Temporal dimension tables – Kmestamp is valid Kme interval.
• Pervasive use – You can’t evaluate an event without knowing about the environment
– Link speeds, cell tower locaKons, power grid organizaKon • Snapshot tables don’t work
– Late arriving data, recomputaKon, new long-‐term analyses.
![Page 41: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/41.jpg)
Temporal Dimension Table Example SNMP_BytesTransferred
Ip_address Timestamp Bytes_xfered
4.3.2.1 1:05 1,000,000
4.3.2.1 1:10 1,200,000
4.3.2.1 1:15 2,200,000
LinkSpeed Ip_address Tlo Thi Speed
4.3.2.1 12:15 1:15 1,000,000 B/min
4.3.2.1 1:15 -‐ 2,000,000 B/min
Ip_address Timestamp UKlizaKon
4.3.2.1 1:05 .2
4.3.2.1 1:10 .24
4.3.2.1 1:15 .22
LinkUKlizaKon
![Page 42: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/42.jpg)
Temporal Dimension Tables • Updates
– Snapshots of current status, deltas. • Snapshot windows in StreamInsight • Compute from the stream
– Frames – based on a condiKon of records in a stream
– Interval punctuaKon
![Page 43: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/43.jpg)
OpKmizaKons
• DB-‐toaster • MulK-‐version Concurrency Control • ParKKon Restructuring • ParKKon Revisions • Temporal Consistency Management • Scheduling
![Page 44: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/44.jpg)
DB-‐toaster • Maintain complex
aggregate views over streaming data.
• In-‐memory architecture : all storage is via hash table. – 1TB main memory servers are
inexpensive • Uses novel recursive-‐delta
technique to accelerate maintenance – CollecKon of support views
that can significantly reduce update Kme.
Join(R,S,T))
Join(S,T)) Join(R,T)) Join(R,S))
T) S) R)
![Page 45: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/45.jpg)
MulK-‐version Concurrency Control • MVCC allows queries and updates to proceed concurrently – Read isolaKon – Long analyKc queries do not block real-‐Kme updates
• Single-‐updater MVCC is cheap and easy – Use a directory-‐swap algorithms
• Encourages use of cloud-‐friendly write-‐once files.
![Page 46: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/46.jpg)
ParKKon Restructuring • As data ages, its best representaKon changes
– Most recent data : opKmize for fast ingest – Stable data : opKmize for queries – Historical data : minimize storage cost
• Restructure parKKons as the data ages – MVCC allows data maintenance to occur as a non-‐interfering background task
![Page 47: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/47.jpg)
ParKKon Size
• New parKKons should match the update increment
• Problem : parKKon explosion – 1 minute parKKons, 1440 per day, 525,600 per year
• Merge parKKons as they age
Time
Data
Index
Indexes opKonal
![Page 48: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/48.jpg)
Data Layout • Write-‐opKmized data
– Row-‐oriented, lightly indexed, uncompressed • Read-‐opKmized data
– Highly indexed, lightly compressed, column storage if beneficial
• Transform as a background task when the data becomes stable – Combine with parKKon merging
• Aggressive compression for archival data • ImplementaKons in SAP HANA and VerKca
![Page 49: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/49.jpg)
ParKKon Revisions
• Some data always arrives late • Problem : need to recompute exisKng parKKons – Disk prefers sequenKal access – Write-‐once files : need to recompute the enKre parKKon
• SoluKon: chain updates to the parKKon – Value of the parKKon is the sum of the primary (anchor) contents plus the updates (revisions).
![Page 50: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/50.jpg)
ParKKon Revisions
• Problem: Don’t change old parKKons, but what if data arrives out-‐of-‐order?
• SoluKon: Overflow chains (Truviso)
Time
anchor
revisions
Packet_Stream
Packets
![Page 51: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/51.jpg)
• Works with “raw” and derived/aggregated data
• E.g., packet counts:
Data Layout
1000 1200 1150 1400
Time
25
Packet_Stream
Packets
Packet_counts
![Page 52: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/52.jpg)
Temporal Consistency Management
• TradiKonal noKon of consistency : a snapshot of the system.
• Doesn’t apply in a stream warehouse – Late-‐arriving data is common – Different data sources have different Kme lags and different likelihoods of late data
• Instead, label data by its degree of completeness
![Page 53: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/53.jpg)
0
2
4
6
8
10
12
0 100000 200000 300000 400000 500000 600000
Num
ber o
f Windo
ws
Time ( seconds)
Number of windows per package
![Page 54: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/54.jpg)
Query Stability • How do I know when the data is stable enough to query?
• What is stable enough? – Data will never change – Data won’t change much. – I’ll take whatever is there.
![Page 55: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/55.jpg)
Consistency Levels • PunctuaKons on parKKons that indicate completeness.
• Example (simple) collecKon of consistency levels – Open : The parKKon should have some data in it. – Closed : The parKKon will not change. – Complete : the parKKon will not change, and all data has been received.
• Closed is a guess – WeaklyClosed, StronglyClosed
• Infer at base tables, propagate inferences to materialized views.
![Page 56: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/56.jpg)
Workflow Scheduling • Need to limit resource use to avoid thrashing.
– Hundreds of tables to update, limited (CPU, memory, cache, network) resources.
– Exclusive resources: non-‐preempKve scheduling. • Ensure that high-‐priority jobs can execute
– Real-‐Kme scheduling • Measures of lateness:
– Staleness : difference between current Kme and most recent data.
– Tardiness : the difference between a task deadline and task compleKon.
![Page 57: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/57.jpg)
Workflow Scheduling • Staleness funcKon:
difference between current Kme and most recent data loaded
• Hierarchies of views with highly varying execuKon Kmes.
9:30 9:45 10:00 10:15
TwiCer feeds
AcKve measure
Link uKl
Customer complaint
Service alerts
SenKment analysis
Hourly aggregate
Daily aggregate
fast frequent
slow infrequent
![Page 58: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/58.jpg)
Bounded Tardiness Scheduling • Bound on the maximum tardiness of any task in a task set.
• If update jobs are scheduled regularly, bounded tardiness => bounded staleness
• Most real-‐Kme scheduling algorithms have bounded tardiness – EDF, minimum slack, etc. – There can be differences in the tardiness bounds
• Pick a heurisKc that works well – E.g. pick the task that provides the largest marginal reducKon in staleness.
![Page 59: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/59.jpg)
Track Scheduling • ComplicaKon: Large differences in task execuKon Kme – Update a base table with 1 minute of data vs. compute a daily aggregate.
• Tardiness bounds depend on the largest task execuKon Kmes. – Long tasks block short criKcal tasks.
• Track Scheduling : – parKKon tasks by execuKon Kme. – Restrict the number of long tasks that can execute concurrently
– Reserve resources for short criKcal tasks
![Page 60: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/60.jpg)
Transient Overload • Common source of overload : catch-‐up processing. – A feed breaks for a day, then is restored. – The source schema changes, requiring a pause in processing to change update procedures.
– New tables load a long history • Update Chopping
– Break a (temporally) long update into short segments. • Update period adjustment
– Decrease the period of backlogged tables to use up (but not oversubscribe) available resources.
![Page 61: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/61.jpg)
Data Stream Quality
• New data quality problems – SystemaKc errors in machine-‐generated streams – Correlated glitches – Missing/delayed data
• New semanKcs – RelaKonal data: keys, FDs, CFDs – Streaming/temporal data: order, arrival frequency (sequenKal dependencies), conservaKon laws
![Page 62: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/62.jpg)
Open Problems • Hybrid system architectures and cross-‐system opKmizaKons
• Big and fast analyKcs as a cloud service • Big/fast data mining • Data stream quality/profiling • Complexity management and administraKon of a big/fast data management system
![Page 63: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/63.jpg)
Bibliography • ApplicaKons
– Smart Grid • hCp://energy.gov/oe/technology-‐development/smart-‐grid
– Semiconductor Manufacturing • hCp://www.appliedmaterials.com/technologies/library/techedge-‐prizm
• hCp://www.extremetech.com/extreme/155588-‐applied-‐materials-‐designs-‐tools-‐to-‐leverage-‐big-‐data-‐and-‐build-‐beCer-‐chips
• Networking ApplicaKons – C. Kalmanek et al., Darkstar: Using Exploratory Data Mining to Raise the
Bar on Network Reliability and Performance, DRCN 2009. – H. Yan, A. Flavel, Z. Ge, A. Gerber, D. Massey, C. Papadopoulos, H. Shah, J.
Yates: Argus: End-‐to-‐end service anomaly detecKon and localizaKon from an ISP's point of view. INFOCOM 2012:2756-‐2760
![Page 64: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/64.jpg)
Bibliography • DSMS-‐based systems
– D. J. Abadi, D. Carney, U. ÇeKntemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. B. Zdonik: Aurora: a new model and architecture for data stream management. VLDB J. 12(2): 120-‐139 (2003)
– M. Balazinska, Y. C. Kwon, N. Kuchta, D. Lee: Moirae: History-‐Enhanced Monitoring. CIDR 2007: 375-‐386
– N. Dindar, P. M. Fischer, M. Soner, N. Tatbul: Efficiently correlaKng complex events over live and archived data streams. DEBS 2011: 243-‐254
![Page 65: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/65.jpg)
Bibliography
• DBMS-‐based systems – Truviso : S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N. Thombre: ConKnuous analyKcs over disconKnuous streams. SIGMOD 2010:1081-‐1092
– DataCell: E. Liarou, R. Goncalves, S. Idreos: ExploiKng the power of relaKonal databases for efficient stream processing. EDBT 2009: 323-‐334
– L. Golab, T. Johnson, J. S. Seidel, V. Shkapenyuk: Stream warehousing with DataDepot. SIGMOD Conference 2009: 847-‐854
![Page 66: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/66.jpg)
Bibliography • Hadoop / Map-‐Reduce Based Systems
– T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears: MapReduce Online. NSDI 2010: 313-‐328
– W. Lam, L. Liu, S. T. S. Prasad, A. Rajaraman, Z. Vacheri, A. H.i Doan: Muppet: MapReduce-‐Style Processing of Fast Data. PVLDB 5(12): 1814-‐1825 (2012)
– C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, X. Wang: Nova: conKnuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090
– B. Li, E. Mazur, Y. Diao, A. McGregor, P. J. Shenoy: SCALLA: A Planorm for Scalable One-‐Pass AnalyKcs Using MapReduce. ACM Trans. Database Syst. 37(4): 27 (2012)
– P. BhatoKa, A. Wieder, R. Rodrigues, U. A. Acar, R. Pasquin: Incoop: MapReduce for incremental computaKons. SoCC 2011: 7
![Page 67: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/67.jpg)
Bibliography
• Late Arriving Data – S. Krishnamurthy et al., ConKnuous analyKcs over disconKnuous
streams, SIGMOD 2010, 1081-‐1092 – J. Li. K.Ture, V. Shkapenyuk, V. Papadimos, T. Johnson, D. Maier, Out-‐
of-‐order processing: a new architecture for high-‐performance stream systems, PVLDB 1(1): 274-‐288 (2008).
– Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse. CIDR 2011: 114-‐122
![Page 68: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/68.jpg)
Bibliography • Update PropagaKon / Workflow
– T. Johnson, V. Shkapenyuk: Update PropagaKon in a Streaming Warehouse. SSDBM 2011: 129-‐149
– C. Olston et al. Nova: conKnuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090
• Temporal Dimension Tables – Interval Event Stream Processing, M. Li, M. Mani, E. A. Rundensteiner., D. Wang, T Lin, DEBS 2008
– David Maier, Michael Grossniklaus, Sharmadha Moorthy, KrisKn Ture: Capturing episodes: may the frame be with you. DEBS 2012:1-‐11
– Snapshot windows: hCp://msdn.microsor.com/en-‐us/library/ff518550.aspx
![Page 69: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/69.jpg)
Bibliography • MulK-‐Version Concurrency Control
– D. Quass, J. Widom: On-‐Line Warehouse View Maintenance. SIGMOD Conference 1997: 393-‐404
– V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, Christof B.: Efficient transacKon processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742
• Data ParKKon TransformaKons – V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, B. Christof: Efficient transacKon processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742
– A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, C. Bear: The VerKca AnalyKc Database: C-‐Store 7 Years Later . PVLDB 5(12): 1790-‐1801 (2012)
![Page 70: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/70.jpg)
Bibliography • DB Toaster
– DBToaster: Higher-‐order Delta Processing for Dynamic, Frequently Fresh Views, Y. Ahmad O. Kennedy, C. Koch, . M. Nikolic, Proc VLDB 2012
• ParKKon Revisions – S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N. Thombre: ConKnuous analyKcs over disconKnuous streams. SIGMOD 2010:1081-‐1092
• Temporal Consistency Management – Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse. CIDR 2011:114-‐122
• Bounded Tardiness Scheduling – H. Leontyev, J. H. Anderson: Generalized tardiness bounds for global mulKprocessor scheduling. Real-‐Time Systems 44(1-‐3): 26-‐71 (2010)
![Page 71: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/71.jpg)
Bibliography • Stream Warehouse Scheduling
– Lukasz Golab, Theodore Johnson, Vladislav Shkapenyuk: Scalable Scheduling of Updates in Streaming Data Warehouses. IEEE Trans. Knowl. Data Eng. 24(6): 1092-‐1105 (2012)
– S. Guirguis, M. A. Sharaf, P. K. Chrysanthis, A. Labrinidis, K. Pruhs, AdapKve Scheduling of Web TransacKons. Proc. 2009 Intl. Conf. on Data Engineering
![Page 72: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$](https://reader031.vdocuments.site/reader031/viewer/2022022514/5af10ccf7f8b9a8c308dff76/html5/thumbnails/72.jpg)
Bibliography • Data stream quality
– Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: SequenKal Dependencies. PVLDB 2(1): 574-‐585 (2009)
– Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, Divesh Srivastava: Discovering ConservaKon Rules. ICDE 2012: 738-‐749
– Tamraparni Dasu, Ji Meng Loh: StaKsKcal DistorKon: Consequences of Data Cleaning. PVLDB 5(11): 1674-‐1683 (2012)
– Lukasz Golab, Data Warehouse Quality: Summary and Outlook, In: S. Sadiq (ed.), Handbook of Data Quality -‐ Research and PracKce, Springer-‐Verlag Berlin Heidelberg 2013