dancing with stream processing
TRANSCRIPT
07/10/16 MSc. Distributed Systems 2
● Motivation● Event Stream Processing
– Pub/sub, CEP, “”Buzzwords”– Stream processing Engines
● Spark Streaming, Storm, Etc.
● Graph Stream Processing– Theory... {“sketching”, “spanners”, “sparsifiers”}– Challenges
● Discussion !!
Lightning Talk
07/10/16 MSc. Distributed Systems 3
Motivation
07/10/16 MSc. Distributed Systems 4
The Streaming Era
● Today, most data is continuously produced- user activity logs, web logs, sensors, database
transactions, …● The common approach to analyze such data so far
- Record data stream to stable storage (DBMS, HDFS, …)- Periodically analyze data with batch processing engine (DBMS, MapReduce, …)
● Streaming processing engines analyze data while it arrives
07/10/16 MSc. Distributed Systems 5
The Streaming Era (Contd.)
● Decreases the overall latency to obtain results- No need to persist data in stable storage- No periodic batch analysis jobs
● Simplifies the data infrastructure- Fewer moving parts to be maintained and coordinated
● Makes time dimension of data explicit- Each event has a timestamp- Data can be processed based on timestamps
07/10/16 MSc. Distributed Systems 6
Event Streams[Immutable]
Web Page Event
Wikipedia Page Update Event
LinkedIn User Update Event
07/10/16 MSc. Distributed Systems 7
Middleware
07/10/16 MSc. Distributed Systems 8
Direct coupling Strict Identity Time coupling
Not good for volatile environment
Not a good way to communicate with several participants
Space uncoupling Anonymity Time uncoupling
Independent lifetimes between parties
Through persistent communication channel
Point-to-point communication
Indirect communication
07/10/16 MSc. Distributed Systems 9
Taxonomy
Indirect Communication
Communication based
Group communication
Message Queues
Publish/subscribe
State based
Tuple spaces
Distributed Shared Memory
07/10/16 MSc. Distributed Systems 10
Pub/Sub Messaging Pattern
Topic-based- Each event belongs to a
number of topics (e.g. “music”, “sport”)
- Users subscribe to topics and receive all relevant events
Content-based - Users subscribe to the
actual content of the events/ a structured summary of it
- More expressive
07/10/16 MSc. Distributed Systems 11
Pub/Sub Activities
Subscription processing Indexing and storing subscriptions.
Event Stream Processing (ESP) Pub/sub approach: upon arrival of events, access
subscription index and identify all matched subscriptions.
Event delivery deliver event to clients with matched subscriptions.
07/10/16 MSc. Distributed Systems 12
Event Stream Processing (ESP)
Wikipedia
07/10/16 MSc. Distributed Systems 13
Today's world...
Pub/sub ≈ ESP ≈
07/10/16 MSc. Distributed Systems 14
“Buzzwords”
https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
07/10/16 MSc. Distributed Systems 15
Complex Event Processing (CEP)
● A set of event processing principles
● Match patterns of events– Comparable to SQL queries– High-level query language
● Cloud of causally related events– POSET (Partially Ordered Set of Events)
07/10/16 MSc. Distributed Systems 16
Complex Event Processing (CEP)
● Some CEP Examples:– When 2 transactions happen on an account from
radically different geographic locations within a certain time window then report as potential fraud.
– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.
– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.
07/10/16 MSc. Distributed Systems 17
Complex Event Processing (CEP)
● Some CEP Examples:– When 2 transactions happen on an account from
radically different geographic locations within a certain time window then report as potential fraud.
– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.
– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.
07/10/16 MSc. Distributed Systems 18
ESP and CEP[Timeline]
2002
AuroraAurora
2003
Medusa
2005
Borealis
STREAM
TelegraphCQ
<20001989 - 1995
Rapide
Esper Apama
StreamBase
SQLStream
WSO2 CEP
2016
07/10/16 MSc. Distributed Systems 19
ESP vs. CEP
http://www.slideshare.net/TimBassCEP/mythbusters-event-stream-processing-v-complex-event-processing-presentation
07/10/16 MSc. Distributed Systems 20
Today's world...
ESP ≈ CEP ≈
07/10/16 MSc. Distributed Systems 21
Laundry of “Buzzwords”
● Actor Frameworks– Better mechanism to handle concurrency– E.g. Akka, Orleans and Erlang OTP
● “Reactive”– Language semantics for bringing event streams to the user
interface– Responsive, Resilient, Elastic and Message Driven– E.g. Data flow languages, Functional reactive programming
● Event Sourcing● Change Data Capture (CDC)
07/10/16 MSc. Distributed Systems 22
Analytics ≈ Stream Transformations
https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
07/10/16 MSc. Distributed Systems 23
https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
07/10/16 MSc. Distributed Systems 24
Target
● Better Scalability● High Throughput● Low latency● Powerful semantics● Easy integration
via Low Level Stream Processing
Frameworks !!
07/10/16 MSc. Distributed Systems 25
Spark Streaming
● General purpose computing engine to run batch, interactive and streaming jobs
● Based on Resilient Distributed Datasets (RDD)– Restricted form of distributed shared memory– Immutable– Can only be built through deterministic
transformations● Efficient fault recovery using lineage graph
– Recompute lost partitions on failure– No cost if nothing fails
07/10/16 MSc. Distributed Systems 26
Spark Streaming (Contd.)[Key concepts]
● DStream – sequence of RDDs representing a stream of data– HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
● Transformations – modify data from one DStream to another– Standard RDD operations – map, countByValue, reduce,
join, …– Stateful operations – window, countByValueAndWindow, …
● Output Operations – send data to external entity– saveAsHadoopFiles – saves to HDFS– foreach – do anything with each batch of results
07/10/16 MSc. Distributed Systems 27
Spark Streaming (Contd.)
● Run a streaming computation as a series of very small, deterministic batch jobs– Chop up the live stream into batches of X seconds– Spark treats each batch of data as RDDs and processes them using RDD
operations– Finally, the processed results of the RDD operations are returned in
batches
07/10/16 MSc. Distributed Systems 28
Berkeley Data Stack
07/10/16 MSc. Distributed Systems 29
Spark 2.0 is coming !!
07/10/16 MSc. Distributed Systems 30
Apache Storm[Key concepts] ● Tuple
– Core Unit of Data– Immutable Set of Key/Value Pairs
● Spouts– Source of Streams– Wraps a streaming data source and emits Tuples
● Bolts– Core functions of a streaming computation– Receive tuples and do stuff– Optionally emit additional tuples
07/10/16 MSc. Distributed Systems 31
Apache Storm[Key concepts]
● Topology– DAG of Spouts and
Bolts– Data Flow
Representation– Streaming
Computation
07/10/16 MSc. Distributed Systems 32
Apache Storm[Physical View]
07/10/16 MSc. Distributed Systems 33
Twitter introduces Heron !!
[Storm's successor]
07/10/16 MSc. Distributed Systems 34
Stream Processing Engines
Many More !!!
07/10/16 MSc. Distributed Systems 35
07/10/16 MSc. Distributed Systems 36
Hidden computation paradigm
via pipelining !!
07/10/16 MSc. Distributed Systems 37
Pipelining ≈ Task Execution
https://martin.kleppmann.com/unix
07/10/16 MSc. Distributed Systems 38
Let's build the concept again...
07/10/16 MSc. Distributed Systems 39
Linux pipelining in modern middle-ware...
https://martin.kleppmann.com/unix
07/10/16 MSc. Distributed Systems 40
Spark, Storm, Samza, Flink Etc.
https://martin.kleppmann.com/unix
07/10/16 MSc. Distributed Systems 41
Spark, Storm, Samza, Flink Etc.
https://martin.kleppmann.com/unix
07/10/16 MSc. Distributed Systems 42
Pub/sub pitch
https://martin.kleppmann.com/unix
07/10/16 MSc. Distributed Systems 43
Streaming Machine Learning
● By using a programing abstraction for distributed streaming– Apache SAMOA
07/10/16 MSc. Distributed Systems 44
Graph Stream ProcessingReferred Author: Vasia Kalavri, KTHhttps://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/buzzwords-kalavri.pdf
07/10/16 MSc. Distributed Systems 45
Static Graph Processing
● Load: read the graph from disk and partition it in memory
● Compute: read and mutate the graph state
● Store: write the final graph state back to disk
07/10/16 MSc. Distributed Systems 46
Static Graph Processing[Drawbacks]
● It is slow– wait until the computation is over before you see
any result– pre-processing and partitioning
● It is expensive– lots of memory and CPU required in order to scale
● It requires re-computation for graph changes– no efficient way to deal with updates
07/10/16 MSc. Distributed Systems 47
Streaming Graph Processing
We consume events in real-time● Get results faster
– No need to wait for the job to finish– Sometimes, early approximations are better
than late exact answers
● Get results continuously– Process unbounded number of events
07/10/16 MSc. Distributed Systems 48
Real-world scenarios
● Targeted Advertisement– Finding Strongly Connected Components in a
social network graph– Targeted chain of advertisement on detected
communities
Jane Joeknows
#Tesla
postslikes
Self driving carsAds
Peter Taphousechecks-in
John
subscribesDinner Offer
Ads
07/10/16 MSc. Distributed Systems 49
Streaming Graph Processing[Challenges]● Maintain the graph structure
– How to apply state updates efficiently?
● Result updates– Re-run the analysis for each event?– Design an incremental algorithm?– Run separate instances on multiple snapshots?
● How to preserve graph properties?– Natural behavior?
07/10/16 MSc. Distributed Systems 50
Streaming Graph Processing[Current Research]
Each event is an edge addition
Jane Joeknows
Jane #Teslalikes
Joe #Teslaposts
Peter TapHousechecks-in
07/10/16 MSc. Distributed Systems 51
07/10/16 MSc. Distributed Systems 52
Dynamic Graph Processing
● Instead of analyzing the whole graph– Analyze it's properties by preserving them
continuously● Connectivity or Distance (spanners)● Graph cut estimation (sparsifiers)● Neighborhood or homomorphic properties (sketches)
07/10/16 MSc. Distributed Systems 53
Dynamic Graph Processing (Contd.)
Jane Joeknows
#Tesla
postslikes
Self driving carsAds
Peter Taphousechecks-in
John
subscribesDinner Offer
Ads
Peter Janeloves
loves
Self driving carsAds
07/10/16 MSc. Distributed Systems 54
Stream Connected Components
● State: a disjoint set data structure for the components
● Computation: For each edge– if seen for the 1st time, create a component with ID
the min of the vertex IDs– if in different components, merge them and update
the component ID to the min of the component IDs– if only one of the endpoints belongs to a
component, add the other one to the same component
07/10/16 MSc. Distributed Systems 55
Stream Connected Components
07/10/16 MSc. Distributed Systems 56
Stream Connected Components
07/10/16 MSc. Distributed Systems 57
Stream Connected Components
07/10/16 MSc. Distributed Systems 58
Stream Connected Components
07/10/16 MSc. Distributed Systems 59
Stream Connected Components
07/10/16 MSc. Distributed Systems 60
Stream Connected Components
07/10/16 MSc. Distributed Systems 61
Stream Connected Components
07/10/16 MSc. Distributed Systems 62
Stream Connected Components
07/10/16 MSc. Distributed Systems 63
Stream Connected Components
07/10/16 MSc. Distributed Systems 64
Stream Connected Components
07/10/16 MSc. Distributed Systems 65
Stream Connected Components
07/10/16 MSc. Distributed Systems 66
Distributed Stream Connected Components
07/10/16 MSc. Distributed Systems 67
Streaming Graph Processing[Current Work]● We're working with Gelly-Streams on
– Preserving natural properties in large scale real-world evolving graphs
– Joining multiple streams for detects graph causality/ bipartite
– Efficient graph partitioning mechanisms to on-board with popular data-stores like Cassandra, HDFS
– Producing a platform to benchmark NPC problems in real-world graphs
07/10/16 MSc. Distributed Systems 68
Discussion !!