dancing with stream processing

69
Dancing with Stream Processing Y.S. Horawalavithana [email protected]

Upload: yasanka-sameera-horawalavithana

Post on 14-Apr-2017

184 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Dancing with Stream Processing

Dancing with Stream

Processing

Y.S. [email protected]

Page 2: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 2

● Motivation● Event Stream Processing

– Pub/sub, CEP, “”Buzzwords”– Stream processing Engines

● Spark Streaming, Storm, Etc.

● Graph Stream Processing– Theory... {“sketching”, “spanners”, “sparsifiers”}– Challenges

● Discussion !!

Lightning Talk

Page 3: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 3

Motivation

Page 4: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 4

The Streaming Era

● Today, most data is continuously produced- user activity logs, web logs, sensors, database

transactions, …● The common approach to analyze such data so far

- Record data stream to stable storage (DBMS, HDFS, …)- Periodically analyze data with batch processing engine (DBMS, MapReduce, …)

● Streaming processing engines analyze data while it arrives

Page 5: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 5

The Streaming Era (Contd.)

● Decreases the overall latency to obtain results- No need to persist data in stable storage- No periodic batch analysis jobs

● Simplifies the data infrastructure- Fewer moving parts to be maintained and coordinated

● Makes time dimension of data explicit- Each event has a timestamp- Data can be processed based on timestamps

Page 6: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 6

Event Streams[Immutable]

Web Page Event

Wikipedia Page Update Event

LinkedIn User Update Event

Page 7: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 7

Middleware

Page 8: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 8

Direct coupling Strict Identity Time coupling

Not good for volatile environment

Not a good way to communicate with several participants

Space uncoupling Anonymity Time uncoupling

Independent lifetimes between parties

Through persistent communication channel

Point-to-point communication

Indirect communication

Page 9: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 9

Taxonomy

Indirect Communication

Communication based

Group communication

Message Queues

Publish/subscribe

State based

Tuple spaces

Distributed Shared Memory

Page 10: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 10

Pub/Sub Messaging Pattern

Topic-based- Each event belongs to a

number of topics (e.g. “music”, “sport”)

- Users subscribe to topics and receive all relevant events

Content-based - Users subscribe to the

actual content of the events/ a structured summary of it

- More expressive

Page 11: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 11

Pub/Sub Activities

Subscription processing Indexing and storing subscriptions.

Event Stream Processing (ESP) Pub/sub approach: upon arrival of events, access

subscription index and identify all matched subscriptions.

Event delivery deliver event to clients with matched subscriptions.

Page 12: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 12

Event Stream Processing (ESP)

Wikipedia

Page 13: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 13

Today's world...

Pub/sub ≈ ESP ≈

Page 14: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 14

“Buzzwords”

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

Page 15: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 15

Complex Event Processing (CEP)

● A set of event processing principles

● Match patterns of events– Comparable to SQL queries– High-level query language

● Cloud of causally related events– POSET (Partially Ordered Set of Events)

Page 16: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 16

Complex Event Processing (CEP)

● Some CEP Examples:– When 2 transactions happen on an account from

radically different geographic locations within a certain time window then report as potential fraud.

– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.

– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.

Page 17: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 17

Complex Event Processing (CEP)

● Some CEP Examples:– When 2 transactions happen on an account from

radically different geographic locations within a certain time window then report as potential fraud.

– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.

– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.

Page 18: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 18

ESP and CEP[Timeline]

2002

AuroraAurora

2003

Medusa

2005

Borealis

STREAM

TelegraphCQ

<20001989 - 1995

Rapide

Esper Apama

StreamBase

SQLStream

WSO2 CEP

2016

Page 19: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 19

ESP vs. CEP

http://www.slideshare.net/TimBassCEP/mythbusters-event-stream-processing-v-complex-event-processing-presentation

Page 20: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 20

Today's world...

ESP ≈ CEP ≈

Page 21: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 21

Laundry of “Buzzwords”

● Actor Frameworks– Better mechanism to handle concurrency– E.g. Akka, Orleans and Erlang OTP

● “Reactive”– Language semantics for bringing event streams to the user

interface– Responsive, Resilient, Elastic and Message Driven– E.g. Data flow languages, Functional reactive programming

● Event Sourcing● Change Data Capture (CDC)

Page 22: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 22

Analytics ≈ Stream Transformations

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

Page 23: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 23

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

Page 24: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 24

Target

● Better Scalability● High Throughput● Low latency● Powerful semantics● Easy integration

via Low Level Stream Processing

Frameworks !!

Page 25: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 25

Spark Streaming

● General purpose computing engine to run batch, interactive and streaming jobs

● Based on Resilient Distributed Datasets (RDD)– Restricted form of distributed shared memory– Immutable– Can only be built through deterministic

transformations● Efficient fault recovery using lineage graph

– Recompute lost partitions on failure– No cost if nothing fails

Page 26: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 26

Spark Streaming (Contd.)[Key concepts]

● DStream – sequence of RDDs representing a stream of data– HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

● Transformations – modify data from one DStream to another– Standard RDD operations – map, countByValue, reduce,

join, …– Stateful operations – window, countByValueAndWindow, …

● Output Operations – send data to external entity– saveAsHadoopFiles – saves to HDFS– foreach – do anything with each batch of results

Page 27: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 27

Spark Streaming (Contd.)

● Run a streaming computation as a series of very small, deterministic batch jobs– Chop up the live stream into batches of X seconds– Spark treats each batch of data as RDDs and processes them using RDD

operations– Finally, the processed results of the RDD operations are returned in

batches

Page 28: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 28

Berkeley Data Stack

Page 29: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 29

Spark 2.0 is coming !!

Page 30: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 30

Apache Storm[Key concepts] ● Tuple

– Core Unit of Data– Immutable Set of Key/Value Pairs

● Spouts– Source of Streams– Wraps a streaming data source and emits Tuples

● Bolts– Core functions of a streaming computation– Receive tuples and do stuff– Optionally emit additional tuples

Page 31: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 31

Apache Storm[Key concepts]

● Topology– DAG of Spouts and

Bolts– Data Flow

Representation– Streaming

Computation

Page 32: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 32

Apache Storm[Physical View]

Page 33: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 33

Twitter introduces Heron !!

[Storm's successor]

Page 34: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 34

Stream Processing Engines

Many More !!!

Page 35: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 35

Page 36: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 36

Hidden computation paradigm

via pipelining !!

Page 37: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 37

Pipelining ≈ Task Execution

https://martin.kleppmann.com/unix

Page 38: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 38

Let's build the concept again...

Page 39: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 39

Linux pipelining in modern middle-ware...

https://martin.kleppmann.com/unix

Page 40: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 40

Spark, Storm, Samza, Flink Etc.

https://martin.kleppmann.com/unix

Page 41: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 41

Spark, Storm, Samza, Flink Etc.

https://martin.kleppmann.com/unix

Page 42: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 42

Pub/sub pitch

https://martin.kleppmann.com/unix

Page 43: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 43

Streaming Machine Learning

● By using a programing abstraction for distributed streaming– Apache SAMOA

Page 44: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 44

Graph Stream ProcessingReferred Author: Vasia Kalavri, KTHhttps://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/buzzwords-kalavri.pdf

Page 45: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 45

Static Graph Processing

● Load: read the graph from disk and partition it in memory

● Compute: read and mutate the graph state

● Store: write the final graph state back to disk

Page 46: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 46

Static Graph Processing[Drawbacks]

● It is slow– wait until the computation is over before you see

any result– pre-processing and partitioning

● It is expensive– lots of memory and CPU required in order to scale

● It requires re-computation for graph changes– no efficient way to deal with updates

Page 47: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 47

Streaming Graph Processing

We consume events in real-time● Get results faster

– No need to wait for the job to finish– Sometimes, early approximations are better

than late exact answers

● Get results continuously– Process unbounded number of events

Page 48: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 48

Real-world scenarios

● Targeted Advertisement– Finding Strongly Connected Components in a

social network graph– Targeted chain of advertisement on detected

communities

Jane Joeknows

#Tesla

postslikes

Self driving carsAds

Peter Taphousechecks-in

John

subscribesDinner Offer

Ads

Page 49: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 49

Streaming Graph Processing[Challenges]● Maintain the graph structure

– How to apply state updates efficiently?

● Result updates– Re-run the analysis for each event?– Design an incremental algorithm?– Run separate instances on multiple snapshots?

● How to preserve graph properties?– Natural behavior?

Page 50: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 50

Streaming Graph Processing[Current Research]

Each event is an edge addition

Jane Joeknows

Jane #Teslalikes

Joe #Teslaposts

Peter TapHousechecks-in

Page 51: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 51

Page 52: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 52

Dynamic Graph Processing

● Instead of analyzing the whole graph– Analyze it's properties by preserving them

continuously● Connectivity or Distance (spanners)● Graph cut estimation (sparsifiers)● Neighborhood or homomorphic properties (sketches)

Page 53: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 53

Dynamic Graph Processing (Contd.)

Jane Joeknows

#Tesla

postslikes

Self driving carsAds

Peter Taphousechecks-in

John

subscribesDinner Offer

Ads

Peter Janeloves

loves

Self driving carsAds

Page 54: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 54

Stream Connected Components

● State: a disjoint set data structure for the components

● Computation: For each edge– if seen for the 1st time, create a component with ID

the min of the vertex IDs– if in different components, merge them and update

the component ID to the min of the component IDs– if only one of the endpoints belongs to a

component, add the other one to the same component

Page 55: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 55

Stream Connected Components

Page 56: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 56

Stream Connected Components

Page 57: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 57

Stream Connected Components

Page 58: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 58

Stream Connected Components

Page 59: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 59

Stream Connected Components

Page 60: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 60

Stream Connected Components

Page 61: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 61

Stream Connected Components

Page 62: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 62

Stream Connected Components

Page 63: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 63

Stream Connected Components

Page 64: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 64

Stream Connected Components

Page 65: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 65

Stream Connected Components

Page 66: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 66

Distributed Stream Connected Components

Page 67: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 67

Streaming Graph Processing[Current Work]● We're working with Gelly-Streams on

– Preserving natural properties in large scale real-world evolving graphs

– Joining multiple streams for detects graph causality/ bipartite

– Efficient graph partitioning mechanisms to on-board with popular data-stores like Cassandra, HDFS

– Producing a platform to benchmark NPC problems in real-world graphs

Page 68: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 68

Discussion !!

Page 69: Dancing with Stream Processing

07/10/16 MSc. Distributed Systems 69

Thank you !!

[email protected]