introduction to data stream processing - github pages · 2019-12-03 · introduction to data stream...

128
Introduction to Data Stream Processing Amir H. Payberah [email protected] 27/09/2018

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Introduction to Data Stream Processing

Amir H. [email protected]

27/09/2018

Page 2: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

The Course Web Page

https://id2221kth.github.io

1 / 102

Page 3: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Where Are We?

2 / 102

Page 4: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing (1/4)

I Stream processing is the act of continuously incorporating new data to compute aresult.

3 / 102

Page 5: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing (2/4)

I The input data is unbounded.• A series of events, no predetermined beginning or end.

• E.g., credit card transactions, clicks on a website, or sensor readings from IoT devices.

4 / 102

Page 6: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing (2/4)

I The input data is unbounded.• A series of events, no predetermined beginning or end.• E.g., credit card transactions, clicks on a website, or sensor readings from IoT devices.

4 / 102

Page 7: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing (3/4)

I User applications can then compute various queries over this stream of events.• E.g., tracking a running count of each type of event or aggregating them into hourly

windows

5 / 102

Page 8: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing (4/4)

I Database Management Systems (DBMS): data-at-rest analytics• Store and index data before processing it.• Process data only when explicitly asked by the users.

I Stream Processing Systems (SPS): data-in-motion analytics• Processing information as it flows, without storing them persistently.

6 / 102

Page 9: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Stream Processing Systems Stack

7 / 102

Page 10: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Data Stream Storage

8 / 102

Page 11: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

The Problem

I We need disseminate streams of events from various producers to various consumers.

9 / 102

Page 12: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Example

I Suppose you have a website, and every time someone loads a page, you send a viewedpage event to consumers.

I The consumers may do any of the following:• Store the message in HDFS for future analysis• Count page views and update a dashboard• Trigger an alert if a page view fails• Send an email notification to another user

10 / 102

Page 13: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Example

I Suppose you have a website, and every time someone loads a page, you send a viewedpage event to consumers.

I The consumers may do any of the following:• Store the message in HDFS for future analysis• Count page views and update a dashboard• Trigger an alert if a page view fails• Send an email notification to another user

10 / 102

Page 14: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Possible Solutions

I Messaging systems

I Partitioned logs

11 / 102

Page 15: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Possible Solutions

I Messaging systems

I Partitioned logs

12 / 102

Page 16: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

What is Messaging System?

I Messaging system is an approach to notify consumers about new events.

I Messaging systems• Direct messaging• Message brokers

13 / 102

Page 17: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Direct Messaging (1/2)

I Necessary in latency critical applications (e.g., remote surgery).

I Both consumers and producers have to be online at the same time.

I A producer sends a message containing the event, which is pushed to consumers.

14 / 102

Page 18: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Direct Messaging (2/2)

I What happens if a consumer crashes or temporarily goes offline? (not durable)

I What happens if producers send messages faster than the consumers can process?• Dropping messages• Backpressure

I We need message brokers that can log events to process at a later time.

15 / 102

Page 19: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Direct Messaging (2/2)

I What happens if a consumer crashes or temporarily goes offline? (not durable)

I What happens if producers send messages faster than the consumers can process?• Dropping messages• Backpressure

I We need message brokers that can log events to process at a later time.

15 / 102

Page 20: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Direct Messaging (2/2)

I What happens if a consumer crashes or temporarily goes offline? (not durable)

I What happens if producers send messages faster than the consumers can process?• Dropping messages• Backpressure

I We need message brokers that can log events to process at a later time.

15 / 102

Page 21: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Message Broker (1/2)

I A message broker decouples the producer-consumer interaction.

I It runs as a server, with producers and consumers connecting to it as clients.

I Producers write messages to the broker, and consumers receive them by reading themfrom the broker.

I Consumers are generally asynchronous.

16 / 102

Page 22: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Message Broker (1/2)

I A message broker decouples the producer-consumer interaction.

I It runs as a server, with producers and consumers connecting to it as clients.

I Producers write messages to the broker, and consumers receive them by reading themfrom the broker.

I Consumers are generally asynchronous.

16 / 102

Page 23: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Message Broker (2/2)

I When multiple consumers read messages in the same topic.

I Load balancing: each message is delivered to one of the consumers.

I Fan-out: each message is delivered to all of the consumers.

17 / 102

Page 24: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Message Broker (2/2)

I When multiple consumers read messages in the same topic.

I Load balancing: each message is delivered to one of the consumers.

I Fan-out: each message is delivered to all of the consumers.

17 / 102

Page 25: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Message Broker (2/2)

I When multiple consumers read messages in the same topic.

I Load balancing: each message is delivered to one of the consumers.

I Fan-out: each message is delivered to all of the consumers.

17 / 102

Page 26: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Possible Solutions

I Messaging systems

I Partitioned logs

18 / 102

Page 27: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Partitioned Logs (1/2)

I Log-based message brokers combine the durable storage approach with the low-latency notification facilities.

I A log is an append-only sequence of records on disk.

I A producer sends a message by appending it to the end of the log.

I A consumer receives messages by reading the log sequentially.• It waits for a notification, if it reaches the end of the log.

19 / 102

Page 28: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Partitioned Logs (1/2)

I Log-based message brokers combine the durable storage approach with the low-latency notification facilities.

I A log is an append-only sequence of records on disk.

I A producer sends a message by appending it to the end of the log.

I A consumer receives messages by reading the log sequentially.• It waits for a notification, if it reaches the end of the log.

19 / 102

Page 29: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Partitioned Logs (1/2)

I Log-based message brokers combine the durable storage approach with the low-latency notification facilities.

I A log is an append-only sequence of records on disk.

I A producer sends a message by appending it to the end of the log.

I A consumer receives messages by reading the log sequentially.• It waits for a notification, if it reaches the end of the log.

19 / 102

Page 30: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Partitioned Logs (2/2)

I To scale up the system, logs can be partitioned hosted on different machines.

I A topic is a group of partitions that all carry messages of the same type.

I Within each partition, the broker assigns a monotonically increasing sequence number(offset) to every message

20 / 102

Page 31: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Partitioned Logs (2/2)

I To scale up the system, logs can be partitioned hosted on different machines.

I A topic is a group of partitions that all carry messages of the same type.

I Within each partition, the broker assigns a monotonically increasing sequence number(offset) to every message

20 / 102

Page 32: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka - A Log-Based Message Broker

21 / 102

Page 33: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka (1/5)

I Kafka is a distributed, topic oriented, partitioned, replicated commit log service.

22 / 102

Page 34: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka (2/5)

I Kafka is a distributed, topic oriented, partitioned, replicated commit log service.

23 / 102

Page 35: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka (3/5)

I Kafka is a distributed, topic oriented, partitioned, replicated commit log service.

24 / 102

Page 36: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka (4/5)

I Kafka is a distributed, topic oriented, partitioned, replicated commit log service.

25 / 102

Page 37: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka (5/5)

I Kafka is a distributed, topic oriented, partitioned, replicated commit log service.

26 / 102

Page 38: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logs, Topics and Partition (1/5)

I Kafka is about logs.

I Topics are queues: a stream of messages of a particular type

27 / 102

Page 39: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logs, Topics and Partition (2/5)

I Each message is assigned a sequential id called an offset.

28 / 102

Page 40: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logs, Topics and Partition (3/5)

I Topics are logical collections of partitions (the physical files).• Ordered• Append only• Immutable

29 / 102

Page 41: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logs, Topics and Partition (4/5)

I Ordering is only guaranteed within a partition for a topic.

I Messages sent by a producer to a particular topic partition will be appended in theorder they are sent.

I A consumer instance sees messages in the order they are stored in the log.

30 / 102

Page 42: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logs, Topics and Partition (5/5)

I Partitions of a topic are replicated: fault-tolerance

I A broker contains some of the partitions for a topic.

I One broker is the leader of a partition: all writes and reads must go to the leader.

31 / 102

Page 43: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Kafka Architecture

32 / 102

Page 44: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Producers

I Producers publish data to the topics of their choice.

I Producers are responsible for choosing which message to assign to which partitionwithin the topic.

• Round-robin• Key-based

33 / 102

Page 45: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Consumers and Consumer Groups (1/2)

I Consumers pull a range of messages from brokers.

I Multiple consumers can read from same topic on their own pace.

I Consumers maintain the message offset.

34 / 102

Page 46: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Consumers and Consumer Groups (2/2)

I Consumers can be organized into consumer groups.

I Each message is delivered to only one of the consumers within the group.

I All messages from one partition are consumed only by a single consumer within eachconsumer group.

35 / 102

Page 47: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Brokers

I The published messages are stored at a set of servers called brokers.

I Brokers are sateless.

I Messages are kept on log for predefined period of time.

36 / 102

Page 48: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Coordination

I Kafka uses Zookeeper for the following tasks:

I Detecting the addition and the removal of brokers and consumers.

I Triggering a rebalance process in each consumer.

I Keeping track of the consumed offset of each partition.

37 / 102

Page 49: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Delivery Guarantees

I Kafka guarantees that messages from a single partition are delivered to a consumerin order.

I There is no guarantee on the ordering of messages coming from different partitions.

I Kafka only guarantees at-least-once delivery.

I No exactly-once delivery: two-phase commits

38 / 102

Page 50: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Start and Work With Kafka

# Start the ZooKeeper

zookeeper-server-start.sh config/zookeeper.properties

# Start the Kafka server

kafka-server-start.sh config/server.properties

# Create a topic, called "avg"

kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1

--topic avg

# Print the list of topics

kafka-topics.sh --list --zookeeper localhost:2181

# Produce messages and send them to the topic "avg"

kafka-console-producer.sh --broker-list localhost:9092 --topic avg

# Consume the messages sent to the topic "avg"

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic avg --from-beginning

39 / 102

Page 51: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Programming Kafka in Scala - Producer

object ScalaProducerExample extends App {

def getRandomVal: String = { ... }

val brokers = "localhost:9092"

val topic = "avg"

val props = new Properties()

props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)

val producer = new KafkaProducer[String, String](props)

while (true) {

val data = new ProducerRecord[String, String](topic, null, getRandomVal)

producer.send(data)

}

producer.close()

}

40 / 102

Page 52: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Programming Kafka in Scala - Consumer

object ScalaConsumerExample extends App {

val brokers = "localhost:9092"

val groupId = "group1"

val topic = "avg"

val props = new Properties()

props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)

props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)

val consumer = new KafkaConsumer[String, String](props)

consumer.subscribe(Collections.singletonList(topic))

Executors.newSingleThreadExecutor.execute(new Runnable {

override def run(): Unit = {

while (true) {

val records = consumer.poll(1000)

for (record <- records) {

System.out.println(record.key() + ", " + record.value() + ", " + record.offset())

}}}})}

41 / 102

Page 53: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Data Stream Processing

42 / 102

Page 54: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data

I Data stream is unbound data, which is broken into a sequence of individual tuples.

I A data tuple is the atomic data item in a data stream.

I Can be structured, semi-structured, and unstructured.

43 / 102

Page 55: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Design Points

I Event time vs. processing time

I Continuous vs. micro-batch processing

I Record-at-a-Time vs. declarative APIs

44 / 102

Page 56: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Design Points

I Event time vs. processing time

I Continuous vs. micro-batch processing

I Record-at-a-Time vs. declarative APIs

45 / 102

Page 57: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Event Time vs. Processing Time (1/2)

I Event time: the time at which events actually occurred.• Timestamps inserted into each record at the source.

I Prcosseing time: the time when the record is received at the streaming application.

46 / 102

Page 58: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Event Time vs. Processing Time (2/2)

I Ideally, event time and processing time should be equal.

I Skew between event time and processing time.

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

47 / 102

Page 59: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Design Points

I Event time vs. processing time

I Continuous vs. micro-batch processing

I Record-at-a-Time vs. declarative APIs

48 / 102

Page 60: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (1/3)

I Window: a buffer associated with an input port to retain previously received tuples.

I Four different windowing management policies.

• Count-based policy: the maximum number of tuples a window buffer can hold• Time-based policy: a wall-clock time period• Delta-based policy: a delta threshold in a tuple attribute• Punctuation-based policy: a punctuation is received

49 / 102

Page 61: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (1/3)

I Window: a buffer associated with an input port to retain previously received tuples.

I Four different windowing management policies.• Count-based policy: the maximum number of tuples a window buffer can hold

• Time-based policy: a wall-clock time period• Delta-based policy: a delta threshold in a tuple attribute• Punctuation-based policy: a punctuation is received

49 / 102

Page 62: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (1/3)

I Window: a buffer associated with an input port to retain previously received tuples.

I Four different windowing management policies.• Count-based policy: the maximum number of tuples a window buffer can hold• Time-based policy: a wall-clock time period

• Delta-based policy: a delta threshold in a tuple attribute• Punctuation-based policy: a punctuation is received

49 / 102

Page 63: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (1/3)

I Window: a buffer associated with an input port to retain previously received tuples.

I Four different windowing management policies.• Count-based policy: the maximum number of tuples a window buffer can hold• Time-based policy: a wall-clock time period• Delta-based policy: a delta threshold in a tuple attribute

• Punctuation-based policy: a punctuation is received

49 / 102

Page 64: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (1/3)

I Window: a buffer associated with an input port to retain previously received tuples.

I Four different windowing management policies.• Count-based policy: the maximum number of tuples a window buffer can hold• Time-based policy: a wall-clock time period• Delta-based policy: a delta threshold in a tuple attribute• Punctuation-based policy: a punctuation is received

49 / 102

Page 65: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (2/3)

I Two types of windows: tumbling and sliding

I Tumbling window: supports batch operations.• When the buffer fills up, all the tuples are evicted.

I Sliding window: supports incremental operations.• When the buffer fills up, older tuples are evicted.

50 / 102

Page 66: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (2/3)

I Two types of windows: tumbling and sliding

I Tumbling window: supports batch operations.• When the buffer fills up, all the tuples are evicted.

I Sliding window: supports incremental operations.• When the buffer fills up, older tuples are evicted.

50 / 102

Page 67: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (2/3)

I Two types of windows: tumbling and sliding

I Tumbling window: supports batch operations.• When the buffer fills up, all the tuples are evicted.

I Sliding window: supports incremental operations.• When the buffer fills up, older tuples are evicted.

50 / 102

Page 68: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Windowing (3/3)

I In summary:• Fixed windows• Sliding windows• Sessions

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

51 / 102

Page 69: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Patterns

I Micro-batch systems• Batch engines• Slicing up the unbounded data into a sets of bounded data, then process each batch.

I Continuous processing-based systems• Each node in the system continually listens to messages from other nodes and outputs

new updates to its child nodes.

52 / 102

Page 70: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Patterns

I Micro-batch systems• Batch engines• Slicing up the unbounded data into a sets of bounded data, then process each batch.

I Continuous processing-based systems• Each node in the system continually listens to messages from other nodes and outputs

new updates to its child nodes.

52 / 102

Page 71: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Micro-Batch Processing (1/2)

I Fixed windows

I Windowing input data into fixed-sized windows, then processing each of window asa bounded data source.

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

53 / 102

Page 72: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Micro-Batch Processing (2/2)

I Session

I Periods of activity (e.g., for a specific user) terminated by a gap of inactivity.

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

54 / 102

Page 73: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Continuous Processing (1/4)

I Time-agnostic

I Time is essentially irrelevant, i.e., all relevant logic is data driven.

I E.g., filtering, inner-join, ...

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

55 / 102

Page 74: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Continuous Processing (2/4)

I Approximation algorithms

I These algorithms typically have some element of time in their design.

I E.g., approximate Top-N, streaming K-means, ...

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

56 / 102

Page 75: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Continuous Processing (3/4)

I Windowing by processing time

I The system buffers up incoming data into windows until some amount of processingtime has passed.

I E.g., five-minute fixed windows

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

57 / 102

Page 76: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Processing Patterns - Continuous Processing (4/4)

I Windowing by event time

I This model is what we use when we need to observe a data source in finite chunksthat reflect the times at which those events actually happened.

[https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101]

58 / 102

Page 77: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Design Points

I Event time vs. processing time

I Continuous vs. micro-batch processing

I Record-at-a-Time vs. declarative APIs

59 / 102

Page 78: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Record-at-a-Time vs. Declarative APIs

I Record-at-a-Time API (e.g., Storm)• Low-level API• Passes each event to the application and let it react.• Useful when applications need full control over the processing of data.• Complicated factors, such as maintaining state, are governed by the application.

I Declarative API (e.g., Spark streaming, Flink, Google Dataflow)• Aapplications specify what to compute not how to compute it in response to each new

event.

60 / 102

Page 79: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing Model

61 / 102

Page 80: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing (1/2)

I The tuples are processed by the application’s operators or processing element (PE).

I A PE is the basic functional unit in an application.• A PE processes input tuples, applies a function, and outputs tuples.• A set of PEs and stream connections, organized into a data flow graph.

62 / 102

Page 81: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Streaming Data Processing (2/2)

I Data flow programming

I Flow composition: techniques for creating the topology associated with the flowgraph for an application.

I Flow manipulation: the use of PEs to perform transformations on data flows.

63 / 102

Page 82: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Data Flow Composition

I Data flow composition patterns:• Static composition• Dynamic composition

64 / 102

Page 83: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (1/2)

I Edge adaptation: converting data from external sources into tuples that can beconsumed by downstream PEs.

I Aggregation: collecting and summarizing a subset of tuples from one or more streams.

I Splitting: partitioning a stream into multiple streams.

I Merging: combining multiple input streams.

65 / 102

Page 84: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (1/2)

I Edge adaptation: converting data from external sources into tuples that can beconsumed by downstream PEs.

I Aggregation: collecting and summarizing a subset of tuples from one or more streams.

I Splitting: partitioning a stream into multiple streams.

I Merging: combining multiple input streams.

65 / 102

Page 85: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (1/2)

I Edge adaptation: converting data from external sources into tuples that can beconsumed by downstream PEs.

I Aggregation: collecting and summarizing a subset of tuples from one or more streams.

I Splitting: partitioning a stream into multiple streams.

I Merging: combining multiple input streams.

65 / 102

Page 86: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (1/2)

I Edge adaptation: converting data from external sources into tuples that can beconsumed by downstream PEs.

I Aggregation: collecting and summarizing a subset of tuples from one or more streams.

I Splitting: partitioning a stream into multiple streams.

I Merging: combining multiple input streams.

65 / 102

Page 87: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (2/2)

I Logical and mathematical operations: applying different logical, relational and math-ematical processing to tuple attributes.

I Sequence manipulation: reordering, delaying, or altering the temporal properties ofa stream.

I Custom data manipulations: applying data mining, machine learning, ...

66 / 102

Page 88: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (2/2)

I Logical and mathematical operations: applying different logical, relational and math-ematical processing to tuple attributes.

I Sequence manipulation: reordering, delaying, or altering the temporal properties ofa stream.

I Custom data manipulations: applying data mining, machine learning, ...

66 / 102

Page 89: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs Tasks (2/2)

I Logical and mathematical operations: applying different logical, relational and math-ematical processing to tuple attributes.

I Sequence manipulation: reordering, delaying, or altering the temporal properties ofa stream.

I Custom data manipulations: applying data mining, machine learning, ...

66 / 102

Page 90: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs States (1/3)

I A PE can either maintain internal state across tuples while processing them, orprocess tuples independently of each other.

I Stateful vs. stateless tasks

67 / 102

Page 91: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs States (2/3)

I Stateless tasks: do not maintain state and process each tuple independently of priorhistory, or even from the order of arrival of tuples.

I Easily parallelized.

I No synchronization.

I Restart upon failures without the need of any recovery procedure.

68 / 102

Page 92: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs States (2/3)

I Stateless tasks: do not maintain state and process each tuple independently of priorhistory, or even from the order of arrival of tuples.

I Easily parallelized.

I No synchronization.

I Restart upon failures without the need of any recovery procedure.

68 / 102

Page 93: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs States (3/3)

I Stateful tasks: involves maintaining information across different tuples to detectcomplex patterns.

I A PE is usually a synopsis of the tuples received so far.

I A subset of recent tuples kept in a window buffer.

69 / 102

Page 94: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

PEs States (3/3)

I Stateful tasks: involves maintaining information across different tuples to detectcomplex patterns.

I A PE is usually a synopsis of the tuples received so far.

I A subset of recent tuples kept in a window buffer.

69 / 102

Page 95: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Runtime Systems

70 / 102

Page 96: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Job and Job Management

I At runtime, an application is represented by one or more jobs.

I Jobs are deployed as a collection of PEs.

I Job management component must identify and track individual PEs, the jobs theybelong to, and associate them with the user that instantiated them.

71 / 102

Page 97: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logical Plan vs. Physical Plan (1/3)

I Logical plan: a data flow graph, where the vertices correspond to PEs, and the edgesto stream connections.

I Physical plan: a data flow graph, where the vertices correspond to OS processes, andthe edges to transport connections.

72 / 102

Page 98: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logical Plan vs. Physical Plan (2/3)

Logical plan

Different physical plans

73 / 102

Page 99: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Logical Plan vs. Physical Plan (3/3)

I How to map a network of PEs onto the physical network of nodes?

• Parallelization

• Fault tolerance

• Optimization

74 / 102

Page 100: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Parallelization

75 / 102

Page 101: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Parallelization

I How to scale with increasing the number queries and the rate of incoming events?

I Three forms of parallelisms.• Pipelined parallelism• Task parallelism• Data parallelism

76 / 102

Page 102: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Pipelined Parallelism

I Sequential stages of a computation execute concurrently for different data items.

77 / 102

Page 103: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Task Parallelism

I Independent processing stages of a larger computation are executed concurrently onthe same or distinct data items.

78 / 102

Page 104: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Data Parallelism (1/2)

I The same computation takes place concurrently on different data items.

79 / 102

Page 105: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Data Parallelism (2/2)

I How to allocate data items to each computation instance?

80 / 102

Page 106: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Fault Tolerance

81 / 102

Page 107: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Recovery Methods (1/2)

I The recovery methods of streaming frameworks must take:

• Correctness, e.g., data loss and duplicates

• Performance, e.g., low latency

82 / 102

Page 108: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Recovery Methods (2/2)

I GAP recovery

I Rollback recovery

I Precise recovery

83 / 102

Page 109: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

GAP Recovery (Cold Restart)

I The weakest recovery guarantee

I A new task takes over the operations of the failed task.

I The new task starts from an empty state.

I Tuples can be lost during the recovery phase.

84 / 102

Page 110: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Rollback Recovery

I The information loss is avoided, but the output may contain duplicate tuples.

I Three types of rollback recovery:• Active backup• Passive backup• Upstream backup

85 / 102

Page 111: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Rollback Recovery - Active Backup

I Each processing node has an associated backup node.

I Both primary and backup nodes are given the same input.

I The output tuples of the backup node are logged at the output queues and they arenot sent downstream.

I If the primary fails, the backup takes over by sending the logged tuples to all down-stream neighbors and then continuing its processing.

86 / 102

Page 112: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Rollback Recovery - Passive Backup

I Periodically check-points processing state to a shared storage.

I The backup node takes over from the latest checkpoint when the primary fails.

I The backup node is always equal or behind the primary.

87 / 102

Page 113: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Rollback Recovery - Upstream Backup

I Upstream nodes store the tuples until the downstream nodes acknowledge them.

I If a node fails, an empty node rebuilds the latest state of the failed primary from thelogs kept at the upstream server.

I There is no backup node in this model.

88 / 102

Page 114: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Precise Recovery

I Post-failure output is exactly the same as the output without failure.

I Can be achieved by modifying the algorithms for rollback recovery.• For example, in passive backup, after a failure occurs the backup node can ask the

downstream nodes for the latest tuples they received and trim the output queuesaccordingly to prevent the duplicates.

89 / 102

Page 115: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization

90 / 102

Page 116: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Early Data Reduction

I Reducing the data volume as early as possible.• Sampling, filtering, quantization, projection, and aggregation.

91 / 102

Page 117: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Reordering

I Operator reordering• Executing the computationally cheaper operator and/or the more selective operator

earlier reduces the overall cost.

92 / 102

Page 118: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Redundancy Elimination

I Removing the redundant segments from a data flow graph.

93 / 102

Page 119: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Operator Fusion

I It changes only the physical layout.

I If two operators of the two ends of a stream connection are placed on different hosts:non-negligible network cost

I It is effective, if the per-tuple processing cost of the operators being fused is lowerthan the cost of transferring tuples across the stream connection.

94 / 102

Page 120: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Operator Fusion

I It changes only the physical layout.

I If two operators of the two ends of a stream connection are placed on different hosts:non-negligible network cost

I It is effective, if the per-tuple processing cost of the operators being fused is lowerthan the cost of transferring tuples across the stream connection.

94 / 102

Page 121: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Tuple Batching

I Processing a group of tuples in every iteration of an operator’s internal algorithm.

I Can increase the throughput at the expense of higher latency.

95 / 102

Page 122: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Load Balancing

I Flow partitioning to distribute the workload, e.g., data or task parallelism.

I Distributing the load evenly across the different subflows.

96 / 102

Page 123: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Optimization - Load Shedding

I Used by an operator to reduce the amount of computational resources it uses.• Decrease the operator latency, and improve the throughput.

I Different techniques: dropping incoming tuples, data reduction techniques (e.g.,sampling), ...

97 / 102

Page 124: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Summary

98 / 102

Page 125: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Summary

I Messaging system and partitioned logs

I Decoupling producers and consumers

I Kafka: distributed, topic oriented, partitioned, replicated log service

I Logs, topcs, partition

I Kafka architecture: producer, consumer (groups), broker, coordinator

99 / 102

Page 126: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Summary

I SPS vs. DBMS

I Data stream, unbounded data, tuples

I Event-time vs. processing time

I Micro-batch vs. continues processing (windowing)

I PEs and dataflow

I Stateless vs. Stateful PEs

I SPS runtime: parallelization, fault-tolerance, optimization

100 / 102

Page 127: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

References

I J. Kreps et al., “Kafka: A distributed messaging system for log processing”, NetDB2011

I M. Zaharia et al., “Spark: The Definitive Guide”, O’Reilly Media, 2018 - Chapter 20

I H. Andrade et al., “Fundamentals of stream processing: application design, systems,and analytics”, Cambridge University Press, 2014 - Chapter 1-5, 7, 9

I J. Hwang et al., “High-availability algorithms for distributed stream processing”,ICDE 2005

I T. Akidau, “The world beyond batch: Streaming 101”,https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

101 / 102

Page 128: Introduction to Data Stream Processing - GitHub Pages · 2019-12-03 · Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 27/09/2018. The Course Web Page 1/102

Questions?

102 / 102