introduction to kafka streams

Post on 16-Apr-2017

14.688 Views

Category:

Engineering

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kafka Streams Stream processing Made Simple with Kafka

1

Guozhang Wang Hadoop Summit, June 28, 2016

2

What is NOT Stream Processing?

3

Stream Processing isn’t (necessarily)

• Transient, approximate, lossy…

• .. that you must have batch processing as safety net

4

5

6

7

8

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

9

Why Kafka in Stream Processing?

10

• Persistent Buffering

• Logical Ordering

• Scalable “source-of-truth”

Kafka: Real-time Platforms

11

Stream Processing with Kafka

12

• Option I: Do It Yourself !

Stream Processing with Kafka

13

• Option I: Do It Yourself !

Stream Processing with Kafka

while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll();

// do some processing…

// send output messages back to Kafka producer.send(outputMessages); }

14

15

• Ordering

• Partitioning &

Scalability

• Fault tolerance

DIY Stream Processing is Hard

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

16

• Option I: Do It Yourself !

• Option II: full-fledged stream processing system

• Storm, Spark, Flink, Samza, ..

Stream Processing with Kafka

17

MapReduce Heritage?

• Config Management

• Resource Management

• Configuration

• etc..

18

MapReduce Heritage?

• Config Management

• Resource Management

• Deployment

• etc..

19

MapReduce Heritage?

• Config Management

• Resource Management

• Deployment

• etc..

Can I just use my own?!

20

• Option I: Do It Yourself !

• Option II: full-fledged stream processing system

• Option III: lightweight stream processing library

Stream Processing with Kafka

Kafka Streams

• In Apache Kafka since v0.10, May 2016

• Powerful yet easy-to-use stream processing library• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..21

22

Anywhere, anytime

Ok. Ok. Ok. Ok.

23

Anywhere, anytime

<dependency>

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

24

Anywhere, anytime

War File

Rsync

Puppet/

Chef

YARN

Mesos

Docker

Kuberne

tes

Very Uncool Very Cool

25

Simple is Beautiful

Kafka Streams DSL

26

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

27

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

28

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

29

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

30

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

31

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

32

Native Kafka IntegrationProperty cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

33

Property cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

Native Kafka Integration

34

API, coding

“Full stack” evaluation

Operations, debugging, …

35

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

36

Key Idea:

Outsource hard problems to Kafka!

Kafka Concepts: the Log

4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 10)

Messages

3

Topic 1

Topic 2

Partitions

Producers

Producers

Consumers

Consumers

Brokers

Kafka Concepts: the Log

39

Kafka Streams: Key Concepts

Stream and Records

40

Key Value Key Value Key Value Key Value

Stream

Record

Processor Topology

41

Stream

Processor Topology

42

StreamProcessor

Processor Topology

43

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

44

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

45

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

46

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

47

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

48

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Processor Topology

49

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.table(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

50

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

51

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

52

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

53Kafka Streams Kafka

Processor Topology

54

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Processor Topology

55

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Processor Topology

56

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Processor Topology

57

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Processor Topology

58

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Sub-Topology

Processor Topology

59Kafka Streams Kafka

Processor Topology

60Kafka Streams Kafka

Processor Topology

61Kafka Streams Kafka

Processor Topology

62Kafka Streams Kafka

Stream Partitions and Tasks

63

Kafka Topic B Kafka Topic A

P1

P2

P1

P2

Stream Partitions and Tasks

64

Kafka Topic B Kafka Topic A

Processor TopologyP1

P2

P1

P2

Stream Partitions and Tasks

65

Kafka Topic AKafka Topic B

Kafka Topic B

Task2Task1

Stream Partitions and Tasks

66

Kafka Topic A

Kafka Topic B

Stream Partitions and Tasks

67

Kafka Topic A

Task2Task1

Kafka Topic B

Stream Threads

68

Kafka Topic A

MyApp.1Task2Task1

Kafka Topic B

Stream Threads

69

Kafka Topic A

Task2Task1MyApp.1 MyApp.2

Kafka Topic B

Stream Threads

70

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

Stream Threads

71

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Stream Threads

72

Task3MyApp.3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Stream Threads

73

Task3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2 MyApp.3

Stream Threads

74Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

Stream Threads

75Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

Stream Threads

76Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

Stream Threads

77Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

78

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

States in Stream Processing

79

• filter

• map

• join

• aggregate

Stateless

Stateful

80

States in Stream Processing

81

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

82

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”) State

States in Stream Processing

Kafka Topic B

Task2Task1

States in Stream Processing

83

Kafka Topic A

State State

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

84

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1997 1980 1983 2015

85

PHAN

TOM

MEN

ACE

ATTA

CK O

F TH

E CL

ON

ES

REV

ENG

E O

F TH

E SI

TH

A N

EW H

OPE

THE

EMPI

RE

STR

IKES

BAC

K

RET

UR

N O

F TH

E JE

DI

THE

FORC

E AW

AKEN

S

Out-of-Order

Timestamp Extractor

86

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

Timestamp Extractor

87

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

Timestamp Extractor

88

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

event-time

Timestamp Extractor

89

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

} processing-time

event-time

public long extract(ConsumerRecord<Object, Object> record) {

return ((JsonNode) record.value()).get(”timestamp”).longValue();

}

Windowing

90

t…

Windowing

91

t…

Windowing

92

t…

Windowing

93

t…

Windowing

94

t…

Windowing

95

t…

Windowing

96

t…

97

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

Stream v.s. Table?

98

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

99

Tables ≈ Streams

100

101

102

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

103

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

104

105

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

106

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs.”

“Alice is now at LinkedIn.”

107

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

108

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

(key: Alice, value: 2)

109

alice 2 bob 10 alice 3

time

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

110

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

map() filter() join() …

111

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

State

112

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

113

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

114

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

115

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

116

Remember?

117

StateProcess

StateProcess

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

Kafka

118

StateProcess

StateProcess Protoco

l

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

Kafka

119

StateProcess

StateProcess Protoco

l

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

Kafka

120

121

122

123

124

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

125

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

Ongoing Work (0.10+)

• Beyond Java APIs

• SQL support, Python client, etc

• End-to-End Semantics (exactly-once)

• Queryable States

• … and more126

Queryable States

127

State

Real-time Analytics

select Count(*), Sum(*)

from “MyAgg”

where windowId > now() - 10;

128

But how to get data in / out Kafka?

129

130

131

132

Take-aways

• Stream Processing: a new programming paradigm

133

Take-aways

• Stream Processing: a new programming paradigm

• Kafka Streams: stream processing made easy

134

Take-aways

• Stream Processing: a new programming paradigm

• Kafka Streams: stream processing made easy

135

THANKS!

Guozhang Wang | guozhang@confluent.io | @guozhangwang

Visit Confluent at the Syncsort Booth (#1303), live demos @ 29thDownload Kafka Streams: www.confluent.io/product

136

We are Hiring!

top related