introduction to kafka streams

Kafka Streams Stream processing Made Simple with Kafka

Guozhang Wang Hadoop Summit, June 28, 2016

What is NOT Stream Processing?

Stream Processing isn’t (necessarily)

• Transient, approximate, lossy…

• .. that you must have batch processing as safety net

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

Why Kafka in Stream Processing?

• Persistent Buffering

• Logical Ordering

• Scalable “source-of-truth”

Kafka: Real-time Platforms

Stream Processing with Kafka

• Option I: Do It Yourself !

while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll();

// do some processing…

// send output messages back to Kafka producer.send(outputMessages); }

• Ordering

• Partitioning &

Scalability

• Fault tolerance

DIY Stream Processing is Hard

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

• Option II: full-fledged stream processing system

• Storm, Spark, Flink, Samza, ..

MapReduce Heritage?

• Config Management

• Resource Management

• Configuration

• etc..

MapReduce Heritage?

• Deployment

• etc..

MapReduce Heritage?

• Deployment

• etc..

Can I just use my own?!

• Option II: full-fledged stream processing system

• Option III: lightweight stream processing library

Kafka Streams

• In Apache Kafka since v0.10, May 2016

• Powerful yet easy-to-use stream processing library• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..21

Anywhere, anytime

Ok. Ok. Ok. Ok.

Anywhere, anytime

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

Anywhere, anytime

War File

Puppet/

Docker

Kuberne

Very Uncool Very Cool

Simple is Beautiful

Kafka Streams DSL

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

Native Kafka IntegrationProperty cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

Property cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

Native Kafka Integration

API, coding

“Full stack” evaluation

Operations, debugging, …

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

Key Idea:

Outsource hard problems to Kafka!

Kafka Concepts: the Log

4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 10)

Messages

Topic 1

Topic 2

Partitions

Producers

Consumers

Brokers

Kafka Concepts: the Log

Kafka Streams: Key Concepts

Stream and Records

Key Value Key Value Key Value Key Value

Stream

Record

Processor Topology

Stream

Processor Topology

StreamProcessor

Processor Topology

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Processor Topology

KStream<..> stream2 = builder.table(”topic2”);

builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

53Kafka Streams Kafka

Processor Topology

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

Processor Topology

Sub-Topology

Processor Topology

Stream Partitions and Tasks

Kafka Topic B Kafka Topic A

Processor TopologyP1

Kafka Topic AKafka Topic B

Kafka Topic B

Task2Task1

Kafka Topic A

Kafka Topic B

Kafka Topic A

Task2Task1

Kafka Topic B

Stream Threads

Kafka Topic A

MyApp.1Task2Task1

Kafka Topic B

Stream Threads

Kafka Topic A

Task2Task1MyApp.1 MyApp.2

Kafka Topic B

Stream Threads

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

Stream Threads

Task3MyApp.3

Stream Threads

Task2Task1MyApp.1 MyApp.2 MyApp.3

Stream Threads

74Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

Stream Threads

75Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Stream Threads

76Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Stream Threads

77Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• Time, Window &

Out-of-order Data

• Re-processing

States in Stream Processing

• filter

• map

• join

• aggregate

Stateless

Stateful

.addSink(”Sink”, ”topic3”, ”Aggregate”) State

Kafka Topic B

Task2Task1

Kafka Topic A

State State

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1997 1980 1983 2015

Out-of-Order

Timestamp Extractor

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

return record.timestamp();

Timestamp Extractor

processing-time

Timestamp Extractor

processing-time

event-time

Timestamp Extractor

} processing-time

event-time

return ((JsonNode) record.value()).get(”timestamp”).longValue();

Windowing

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Stream v.s. Table?

Tables ≈ Streams

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

KStream

KTable

“Alice bought eggs.”

“Alice is now at LinkedIn.”

KStream

KTable

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

alice 2 bob 10 alice 3

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

KTable aggregated

KStream joined

KTable aggregated

KStream joined

KTable aggregated

KStream joined

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Remember?

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

StateProcess

StateProcess Protoco

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

StateProcess

StateProcess Protoco

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

Ongoing Work (0.10+)

• Beyond Java APIs

• SQL support, Python client, etc

• End-to-End Semantics (exactly-once)

• Queryable States

• … and more126

Queryable States

Real-time Analytics

select Count(*), Sum(*)

from “MyAgg”

where windowId > now() - 10;

But how to get data in / out Kafka?

Take-aways

• Stream Processing: a new programming paradigm

Take-aways

• Kafka Streams: stream processing made easy

Take-aways

• Kafka Streams: stream processing made easy

THANKS!

Guozhang Wang | guozhang@confluent.io | @guozhangwang

Visit Confluent at the Syncsort Booth (#1303), live demos @ 29thDownload Kafka Streams: www.confluent.io/product

We are Hiring!

introduction to kafka streams

Engineering

event stream processing using kafka streams · event stream...

introducing kafka streams: large-scale stream processing...

kafka connect & kafka streams/ksql - the ecosystem around...

event stream processing using kafka...

kafka connect & kafka streams/ksql - powerful ecosystem...

reducing microservice complexity with kafka and reactive...

architecting fast data applications - techarch day · kafka...

introduction to kafka streams - michael noll - berlin...

akka streams kafka kinesis

ibm event streams using apache kafka · apache kafka...

building event driven services with apache kafka and kafka...

real-time streams and logs with storm and kafka

kafka streams and ksql - itoug · •kafka is popular for...

reducing microservice complexity with kafka and...

deploying confluent enterprise on microsoft azure ·...

kafka connect & kafka streams - paris kafka user group

back-pressure in action: handling high-burst workloads with...

introduction to kafka streams

reactive kafka with akka streams

ibm event streams - event streaming with apache kafka in