stream processing made simple with kafka

Kafka Streams Stream processing Made Simple with Kafka

Guozhang Wang Hadoop Summit, June 28, 2016

What is NOT Stream Processing?

Stream Processing isn’t (necessarily)

• Transient, approximate, lossy…

• .. that you must have batch processing as safety net

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

Why Kafka in Stream Processing?

• Persistent Buffering

• Logical Ordering

• Highly Scalable “source-of-truth”

Kafka: Real-time Platforms

Stream Processing with Kafka

• Option I: Do It Yourself !

while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll();

// do some processing…

// send output messages back to Kafka producer.send(outputMessages); }

• Ordering

• Partitioning &

Scalability

• Fault tolerance

DIY Stream Processing is Hard

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

• Option II: full-fledged stream processing system

• Storm, Spark, Flink, Samza, ..

MapReduce Heritage?

• Config Management

• Resource Management

• Configuration

• etc..

MapReduce Heritage?

• Deployment

• etc..

MapReduce Heritage?

• Deployment

• etc..

Can I just use my own?!

• Option II: full-fledged stream processing system

• Option III: lightweight stream processing library

Kafka Streams

• In Apache Kafka since v0.10, May 2016

• Powerful yet easy-to-use stream processing library• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..20

Anywhere, anytime

Ok. Ok. Ok. Ok.

Anywhere, anytime

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

Anywhere, anytime

War File

Puppet/

Docker

Kuberne

Very Uncool Very Cool

Simple is Beautiful

Kafka Streams DSL

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

Native Kafka IntegrationProperty cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

Property cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

KafkaStreams streams = new KafkaStreams(builder, config);

Native Kafka Integration

API, coding

“Full stack” evaluation

Operations, debugging, …

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

Key Idea:

Outsource hard problems to Kafka!

Kafka Concepts: the Log

4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 10)

Messages

Topic 1

Topic 2

Partitions

Producers

Consumers

Brokers

Kafka Concepts: the Log

Kafka Streams: Key Concepts

Stream and Records

Key Value Key Value Key Value Key Value

Stream

Record

Processor Topology

Stream

Processor Topology

StreamProcessor

Processor Topology

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Processor Topology

48Kafka Streams Kafka

Kafka Topic B

Data Parallelism

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• Time, Window &

Out-of-order Data

• Re-processing

States in Stream Processing

• filter

• map

• join

• aggregate

Stateless

Stateful

Kafka Topic B

Task2Task1

Kafka Topic A

State State

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1997 1980 1983 2015

Out-of-Order

Timestamp Extractor

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

return record.timestamp();

Timestamp Extractor

processing-time

Timestamp Extractor

processing-time

event-time

Windowing

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Stream v.s. Table?

Tables ≈ Streams

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

KStream

KTable

“Alice bought eggs.”

“Alice is now at LinkedIn.”

KStream

KTable

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

alice 2 bob 10 alice 3

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

KTable aggregated

KStream joined

KTable aggregated

KStream joined

KTable aggregated

KStream joined

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Remember?

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

StateProcess

StateProcess Protoco

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

StateProcess

StateProcess Protoco

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

• Ordering

• Partitioning &

Scalability

• Fault tolerance

• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

But how to get data in / out Kafka?

Take-aways

• Stream Processing: a new programming paradigm

Take-aways

• Kafka Streams: stream processing made easy

Take-aways

• Kafka Streams: stream processing made easy

THANKS!

Guozhang Wang | guozhang@confluent.io | @guozhangwang

Visit Confluent at the Syncsort Booth (#1303), live demos @ 29thDownload Kafka Streams: www.confluent.io/product

We are Hiring!

stream processing made simple with kafka

Technology

stream processing with kafka and samza

introducing kafka streams: large-scale stream processing...

apache kafka, and the rise of stream processing

apache kafka event stream processing solution is apache...

event-stream processing with kafka

data stream processing - amazon s3 · • high performance...

distributed stream processing with apache kafka

meet up - spark stream processing + kafka

stream data from apache kafka for processing with apache...

stream processing at linkedin: apache kafka & apache …...

stream processing with kafka in uber, danny yuan

type safe, versioned, and rewindable stream processing with...

distributed systems for stream processing · distributed...

apache samza: reliable stream processing atop apache kafka...

online media data stream processing with kafka

event stream processing using kafka...

best practices for stream processing with gridgain and...

spark streaming & kafka-the future of stream processing

kafka and stream processing, taking analytics real-time,...

kafka streams: the stream processing engine of apache kafka