event stream processing with kafka and samza

Event Stream Processing with Kafka and Samza Zach Cox - @zcox - [email protected] Iowa Code Camp - 1 Nov 2014

Upload: zach-cox

Post on 12-Jul-2015




2 download


Page 1: Event Stream Processing with Kafka and Samza

Event Stream Processingwith Kafka and Samza

Zach Cox - @zcox - [email protected] Code Camp - 1 Nov 2014

Page 2: Event Stream Processing with Kafka and Samza

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Page 4: Event Stream Processing with Kafka and Samza

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Page 5: Event Stream Processing with Kafka and Samza

EventSomething happenedRecord that fact so we can process it

Page 6: Event Stream Processing with Kafka and Samza

EventDescribes what happened

Who did it?What did they do?What was the result?

Provides contextWhen did it happen?Where did it happen?How did they do it?Why did they do it?

Page 7: Event Stream Processing with Kafka and Samza

Event Example: PageviewUser viewed web pageUser

ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

Web PageURL:

ContextTime: 2014-10-14T10:49:24.438-05:00


Page 8: Event Stream Processing with Kafka and Samza

Event Example: ClickthroughUser clicked linkUser

ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

LinkURL: Referer:

ContextTime: 2014-10-14T10:49:24.438-05:00


Page 9: Event Stream Processing with Kafka and Samza

Event Example: User UpdateUser changed first nameUser

ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5First name: ZachContext

Time: 2014-10-14T10:59:56.481-05:00IP:

Page 10: Event Stream Processing with Kafka and Samza

Event Example: User UpdateUser uploaded a new profile imageUser

ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5Profile Image

URL: Context

Time: 2014-10-14T10:59:56.481-05:00IP: webcam


Page 11: Event Stream Processing with Kafka and Samza

Event Example: TweetUser posted a tweetUser

ID:Username: @zcoxName: Zach CoxBio: Developer @BannoHQ | @iascala organizer | co-founded@Pongr

TweetID: 527152511568719872URL: URL: Text: Going to talk about processing event streams using@apachekafka and @samzastream this Saturday @iowacodecamp

Mentions: @apachekafka, @samzastream, @iowacodecampURLs:

ContextTime: 2014-10-14T10:59:56.481-05:00Using: Twitter for AndroidLocation: 41.7146365,-93.5914038




Page 12: Event Stream Processing with Kafka and Samza

Event Example: HTTP Request LatencySome measured code took some time to executeCode

production.my-app.some-server.http.get-user-profileTime to execute

Min: 20 msecMax: 950 msecAverage: 190 msecMedian: 110 msec50%: 100 msec75%: 120 msec95%: 150 msec99%: 500 msec

ContextTime: 2014-10-14T11:17:01.597-05:00

Page 13: Event Stream Processing with Kafka and Samza

Event Example: Runtime ExceptionSome code threw a runtime exceptionSome code

Stack trace: [...]Exception

Message: HBase read timed outContext

Time: 2014-10-14T11:21:23.749-05:00Application: my-appMachine: some-server.my-company.com

Page 14: Event Stream Processing with Kafka and Samza

Event Example: Application LoggingSome code logged some information[INFO] [2014-10-14 11:25:44,750] [sentry-akka.actor.default-dispatcher-2]a.e.s.Slf4jEventHandler: Slf4jEventHandler startedMessage: Slf4jEventHandler startedLevel: INFOTime: 2014-10-14 11:25:44,750Thread: sentry-akka.actor.default-dispatcher-2Logger: akka.event.slf4j.Slf4jEventHandler

Page 15: Event Stream Processing with Kafka and Samza

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Page 16: Event Stream Processing with Kafka and Samza

Unified LogEvents need to be sent somewhereEvents should be accessible to any programLog provides a place for events to be sent and accessedKafka is a great log service

Page 17: Event Stream Processing with Kafka and Samza

Data Integration

Page 18: Event Stream Processing with Kafka and Samza

Data Integration

Page 19: Event Stream Processing with Kafka and Samza


Sequence of recordsAppend-onlyOrdered by timeEach record assigned unique sequential numberRecords stored persistently on disk

Page 20: Event Stream Processing with Kafka and Samza

Log Service

Page 21: Event Stream Processing with Kafka and Samza

Logs in Distributed Databases

Page 22: Event Stream Processing with Kafka and Samza

Traditional Cache

Cache missesCache invalidation

Page 23: Event Stream Processing with Kafka and Samza

Infrastructure as Distributed Database

Cache is now replicated from DB

Page 24: Event Stream Processing with Kafka and Samza

Infrastructure as Distributed Database

Cache can be in-process with web app

Page 25: Event Stream Processing with Kafka and Samza

Log for Event StreamsSimple to send events toBroadcasts events to all consumersBuffers events on disk: producers and consumers decoupledConsumers can start reading at any offset

Page 26: Event Stream Processing with Kafka and Samza

KafkaApache OSS, mainly from LinkedInHandles all the logs/event streamsHigh-throughput: millions events/secHigh-volume: TBs - PBs of eventsLow-latency: single-digit msec from producer to consumerScalable: topics are partitioned across clusterDurable: topics are replicated across clusterAvailable: auto failover

Page 27: Event Stream Processing with Kafka and Samza

Twitter Example

Receive messages via long-lived HTTP connection as JSONWrite messages to a Kafka topic

Twitter Streaming API

Page 28: Event Stream Processing with Kafka and Samza

Twitter Example

Twitter rate-limits clients<1% sample, ~50-100 tweets/sec400 keywords, ? tweets/sec

1 weird trick to get more tweets: multiple clients, same Kafka topic!

Page 29: Event Stream Processing with Kafka and Samza

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Page 30: Event Stream Processing with Kafka and Samza

Event Stream ProcessingTurn events into valuable, actionable informationProcess events as they happen, not later (batch)Do all of this reliably, at scale

Page 31: Event Stream Processing with Kafka and Samza

Event Stream Processor

Page 32: Event Stream Processing with Kafka and Samza

Event Stream Processor: Input

Page 33: Event Stream Processing with Kafka and Samza

Event Stream Processor: Output

Page 34: Event Stream Processing with Kafka and Samza

SamzaEvent stream processing frameworkApache OSS, mainly from LinkedInSimple Java APIScalable: runs jobs in parallel across clusterReliable: fault-tolerance and durability built-inTools for stateful stream processing

Page 35: Event Stream Processing with Kafka and Samza

Samza Job1) Class that extends StreamTask:

class MyTask extends StreamTask { override def process( envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator): Unit = { //process message in envelope }}

2) my-task.properties config filejob.factory.class=org.apache.samza.job.local.ThreadJobFactoryjob.name=my-task


Page 36: Event Stream Processing with Kafka and Samza

Stateless ProcessingOne event at a timeTake action using only that event

SELECT * FROM raw_messages WHERE message_type = 'status';

Page 37: Event Stream Processing with Kafka and Samza

Samza Job: Separate Message Types

Many message types from TwitterSamza job to separate into type-specific streamsOther jobs process specific message types

Page 38: Event Stream Processing with Kafka and Samza

Stateful Stream ProcessingOne event at a timeTake action using that event and stateState = data built up from past eventsAggregationGroupingJoins

Page 39: Event Stream Processing with Kafka and Samza

AggregationState = aggregated values (e.g. count)Incorporate each new event into that aggregationOutput aggregated values as events to new streamWhat happens if job stops?

Crash, deploy, ...Can't lose state!Samza handles this all for you

SELECT COUNT(*) FROM statuses;

Page 40: Event Stream Processing with Kafka and Samza

Samza Job: Total Status Count

Increment a counter on every status (tweet)Periodically output current count

Page 41: Event Stream Processing with Kafka and Samza

GroupingState = some data per groupTwo Samza jobs:

Output statuses by user (map)Count statuses per user (reduce)

Output: (user, count)Could use as input to job that sorts by count (most active users)

SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id;

SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id ORDER BY COUNT(user_id) DESC LIMIT 5;

Page 42: Event Stream Processing with Kafka and Samza

JoinsSamza job has multiple input streamsStream-Stream join: ad impressions + ad clicksStream-Table join: page views + user zip codeTable-Table join: user data + user settingsJoins involving tables need DB changelog

SELECT u.username, s.text FROM statuses s JOIN users u ON u.id = s.user_id;

Page 43: Event Stream Processing with Kafka and Samza

What else can we compute?Tweets per sec/min/hour (recent, not for-all-time)Enrich tweets with weather at current locationMost active users, locations, etcEmojis: % of tweets that contain, top emojisHashtags: % of tweets that contain, top #hashtagsURLs: % of tweets that contain, top domainsPhoto URLs: % of tweets that contain, top domainsText analysis: sentiment, spam

Page 44: Event Stream Processing with Kafka and Samza


Page 46: Event Stream Processing with Kafka and Samza


Send it eventsDruid reads from Kafka topicThat Kafka topic is a Samza output stream

Super fast time-series queries: aggregations, filters, top-n, etc


Page 47: Event Stream Processing with Kafka and Samza

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Page 49: Event Stream Processing with Kafka and Samza

Let's chat!Zach Cox@[email protected] is hiring!