samza la hug

100
Apache Samza* Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating

Upload: sriramsub

Post on 27-Jan-2015

144 views

Category:

Technology


0 download

DESCRIPTION

Apache Samza talk at LA HUG

TRANSCRIPT

Page 1: Samza la hug

Apache Samza*Reliable Stream Processing atop

Apache Kafka and Yarn

Sriram SubramanianMe on Linkedin

Me on twitter - @sriramsub1

* Incubating

Page 2: Samza la hug
Page 3: Samza la hug

Agenda

• Why Stream Processing?• What is Samza’s Design ?• How is Samza’s Design

Implemented? • How can you use Samza ?• Example usage at Linkedin

Page 4: Samza la hug

Why Stream Processing?

Page 5: Samza la hug

Response latency

0 ms

Page 6: Samza la hug

Response latency

RPC

Synchronous

0 ms

Page 7: Samza la hug

Response latency

RPC

Synchronous Later. Possibly much later.

0 ms

Page 8: Samza la hug

Response latency

Samza

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

0 ms

Page 9: Samza la hug

Newsfeed

Ad Relevance

Page 10: Samza la hug

Search Index

Metrics and Monitoring

Page 11: Samza la hug

What is Samza’s Design ?

Page 12: Samza la hug

Stream A

JOB

Stream B

Stream C

Page 13: Samza la hug

Stream A

JOB 1

Stream B

Stream C

Stream D

JOB 2

Stream E

Stream F

JOB 3

Stream G

Page 14: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

Page 15: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

123456

12345

1234567

Page 16: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

123456

12345

1234567

Page 17: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

123456

12345

1234567

Page 18: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

123456

12345

1234567

Page 19: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

123456

12345

1234567

Page 20: Samza la hug

Streams

Partition 0 Partition 1 Partition 2

next append

123456

12345

1234567

Page 21: Samza la hug

JobsStream A Stream B

Task 1 Task 2 Task 3

Stream C

Page 22: Samza la hug

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

Page 23: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 24: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 25: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 26: Samza la hug

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 27: Samza la hug

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 28: Samza la hug

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 29: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 30: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 31: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 32: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 33: Samza la hug

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 34: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 35: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 36: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 37: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 38: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 39: Samza la hug

Tasks

AdViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 40: Samza la hug

DataflowStream A Stream B Stream C

Stream E

Stream B

Job 1 Job 2

Stream D

Job 3

Page 41: Samza la hug

DataflowStream A Stream B Stream C

Stream E

Stream B

Job 1 Job 2

Stream D

Job 3

Page 42: Samza la hug

Stateful Processing

• Windowed Aggregation – Counting the number of page views for each user per hour

• Stream Stream Join– Join stream of ad clicks to stream of ad views to identify the

view that lead to the click

• Stream Table Join– Join user region info to stream of page views to create an

augmented stream

Page 43: Samza la hug

• In memory state with checkpointing

– Periodically save out the task’s in memory data

– As state grows becomes very expensive

– Some implementation checkpoints diffs but adds complexity

How do people do this?

Page 44: Samza la hug

• Using an external store

– Push state to an external store

– Performance suffers because of remote queries

– Lack of isolation

– Limited query capabilities

How do people do this?

Page 45: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Page 46: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Page 47: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 48: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 49: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 50: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 51: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 52: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 53: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 54: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 55: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 56: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 57: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 58: Samza la hug

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 59: Samza la hug

Key-Value Store

• put(table_name, key, value)• get(table_name, key)• delete(table_name, key)• range(table_name, key1, key2)

Page 60: Samza la hug

How is Samza’s Design Implemented?

Page 61: Samza la hug

Apache Kafka

• Persistent, reliable,distributed message queue

Page 62: Samza la hug

At LinkedIn

10+ billionwrites per day

172kmessages per second

(average)

60+ billionmessages per day

to real-time consumers

Page 63: Samza la hug

Apache Kafka

• Models streams as topics

• Each topic is partitioned and each partition is replicated

• Producer sends messages to a topic

• Messages are stored in brokers

• Consumers consume from a topic (pull from broker)

Page 64: Samza la hug

YARN- Yet another resource negotiator

• Framework to run your code on a grid of machines

• Distributes our tasks across multiple machines

• Notifies our framework when a task has died

• Isolates our tasks from each other

Page 65: Samza la hug

JobsStream A

Task 1 Task 2 Task 3

Stream B

Page 66: Samza la hug

Containers

Task 1 Task 2 Task 3

Stream B

Stream A

Page 67: Samza la hug

Containers

Stream B

Stream A

Samza Container 1 Samza Container 2

Page 68: Samza la hug

Containers

Samza Container 1 Samza Container 2

Page 69: Samza la hug

YARN

Samza Container 1 Samza Container 2

Host 1 Host 2

Page 70: Samza la hug

YARN

Samza Container 1 Samza Container 2

NodeManager NodeManager

Host 1 Host 2

Page 71: Samza la hug

YARN

Samza Container 1 Samza Container 2

NodeManager NodeManager

Samza YARN AM

Host 1 Host 2

Page 72: Samza la hug

YARN

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 73: Samza la hug

YARN

MapReduceContainer

MapReduce Container

NodeManager

HDFS

NodeManager

MapReduce YARN AM

HDFS

Host 1 Host 2

Page 74: Samza la hug

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 75: Samza la hug

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 76: Samza la hug

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 77: Samza la hug

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 78: Samza la hug

YARN

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 79: Samza la hug

CGroups

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 80: Samza la hug

How can you use Samza ?

Page 81: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 82: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 83: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 84: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 85: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 86: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 87: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 88: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 89: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 90: Samza la hug

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 91: Samza la hug

Stateful Stream Task

public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 92: Samza la hug

Stateful Stream Task

public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 93: Samza la hug

Stateful Stream Task

public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 94: Samza la hug

Stateful Stream Task

public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 95: Samza la hug

Example usage at Linkedin

Page 96: Samza la hug

Call graph assemblyget_unread_msg_count()

get_PYMK()

get_Pulse_news()

get_relevant_ads()

get_news_updates()

Page 97: Samza la hug

Lots of calls == lots of machines, logs

get_unread_msg_count()

get_PYMK()

get_Pulse_news()

get_relevant_ads()

get_news_updates()

unread_msg_service_call

get_PYMK_service_call

pulse_news_service_call

add_relevance_service_call

news_update_service_call

Page 98: Samza la hug

TreeID: Unique identifier

page_view_event (123456)

unread_msg_service_call (123456)

another_service_call (123456)

silly_service_call (123456)

get_PYMK_service_call (123456) counter_service_call (123456)

unread_msg_service_call (123456)count_invites_service_call (123456)

count_msgs_service_call (123456)

Page 99: Samza la hug

OK, now lots of streams with TreeIDs…

all_service_calls(partitioned by TreeID)

Samza job:Repartition-By-TreeID

*_service_call

Samza job:Assemble Call Graph

service_call_graphs

• Near real-time holistic view of how we’re actually serving data• Compare day-over-day, cost, changes, outages

Page 100: Samza la hug

Thank you

• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org

• Newbie issues: bit.ly/samza_newbie_issues

• Detailed Samza and YARN talk: bit.ly/samza_and_yarn

• A must-read: http://bit.ly/jay_on_logs• Twitter: @samzastream• Me on Twitter: @sriramsub1