essential ingredients of realtime stream processing @ scale

47
Essential Ingredients of Stream Processing @ Scale Kartik Paramasivam

Upload: kartik-paramasivam

Post on 12-Apr-2017

891 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients of Stream Processing @ Scale

Kartik Paramasivam

Page 2: Essential Ingredients of Realtime Stream Processing @ Scale

About Me

• ‘Streams Infrastructure’ at LinkedIn – Pub-sub messaging : Apache Kafka– Change Capture from various data systems: Databus– Stream Processing platform : Apache Samza

• Previous– Microsoft Cloud/IOT Messaging (EventHub) and

Enterprise Messaging(Queues/Topics)– .NET WebServices and Workflow stack – BizTalk Server

Page 3: Essential Ingredients of Realtime Stream Processing @ Scale

Agenda

• What is Stream Processing ?• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close

Page 4: Essential Ingredients of Realtime Stream Processing @ Scale

Response latency

Stream processing

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

0 ms

Page 5: Essential Ingredients of Realtime Stream Processing @ Scale

Agenda

• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close

Page 6: Essential Ingredients of Realtime Stream Processing @ Scale

Newsfeed

Page 7: Essential Ingredients of Realtime Stream Processing @ Scale

Cyber-security

Page 8: Essential Ingredients of Realtime Stream Processing @ Scale

Internet of Things

Page 9: Essential Ingredients of Realtime Stream Processing @ Scale

Agenda

• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close

Page 10: Essential Ingredients of Realtime Stream Processing @ Scale

CANONICAL ARCHITECTURE

Data-Bus

Real Time Processing

(Samza)

Batch Processing

(Hadoop/Spark)

Voldemort R/Oe.g.

Espresso

Processing

Bulk upload

Espresso

Services Tier

Ingestion Serving

Clients(browser,devices, sensors ….)

Kafka

Page 11: Essential Ingredients of Realtime Stream Processing @ Scale

Agenda

• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close

Page 12: Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients to Stream Processing

1. Scale2. Reprocessing3. Accuracy of results4. Easy to program

Page 13: Essential Ingredients of Realtime Stream Processing @ Scale

SCALE.. but not at any cost

Page 14: Essential Ingredients of Realtime Stream Processing @ Scale

Basics : Scaling Ingestion

- Streams are partitioned- Messages sent to partitions

based on PartitionKey- Time based message

retention

Stream A

producers

Pkey=10

consumerA(machine1)

consumerA(machine2)

Pkey=25 Pkey=45

e.g. Kafka, AWS Kinesis, Azure EventHub

Page 15: Essential Ingredients of Realtime Stream Processing @ Scale

Scaling Processing.. E.g. SamzaStream A

Task 1 Task 2 Task 3

Stream B

Samza Job

Page 16: Essential Ingredients of Realtime Stream Processing @ Scale

Samza – Streaming DataflowStream A

Stream c

Stream D

Job 1

Job 2

Stream B

Page 17: Essential Ingredients of Realtime Stream Processing @ Scale

Horizontal Scaling is great ! But..

• But more machines means more $$ • Need to do more with less.• So what’s the key bottleneck during

Event/Stream Processing ?

Page 18: Essential Ingredients of Realtime Stream Processing @ Scale

Key Bottleneck: “Accessing Data”

• Big impact on CPU, Network, Disk

• Types of Data Access 1. Adjunct data – Read only data2. Scratchpad/derived data - Read-Write

data

Page 19: Essential Ingredients of Realtime Stream Processing @ Scale

Adjunct Data – typical access

KafkaAdClicks Processing Job

AdQuality update

Kafka

Member Database

Read Member Info Concerns1. Latency2. CPU3. Network4. DDOS

Page 20: Essential Ingredients of Realtime Stream Processing @ Scale

Scratch pad/Derived Data – typical access

KafkaSensor Data

Processing Job

Alerts

Kafka

DeviceState

Database

Concerns1. Latency2. CPU3. Network4. DDOS

Read + Update per Device Info

Page 21: Essential Ingredients of Realtime Stream Processing @ Scale

Adjunct Data – with Samza

KafkaAdClicks

Processing Job

outputKafka

Member Database(espresso) Databus

Kafka, Databus, Database, Samza Job are all partitioned by MemberId

Member Updates

Task1

Task2

Task3

Rocks Db

Page 22: Essential Ingredients of Realtime Stream Processing @ Scale

Fault Tolerance in a stateful Samza job

P0

P1

P2

P3

Task-0 Task-1 Task-2 Task-3

P0P1

P2

P3

Host-A Host-B Host-C

Changelog Stream

Stable State

Page 23: Essential Ingredients of Realtime Stream Processing @ Scale

Fault Tolerance in a stateful Samza job

P0

P1

P2

P3

Task-0 Task-1 Task-2 Task-3

P0P1

P2

P3

Host-A Host-B Host-C

Changelog Stream

Host A dies/fails

Page 24: Essential Ingredients of Realtime Stream Processing @ Scale

Fault Tolerance in a stateful Samza job

P0

P1

P2

P3

Task-0 Task-1 Task-2 Task-3

P0P1

P2

P3

Host-E Host-B Host-C

Changelog Stream

YARN allocates the tasks to a container on a different host!

Page 25: Essential Ingredients of Realtime Stream Processing @ Scale

Fault Tolerance in a stateful Samza job

P0

P1

P2

P3

Task-0 Task-1 Task-2 Task-3

P0P1

P2

P3

Host-E Host-B Host-C

Changelog Stream

Restore local state by reading from the

ChangeLog

Page 26: Essential Ingredients of Realtime Stream Processing @ Scale

Fault Tolerance in a stateful Samza job

P0

P1

P2

P3

Task-0 Task-1 Task-2 Task-3

P0P1

P2

P3

Host-E Host-B Host-C

Changelog Stream

Back to Stable State

Page 27: Essential Ingredients of Realtime Stream Processing @ Scale

Hardware Spec: 24 cores, 1Gig NIC, SSD

• (Baseline) Simple pass through job with no local state – 1.2 Million msg/sec

• Samza job with local state – 400k msg/sec

• Samza job with local state with Kafka backup– 300k msg/sec

Performance Numbers with Samza

Page 28: Essential Ingredients of Realtime Stream Processing @ Scale

Local State - Summary

• Great for both read-only data and read-write data

• Secret sauce to make local state work 1. Change Capture System: Databus/DynamoDB

streams2. Durable backup with Kafka Log Compacted

topics

Page 29: Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients to Stream Processing

1. Scale2. Reprocessing 3. Accuracy of results4. Easy to program

Page 30: Essential Ingredients of Realtime Stream Processing @ Scale

REPROCESSING

Page 31: Essential Ingredients of Realtime Stream Processing @ Scale

Why do we need it ?

• Software upgrades.. Yes bugs are a reality• Business logic changes• First time job deployment

Page 32: Essential Ingredients of Realtime Stream Processing @ Scale

Reprocessing Data – with Samza

outputKafka

Member Database(espresso)

Databus

Member Updates

Company/Title/Location

StandardIzationJob

Machine Learning

modelbootstrap

Page 33: Essential Ingredients of Realtime Stream Processing @ Scale

Reprocessing- Caveats

• Stream processors are fast.. They can DOS the system if you reprocess – Control max-concurrency of your job– Quotas for Kafka, Databases– Async load into databases (Project Venice)

• Capacity– Reprocessing a 100 TB source ?

• Doesn’t reprocessing mean you are no-longer being real-time ?

Page 34: Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients to Stream Processing

1. Scale but at not at any cost2. Reprocessing 3. Accuracy of results4. Easy to Program

Page 35: Essential Ingredients of Realtime Stream Processing @ Scale

ACCURACY OF RESULTS

Page 36: Essential Ingredients of Realtime Stream Processing @ Scale

Querying over an infinite stream

1.00 pm

Ad View Event

1:01pm

Ad Click Event

AdQuality

ProcessorUser1

Did user click the Ad within 2 minutes of seeing the Ad

Page 37: Essential Ingredients of Realtime Stream Processing @ Scale

DELAYS – AN EXAMPLE

Ad Quality Processor(Samza)

Services Tier

Kafka

Services Tier

Ad Quality Processor(Samza)

KafkaMirrored

kartik

DATACENTER 1 DATACENTER 2

AdViewEvent

LB

Page 38: Essential Ingredients of Realtime Stream Processing @ Scale

DELAYS – AN EXAMPLE

Real Time Processing

(Samza)

Services Tier

Kafka

Services Tier

Real Time Processing

(Samza)

KafkaMirrored

kartik

DATACENTER 1 DATACENTER 2

AdClick EventLB

Page 39: Essential Ingredients of Realtime Stream Processing @ Scale

What do we need to do to get accurate results?

Deal with• Late Arrivals

– E.g. AdClick event showed up 5 minutes late.• Out of order arrival

– E.g. AdClick event showed up before AdView event

• Influenced by “Google MillWheel”

Page 40: Essential Ingredients of Realtime Stream Processing @ Scale

SolutionKafka

AdClicks

Processing Job

output

Kafka

Task1

Task2

Task3

Message Store

Kafka

AdView MessageStore

MessageStore

1. All events are stored locally2. Find impacted ‘window/s’ for late

arrivals3. Recompute result4. Choose strategy for emitting results

(absolute or relative value)

Page 41: Essential Ingredients of Realtime Stream Processing @ Scale

Myth: This isn’t a problem with Lambda Architecture..

• Theory: Since the processing happens 1 hour or several hours later delays are not a problem.

• Ok.. But what about the “edges”– Some “sessions” start before the cut off time for

processing.. And end after the cut off time.– Delays and out of order processing make things

worse on the edges

Page 42: Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients to Stream Processing

1. Scale but at not at any cost2. Reprocessing 3. Accuracy of results4. Easy Programmability

Page 43: Essential Ingredients of Realtime Stream Processing @ Scale

Easy Programmability

• Support for “accurate” Windowing/Joins.( Google Cloud Dataflow )

• Ability to express workflows/DAGs in config and DSL (e.g. Storm)

• SQL support for querying over streams– Azure Stream Insight

• Apache Samza – working on the above

Page 44: Essential Ingredients of Realtime Stream Processing @ Scale

Agenda

• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close

Page 45: Essential Ingredients of Realtime Stream Processing @ Scale

Some scale numbers at LinkedIn

• 1.3 Trillion Messages get ingested into Kafka per day – Each message gets consumed 4-5 times

• Database change capture :– More than 2 Trillion Messages get consumed per

week• Samza jobs in production which process more

than 1 Million messages/sec Note: These numbers are not reflective of LinkedIn Site traffic

Page 47: Essential Ingredients of Realtime Stream Processing @ Scale

Thank You!