leveraging kafka for big data in real time bidding, analytics, ml & campaign management for...

Post on 10-Jan-2017

7.027 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@helenaedelson #kafkasummit 1

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

Helena Edelson @helenaedelson Kafka Summit 2016

@helenaedelson #kafkasummit

VP of Engineering, Tuplejump

Previously: Sr Cloud / Big Data / Analytics Engineer: DataStax, CrowdStrike, VMware, SpringSource...

Event-Driven systems, Analytics, Machine Learning, Scala

Committer: Kafka Connect Cassandra, Spark Cassandra Connector

Contributor: Akka, previously: Spring Integration

Speaker: Kafka Summit, Spark Summit, Strata, QCon, Scala Days, Scala World, Philly ETE

2

twitter.com/helenaedelson github.com/helena

slideshare.net/helenaedelson

@helenaedelson #kafkasummit

The Real Topic

3

http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/42

@helenaedelson #kafkasummit

Chaos Of Distribution

One of the more

fascinating problems is

that of solving the chaos

of distributed systems.

Regardless of the

domain.

4

@helenaedelson #kafkasummit

Aproaching this within the use case of:

High-Level Landscape

Platform & Infrastructure

Strategies and Patterns

Four-Letter Acronyms

Can't Touch This

Architecture

5

@helenaedelson #kafkasummit 6

The Landscape

@helenaedelson #kafkasummit 7

The Digital Ad Industry

@helenaedelson #kafkasummit

An RTB Drive-By

Real time auction for ad spaces, all devices High throughput, low-Latency (similar to FIN Tech but not quite) OpenRTB API Spec - but not everyone uses it

8

Open protocol for automated trading of digital media across

platforms, devices, and advertising solutions

@helenaedelson #kafkasummit 9

Ad Delivered to User

In A Nutshell

User hits a

Publisher'spage

Advertiser

Advertiser

Advertisers send Bid Requests

Highest Bid

Accepted

@helenaedelson #kafkasummit 10

Site: Ad supported

content

Real Time Exchange & Auction (SSP):

OpenRTB Server used to bid

Bidder Service (DSP):

OpenRTB client

Advertiser:Buyer wants ad

impressions. Uses bidders to bid on

behalf

Publisher:Seller has ad spaces to sell to highest

bidders

User Devices

ad request

winning ad

bid request

win notice & settlement price

insert orders

bid response

winning ad

RTB Auction for Impressions

@helenaedelson #kafkasummit 11

Time Is Money

RTB: Maximum response latency of 100 ms

@helenaedelson #kafkasummit 12

Time Is Money

Assume some network latency!

@helenaedelson #kafkasummit

Sampling of RTB Events

Ad Request

Bid Request - JSON 100 bytes

Compute optimal bid for advertiser

Bid Response - JSON 1000 bytes (may include ad metadata)

Win Notification (may or may not exist) with settlement price

Ad Impression - when the ad is viewed

Ad Click

Ad Conversion

13

@helenaedelson #kafkasummit

Event Streams

Auctions: auction data + bid requests

Ad Impressions: which ad ids were shown

Ad Clicks: which auction ids resulted in a click

Ad Conversions: streams joined on auction id

Analytics Aggregations & ML to derive hundreds of metrics and dimensions

14

@helenaedelson #kafkasummit 15

Real TimeJust means Event-Driven or processing events as they arrive.

Does not automatically equal sub-second latency requirements.

Seen / Ingestion TimeWhen an event is ingested into the system

Event TimeWhen an event is created, e.g. on a device.

@helenaedelson #kafkasummit 16

The Platform

@helenaedelson #kafkasummit

Platform Requirements24 / 7 Uptime

Brokerage model: DSPs only make $ on successful ad deliveries, so uptime is critical

Security

Enable service across the globe

Handle thousands of concurrent requests per second

Scale to traffic of 700TB per day

Manage 700TB per day of data

Derive Metrics

17

@helenaedelson #kafkasummit

Business RequirementsSupport SLAs for bid transactions

Legal constraints - user data crossing borders

The critical path must be fast to win

No data loss on ingestion path

Bid & Campaign Optimization

Frequency Capping

Management UI for Publishers & Advertisers

18

@helenaedelson #kafkasummit

Questions To Answer% Writes on ingestion, analytics pre-aggregation, etc.

% Reads of raw data by analytics, aggregated views by customer management UI

How much in memory on RTB app nodes?

Dimensions of data in analytics queries

Optimization Algos

What needs real time feedback loops, what does not

Which data flows are low-lateny/high frequency, which not

Where are potential bottlenecks

19

@helenaedelson #kafkasummit

ConstraintsResources - I need to build highly functioning teams that are psyched about the work and working together

Budget

Cloud Resources

JDK Version (What?!)

Existing infrastructure & technologies that will be replaced later but you have to deal with now :(

20

Pro Tip: Pay well,

Allow people to grow & be

creative

@helenaedelson #kafkasummit 21

Strategies

To Avoid

@helenaedelson #kafkasummit

Beware of the C word

Consistency?

22

Convergence?

@helenaedelson #kafkasummit 23

http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/39

he went there

@palvaro

@helenaedelson #kafkasummit

Complexity

24

Can't Ops your way out of that

@helenaedelson #kafkasummit 25

Occam's razor: Simpler theories are preferable to more complex

@helenaedelson #kafkasummit 26

Strategies

@helenaedelson #kafkasummit

ApproachesEventual/Tunable consistency

Time & Clocks in globally-distributed systems

Location Transparency

Asynchrony

Pub-Sub

Design for scale

Design for Failure

27

@helenaedelson #kafkasummit

Kafka as Platform Fabric

28

@helenaedelson #kafkasummit

From MVP to Scalable with KafkaMicroservices

Does One Thing, Knows One Thing Separate low-latency hot path Separate deploy artifacts

Separate data mgmt clusters by concern

analytics, timeseries, etc.

CQRS: Separate Read Write paths

29

Scalpel...

Separate The Monolith

@helenaedelson #kafkasummit

Immutable events stream to Kafka, partitioned by event type, time, etc.

Subscribers & Publishers

RTB microservices - receives raw, receives

Analytics cluster - receives raw, publishes aggregates

Management / Reporting nodes

30

Services communicate indirectly via Kafka

@helenaedelson #kafkasummit

CQRS: Command Query Responsibility Segregation

Decouple Write streams from Read streams

Different schemas / data structures

Writers (Publishers) publish without having awareness who needs to receive it or how to reach them (location, protocol...)

Readers (Subscribers) should be able to subscribe and asynchronously receive from topics of interest

31

@helenaedelson #kafkasummit 32

Eventually Consistent Across DCs

US-East-1

MirrorMakerEU-west-1

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

Kafka Cluster Per Region

ZK

ZK

Mgmt micro

services

Mgmt micro

services

Mgmt micro

servicesQuery Layer

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

@helenaedelson #kafkasummit 33

MirrorMaker

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

C*

C*

Eventually Consistent Across DCs

Mgmt micro

services

Mgmt micro

services

Mgmt micro

services

US-East-1

EU-west-1

Kafka Cluster Per Region

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

Query Layer

@helenaedelson #kafkasummit

Kafka Cross Datacenter Mirroring

bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config config/consumer_source_cluster.properties --producer.config config/producer_target_cluster.properties --whitelist bidrequests --num.producers 2 --num.streams 4

34

Publish messages from various datacenters around the world

@helenaedelson #kafkasummit

Users in the US and UK connect DCs in their geo region for lower latency

Both DCs are part of the same cluster for X-DC Replication

Configure LB policies to prefer local DC

LOCAL_QUORUM reads

Data is available cluster-wide for backup, analytics, and to account for user travel across regions

35

Cassandra Cross DC ReplicationIt's out of the box. Multi-region live backups for free:

[ NetworkTopologyStrategy ]

@helenaedelson #kafkasummit 36

Cassandra Cross DC ReplicationKeep EU User Data in the EU

CREATE KEYSPACE rtb WITH REPLICATION = {

‘class’: ‘NetworkTopologyStrategy’,

‘eu-east-dc’: ‘3’,‘eu-west-dc’: ‘3’

};

@helenaedelson #kafkasummit 37

Cassandra Time Windowed Buckets with TTL

CREATE TABLE rtb.fu_events ( id int, seen_time timeuuid, event_time timestamp, PRIMARY KEY (id,date)

) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction = { 'compaction_window_unit': 'DAY', 'compaction_window_size': '3', 'class':'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'

} AND compression = { 'crc_check_chance': '0.5', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"100"}' AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 60 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';

3 DAY buckets -

larger SSTables on disk minimizes bootstrapping issues when adding nodes to a cluster

3 MINUTE buckets 1 HOUR buckets 1 DAY buckets

MICROSECOND resolution:

@helenaedelson #kafkasummit 38

Want Can Or Currently Use Status ButKafka Security Kafka Security TLS, Kerberos, SASL, Auth,

Encryption, Authenticationv0.9.0

Thanks Jun!

Integrated Streaming Kafka Streams processing inside Kafka, no alternate cluster setup or ops.

v0.10 Thanks Guozhang!

It's java :( Iw

Cassandra CDC Cassandra CDC. Triggers? Tiggers are a pre-commit

hook :(

The Epic JIRA: https://issues.apache.org/jira/browse/CASSANDRA-8844

no comment

And... Kafka Streams & Kafka Connect Integration

..wait for it..no comment

Always on, X-DC Replication, Flexible Topologies

Kafka, Cassandra

OOTB

Fault Tolerance Kafka, Spark, Mesos, Cassandra, Akka

Baked In

Location Transparency Kafka, Cassandra, Akka Check!

Asynchrony Kafka, Cassandra, Akka Check!

Decoupling Kafka, Akka Check!

Pub-Sub Kafka, Cassandra, Akka Check!

Immutability Kafka, Akka, Scala Check!

My Nerdy Chart v2.0

@helenaedelson #kafkasummit

Kafka Streams in v 0.10

39

val builder = new KStreamBuilder()

val stream: KStream[K,V] = builder.stream(des, des, "raw.data.topic")

.flatMapValues(value -> Arrays.asList(value.toLowerCase.split(" ")

.map((k,v) -> new KeyValue(k,v))

.countByKey(ser, ser, des, des, "kTable")

.toStream

stream.to("results.topic", ...)

val streams = new KafkaStreams(builder, props)

streams.start()

@helenaedelson #kafkasummit

Kafka Streams & Kafka Connect?

40

val builder = new KStreamBuilder()

val stream1: KStream[K,V] = builder.stream(new CassandraConnect(configs))

.flatMapValues(..)

.map((k,v) -> new KeyValue(k,v))

.countByKey(ser, ser, des, des, "kTable")

.toStream

stream.to("results.topic", ...)

val streams = new KafkaStreams(builder, props)

streams.start()

YES

@helenaedelson #kafkasummit 41

/** Writes records from Kafka to Cassandra asynchronously and non-blocking. */ override def put(records: JCollection[SinkRecord]): Unit

/** Returns a list of records when available by polling for new records. */ override def poll: JList[SourceRecord])

https://github.com/tuplejump/kafka-connect-cassandra

@helenaedelson #kafkasummit

Frequency Capping

1. Count the number of times user X has seen ad Y from Advertiser A's Campaign C

2. Limit the max number of impressions of an ad within T1...T2

42

Use Case:

Continuously count impressions grouped by campaign across DCs

low-latency reads & writes

Must scale

Cross DC Counters

Translation: Distributed Counters

@helenaedelson #kafkasummit

Redis? Broke under the load

Aerospike? Great candidate

Eventuate? Interesting, much lighter

Kafka streams when it's out? Interesting, already in the infra

Flink? Very interesting but...

Cassandra Counters - not applicable for this

43

Frequency Capping

@helenaedelson #kafkasummit

As a distributed counting microservice

As a key-value store for in-memory caching

Fast reads - Very read heavy

99% reads are < 1 ms latency (sweet)

30,000 writes per second

350,000 reads per second on 7 nodes

Replication factor 2:

Cross datacenter replication (XDC), SSD-backed

Excellent few posts by Dag, Tapads CTO on in-memory infrastructure + Ad Tech: (see resources slide)

44

Aerospike

@helenaedelson #kafkasummit

CRDT: Conflict Free Replicated Data TypeState-based: objects require only eventual communication between pairs of replicas

Operation-based: replication requires reliable broadcast communication with delivery in a well-defined delivery order

Both guaranteed to converge towards common, correct state

Keep replicas available for writes during a network partition requires resolution of conflicting writes when the partition heals

45

@helenaedelson #kafkasummit

EventuateA toolkit for building distributed, HA & partition-tolerant event-sourced applications. Developed by Martin Krasser (@mrt1nz) for Red Bull Media (open source)

Interactive, automated conflict resolution (via op-based CRDTs)

Separates command side of an app from its query side (CQRS)

Primary Goals: preserving causality, idempotency & event ordering guarantees even under chaotic conditions

AP of CAP - conflicts cannot be prevented & must be resolved.

Causality - tracked with Vector Clocks

Adapters provide connectivity to other stream processing solutions

Can currently chose Cassandra if desired

Kafka coming soon!

46

@helenaedelson #kafkasummit

Replication of application state through async event replication across locations

Locations consume replicated events to re-construct application state locally

Multiple locations concurrently update as multi-master

47

Eventuate as Distributed CRDT Microservice

@helenaedelson #kafkasummit 48

Applications can continue writing to a local replica during

a network partition

-> To Cassandra-> To Kafka

(soon)

Pass To Pipeline:

@helenaedelson #kafkasummit 49

import scala.concurrent.Futureimport akka.actor.{ActorRef, ActorSystem}import com.rbmhtechnology.eventuate.crdt.{CRDTServiceOps, Counter, CounterService}

class CappingService(val id: String, override val log: ActorRef) (implicit val system: ActorSystem, val integral: Integral[Int], override val ops: CRDTServiceOps[Counter[Int], Int]) extends CounterService[Int](id, log) { /** Increment only op: adds `delta` to the counter identified by `id` * and returns the updated counter value. */ def increment(id: String, delta: Int): Future[Int] = value(id) flatMap { case v if v >= 0 && (delta > 0 || delta > v) => update(id, delta) case v => Future.successful(v) } start()}

import scala.concurrent.Future import akka.actor.ActorSystem

val a = new CappingService(id1, eventLog)a.increment(id1, 3) // Future(3) 3 impressionsa.value(id1) // Future(3) 3 impressionsa.increment(id1, -2) // increments only, idempotent.

val b = new CappingService(id2, eventLog) b.value(id1) // Future(a.value(id1))

Knows the same count over n-instances, all geo-locations, for the same id

class CounterService[A : Integral](val replicaId: String, val log: ActorRef) {

def value(id: String): Future[A] = { ... }

def update(id: String, delta: A): Future[A] = { ... }

}

@helenaedelson #kafkasummit 50

Eventuate

@helenaedelson #kafkasummit

Eventuate TakeawayIt's just a jar!

OOTB async internal component messaging and fault tolerance

Integrate with relevant microservices

No store/cache cluster to deploy, just keep monitoring your apps Written in Scala Built on Akka - a toolkit for building highly concurrent, distributed, and resilient event-driven applications on the JVM

51

@helenaedelson #kafkasummit 52

Analytics & ML

@helenaedelson #kafkasummit

Refresher: Sampling of RTB Events

Ad Request

Bid Request - JSON 100 bytes

Compute optimal bid for advertiser

Bid Response - JSON 1000 bytes (may include ad metadata)

Win Notification (may or may not exist) with settlement price

Ad Impression - when the ad is viewed

Ad Click

Ad Conversion

53

@helenaedelson #kafkasummit 54

OpenRTB: objects in the Bid Request model

@helenaedelson #kafkasummit

TopK most high performing campaigns

Number of views served in the last 7 days, by country, by city

What determined successful ad conversions

Age distribution per campaign

55

Streaming Analytics

@helenaedelson #kafkasummit

Spark Streaming Kafkaclass KafkaStreamingActor(ssc: StreamingContext) extends MyAggregationActor {

val stream = KafkaUtils.createDirectStream(...).map(RawData(_))

stream .foreachRDD(_.toDF.write.format("filodb.spark")

.option("dataset", "rawdata") .save())

/* Pre-Aggregate data in the stream for fast querying and aggregation later

stream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)

).saveToCassandra(timeseriesKeyspace, dailyPrecipTable)

}

56

Can write to Cassandra, FiloDB...

@helenaedelson #kafkasummit

Machine LearningTrain on 1+ week of data for

Recommendations

Bid Optimization

Campaign Optimization

Consumer Profiling

...and much more

57

@helenaedelson #kafkasummit

Machine Learning

The probability of an ad, from a specific ISP, OS, website, demographic, etc. resulting in a conversion

Which attributes of impressions are good predictors of better ad performance?

58

@helenaedelson #kafkasummit

Bid Optimization & Predictive Models

Which impressions should an Advertiser bid for?

Per campaign, per country it may run in..?

What is the best bid for each impression

59

@helenaedelson #kafkasummit 60

Compute optimal bid

price

Train the model

Score bid requests

Determine value of bid reqest

Train on every bid req attribute

Based on Campaign Objectives

Against Budget Send bid decision to bidder

Machine Learning

@helenaedelson #kafkasummit

Spark Streaming, MLLib & FiloDB

61

val ssc = new StreamingContext(sparkConf, Seconds(5))

val kafkaStream = KafkaUtils.createDirectStream[..](..)

.map(transformFunc) .map(LabeledPoint.parse)

kafkaStream.foreachRDD(_.toDF.write.format("filodb.spark")

.option("dataset", "training").save())

val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.dense(weights)) .trainOn(dataStream.join(historicalEvents)) model.predictOnValues(dataStream.map(lp => (lp.label, lp.features))) .insertIntoFilo("predictions")

@helenaedelson #kafkasummit

700 Queries Per Second: Spark Streaming & FiloDB

Even for datasets with 15 million rows! Using FiloDB's

InMemoryColumnStore

Single host / MBP

5GB RAM

SQL to DataFrame caching

https://github.com/tuplejump/FiloDB

Evan Chan's (@velvia) blog post

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

62

@helenaedelson #kafkasummit 63

Eventually Consistent Across DCs

US-East-1

MirrorMakerEU-west-1

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

Kafka Cluster Per Region

ZK

ZK

Mgmt micro

services

Mgmt micro

services

Mgmt micro

servicesQuery Layer

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

@helenaedelson #kafkasummit

Self-Healing SystemsMassive event spikes & bursty traffic

Fast producers / slow consumers

Network partitioning & out of sync systems

DC down

Not DDOS'ing ourselves from fast streams No data loss when auto-scaling down

64

@helenaedelson #kafkasummit

Byzantine Fault Tolerance?

65

Looks like I'll miss standup

@helenaedelson #kafkasummit

Everything fails, all the time

Monitor Everything

66

@helenaedelson #kafkasummit

Non-Monotonic Snapshot Isolation: scalable and strong consistency

for geo-replicated transactional systems

Conflict-free Replicated Data Types

Implementing operation-based CRDTs

http://codebetter.com/gregyoung/2010/02/16/cqrs-task-based-uis-event-sourcing-agh

http://martinfowler.com/bliki/CQRS.html

http://github.com/openrtb/OpenRTB

http://akka.io

http://rbmhtechnology.github.io/eventuate

https://github.com/RBMHTechnology/eventuate

http://rbmhtechnology.github.io/eventuate/user-guide.html#commutative-replicated-data-types

http://www.planetcassandra.org/data-replication-in-nosql-databases-explained

http://wikibon.org/wiki/v/Optimizing_Infrastructure_for_Analytics-Driven_Real-Time_Decision_Making

Resources

67

twitter.com/helenaedelson

github.com/helena

slideshare.net/helenaedelson

Thanks!

@helenaedelson #kafkasummit

top related