laying down the smack on your data pipelines

94
@PatrickMcFadin Patrick McFadin Chief Evangelist for Apache Cassandra Laying down the SMACK on your data pipelines 1

Upload: patrick-mcfadin

Post on 13-Apr-2017

3.128 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Laying down the smack on your data pipelines

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra

Laying down the SMACK on your data pipelines

1

Page 2: Laying down the smack on your data pipelines

The problem

Your Magical

App

Page 3: Laying down the smack on your data pipelines

Sad solutions

Page 4: Laying down the smack on your data pipelines

SMACK

Page 5: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 6: Laying down the smack on your data pipelines

CassandraAkka

SparkKafka

Organize Process Store

Mesos

KafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

Page 7: Laying down the smack on your data pipelines

CassandraAkka

SparkKafka

Organize Process Store

Page 8: Laying down the smack on your data pipelines

Managing Weather Data

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

Our Magical

App

Reactive and immediate

Batch

Page 9: Laying down the smack on your data pipelines

KillrWeather

KillrWeather

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

Page 10: Laying down the smack on your data pipelines

https://github.com/killrweather/killrweather

Page 11: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 12: Laying down the smack on your data pipelines

Kafka

Page 13: Laying down the smack on your data pipelines

Kafka decouples data pipelines

Page 14: Laying down the smack on your data pipelines

The problem

Kitchen

Hamburger please

Meat disk on bread please

Page 15: Laying down the smack on your data pipelines

The problem

Kitchen

Page 16: Laying down the smack on your data pipelines

The problem

Kitchen

Order Queue

Hamburger please

Order

Page 17: Laying down the smack on your data pipelines

The problem

Kitchen

Order Queue

Page 18: Laying down the smack on your data pipelines

The problem

Kitchen

Order Queue

Meat disk on bread please

You mean a Hamburger?

Uh yeah. That.

Order

Page 19: Laying down the smack on your data pipelines

Order from chaosProducer

Consumer

Topic = FoodOrder

Page 20: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Consumer

Page 21: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

Consumer

Page 22: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Page 23: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

Page 24: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 25: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 26: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 27: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

Page 28: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Page 29: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

Page 30: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 31: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 32: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 33: Laying down the smack on your data pipelines

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 34: Laying down the smack on your data pipelines

ScaleProducer

Topic = Hamburgers

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Topic = Pizza

Order

1

Order

2

Order

3

Order

4

Order

5

Topic = Food

Page 35: Laying down the smack on your data pipelines

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API

Temperature Processor

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

Page 36: Laying down the smack on your data pipelines

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API

Temperature Processor

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

Partition 0

Partition 0

Page 37: Laying down the smack on your data pipelines

KafkaProducer Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5

Broker

Partition 0

Partition 0

Tem1

Temp2

Tem3

Temp4

Temp5

Partition 1 Temperature Processor

Page 38: Laying down the smack on your data pipelines

KafkaProducer Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Temperature Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

Page 39: Laying down the smack on your data pipelines

KafkaProducer

Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1 Temperature

Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Temperature Processor

Temperature Processor

Precipitation Processor

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

Page 40: Laying down the smack on your data pipelines

GuaranteesOrder •Messages are ordered as they are sent by the producer

•Consumers see messages in the order they were inserted by the producer

Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages

Page 41: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 42: Laying down the smack on your data pipelines

Akka

Page 43: Laying down the smack on your data pipelines

Akka in a nutshell• Highly concurrent • Reactive • Fully distributed • Completely elastic and resilient

Actor

Mailbox

Actor

Mailbox

Actor

Mailbox

Actor

Mailbox

Page 44: Laying down the smack on your data pipelines

KafkaStreamingActor• Pulls from Kafka Queue• Immediately saves to Cassandra Counter

kafkaStream.map { weather => (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

Page 45: Laying down the smack on your data pipelines

Temperature High/Low Stream

Weather Stations Receive API

Producer

TemperatureActor

TemperatureActor

TemperatureActor

Consumer

Page 46: Laying down the smack on your data pipelines

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

Page 47: Laying down the smack on your data pipelines

TemperatureActor /** Computes and sends the daily aggregation to the `requester` actor. * We aggregate this data on-demand versus in the stream. * * For the given day of the year, aggregates 0 - 23 temp values to statistics: * high, low, mean, std, etc., and persists to Cassandra daily temperature table * by weather station, automatically sorted by most recent - due to our cassandra schema - * you don't need to do a sort in spark. * * Because the gov. data is not by interval (window/slide) but by specific date/time * we look for historic data for hours 0-23 that may or may not already exist yet * and create stats on does exist at the time of request. */ def daily(day: Day, requester: ActorRef): Unit = (for { aggregate <- sc.cassandraTable[Double](keyspace, rawtable) .select("temperature").where("wsid = ? AND year = ? AND month = ? AND day = ?", day.wsid, day.year, day.month, day.day) .collectAsync() } yield forDay(day, aggregate)) pipeTo requester

Page 48: Laying down the smack on your data pipelines

TemperatureActor

/** * Would only be handling handles 0-23 small items or fewer. */ private def forDay(key: Day, temps: Seq[Double]): WeatherAggregate = if (temps.nonEmpty) { val stats = StatCounter(temps) val data = DailyTemperature( key.wsid, key.year, key.month, key.day, high = stats.max, low = stats.min, mean = stats.mean, variance = stats.variance, stdev = stats.stdev)

self ! data data } else NoDataAvailable(key.wsid, key.year, classOf[DailyTemperature])

Page 49: Laying down the smack on your data pipelines

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

Page 50: Laying down the smack on your data pipelines

TemperatureActor

/** Stores the daily temperature aggregates asynchronously which are triggered * by on-demand requests during the `forDay` function's `self ! data` * to the daily temperature aggregation table. */ private def store(e: DailyTemperature): Unit = sc.parallelize(Seq(e)).saveToCassandra(keyspace, dailytable)

Page 51: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 52: Laying down the smack on your data pipelines

Cassandra

Page 53: Laying down the smack on your data pipelines

NodeServer

Page 54: Laying down the smack on your data pipelines

TokenServer•Consistent hash between 2-63 and 264

•Each node owns a range of those values

•The token is the beginning of that range to the next node’s token value

•Virtual Nodes break these down further

Data

Token Range

0 …

Page 55: Laying down the smack on your data pipelines

Cluster Server

Token Range

0 0-100

0-100

Page 56: Laying down the smack on your data pipelines

Cluster Server

Token Range

0 0-50

51 51-100

Server

0-50

51-100

Page 57: Laying down the smack on your data pipelines

Cluster Server

Token Range

0 0-25

26 26-50

51 51-75

76 76-100Server

ServerServer

0-25

76-100

26-5051-75

Page 58: Laying down the smack on your data pipelines

Table

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );

Table Name

Column NameColumn CQL Type

Primary Key Designation Partition Key

Page 59: Laying down the smack on your data pipelines

Queries supported

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Page 60: Laying down the smack on your data pipelines

Replication10.0.0.1 00-25

DC1

DC1: RF=1

Node Primary

10.0.0.1 00-25

10.0.0.2 26-50

10.0.0.3 51-75

10.0.0.4 76-100

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Page 61: Laying down the smack on your data pipelines

Replication10.0.0.1

00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=2

Node Primary Replica

10.0.0.1 00-25 76-100

10.0.0.2 26-50 00-25

10.0.0.3 51-75 26-50

10.0.0.4 76-100 51-75

76-100

00-25

26-50

51-75

Page 62: Laying down the smack on your data pipelines

ReplicationDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Page 63: Laying down the smack on your data pipelines

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

Page 64: Laying down the smack on your data pipelines

Consistency level

Consistency Level Number of Nodes Acknowledged

One One - Read repair triggered

Local One One - Read repair in local DC

Quorum 51%

Local Quorum 51% in local DC

Page 65: Laying down the smack on your data pipelines

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

Page 66: Laying down the smack on your data pipelines

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

Page 67: Laying down the smack on your data pipelines

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= Quorum

Page 68: Laying down the smack on your data pipelines

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Page 69: Laying down the smack on your data pipelines

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Page 70: Laying down the smack on your data pipelines

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Page 71: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 72: Laying down the smack on your data pipelines

Spark

Page 73: Laying down the smack on your data pipelines

Great combo

Store a ton of data Analyze a ton of data

Page 74: Laying down the smack on your data pipelines

Great combo

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Page 75: Laying down the smack on your data pipelines

Great comboSpark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Spark Connector

Page 76: Laying down the smack on your data pipelines

Executer

Master

Worker

Executer

Executer

Server

Page 77: Laying down the smack on your data pipelines

Master

Worker

Worker

Worker Worker

0-24Token Ranges 0-100

25-49

50-74

75-99

I will only analyze 25% of the data.

Page 78: Laying down the smack on your data pipelines

Master

Worker

Worker

Worker Worker

0-24

25-49

50-74

75-9975-99

0-24

25-49

50-74

AnalyticsTransactional

Page 79: Laying down the smack on your data pipelines

Executer

Master

Worker

Executer

Executer

75-99

SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Spark Connector

Page 80: Laying down the smack on your data pipelines

Executer

Master

Worker

Executer

Executer

75-99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Page 81: Laying down the smack on your data pipelines

Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()

Executer

SELECT * FROM isd_weather_data.raw_weather_data

Spark RDD

Spark Partition

Spark Connector

Page 82: Laying down the smack on your data pipelines

Saving back the weather data

val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

Page 83: Laying down the smack on your data pipelines

Spark Streaming - Micro Batching

83© 2015. All Rights Reserved.

Page 84: Laying down the smack on your data pipelines

DStream

84© 2015. All Rights Reserved.

Page 85: Laying down the smack on your data pipelines

Sliding Windows

85© 2015. All Rights Reserved.

Page 86: Laying down the smack on your data pipelines

SparkMesosAkka

Cassandra

Kafka

Page 87: Laying down the smack on your data pipelines

Mesos

Page 88: Laying down the smack on your data pipelines

CassandraAkka

SparkKafkaKafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

I need CPU!!

I need memory!!

Got you covered

Page 89: Laying down the smack on your data pipelines

KafkaAkka AkkaAkka

KafkaSpark Spark

Page 90: Laying down the smack on your data pipelines

Kafka

Akka

Akka

Akka

KafkaSpark Spark

Page 91: Laying down the smack on your data pipelines
Page 92: Laying down the smack on your data pipelines

Kafka on Mesos exampleScheduler • Provides the operational automation for a Kafka Cluster • Manages the changes to the broker's configuration • Exposes a REST API for the CLI to use or any other client • Runs on Marathon for high availability

Executor • The executor interacts with the kafka broker as an

intermediary to the scheduler

Page 93: Laying down the smack on your data pipelines

CassandraAkka

SparkKafka

Organize Process Store

Mesos

Page 94: Laying down the smack on your data pipelines

Go get your SMACK on

Thank you!

Follow me on twitter: @PatrickMcFadin