laying down the smack on your data pipelines

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra

Laying down the SMACK on your data pipelines

1

The problem

Your Magical

App

Sad solutions

SparkMesosAkka

Cassandra

Kafka

CassandraAkka

SparkKafka

Organize Process Store

Mesos

KafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

CassandraAkka

SparkKafka


Managing Weather Data

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

Our Magical

App

Reactive and immediate

Batch

KillrWeather

KillrWeather

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

https://github.com/killrweather/killrweather

https://github.com/killrweather/killrweather

SparkMesosAkka

Cassandra

Kafka

Kafka decouples data pipelines

The problem

Kitchen

Hamburger please

Meat disk on bread please

The problem

Kitchen

The problem

Kitchen

Order Queue

Hamburger please

Order

The problem

Kitchen

Order Queue

The problem

Kitchen

Order Queue

Meat disk on bread please

You mean a Hamburger?

Uh yeah. That.

Order

Order from chaosProducer

Consumer

Topic = FoodOrder


Topic = Food

Order

1

Consumer


Topic = Food

Order

1

Order

Consumer


Topic = Food

Order

1

Order

2

Consumer


Topic = Food

Order

1

Order

2

Consumer

Order


Topic = Food

Order

1

Order

2

Consumer

Order

3


Topic = Food

Order

1

Order

2

Consumer

Order

3

Order


Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4


Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order


Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

ScaleProducer

Topic = Hamburgers

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Topic = Pizza

Order

1

Order

2

Order

3

Order

4

Order

5

Topic = Food

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API

Temperature Processor

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API



Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

Partition 0

Partition 0

KafkaProducer Consumer

Collection API


Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5


Precip

1

Precip

2

Precip

3

Precip

4

Precip

5

Broker

Partition 0

Partition 0

Tem1

Temp2

Tem3

Temp4

Temp5

Partition 1 Temperature Processor

KafkaProducer Consumer

Collection API



Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5


Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1


Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5


Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

KafkaProducer

Consumer

Collection API



Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5


Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1 Temperature

Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp5


Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1




Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

GuaranteesOrder •Messages are ordered as they are sent by the producer

•Consumers see messages in the order they were inserted by the producer

Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages

SparkMesosAkka

Cassandra

Kafka

Akka in a nutshell• Highly concurrent • Reactive • Fully distributed • Completely elastic and resilient

Actor

Mailbox

Actor

Mailbox

Actor

Mailbox

Actor

Mailbox

KafkaStreamingActor• Pulls from Kafka Queue• Immediately saves to Cassandra Counter

kafkaStream.map { weather => (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

Temperature High/Low Stream

Weather Stations Receive API

Producer

TemperatureActor

TemperatureActor

TemperatureActor

Consumer

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

TemperatureActor /** Computes and sends the daily aggregation to the `requester` actor. * We aggregate this data on-demand versus in the stream. * * For the given day of the year, aggregates 0 - 23 temp values to statistics: * high, low, mean, std, etc., and persists to Cassandra daily temperature table * by weather station, automatically sorted by most recent - due to our cassandra schema - * you don't need to do a sort in spark. * * Because the gov. data is not by interval (window/slide) but by specific date/time * we look for historic data for hours 0-23 that may or may not already exist yet * and create stats on does exist at the time of request. */ def daily(day: Day, requester: ActorRef): Unit = (for { aggregate <- sc.cassandraTable[Double](keyspace, rawtable) .select("temperature").where("wsid = ? AND year = ? AND month = ? AND day = ?", day.wsid, day.year, day.month, day.day) .collectAsync() } yield forDay(day, aggregate)) pipeTo requester

TemperatureActor

/** * Would only be handling handles 0-23 small items or fewer. */ private def forDay(key: Day, temps: Seq[Double]): WeatherAggregate = if (temps.nonEmpty) { val stats = StatCounter(temps) val data = DailyTemperature( key.wsid, key.year, key.month, key.day, high = stats.max, low = stats.min, mean = stats.mean, variance = stats.variance, stdev = stats.stdev)

self ! data data } else NoDataAvailable(key.wsid, key.year, classOf[DailyTemperature])

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

TemperatureActor

/** Stores the daily temperature aggregates asynchronously which are triggered * by on-demand requests during the `forDay` function's `self ! data` * to the daily temperature aggregation table. */ private def store(e: DailyTemperature): Unit = sc.parallelize(Seq(e)).saveToCassandra(keyspace, dailytable)

SparkMesosAkka

Cassandra

Kafka

Cassandra

NodeServer

TokenServer•Consistent hash between 2-63 and 264

•Each node owns a range of those values

•The token is the beginning of that range to the next node’s token value

•Virtual Nodes break these down further

Data

Token Range

0 …

Cluster Server

Token Range

0 0-100

0-100

Cluster Server

Token Range

0 0-50

51 51-100

Server

0-50

51-100

Cluster Server

Token Range

0 0-25

26 26-50

51 51-75

76 76-100Server

ServerServer

0-25

76-100

26-5051-75

Table

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );

Table Name

Column NameColumn CQL Type

Primary Key Designation Partition Key

Queries supported

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Replication10.0.0.1 00-25

DC1

DC1: RF=1

Node Primary

10.0.0.1 00-25

10.0.0.2 26-50

10.0.0.3 51-75

10.0.0.4 76-100

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Replication10.0.0.1

00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=2

Node Primary Replica

10.0.0.1 00-25 76-100

10.0.0.2 26-50 00-25

10.0.0.3 51-75 26-50

10.0.0.4 76-100 51-75

76-100

00-25

26-50

51-75

ReplicationDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

ConsistencyDC1

DC1: RF=3


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

Consistency level

Consistency Level Number of Nodes Acknowledged

One One - Read repair triggered

Local One One - Read repair in local DC

Quorum 51%

Local Quorum 51% in local DC

ConsistencyDC1

DC1: RF=3


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

ConsistencyDC1

DC1: RF=3


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= Quorum

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50


10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

SparkMesosAkka

Cassandra

Kafka

Great combo

Store a ton of data Analyze a ton of data

Great combo

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Great comboSpark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Spark Connector

Executer

Master

Worker

Executer

Executer

Server

Master

Worker

Worker

Worker Worker

0-24Token Ranges 0-100

25-49

50-74

75-99

I will only analyze 25% of the data.

Master

Worker

Worker

Worker Worker

0-24

25-49

50-74

75-9975-99

0-24

25-49

50-74

AnalyticsTransactional

Executer

Master

Worker

Executer

Executer

75-99

SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Spark Connector

Executer

Master

Worker

Executer

Executer

75-99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()

Executer

SELECT * FROM isd_weather_data.raw_weather_data

Spark RDD

Spark Partition

Spark Connector

Saving back the weather data

val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

DStream


Sliding Windows


SparkMesosAkka

Cassandra

Kafka

CassandraAkka

SparkKafkaKafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

I need CPU!!

I need memory!!

Got you covered

KafkaAkka AkkaAkka

KafkaSpark Spark

Kafka

Akka

Akka

Akka

KafkaSpark Spark

Kafka on Mesos exampleScheduler • Provides the operational automation for a Kafka Cluster • Manages the changes to the broker's configuration • Exposes a REST API for the CLI to use or any other client • Runs on Marathon for high availability

Executor • The executor interacts with the kafka broker as an

intermediary to the scheduler

CassandraAkka

SparkKafka


Mesos

Go get your SMACK on

Thank you!

Follow me on twitter: @PatrickMcFadin

laying down the smack on your data pipelines

Technology