laying down the smack on your data pipelines

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra

Laying down the SMACK on your data pipelines

The problem

Your Magical

Sad solutions

SparkMesosAkka

Cassandra

CassandraAkka

SparkKafka

Organize Process Store

KafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

CassandraAkka

SparkKafka

Managing Weather Data

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

Our Magical

Reactive and immediate

KillrWeather

Windsor California 67.3 F Rainfall total: 1.2cm

Today:

High: 73.4F Low : 51.4F

Yesterday:

High: 75.2F Low : 52.3F

https://github.com/killrweather/killrweather

SparkMesosAkka

Cassandra

Kafka decouples data pipelines

The problem

Kitchen

Hamburger please

Meat disk on bread please

The problem

Kitchen

The problem

Kitchen

Order Queue

Hamburger please

The problem

Kitchen

Order Queue

The problem

Kitchen

Order Queue

Meat disk on bread please

You mean a Hamburger?

Uh yeah. That.

Order from chaosProducer

Consumer

Topic = FoodOrder

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

Topic = Food

Consumer

ScaleProducer

Topic = Hamburgers

Consumer

Topic = Pizza

Topic = Food

KafkaProducer

Topic = Temperature

Consumer

Collection API

Temperature Processor

Topic = Precipitation

Precip

5Precipitation Processor

Broker

KafkaProducer

Topic = Temperature

Consumer

Collection API

Precip

5Precipitation Processor

Broker

Partition 0

KafkaProducer Consumer

Collection API

Precipitation Processor

Topic = Temperature

Precip

Broker

Partition 0

Partition 1 Temperature Processor

KafkaProducer Consumer

Collection API

Topic = Temperature

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

5Partition 1

Topic = Temperature

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

5Partition 1

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

KafkaProducer

Consumer

Collection API

Topic = Temperature

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

5Partition 1 Temperature

Processor

Topic = Temperature

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

5Partition 1

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

GuaranteesOrder •Messages are ordered as they are sent by the producer

•Consumers see messages in the order they were inserted by the producer

Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages

SparkMesosAkka

Cassandra

Akka in a nutshell• Highly concurrent • Reactive • Fully distributed • Completely elastic and resilient

Mailbox

KafkaStreamingActor• Pulls from Kafka Queue• Immediately saves to Cassandra Counter

kafkaStream.map { weather => (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

Temperature High/Low Stream

Weather Stations Receive API

Producer

TemperatureActor

Consumer

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

TemperatureActor /** Computes and sends the daily aggregation to the `requester` actor. * We aggregate this data on-demand versus in the stream. * * For the given day of the year, aggregates 0 - 23 temp values to statistics: * high, low, mean, std, etc., and persists to Cassandra daily temperature table * by weather station, automatically sorted by most recent - due to our cassandra schema - * you don't need to do a sort in spark. * * Because the gov. data is not by interval (window/slide) but by specific date/time * we look for historic data for hours 0-23 that may or may not already exist yet * and create stats on does exist at the time of request. */ def daily(day: Day, requester: ActorRef): Unit = (for { aggregate <- sc.cassandraTable[Double](keyspace, rawtable) .select("temperature").where("wsid = ? AND year = ? AND month = ? AND day = ?", day.wsid, day.year, day.month, day.day) .collectAsync() } yield forDay(day, aggregate)) pipeTo requester

TemperatureActor

/** * Would only be handling handles 0-23 small items or fewer. */ private def forDay(key: Day, temps: Seq[Double]): WeatherAggregate = if (temps.nonEmpty) { val stats = StatCounter(temps) val data = DailyTemperature( key.wsid, key.year, key.month, key.day, high = stats.max, low = stats.min, mean = stats.mean, variance = stats.variance, stdev = stats.stdev)

self ! data data } else NoDataAvailable(key.wsid, key.year, classOf[DailyTemperature])

TemperatureActor

class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {

def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }

TemperatureActor

/** Stores the daily temperature aggregates asynchronously which are triggered * by on-demand requests during the `forDay` function's `self ! data` * to the daily temperature aggregation table. */ private def store(e: DailyTemperature): Unit = sc.parallelize(Seq(e)).saveToCassandra(keyspace, dailytable)

SparkMesosAkka

Cassandra

NodeServer

TokenServer•Consistent hash between 2-63 and 264

•Each node owns a range of those values

•The token is the beginning of that range to the next node’s token value

•Virtual Nodes break these down further

Token Range

Cluster Server

Token Range

0 0-100

Cluster Server

Token Range

0 0-50

51 51-100

Server

51-100

Cluster Server

Token Range

0 0-25

26 26-50

51 51-75

76 76-100Server

ServerServer

76-100

26-5051-75

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );

Table Name

Column NameColumn CQL Type

Primary Key Designation Partition Key

Queries supported

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Replication10.0.0.1 00-25

DC1: RF=1

Node Primary

10.0.0.1 00-25

10.0.0.2 26-50

10.0.0.3 51-75

10.0.0.4 76-100

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Replication10.0.0.1

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1: RF=2

Node Primary Replica

10.0.0.1 00-25 76-100

10.0.0.2 26-50 00-25

10.0.0.3 51-75 26-50

10.0.0.4 76-100 51-75

76-100

ReplicationDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

ConsistencyDC1

DC1: RF=3

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

Consistency level

Consistency Level Number of Nodes Acknowledged

One One - Read repair triggered

Local One One - Read repair in local DC

Quorum 51%

Local Quorum 51% in local DC

ConsistencyDC1

DC1: RF=3

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

ConsistencyDC1

DC1: RF=3

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

ConsistencyDC1

DC1: RF=3

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= Quorum

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Multi-datacenterDC1

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Multi-datacenterDC1

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

SparkMesosAkka

Cassandra

Great combo

Store a ton of data Analyze a ton of data

Great combo

Spark Streaming

Near Real-time

SparkSQL

Structured Data

Machine Learning

GraphX

Graph Analysis

Great comboSpark Streaming

Near Real-time

SparkSQL

Structured Data

Machine Learning

GraphX

Graph Analysis

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Spark Connector

Executer

Master

Worker

Executer

Server

Master

Worker

Worker Worker

0-24Token Ranges 0-100

I will only analyze 25% of the data.

Master

Worker

Worker Worker

75-9975-99

AnalyticsTransactional

Executer

Master

Worker

Executer

SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99

Spark RDD

Spark Partition

Spark Connector

Executer

Master

Worker

Executer

Spark RDD

Spark Partition

Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()

Executer

SELECT * FROM isd_weather_data.raw_weather_data

Spark RDD

Spark Partition

Spark Connector

Saving back the weather data

val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

Spark Streaming - Micro Batching

DStream

Sliding Windows

SparkMesosAkka

Cassandra

CassandraAkka

SparkKafkaKafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

I need CPU!!

I need memory!!

Got you covered

KafkaAkka AkkaAkka

KafkaSpark Spark

Kafka on Mesos exampleScheduler • Provides the operational automation for a Kafka Cluster • Manages the changes to the broker's configuration • Exposes a REST API for the CLI to use or any other client • Runs on Marathon for high availability

Executor • The executor interacts with the kafka broker as an

intermediary to the scheduler

CassandraAkka

SparkKafka

Go get your SMACK on

Thank you!

Follow me on twitter: @PatrickMcFadin

laying down the smack on your data pipelines

Technology

smack talk, big data, and localization

biztalk smack down full

grp pipe & fittings - old bawn construction pty...

smack aff file

smack down celebrities you dislike

talking smack with bob

[ppt]powerpoint presentation - bİlgesam | bilge adamlar ......

reglament smack ball

simulation of tdp dynamics during s-laying of subsea...

natalia almonte - smack mellon

don't get smacked by smack-it

a smack...

catalogo smack

smack dab in the middle

energa libre-smack-el-pontenciador-d

dossier de presse smack coworking

smack: behind the refactorings

building data pipelines with smack: designing storage...

smack down

heroin (smack, junk) facts