streaming big data & analytics for scale

77
STREAMING BIG DATA & ANALYTICS FOR SCALE Helena Edelson 1 @helenaedelson

Upload: helena-edelson

Post on 23-Jan-2018

4.366 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Streaming Big Data & Analytics For Scale

STREAMING BIG DATA & ANALYTICS FOR SCALE

Helena Edelson1

@helenaedelson

Page 2: Streaming Big Data & Analytics For Scale

@helenaedelson

Who Is This Person?• VP of Product Engineering @Tuplejump

• Big Data, Analytics, Cloud Engineering, Cyber Security

• Committer / Contributor to FiloDB, Spark Cassandra

Connector, Akka, Spring Integration

2

• @helenaedelson

• github.com/helena

• linkedin.com/in/helenaedelson

• slideshare.net/helenaedelson

Page 3: Streaming Big Data & Analytics For Scale

@helenaedelson

Page 4: Streaming Big Data & Analytics For Scale

@helenaedelson

Topics• The Problem Domain - What needs to be solved

• The Stack (Scala,Akka,Spark Streaming,Kafka,Cassandra)

• Simplifying the architecture

• The Pipeline - integration

4

Page 5: Streaming Big Data & Analytics For Scale

@helenaedelson

THE PROBLEM DOMAINDelivering Meaning From A Flood Of Data

5@helenaedelson

Page 6: Streaming Big Data & Analytics For Scale

@helenaedelson

The Problem Domain

Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures.

6

Page 7: Streaming Big Data & Analytics For Scale

@helenaedelson

Delivering Meaning• Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream

7

Page 8: Streaming Big Data & Analytics For Scale

I need fast access to historical data on the fly for predictive modeling with real time datafrom the stream

Page 9: Streaming Big Data & Analytics For Scale

@helenaedelson

It's Not A Stream It's A Flood• Netflix

• 50 - 100 billion events per day • 1 - 2 million events per second at peak

• LinkedIn • 500 billion write events per day • 2.5 trillion read events per day • 4.5 million events per second at peak with Kafka • 1 PB of stream data

9

Page 10: Streaming Big Data & Analytics For Scale

@helenaedelson

Reality Check• Massive event spikes & bursty traffic • Fast producers / slow consumers • Network partitioning & out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues

– When we scale down VMs how do we not lose data?

10

Page 11: Streaming Big Data & Analytics For Scale

@helenaedelson

And stay within our cloud hosting budget

11

Page 12: Streaming Big Data & Analytics For Scale

@helenaedelson

Oh, and don't loose data

12

Page 13: Streaming Big Data & Analytics For Scale

@helenaedelson

THE STACK

13

Page 14: Streaming Big Data & Analytics For Scale

@helenaedelson

About The StackAn ensemble of technologies enabling a data pipeline, storage and analytics.

14

Page 15: Streaming Big Data & Analytics For Scale

@helenaedelson15

Page 16: Streaming Big Data & Analytics For Scale

@helenaedelson

Scala and SparkScala is now becoming this glue that connects the dots in big data. The emergence of Spark owes directly to its simplicity and the high level of abstraction of Scala.

• Distributed collections that are functional by default • Apply transformations to them • Spark takes that exact idea and puts it on a cluster

16

Page 17: Streaming Big Data & Analytics For Scale

@helenaedelson

SIMPLIFYING ARCHITECTURE

17

Page 18: Streaming Big Data & Analytics For Scale

@helenaedelson

Pick Technologies WiselyBased on your requirements • Latency

• Real time / Sub-Second: < 100ms • Near real time (low): > 100 ms or a few seconds - a few hours

• Consistency • Highly Scalable • Topology-Aware & Multi-Datacenter support • Partitioning Collaboration - do they play together well

18

Page 19: Streaming Big Data & Analytics For Scale

@helenaedelson

Strategies

19

• Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management

• Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency

Page 20: Streaming Big Data & Analytics For Scale

@helenaedelson

Strategy Technologies

Scalable Infrastructure / Elastic Spark, Cassandra, Kafka

Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster

Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring

Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style

Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka

Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence

Failure Detection Cassandra, Spark, Akka, Kafka

Consensus & Gossip Cassandra & Akka Cluster

Parallelism Spark, Cassandra, Kafka, Akka

Asynchronous Data Passing Kafka, Akka, Spark

Fast, Low Latency, Data Locality Cassandra, Spark, Kafka

Location Transparency Akka, Spark, Cassandra, Kafka

My Nerdy Chart

20

Page 21: Streaming Big Data & Analytics For Scale

@helenaedelson21

few years

in Silicon Valley

Cloud Engineering team

@helenaedelson

Page 22: Streaming Big Data & Analytics For Scale

@helenaedelson22

Batch analytics data flow from several years ago looked like...

Page 23: Streaming Big Data & Analytics For Scale

@helenaedelson

Analytic

Analytic

Search

Hadoop

WordCount

in Java

23

Painful just to look at

Page 24: Streaming Big Data & Analytics For Scale

@helenaedelson24

Batch analytics data flow from several years ago looked like...

Page 25: Streaming Big Data & Analytics For Scale

@helenaedelson

Analytic

Analytic

Search

Scalding: WordCount

25

Page 26: Streaming Big Data & Analytics For Scale

@helenaedelson26

Transforming data multiple times, multiple ways

Page 27: Streaming Big Data & Analytics For Scale

@helenaedelson27

Sweet, let's triple the code we have to update and regression test every time our analytics logic changes

Page 28: Streaming Big Data & Analytics For Scale

@helenaedelson

AND THEN WE GREEKED OUT

28

Lambda

Page 29: Streaming Big Data & Analytics For Scale

@helenaedelson

Lambda ArchitectureA data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.

29

• Or, "How to beat the CAP theorum" • An approach coined by Nathan Mars • This was a huge stride forward

Page 30: Streaming Big Data & Analytics For Scale

@helenaedelsonhttps://www.mapr.com/developercentral/lambda-architecture 30

Page 31: Streaming Big Data & Analytics For Scale

@helenaedelson

Implementing Is Hard• Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places

31

Page 32: Streaming Big Data & Analytics For Scale

@helenaedelson

Performance Tuning & Monitoring on so many disparate systems

32

Also Hard

Page 33: Streaming Big Data & Analytics For Scale

@helenaedelson33

λ: Streaming & Batch Flows

Evolution Or Just Addition?Or Just Technical Debt?

Page 34: Streaming Big Data & Analytics For Scale

@helenaedelson

Lambda ArchitectureIngest an immutable sequence of records is captured and fed into • a batch system • and a stream processing system in parallel

34

Page 35: Streaming Big Data & Analytics For Scale

@helenaedelson

WAIT, DUAL SYSTEMS?

35

Challenge Assumptions

Page 36: Streaming Big Data & Analytics For Scale

@helenaedelson

Which Translates To• Performing analytical computations & queries in dual

systems • Duplicate Code • Untyped Code - Strings • Spaghetti Architecture for Data Flows • One Busy Network

36

Page 37: Streaming Big Data & Analytics For Scale

@helenaedelson

Architectyr?

"This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 37

Page 38: Streaming Big Data & Analytics For Scale

@helenaedelson38

These are not the solutions you're looking for

Page 39: Streaming Big Data & Analytics For Scale

@helenaedelson

Escape from Hadoop?

Hadoop

• MapReduce - very powerful, no longer enough • It’s Batch • Stale data • Slow, everything written to disk • Huge overhead • Inefficient with respect to memory use • Inflexible vs Dynamic

39

Page 40: Streaming Big Data & Analytics For Scale

@helenaedelson

One Pipeline

40

• A unified system for streaming and batch • Real-time processing and reprocessing

• Code changes • Fault tolerance

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps

Page 41: Streaming Big Data & Analytics For Scale

@helenaedelson

THE PIPELINE

41

Stream Integration, IoT, Timeseries and Data Locality

Page 42: Streaming Big Data & Analytics For Scale

@helenaedelson42

KillrWeatherhttp://github.com/killrweather/killrweather

A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments.

http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/databricks/apps/weather

Page 43: Streaming Big Data & Analytics For Scale

@helenaedelson

val context = new StreamingContext(conf, Seconds(1))

val stream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( context, kafkaParams, kafkaTopics)

stream.flatMap(func1).saveToCassandra(ks1,table1) stream.map(func2).saveToCassandra(ks1,table1)

context.start()

43

Kafka, Spark Streaming and Cassandra

Page 44: Streaming Big Data & Analytics For Scale

@helenaedelson

class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { case _: ActorInitializationException => Stop case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate case _: Exception => Escalate } private val producer = new KafkaProducer[K, V](producerConfig) override def postStop(): Unit = producer.close() def receive = { case e: KafkaMessageEnvelope[K,V] => producer.send(e) }} 44

Kafka, Spark Streaming, Cassandra & Akka

Page 45: Streaming Big Data & Analytics For Scale

Training Data

Feature Extraction

Model Training

Model Testing

Test Data

Your Data Extract Data To Analyze

Train your model to predict

Spark Streaming ML, Kafka & Cassandra

Page 46: Streaming Big Data & Analytics For Scale

@helenaedelson

Spark Streaming, ML, Kafka & C*val ssc = new StreamingContext(new SparkConf()…, Seconds(5)

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse) val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]( ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) .map(_._2).map(LabeledPoint.parse)

trainingStream.saveToCassandra("ml_keyspace", "raw_training_data") val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.dense(weights)) .trainOn(trainingStream)

//Making predictions on testData model .predictOnValues(testData.map(lp => (lp.label, lp.features))) .saveToCassandra("ml_keyspace", "predictions")

46

Page 47: Streaming Big Data & Analytics For Scale

@helenaedelson47

Page 48: Streaming Big Data & Analytics For Scale

@helenaedelson48

class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val stream = KafkaUtils.createStream( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip) }

Kafka, Spark Streaming, Cassandra & Akka

Page 49: Streaming Big Data & Analytics For Scale

@helenaedelson

class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val stream = KafkaUtils.createStream( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip) }

49

Now we can replay • On failure • Reprocessing on code changes • Future computation...

Page 50: Streaming Big Data & Analytics For Scale

@helenaedelson50

Here we are pre-aggregating to a table for fast querying later - in other secondary stream aggregation computations and scheduled computing

class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val stream = KafkaUtils.createStream( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip) }

Page 51: Streaming Big Data & Analytics For Scale

@helenaedelson

CREATE TABLE weather.raw_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

CREATE TABLE daily_aggregate_precip ( wsid text, year int, month int, day int, precipitation counter, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

Data Model (simplified)

51

Page 52: Streaming Big Data & Analytics For Scale

@helenaedelson52

class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val stream = KafkaUtils.createStream( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip) }

Gets the partition key: Data Locality Spark C* Connector feeds this to Spark

Cassandra Counter column in our schema,no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.

Page 53: Streaming Big Data & Analytics For Scale

@helenaedelson

CREATE TABLE weather.raw_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

C* Clustering Columns Writes by most recentReads return most recent first

Timeseries Data

53

Cassandra will automatically sort by most recent for both write and read

Page 54: Streaming Big Data & Analytics For Scale

@helenaedelson54

val multipleStreams = for (i <- numstreams) { streamingContext.receiverStream[HttpRequest](new HttpReceiver(port)) }

streamingContext.union(multipleStreams) .map { httpRequest => TimelineRequestEvent(httpRequest)} .saveToCassandra("requests_ks", "timeline")

CREATE TABLE IF NOT EXISTS requests_ks.timeline ( timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text, PRIMARY KEY ((url, timesegment) , t_uuid) );

Record Every Event In The Order In Which It Happened, Per URL

timesegment protects from writing unbounded partitions.

timeuuid protects from simultaneous events over-writing one another.

Page 55: Streaming Big Data & Analytics For Scale

@helenaedelson

val stream = KafkaUtils.createDirectStream(...) .map(_._2.split(",")) .map(RawWeatherData(_))

stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)

stream .map(hour => (hour.id, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

55

Replay and Reprocess - Any Time Data is on the nodes doing the querying - Spark C* Connector - Partitions

• Timeseries data with Data Locality • Co-located Spark + Cassandra nodes • S3 does not give you

Cassandra & Spark Streaming: Data Locality For Free®

Page 56: Streaming Big Data & Analytics For Scale

@helenaedelson

class PrecipitationActor(ssc: StreamingContext, settings: Settings) extends AggregationActor { import akka.pattern.pipe def receive : Actor.Receive = { case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender) } /** Returns the 10 highest temps for any station in the `year`. */ def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = { val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year, ssc.sparkContext.parallelize(aggregate).top(k).toSeq) ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync().map(toTopK) pipeTo requester }}

56

Queries pre-aggregated tables from the stream

Compute Isolation: Actor

Page 57: Streaming Big Data & Analytics For Scale

@helenaedelson57

class TemperatureActor(sc: SparkContext, settings: Settings) extends AggregationActor { import akka.pattern.pipe def receive: Actor.Receive = { case e: GetMonthlyHiLowTemperature => highLow(e, sender) } def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit = sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr) .where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month) .collectAsync() .map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester }

C* data is automatically sorted by most recent - due to our data model. Additional Spark or collection sort not needed.

Efficient Batch Analysis

Page 58: Streaming Big Data & Analytics For Scale

@helenaedelson

TCO: Cassandra, Hadoop• Compactions vs file management and de-duplication jobs • Storage cost, cloud providers / hardware, storage format, query speeds, cost of

maintaining • Hiring talent to manage Cassandra vs Hadoop • Hadoop has some advantages if used with AWS EMR or other hosted solutions

(What about HCP?) • HDFS vs Cassandra is not a fair comparison

– You have to decide first if you want file vs DB – Where in the stack – Then you can compare

• See http://velvia.github.io/presentations/2015-breakthrough-olap-cass-spark/index.html#/15/3

58

Page 60: Streaming Big Data & Analytics For Scale

@helenaedelson

Page 61: Streaming Big Data & Analytics For Scale

@helenaedelson

THE STACK

61

Appendix

Page 62: Streaming Big Data & Analytics For Scale

Analytic

Analytic

Search

• Fast, distributed, scalable and fault tolerant cluster compute system

• Enables Low-latency with complex analytics

• Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014

Page 63: Streaming Big Data & Analytics For Scale

@helenaedelson

Spark Streaming• One runtime for streaming and batch processing

• Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to

disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage

63

Page 64: Streaming Big Data & Analytics For Scale

@helenaedelson

Training Data

Feature Extraction

Model Training

Model Testing

Test Data

Your Data Extract Data To Analyze

Train your model to predict

64

val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))

Page 65: Streaming Big Data & Analytics For Scale

@helenaedelson

Apache Cassandra• Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On

• No single point of failure • Survive regional outages

• Easy to operate • Automatic & configurable replication

65

Page 66: Streaming Big Data & Analytics For Scale

@helenaedelson66

The one thing in your infrastructure you can always rely on

Page 67: Streaming Big Data & Analytics For Scale

•Massively Scalable• High Performance• Always On• Masterless

Page 68: Streaming Big Data & Analytics For Scale
Page 69: Streaming Big Data & Analytics For Scale

IoT

Page 70: Streaming Big Data & Analytics For Scale

Security, Machine Learning

Page 71: Streaming Big Data & Analytics For Scale

Science Physics: Astro Physics / Particle Physics..

Page 72: Streaming Big Data & Analytics For Scale

Genetics / Biological Computations

Page 73: Streaming Big Data & Analytics For Scale

• High Throughput Distributed Messaging• Decouples Data Pipelines• Handles Massive Data Load• Support Massive Number of Consumers• Distribution & partitioning across cluster nodes • Automatic recovery from broker failures

Page 74: Streaming Big Data & Analytics For Scale
Page 75: Streaming Big Data & Analytics For Scale

@helenaedelson75

High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams

Page 76: Streaming Big Data & Analytics For Scale

@helenaedelson

Akka Actors

76

A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance

Page 77: Streaming Big Data & Analytics For Scale

@helenaedelson

Some Resources• http://www.planetcassandra.org/blog/equinix-evaluates-hbase-and-cassandra-for-the-real-time-

billing-and-fraud-detection-of-hundreds-of-data-centers-worldwide

• Building a scalable platform for streaming updates and analytics • @SAPdevs tweet: Watch us take a dive deeper and explore the benefits of using Akka and

Scala https://t.co/M9arKYy02y • https://mesosphere.com/blog/2015/07/24/learn-everything-you-need-to-know-about-scala-and-

big-data-in-oakland/

• https://www.youtube.com/watch?v=buszgwRc8hQ • https://www.youtube.com/watch?v=G7N6YcfatWY • http://noetl.org

77