Transcript
Page 1: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

The New Analytics Toolbox

Going beyond Hadoop

Page 2: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

whoami

Robbie StricklandSoftware Dev Managerlinkedin.com/in/[email protected]

Page 3: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

Page 4: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)

Page 5: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient

Page 6: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

Page 7: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

Page 8: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

Page 9: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

Page 10: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Page 11: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Page 12: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Page 13: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Varies by vendor

Page 14: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Generic in-memory analysis

Page 15: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Page 16: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

Page 17: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement

Page 18: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Page 19: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Page 20: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Page 21: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Page 22: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Page 23: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark vs. Hadoop

Page 24: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 25: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 26: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 27: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 28: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 29: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 30: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark (angels sing!):

Page 31: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

Page 32: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing

Page 33: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

Page 34: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

Page 35: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

Page 36: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

Page 37: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

Page 38: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX

Page 39: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

Page 40: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

Page 41: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

Page 42: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Page 43: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

Page 44: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Page 45: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft

Page 46: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)

Page 47: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Page 48: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Page 49: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Page 50: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

Page 51: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application

Page 52: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

Page 53: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

Page 54: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

Page 55: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Page 56: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)

Page 57: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

Page 58: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Page 59: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

● Can recalculate from any point of failure

Page 60: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

Page 61: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Page 62: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Page 63: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Example

Page 64: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Demo

Page 65: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark SQL Example

Page 66: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

Page 67: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

Page 68: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs

Page 69: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources

○ Queues○ Twitter○ Flume○ etc.

Page 70: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

Page 71: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

The Output

Page 72: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Page 73: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

Page 74: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2 x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

Page 75: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be

to appropriately answer the question

Page 76: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Page 77: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Page 78: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Thank you!

Robbie [email protected]://github.com/rstrickland/sparkdemo


Top Related