new analytics toolbox
DESCRIPTION
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. Learn the serious advantages to the new tools, and get an analysis of the current state--including pros and cons as well as what's needed to bootstrap and operate the various options.TRANSCRIPT
![Page 1: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/1.jpg)
The New Analytics Toolbox
Going beyond Hadoop
![Page 3: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/3.jpg)
Hadoop works fine, right?
![Page 4: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/4.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)
![Page 5: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/5.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient
![Page 6: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/6.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk
![Page 7: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/7.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks
![Page 8: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/8.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks
![Page 9: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/9.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate
![Page 10: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/10.jpg)
Hadoop works fine, right?
● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart
![Page 11: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/11.jpg)
Landscape has changed ...
Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark
![Page 12: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/12.jpg)
Landscape has changed ...
Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark
MPP queries for Hadoop data
![Page 13: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/13.jpg)
Landscape has changed ...
Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark
Varies by vendor
![Page 14: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/14.jpg)
Landscape has changed ...
Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark
Generic in-memory analysis
![Page 15: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/15.jpg)
Landscape has changed ...
Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL
![Page 16: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/16.jpg)
Spark wins?
![Page 17: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/17.jpg)
Spark wins?
● Can be a Hadoop replacement
![Page 18: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/18.jpg)
Spark wins?
● Can be a Hadoop replacement● Works with any existing InputFormat
![Page 19: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/19.jpg)
Spark wins?
● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop
![Page 20: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/20.jpg)
Spark wins?
● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis
![Page 21: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/21.jpg)
Spark wins?
● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model
![Page 22: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/22.jpg)
Spark wins?
● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration
![Page 23: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/23.jpg)
Spark vs. Hadoop
![Page 24: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/24.jpg)
MapReduce:
![Page 25: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/25.jpg)
MapReduce:
![Page 26: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/26.jpg)
MapReduce:
![Page 27: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/27.jpg)
MapReduce:
![Page 28: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/28.jpg)
MapReduce:
![Page 29: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/29.jpg)
MapReduce:
![Page 30: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/30.jpg)
Spark (angels sing!):
![Page 31: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/31.jpg)
What is Spark?
![Page 32: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/32.jpg)
What is Spark?
● In-memory cluster computing
![Page 33: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/33.jpg)
What is Spark?
● In-memory cluster computing● 10-100x faster than MapReduce
![Page 34: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/34.jpg)
What is Spark?
● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets
![Page 35: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/35.jpg)
What is Spark?
● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java
![Page 36: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/36.jpg)
What is Spark?
● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing
![Page 37: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/37.jpg)
What is Spark?
● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output
format
![Page 38: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/38.jpg)
What is Spark?
● Native graph processing via GraphX
![Page 39: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/39.jpg)
What is Spark?
● Native graph processing via GraphX● Native machine learning via MLlib
![Page 40: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/40.jpg)
What is Spark?
● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL
![Page 41: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/41.jpg)
What is Spark?
● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR
![Page 42: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/42.jpg)
What is Spark?
● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources
![Page 43: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/43.jpg)
Spark on Cassandra
![Page 44: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/44.jpg)
Spark on Cassandra
● Direct integration via DataStax driver - cassandra-driver-spark on github
![Page 45: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/45.jpg)
Spark on Cassandra
● Direct integration via DataStax driver - cassandra-driver-spark on github
● No job config cruft
![Page 46: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/46.jpg)
Spark on Cassandra
● Direct integration via DataStax driver - cassandra-driver-spark on github
● No job config cruft● Supports server-side filters (where clauses)
![Page 47: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/47.jpg)
Spark on Cassandra
● Direct integration via DataStax driver - cassandra-driver-spark on github
● No job config cruft● Supports server-side filters (where clauses)● Data locality aware
![Page 48: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/48.jpg)
Spark on Cassandra
● Direct integration via DataStax driver - cassandra-driver-spark on github
● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other
distributed FS for checkpointing
![Page 49: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/49.jpg)
Cassandra with Hadoop
CassandraDataNode
TaskTracker
CassandraDataNode
TaskTracker
CassandraDataNode
TaskTracker
NameNode2ndaryNN
JobTracker
![Page 50: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/50.jpg)
Cassandra with Spark (using HDFS)
CassandraSpark Worker
DataNode
CassandraSpark Worker
DataNode
CassandraSpark Worker
DataNode
Spark MasterNameNode2ndaryNN
![Page 51: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/51.jpg)
A Typical Spark Application
![Page 52: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/52.jpg)
A Typical Spark Application● SparkContext + SparkConf
![Page 53: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/53.jpg)
A Typical Spark Application● SparkContext + SparkConf
● Data Source to RDD[T]
![Page 54: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/54.jpg)
A Typical Spark Application● SparkContext + SparkConf
● Data Source to RDD[T]
● Transformations/Actions
![Page 55: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/55.jpg)
A Typical Spark Application● SparkContext + SparkConf
● Data Source to RDD[T]
● Transformations/Actions
● Saving/Displaying
![Page 56: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/56.jpg)
Resilient Distributed Dataset (RDD)
![Page 57: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/57.jpg)
Resilient Distributed Dataset (RDD)● A distributed collection of items
![Page 58: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/58.jpg)
Resilient Distributed Dataset (RDD)● A distributed collection of items
● Transformations
○ Similar to those found in scala collections
○ Lazily processed
![Page 59: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/59.jpg)
Resilient Distributed Dataset (RDD)● A distributed collection of items
● Transformations
○ Similar to those found in scala collections
○ Lazily processed
● Can recalculate from any point of failure
![Page 60: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/60.jpg)
RDD Transformations vs Actions
Transformations: Produce new RDDs
Actions: Require the materialization of the records to produce a value
![Page 61: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/61.jpg)
RDD Transformations/Actions
Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...
Actions: collect:Array[T], count, fold, reduce ...
![Page 62: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/62.jpg)
Resilient Distributed Dataset (RDD)
val numOfAdults = persons.filter(_.age > 17).count()
Transformation Action
![Page 63: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/63.jpg)
Example
![Page 64: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/64.jpg)
Demo
![Page 65: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/65.jpg)
Spark SQL Example
![Page 66: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/66.jpg)
Spark Streaming
![Page 67: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/67.jpg)
Spark Streaming
● Creates RDDs from stream source on a defined interval
![Page 68: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/68.jpg)
Spark Streaming
● Creates RDDs from stream source on a defined interval
● Same ops as “normal” RDDs
![Page 69: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/69.jpg)
Spark Streaming
● Creates RDDs from stream source on a defined interval
● Same ops as “normal” RDDs● Supports a variety of sources
○ Queues○ Twitter○ Flume○ etc.
![Page 70: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/70.jpg)
Spark Streaming
![Page 71: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/71.jpg)
The Output
![Page 72: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/72.jpg)
Real World Task Distribution
![Page 73: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/73.jpg)
Real World Task Distribution
Transformations and Actions:
Similar to Scala but …
Your choice of transformations need be made with task distribution and memory in mind
![Page 74: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/74.jpg)
Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at least 2 x number of cores)
● Filter early and often (Queries, Filters, Distinct...)● Use pipelines
![Page 75: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/75.jpg)
Real World Task Distribution
How do we do that?
● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be
to appropriately answer the question
![Page 76: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/76.jpg)
Real World Task Distribution
![Page 77: New Analytics Toolbox](https://reader036.vdocuments.site/reader036/viewer/2022081404/558c8f0dd8b42a10338b45e1/html5/thumbnails/77.jpg)
Real World Task Distribution
Some Common Costly Transformations:
● sorting● groupByKey● reduceByKey● ...