![Page 1: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/1.jpg)
Apache Spark: ���killer or savior of Apache Hadoop?
Roman Shaposhnik Director of Open Source @Pivotal
(Twitter: @rhatr)
![Page 2: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/2.jpg)
Who’s this guy?
• Director of Open Source (building a team of OS contributors)
• Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
![Page 3: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/3.jpg)
Shameless plug
http://manning.com/martella
![Page 4: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/4.jpg)
Dearly beloved…
![Page 5: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/5.jpg)
40 minute to figure out
Hadoop vs. Spark
![Page 6: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/6.jpg)
40 minute to figure out
Hadoop++ == Spark
![Page 7: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/7.jpg)
40 minute to figure out
Hadoop + Spark
![Page 8: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/8.jpg)
40 minute to figure out
![Page 9: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/9.jpg)
Long, long time ago…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
![Page 10: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/10.jpg)
In a blink of an eye
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
Tachyon
![Page 11: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/11.jpg)
A Spark view?
HDFS
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
Hadoop UI
Hue
SolrCloud
Phoenix
HBase Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Tachyon
![Page 12: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/12.jpg)
BDAS
![Page 13: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/13.jpg)
Principle #1
HDFS is the datalake
![Page 14: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/14.jpg)
Your datacenter
…
server 1
server N
![Page 15: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/15.jpg)
Hadoop’s view
MapReduce
server 1
server N
HDFS
![Page 16: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/16.jpg)
HDFS: decoupled storage
… MR
HDFS
MR
![Page 17: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/17.jpg)
![Page 18: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/18.jpg)
Anatomy of MapReduce
d a c
a b c
a 3 b 1 c 2
a 1 b 1 c 1
a 1 c 1 a 1
a 1 1 1 b 1 c 1 1
HDFS mappers reducers HDFS
![Page 19: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/19.jpg)
Principle #2
MR is assembly language
![Page 20: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/20.jpg)
MapReduce 1.0
Job Tracker
Task Tracker���(HDFS)
Task Tracker���(HDFS)
task1 task1 task1 task1 task1
task1 task1 task1 task1 taskN
![Page 21: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/21.jpg)
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
![Page 22: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/22.jpg)
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
![Page 23: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/23.jpg)
Principle #3
MR: YARN + library
![Page 24: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/24.jpg)
What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)
![Page 25: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/25.jpg)
Principle #4
$ grep –R | awk | sort …
![Page 26: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/26.jpg)
Spark philosophy • Make life easy for Data Scientists
• Provide well documented and expressive APIs
• Powerful Domain Specific Libraries
• Easy integration with storage systems
• Caching to avoid data movement
• Well defined releases, stable API
![Page 27: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/27.jpg)
Spark innovations • Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
![Page 28: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/28.jpg)
RDDs
warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))
HadoopRDD���path = hdfs://
FilteredRDD���contains…
MappedRDD split…
![Page 29: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/29.jpg)
Parallel operators
• map, reduce
• sample, filter
• groupBy, reduceByKey
• join, leftOuterJoin, rightOuterJoin
• union, cross
![Page 30: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/30.jpg)
How do I use it?
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
![Page 31: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/31.jpg)
Principle #5
Memory is the new disk
![Page 32: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/32.jpg)
RDDs are the foundation
• SQL
• Graph
• ML
• Streaming
![Page 33: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/33.jpg)
Spark SQL • Lib in Spark Core that models RDDs as rels.
• SchemaRDD
• Replaces Shark
• Lightweight with no code from Hive
• Import/Export into different storage formats
• Columnar storage (as in Shark)
![Page 34: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/34.jpg)
Spark Streaming
• Extend Spark to do large scale stream processing
• Simple, batch like API with RDDs
• Single semantics for both real time and high latency
![Page 35: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/35.jpg)
D-Streams
![Page 36: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/36.jpg)
Streaming from Twitter
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
![Page 37: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/37.jpg)
Spark GraphX
• Pregel (BSP) (formerly know as Bagel)
• Graph-centric modeling
• Unification of processing
• No more MR trickery
![Page 38: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/38.jpg)
You killed Apache Giraph?
![Page 39: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/39.jpg)
MLbase
• Machine Learning toolset
• MatLab for scale out computing
• Built on Spark Mlib
• Classification, Regression, Colab. Filtering, etc.
![Page 40: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/40.jpg)
What is really happening?
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
Tachyon
![Page 41: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/41.jpg)
Principle #6
Spark: the ecosystem
![Page 42: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/42.jpg)
May be its not so bad server 1
server N
![Page 43: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/43.jpg)
But HDFS/YARN are safe?
HDFS, Ceph, S3, NAS, etc.
New HDFS
New YARN
![Page 44: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/44.jpg)
What is *really* going on? • 2009 Research at UCB, written in Scala
• 2010 Open Sourced
• 2013 Accepted into Apache Incubator
• 2013 Databricks formed ($14M funding)
• 2014 Becomes TLP with ASF
• 2014 Spark 1.0 is out
• 2014 Databricks gets an extra $33M
![Page 45: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/45.jpg)
Bigdata: brought to U by ASF
• >50% ML traffic
• 100-200 contributors across 25-35 companies
• More active than Hadoop
• Cross-pollination with other TLPs
![Page 46: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/46.jpg)
Principle #7
Where Hadoop was ‘09
![Page 47: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/47.jpg)
This is how hardening looks
![Page 48: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/48.jpg)
What is Hadoop?
Hadoop != MR + HDFS
![Page 49: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/49.jpg)
The ecosystem • Apache HBase
• Apache Crunch, Pig, Hive and Phoenix
• Apache Giraph
• Apache Oozie
• Apache Mahout
• Apache Sqoop and Flume
![Page 50: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/50.jpg)
Principle #8
Spark: an alternative backend
![Page 51: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/51.jpg)
Spark is best for cloud
![Page 52: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/52.jpg)
Principle #9
Memory is expensive
![Page 53: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/53.jpg)
What’s new?
• True elasticity
• Resource partitioning
• Security
• Data marketplace
• Multi datacenter deployments
![Page 54: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/54.jpg)
Hadoop Maturity
ETL Offload Accommodate massive ���
data growth with existing EDW investments
Data Lakes Unify Unstructured and Structured Data Access
Big Data Apps
Build analytic-led applications impacting ���
top line revenue
Data-Driven Enterprise
App Dev and Operational Management on HDFS
Data Architecture
![Page 55: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/55.jpg)
Pivotal HD on Pivotal CF
� Enterprise PaaS Management System
� Flexible multi-language ‘buildpack’ architecture
� Deployed applications enjoy built-in services
� On-Premise Hadoop as a Service
� Single cluster deployment of Pivotal HD
� Developers instantly bind to shared Hadoop Clusters
� Speeds up time-to-value
![Page 56: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/56.jpg)
Pivotal’s view
Data Science Platform
Tachyon/Gem
Cluster Manager
MR
Application
Stream Server
MPP SQL
Data Lake / HDFS / Virtual Storage
GemFireXD
...ETC
Hadoop HDFS Isilon
App Dev / Ops
MLbase Streaming
Legacy Systems
Legacy
Data Scientists Data Sources End Users
SparkSQL
![Page 57: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/57.jpg)
Principle #10
The rumors of my death…
![Page 58: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/58.jpg)
It will be called Hadoop
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
![Page 59: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/59.jpg)
Spark recap
• Is it “Big Data” (Yes)
• Is it “Hadoop” (No)
• It’s one of those “in memory” things, right (Yes)
• JVM, Java, Scala (All)
• Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)
![Page 60: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/60.jpg)
A NEW PLATFORM FOR A NEW ERA
Additional Line 18 Point Verdana
![Page 61: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/61.jpg)
Credits • Wikipedia and Dilbert.com
• Apache Software Foundation
• Scott Deeg
• Milind Bhandarkar
• Susheel Kaushik
• Mak Gokhale
![Page 62: Apache Spark: killer or savior of Apache Hadoop?](https://reader034.vdocuments.site/reader034/viewer/2022050805/554f66e6b4c905c8088b4e2c/html5/thumbnails/62.jpg)
Questions ?