apache spark: killer or savior of apache hadoop?
DESCRIPTION
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?TRANSCRIPT
Apache Spark: ���killer or savior of Apache Hadoop?
Roman Shaposhnik Director of Open Source @Pivotal
(Twitter: @rhatr)
Who’s this guy?
• Director of Open Source (building a team of OS contributors)
• Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
Shameless plug
http://manning.com/martella
Dearly beloved…
40 minute to figure out
Hadoop vs. Spark
40 minute to figure out
Hadoop++ == Spark
40 minute to figure out
Hadoop + Spark
40 minute to figure out
Long, long time ago…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
In a blink of an eye
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
Tachyon
A Spark view?
HDFS
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
Hadoop UI
Hue
SolrCloud
Phoenix
HBase Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Tachyon
BDAS
Principle #1
HDFS is the datalake
Your datacenter
…
server 1
server N
Hadoop’s view
MapReduce
server 1
server N
HDFS
HDFS: decoupled storage
… MR
HDFS
MR
Anatomy of MapReduce
d a c
a b c
a 3 b 1 c 2
a 1 b 1 c 1
a 1 c 1 a 1
a 1 1 1 b 1 c 1 1
HDFS mappers reducers HDFS
Principle #2
MR is assembly language
MapReduce 1.0
Job Tracker
Task Tracker���(HDFS)
Task Tracker���(HDFS)
task1 task1 task1 task1 task1
task1 task1 task1 task1 taskN
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
Principle #3
MR: YARN + library
What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)
Principle #4
$ grep –R | awk | sort …
Spark philosophy • Make life easy for Data Scientists
• Provide well documented and expressive APIs
• Powerful Domain Specific Libraries
• Easy integration with storage systems
• Caching to avoid data movement
• Well defined releases, stable API
Spark innovations • Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
RDDs
warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))
HadoopRDD���path = hdfs://
FilteredRDD���contains…
MappedRDD split…
Parallel operators
• map, reduce
• sample, filter
• groupBy, reduceByKey
• join, leftOuterJoin, rightOuterJoin
• union, cross
How do I use it?
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Principle #5
Memory is the new disk
RDDs are the foundation
• SQL
• Graph
• ML
• Streaming
Spark SQL • Lib in Spark Core that models RDDs as rels.
• SchemaRDD
• Replaces Shark
• Lightweight with no code from Hive
• Import/Export into different storage formats
• Columnar storage (as in Shark)
Spark Streaming
• Extend Spark to do large scale stream processing
• Simple, batch like API with RDDs
• Single semantics for both real time and high latency
D-Streams
Streaming from Twitter
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Spark GraphX
• Pregel (BSP) (formerly know as Bagel)
• Graph-centric modeling
• Unification of processing
• No more MR trickery
You killed Apache Giraph?
MLbase
• Machine Learning toolset
• MatLab for scale out computing
• Built on Spark Mlib
• Classification, Regression, Colab. Filtering, etc.
What is really happening?
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
Tachyon
Principle #6
Spark: the ecosystem
May be its not so bad server 1
server N
But HDFS/YARN are safe?
HDFS, Ceph, S3, NAS, etc.
New HDFS
New YARN
What is *really* going on? • 2009 Research at UCB, written in Scala
• 2010 Open Sourced
• 2013 Accepted into Apache Incubator
• 2013 Databricks formed ($14M funding)
• 2014 Becomes TLP with ASF
• 2014 Spark 1.0 is out
• 2014 Databricks gets an extra $33M
Bigdata: brought to U by ASF
• >50% ML traffic
• 100-200 contributors across 25-35 companies
• More active than Hadoop
• Cross-pollination with other TLPs
Principle #7
Where Hadoop was ‘09
This is how hardening looks
What is Hadoop?
Hadoop != MR + HDFS
The ecosystem • Apache HBase
• Apache Crunch, Pig, Hive and Phoenix
• Apache Giraph
• Apache Oozie
• Apache Mahout
• Apache Sqoop and Flume
Principle #8
Spark: an alternative backend
Spark is best for cloud
Principle #9
Memory is expensive
What’s new?
• True elasticity
• Resource partitioning
• Security
• Data marketplace
• Multi datacenter deployments
Hadoop Maturity
ETL Offload Accommodate massive ���
data growth with existing EDW investments
Data Lakes Unify Unstructured and Structured Data Access
Big Data Apps
Build analytic-led applications impacting ���
top line revenue
Data-Driven Enterprise
App Dev and Operational Management on HDFS
Data Architecture
Pivotal HD on Pivotal CF
� Enterprise PaaS Management System
� Flexible multi-language ‘buildpack’ architecture
� Deployed applications enjoy built-in services
� On-Premise Hadoop as a Service
� Single cluster deployment of Pivotal HD
� Developers instantly bind to shared Hadoop Clusters
� Speeds up time-to-value
Pivotal’s view
Data Science Platform
Tachyon/Gem
Cluster Manager
MR
Application
Stream Server
MPP SQL
Data Lake / HDFS / Virtual Storage
GemFireXD
...ETC
Hadoop HDFS Isilon
App Dev / Ops
MLbase Streaming
Legacy Systems
Legacy
Data Scientists Data Sources End Users
SparkSQL
Principle #10
The rumors of my death…
It will be called Hadoop
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
Spark recap
• Is it “Big Data” (Yes)
• Is it “Hadoop” (No)
• It’s one of those “in memory” things, right (Yes)
• JVM, Java, Scala (All)
• Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)
A NEW PLATFORM FOR A NEW ERA
Additional Line 18 Point Verdana
Credits • Wikipedia and Dilbert.com
• Apache Software Foundation
• Scott Deeg
• Milind Bhandarkar
• Susheel Kaushik
• Mak Gokhale
Questions ?