Download - Hadoop - Past, Present and Future - v1.1
04/07/2023
Prepared for:
Presented by:“Big Data Joe” Rossi@bigdatajoerossi
HadoopPast, Present and Future
Roadmap
~45mins
Q&A
1- What Makes Up Hadoop 1.x?
2- What’s New In Hadoop 2.x?
3- The Future Of Hadoop …
What Makes Up Hadoop 1.x?
Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client1-1
1-21-3
Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client1-1 1-2
1-3
ReduceMap
2-1 3-2 3-3 4-1
2-3 4-2 2-2 3-1 4-3
ReduceMap
MapReduce v1 LimitationsScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
AvailabilityJobTracker failure kills all queued and running jobs
Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization
No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else
HADOOP 1.0
Single Use SystemBatch Apps
Apache Hadoop 1.0: Single Use System
HDFS(redundant, reliable storage)
MapReduce(cluster resource management and data
processing)
Pig Hive
What’s New In Hadoop 2.x?
YARN Replaces MapReduce
Yet Another Resource Negotiator
YARN
YARN will be the de-facto distributed operating system for Big Data
Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
BATCH(MapReduce)
INTERACTIVE(Tez)
ONLINE(HBase)
STREAMING(DataTorrent)
GRAPH(Giraph)
Running all on the same Hadoop cluster to give applications access to all the same source data!
YARN: Applications
MapReduce v2
Stream Processing
Master-WorkerOnline
In-Memory
Apache Storm
2010
2011
2012
2013
2014
Today
YARN: Moving QuicklyConceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily
Version 2.3 Version 2.4
YARN: Dr. Evil Approved
YARN: What Has Changed?YARN MRv1RM
ResourceManager
AMApplicationMaster
JTJobTracker
Scheduler Scheduler
NMNodeManager
TTTaskTracker
ContainerMap
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicationMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce
ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of data
6 Benefits of YARN
6
The Future of HadoopProjects and Roadmap
SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.
Stinger: Interactive Query for Hive
SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.
ScaleThe only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.
Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.
HOYA: HBase (NoSQL) on YARN
Easier DeploymentAPIs to create, start, stop and delete HBase clusters.
AvailabilityRecover from Region Server loss with a new container.
Machine LearningFramework well suited for building machine learning jobs.
Microsoft REEF
Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.
Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.
RetainableEvaluatorExecutionFramework
Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD Fusion IO
Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades
Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration
Hadoop Roadmap
RELEASEDEARLY
Q2 2014
MIDQ2 2014
I Know You Have Questions …No such thing as a stupid question.
Hadoop: Past, Present and Future
SD Big Data Meetup
One Last Thing …
meetup.com/sdbigdata2nd Wednesday Of The MonthNext: July 9st @ 5:45P
Thank You!
Hadoop: Past, Present and Future
Big Data Joe Rossihttp://bigdatajoe.io/@bigdatajoerossi
Supporting SlidesSlides with information that may be asked
YARN: How It Works
ResourceManager
NodeManager
ApplicationMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client
YARN: Example App Deployment
ResourceManager
NodeManager
HOYA / HBase Master
NodeManager
NodeManager NodeManager
Scheduler
Region Server
Region Server Region Server
HOYA Client
Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 3
Dynamic Runtime Changes 3
Linear Scalability 3
State Checkpointing 3
Apache Spark + Shark
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Apache Spark
Shark
Hive(sql)
Hadoop 2.x – YARN + HDFS
NameNode
DataNode / NodeManager DataNode / NodeManager
DataNode / NodeManager DataNode / NodeManager
StandbyNameNode /
ResourceManager
ContainerContainer
ContainerContainer
ContainerContainer
ContainerContainer
Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.
YARN: Key Take-Aways
Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.
One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.
Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.
YARN will be the de-facto distributed operating system for Big Data
Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm
Proprietary / Open Source O OSupport for Hadoop 1.x 1 1
Support for Hadoop 2.x 1 1
Native YARN 1 3
Dashboard 1 3
Extensible via Modules 1 1
Technical Support 1 1
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 1 3
Dynamic Runtime Changes 1 3
High Availability 1 2
Prog. Languages Supported Java, Python, etc. Java, Python, etc.
Log Analysis 1 3
Site Operations 1 3
MapReduce Diagnostics 1 3
Open Source Operators Library 1 2
Open Source Application Templates 1 3
Complex Computations (DAG) 1 3
Linear Scalability 1 3
Security 1 3
CLI and Macros 1 3
Configuration Based Specification 1 3
State Checkpointing 1 3
Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases
The 1st Generation Of Hadoop
Hadoop
HBase
Apache Spark
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Apache Spark
Shark
Hive(sql)
Spark Streaming
MLib(machine learning)
Project Mgt Committee Members
Hortonworks
Others
Cloudera
Yahoo!
0 2 4 6 8 10 12 14 16
7
6
3
15
11
Project Committers
Hortonworks
Others
Cloudera
Yahoo!
0 5 10 15 20 25 30
24
24
11
11
5
YARN: Why The De-Facto Distributed OS
Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily
Enables InnovationSmart people and companies to do amazing things to data
Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the
2013 alone
Apache Storm Topology
Bolt(Filter)Spout
Stream(Data Source)
Spout
Stream(Data Source)
Bolt(RDBMS Writes)
Bolt(Calculation)
Bolt(HDFS Writes)
RDBMS
HDFS
HDFS Write Data FlowNameNode
Client
DataNode DataNode DataNode
1
2
4 5
67
3Block Bytes
Block Bytes Block Bytes
Block Write Complete
AckAck
Ack
A
B
C