hadoop: past, present and future - v2.2 - sqlsaturday #326 - tampa ba edition
DESCRIPTION
Presentation given at SQLSaturday #326 Tampa, FL BA Edition https://www.sqlsaturday.com/326/schedule.aspxTRANSCRIPT
![Page 1: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/1.jpg)
© 2014 Trace3, All rights reserved.
BIG DATA INTELLIGENCE PRACTICE
HADOOP: PAST, PRESENT AND FUTURE
![Page 2: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/2.jpg)
© 2014 Trace3, All rights reserved.
Roadmap
1
~1 hour
1-‐ What Makes Up Hadoop 1.x?
2-‐ What’s New In Hadoop 2.x?
3-‐ The Future Of Hadoop …
![Page 3: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/3.jpg)
© 2014 Trace3, All rights reserved.
WHAT MAKES UP HADOOP 1.0?
![Page 4: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/4.jpg)
© 2014 Trace3, All rights reserved.
What’s a “Node”?
Node aka Server
Compute
Storage
Processes / Daemons / Services
Memory
OperaZng System
![Page 5: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/5.jpg)
© 2014 Trace3, All rights reserved.
Hadoop 1.0: HDFS + MapReduce
4
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1
1-‐2 1-‐3
![Page 6: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/6.jpg)
© 2014 Trace3, All rights reserved.
Hadoop 1.0: HDFS + MapReduce
5
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1 1-‐2
1-‐3
Reduce Map
2-‐1 3-‐2 3-‐3 4-‐1
2-‐3 4-‐2 2-‐2 3-‐1 4-‐3
Reduce Map
![Page 7: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/7.jpg)
© 2014 Trace3, All rights reserved.
MapReduce v1 LimitaZons
6
Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability JobTracker failure kills all queued and running jobs
Resources ParZZoned into Map and Reduce Hard parGGoning of Map and Reduce slots led to low resource uZlizaZon
No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
![Page 8: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/8.jpg)
© 2014 Trace3, All rights reserved.
Hadoop 1.0: Single Use System
7
HADOOP 1.0
Single Use System Batch Apps
HDFS (redundant, reliable storage)
MapReduce (cluster resource management and data
processing)
Pig Hive
![Page 9: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/9.jpg)
© 2014 Trace3, All rights reserved.
WHAT’S NEW IN HADOOP 2.0?
![Page 10: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/10.jpg)
© 2014 Trace3, All rights reserved.
YARN
9
YARN Replaces MapReduce
Yet Another Resource NegoZator
YARN will be the de-‐facto distributed operaZng system for Big Data
![Page 11: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/11.jpg)
© 2014 Trace3, All rights reserved.
YARN = BIG DATA
10
![Page 12: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/12.jpg)
© 2014 Trace3, All rights reserved. 11
Store DATA in one place Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
ApplicaGons Run NaGvely IN Hadoop
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
BATCH (MapReduce)
INTERACTIVE (Tez)
ONLINE (HBase)
STREAMING (DataTorrent)
GRAPH (Giraph)
YARN: No Longer Just Batch Apps
![Page 13: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/13.jpg)
© 2014 Trace3, All rights reserved. 12
YARN: ApplicaZons
Running all on the same Hadoop cluster to give applicaZons access to all the same source data!
MapReduce v2
Real-‐Time Stream Processing
Master-‐Worker Online
In-‐Memory
Apache Storm
![Page 14: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/14.jpg)
© 2014 Trace3, All rights reserved. 13
YARN: Quickly Maturing
2010
2011
2012
2013
2014
Today
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1 GA Released – 2.2
200,000+ nodes, 800,000+ jobs daily 10 million+ hours of compute daily
Version 2.3 Version 2.4
Version 2.5
![Page 15: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/15.jpg)
© 2014 Trace3, All rights reserved. 14
YARN: What Has Changed? YARN MRv1 RM
ResourceManager
AM ApplicaZonMaster
JT JobTracker
Scheduler Scheduler
NM NodeManager
TT TaskTracker
Container Map & Reduce Slot
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicaZonMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce
![Page 16: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/16.jpg)
© 2014 Trace3, All rights reserved.
The 6 Benefits Of YARN
15
• Scale • New programming models and services
• Improved cluster uZlizaZon
• Agility • Backwards compaZble with MapReduce v1
• Mixed workloads on the same source of data
![Page 17: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/17.jpg)
© 2014 Trace3, All rights reserved.
THE FUTURE OF HADOOP
![Page 18: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/18.jpg)
© 2014 Trace3, All rights reserved.
SQL on Hadoop
Speed Deliver interacGve query performance.
SQL Support array of SQL semanGcs for analyGc applicaGons running against Hadoop.
Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
![Page 19: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/19.jpg)
© 2014 Trace3, All rights reserved.
SQL on Hadoop
Hive on Apache Tez Hortonworks HDP2
Hive on Apache Spark Cloudera CDH5
Apache Drill MapR M7
Cloudera Impala Cloudera CDH5
Pivotal HAWQ Pivotal Big Data Suite
![Page 20: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/20.jpg)
© 2014 Trace3, All rights reserved.
Apache Spark
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
Apache Spark (Databricks)
Programming Languages Java, Scala, Python, R*
InteracZve Shell Ability to write code and get output.
Faster by ~100x Due how it handles data in memory.
![Page 21: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/21.jpg)
© 2014 Trace3, All rights reserved.
Apache Spark – Wordcount
![Page 22: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/22.jpg)
© 2014 Trace3, All rights reserved.
HOYA: HBase (NoSQL) on YARN
Dynamic Scaling On-‐demand cluster size. Increase and decrease the size with load.
Easier Deployment APIs to create, start, stop and delete HBase clusters.
Availability Recover from Region Server loss with a new container.
![Page 23: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/23.jpg)
© 2014 Trace3, All rights reserved.
Apache REEF
Machine Learning Framework well suited for building machine learning jobs.
Scalable / Fault Tolerant Makes it easy to implement scalable, fault-‐tolerant runGme environments for a range of computaGonal models.
Maintain State Users can build jobs that uGlize data from where it’s needed and also maintain state a`er jobs are done.
Retainable Evaluator ExecuGon Framework
![Page 24: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/24.jpg)
© 2014 Trace3, All rights reserved.
Real-‐Time Stream Processing
Apache Storm
Streaming
![Page 25: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/25.jpg)
© 2014 Trace3, All rights reserved.
Heterogeneous Storage
NameNode
Storage
NameNode
SATA SSD Fusion IO
THEN NOW
![Page 26: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/26.jpg)
© 2014 Trace3, All rights reserved.
Hadoop Roadmap
• Apache Hadoop 2.5 – NodeManager Restart w/o disrupGon
• Apache Hadoop 2.6 – Memory As Storage Tier – Dynamic Resource ConfiguraGon – Support For Docker Containers
Q3 2014
Q4 2014
![Page 27: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/27.jpg)
© 2014 Trace3, All rights reserved.
I KNOW YOU HAVE QUESTIONS
26
![Page 28: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition](https://reader033.vdocuments.site/reader033/viewer/2022060120/55934a291a28ab16568b471b/html5/thumbnails/28.jpg)
© 2014 Trace3, All rights reserved.
THANK YOU!
hqp://bigdatajoe.io/
hqp://bigdatacentric.com/
@bigdatajoerossi