understanding apache hadoop - neudesic...• employs contributors to project apache hadoop •...
TRANSCRIPT
![Page 1: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/1.jpg)
© Copyright 2014, Neudesic. All rights reserved.
Understanding Apache Hadoop
Presented By: Orion GebremedhinBI/Big Data Solution Partner , Neudesic LLC.
Data Platform VTSP, Microsoft Corp.
@OrionGM
![Page 2: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/2.jpg)
© Copyright 2014, Neudesic. All rights reserved.
Topics Covered
• Fundamentals of Hadoop
• Major Hadoop Distributions
![Page 3: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/3.jpg)
© Copyright 2014, Neudesic. All rights reserved.
3
Big Data = Hadoop?
* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014
![Page 4: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/4.jpg)
© Copyright 2014, Neudesic. All rights reserved.
4
The Fundamentals of Hadoop
• Hadoop evolved directly from
commodity scientific supercomputing
clusters developed in the 1990s
• Hadoop consists of:
• MapReduce
• Hadoop Distributed File System
(HDFS)
![Page 5: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/5.jpg)
© Copyright 2014, Neudesic. All rights reserved.
5
What’s New?
![Page 6: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/6.jpg)
© Copyright 2014, Neudesic. All rights reserved.
6
Characteristics of HDFS
• Very high fault tolerance
• Cannot be updated, but
corrections can be appended
• File blocks are replicated
multiple times
Three Types of Nodes:
1. Name Node (Directory)
2. Backup Node (Checkpoint)
3. Data Node (Actual Data)
![Page 7: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/7.jpg)
© Copyright 2014, Neudesic. All rights reserved.
7
Characteristics of MapReduce
• Map Function - Take a task and break it down into small tasks
• Reduce Function - Combine the partial answers and find the combined list
A programing framework for library and runtime (just like .NET)
• When submitting a query, this manages the Task Trackers which trigger the actual Map or Reduce task
Master (Job Tracker)
• The “doers.” Each node in the cluster has a data node and a task tracker
Workers (Task Trackers)
![Page 8: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/8.jpg)
© Copyright 2014, Neudesic. All rights reserved.
8
HDFS and MapReduce
The Main Node: runs the Job tracker and the name node controls the files.
Each node runs two processes: Task Tracker and Data Node
![Page 9: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/9.jpg)
© Copyright 2014, Neudesic. All rights reserved.
9
Basics of MapReduce
1 bill/ sec
= 400 Seconds
400 bills
![Page 10: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/10.jpg)
© Copyright 2014, Neudesic. All rights reserved.
10
Basics of MapReduce
1 bill/ sec
1 bill/ sec
= 200 Seconds
= 200 Seconds
200 Bills
200 BillsTotal =200 seconds
![Page 11: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/11.jpg)
© Copyright 2014, Neudesic. All rights reserved.
11
Basics of MapReduce
= 100 Seconds100 bills
100 bills
100 bills
100 bills
= 100 Seconds
= 100 Seconds
= 100 Seconds
Total =100 seconds
![Page 12: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/12.jpg)
© Copyright 2014, Neudesic. All rights reserved.
12
Basics of MapReduce
Query
Result
Name Node/Job TrackerQuery
Data Nodes/Task Trackers
![Page 13: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/13.jpg)
© Copyright 2014, Neudesic. All rights reserved.
13
MapReduce
• Java
• Write many lines of code
Pig
• Mostly used by Yahoo
• Most used for data processing
• Shares some constructs w/ SQL
• Is more Verbose
• Needs a lot of training for users with limited procedural programming background
• Offers control over the flow of data
Hive
• Mostly used by Facebook for analytic purposes
• Used for analytics
• Relatively easier for developers w/ SQL experience.
• Less control over optimization of data flows compared to Pig
MapReduce, Pig and Hive
• Not as efficient as MapReduce
• Higher productivity for data scientists and developers
![Page 14: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/14.jpg)
© Copyright 2014, Neudesic. All rights reserved.
14
Major Versions of Apache Hadoop
Apache Foundation
![Page 15: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/15.jpg)
© Copyright 2014, Neudesic. All rights reserved.
15
Hortonworks
• 2011: $23 million funding from Yahoo! and Benchmark Capital- positioned as an independent company
• Horton the Elephant - Horton Hears a Who!
• Employs contributors to project Apache Hadoop
• October 2011 partnered with Microsoft : Azure and Windows Server
• Cloudera founded in October 2008…started the effort to be Microsoft Azure Certified in October 2014.
![Page 16: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/16.jpg)
© Copyright 2014, Neudesic. All rights reserved.
16
HDP User Interface
![Page 17: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/17.jpg)
© Copyright 2014, Neudesic. All rights reserved.
17
HUE Architecture
![Page 18: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded](https://reader030.vdocuments.site/reader030/viewer/2022040515/5e6fc6f7f119d4624431c0ca/html5/thumbnails/18.jpg)
© Copyright 2014, Neudesic. All rights reserved.
Questions?Tom Marek
Twitter: @twmarek
Orion Gebremedhin
Twitter: @OrionGM