introduction to hadoop and hdfs
DESCRIPTION
Introduction to Hadoop and HDFS. Table of Contents. Hadoop – Overview . Hadoop Cluster. HDFS . Hadoop Overview. What is Hadoop ?. Hadoop is an open source framework for writing and running distributed applications that process large amounts of data . - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/1.jpg)
Introduction to Hadoop and
HDFS
![Page 2: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/2.jpg)
Table of Contents
Hadoop – Overview
Hadoop Cluster
HDFS
![Page 3: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/3.jpg)
Hadoop Overview
![Page 4: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/4.jpg)
What is Hadoop ?
• Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.
• Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs
• On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook.
• Hadoop cluster is a set of commodity machines networked together in one location.
![Page 5: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/5.jpg)
Key distinctions of Hadoop
• Accessible - Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
• Robust - Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
• Scalable - Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
• Simple - Hadoop allows users to quickly write efficient parallel code.
![Page 6: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/6.jpg)
Comparing SQL databases and Hadoop• SCALE-OUT INSTEAD OF SCALE-UP
• KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES
• FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
• OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS
![Page 7: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/7.jpg)
Hadoop Ecosystem• HDFS
A distributed file system that runs on large clusters of commodity machines.
• MapReduceA distributed data processing model and execution environment that runs on large clusters of commodity machines.
• PigA data flow language and execution environment for exploring very largedatasets. Pig runs on HDFS and MapReduce clusters.
![Page 8: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/8.jpg)
Hadoop Ecosystem
• HiveA distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
• HBaseA distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
![Page 9: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/9.jpg)
Hadoop Cluster
![Page 10: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/10.jpg)
Detail Hadoop Architecture
Client
NN JT
TT TT TT
TASK
TASK
TASK
DN DN DN
![Page 11: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/11.jpg)
Hadoop Framework
MAP/Reduced Job
HDFS Framework / File system
structured
unstructured
semi-structured
structured
unstructured
Semi-structured
![Page 12: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/12.jpg)
Typical Workflow
• Load data into the cluster (HDFS writes)
• Analyze data (MAP/ Reduce job)
• Store results in the cluster (HDFS write)
• Read results from the cluster (HDFS reads)
![Page 13: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/13.jpg)
Example
![Page 14: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/14.jpg)
Hadoop Distributed File System
(HDFS)
![Page 15: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/15.jpg)
Hadoop Distributed File System
• Shared multi-petabyte file system for entire cluster. Managed by a single NameNode File are written, read, renamed, deleted, but append only optimized for streaming reads of large files.
• Files are broken into uniform sized blocks. Blocks are typically 128 MB (64 MB default) Replicated to several DataNodes, for reliability.
• Data is distributed to many nodes Bandwidth scales linearly with the number of disks Avoids single path to all data
![Page 16: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/16.jpg)
Job Assignment
• Move map task to where the data is.
• Job Tracker assigns job based on the location of the data.
• The computation of job task are done mostly on servers containing the data.
• Handles recovery of task failures.
TT
TASK
DN
TT
TASK
DN
Job Tracker
![Page 17: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/17.jpg)
HDFS Demons on Nodes
Name Node
Date Node
Date Node
Date Node
Date Node
Hadoop Data File System
(HDFS) supports storage
of massive amount of data
on commodity hardware.
![Page 18: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/18.jpg)
Inside a DataNode
• Each Data Node can have thousands of Blocks of data
• Blocks by default are 64 MB each-- Often set at 128 MB
DATA NODE
Blocks
![Page 19: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/19.jpg)
Writing data to HDFS
• Blocks of data are replicated.• Allows computation to be brought close to data.• Replication increases the chances data locality.
Tasks are assigned to local node (when possible and then local rack.
Replication also supports reliability (node failure).• A Job is decomposed into Tasks that scan the data.
Block A
Node
Block B Block C
Block A
Block C
Node
Block B
Block A
Node
Block A
Block C
Node
Block B
Block C
Block B
![Page 20: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/20.jpg)
Inside a Task Tracker Node
• The administrator will assign slots for running maps and reduces.A given node may have 4 map slots and 8 reduce slots The particular number is site dependent. Varies with work load and machine configuration.
• Slots are designed as is being either map or reduce slots
• Each node may be individually configured.
• A slot will run a JVM to run a mapper or reducer.
![Page 21: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/21.jpg)
Map Reduce Architecture
HDFS
Map Reduce
Node (Map) Node (Reduce)
Input MapCode
Partitioner
Sort
Reduce code
Output
![Page 22: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/22.jpg)
Map Reduce Overview
• MapReduce works on <Key, Value> pairs
![Page 23: Introduction to Hadoop and HDFS](https://reader035.vdocuments.site/reader035/viewer/2022081420/56815ec1550346895dcd48ea/html5/thumbnails/23.jpg)
Thank You