hadoop distributed file system

19
Hadoop Distributed File System (HDFS)

Upload: vaibhav-jain

Post on 15-Feb-2017

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Distributed File System

Hadoop Distributed File System(HDFS)

Page 2: Hadoop Distributed File System

Big Data Concepts

• Volume– No more GBs of data– TB,PB,EB,ZB

• Velocity– High frequency data like

in stocks • Variety– Structure and

Unstructured data

Page 3: Hadoop Distributed File System

Challenges In Big Data

• Complex– No proper understanding of

the underlying data • Storage

– How to accommodate large amount of data in single physical machine

• Performance– How to process large amount

of data efficiently and effectively so as to increase the performance

Page 4: Hadoop Distributed File System

Challenges in Traditional Application

• Network– Limited bandwidth

• Data– Growth of data can’t be

controlled• Efficiency & Performance

– How fast data can be read• Processing capacity of machine

– Processor, RAM is a bottleneck

Page 5: Hadoop Distributed File System

StatisticsApplication Size(MB) Data Size Total Round trip time(sec)

10 10 MB 1+1 = 2

10 100MB 10+10 = 20

10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)

10 1000 GB= 1TB 100000 + 100000 = ~55 Hour

• Calculation is done under ideal condition• No processing time is taken into consideration

Assuming N/W bandwidth is 10MBPS

• How data is read ? • Line by Line reading

• Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours

How much time you take to sort 1TB of data??

Enough time to watch a movie, while

data is being read

Page 6: Hadoop Distributed File System

Statistics(Contd.)

Observation

• Large amount of data takes lot of time to read

• Data is moved back and forth over the low latency network where application is running – 90% of the time is consumed in

data transfer • Application size is constant

Conclusion

• Achieving Data Localization– Move application close to data

Or– Move data close to application

Page 7: Hadoop Distributed File System

Summary

• Storage is problem – Cannot store large amount of data – Upgrading the hard disk will also not solve the problem

(Hardware limitation) • Performance degradation – Upgrading RAM will not solve the problem (Hardware

limitation) • Reading – Larger data requires larger time to read

Page 8: Hadoop Distributed File System

Solution Approach• Distributed Framework

– Storing the data across several machine

– Performing computation parallel across several machines

• Should Support– Partial failures– Recoverability– Data availability– Consistency– Data reliability– Upgrading

Page 9: Hadoop Distributed File System

Introducing Hadoop

Distributed framework that provides scaling in : • Storage • Performance • IO Bandwidth

Page 10: Hadoop Distributed File System

What makes Hadoop special?

• No high end or expensive systems are required • Can run on Linux, Mac OS/X, Windows, Solaris • Fault tolerant system

– Execution of the job continues even of nodes are failing • Highly reliable and efficient storage system• In built intelligence to speed up the application

– Speculative execution• Fit for lot of applications:

– Web log processing – Page Indexing, page ranking – Complex event processing

Page 11: Hadoop Distributed File System

Features of Hadoop

• Partition, replicate and distributes the data – Data availability, consistency

• Performs Computation closer to the data – Data Localization

• Performs computation across several hosts – MapReduce framework

Page 12: Hadoop Distributed File System

Hadoop Components

• Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO

bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance

Page 13: Hadoop Distributed File System

Understanding file structure

1 GB file

File is split into

blocks

Each block is typically

64MB

Each block is stored as two files – one holding

data and second for metadata, checksum

Block

Page 14: Hadoop Distributed File System

Hadoop Processes

• Processes running on Hadoop– NameNode– DataNode– Secondary NameNode– Task Tracker– Job Tracker

Page 15: Hadoop Distributed File System

NameNode

• Single point of contact• HDFS master• Holds meta information– List of files and directories– Location of blocks

• Single node per cluster– Cluster can have thousands of

DataNodes and tens of thousands of HDFS client.

NameNode

Page 16: Hadoop Distributed File System

DataNode

• Can execute multiple tasks concurrently• Holds actual data blocks, checksum and generation

stamp• If block is half full, needs only half of the space of

full block• At start-up, connects to NameNode and perform

handshake• No binding to IP address or port, uses Storage ID• Sends heartbeat to NameNode

DataNodeStorage ID: XYZ001

Page 17: Hadoop Distributed File System

Communication• Total Storage Capacity• Fraction of storage in

use• No of data transfer

currently in progress

• Instructs DataNode• Replicate block to other node• Remove local block replica• Send immediate block report• Shut down the node

Every 3 seconds.

“I AM ALIVE”

NameNode

DataNodeStorage ID: XYZ001 DataNode

Storage ID: XYZ002

DataNodeStorage ID: XYZ003

Reply

No heartbeat

for 10 minutes

Heartbeat

Page 18: Hadoop Distributed File System

Overview of HDFS

Page 19: Hadoop Distributed File System

HDFS Client