hadoop distributed file system

Hadoop Distributed File System(HDFS)

Big Data Concepts

• Volume– No more GBs of data– TB,PB,EB,ZB

• Velocity– High frequency data like

in stocks • Variety– Structure and

Unstructured data

Challenges In Big Data

• Complex– No proper understanding of

the underlying data • Storage

– How to accommodate large amount of data in single physical machine

• Performance– How to process large amount

of data efficiently and effectively so as to increase the performance

Challenges in Traditional Application

• Network– Limited bandwidth

• Data– Growth of data can’t be

controlled• Efficiency & Performance

– How fast data can be read• Processing capacity of machine

– Processor, RAM is a bottleneck

StatisticsApplication Size(MB) Data Size Total Round trip time(sec)

10 10 MB 1+1 = 2

10 100MB 10+10 = 20

10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)

10 1000 GB= 1TB 100000 + 100000 = ~55 Hour

• Calculation is done under ideal condition• No processing time is taken into consideration

Assuming N/W bandwidth is 10MBPS

• How data is read ? • Line by Line reading

• Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours

How much time you take to sort 1TB of data??

Enough time to watch a movie, while

data is being read

Statistics(Contd.)

Observation

• Large amount of data takes lot of time to read

• Data is moved back and forth over the low latency network where application is running – 90% of the time is consumed in

data transfer • Application size is constant

Conclusion

• Achieving Data Localization– Move application close to data

Or– Move data close to application

Summary

• Storage is problem – Cannot store large amount of data – Upgrading the hard disk will also not solve the problem

(Hardware limitation) • Performance degradation – Upgrading RAM will not solve the problem (Hardware

limitation) • Reading – Larger data requires larger time to read

Solution Approach• Distributed Framework

– Storing the data across several machine

– Performing computation parallel across several machines

• Should Support– Partial failures– Recoverability– Data availability– Consistency– Data reliability– Upgrading

Introducing Hadoop

Distributed framework that provides scaling in : • Storage • Performance • IO Bandwidth

What makes Hadoop special?

• No high end or expensive systems are required • Can run on Linux, Mac OS/X, Windows, Solaris • Fault tolerant system

– Execution of the job continues even of nodes are failing • Highly reliable and efficient storage system• In built intelligence to speed up the application

– Speculative execution• Fit for lot of applications:

– Web log processing – Page Indexing, page ranking – Complex event processing

Features of Hadoop

• Partition, replicate and distributes the data – Data availability, consistency

• Performs Computation closer to the data – Data Localization

• Performs computation across several hosts – MapReduce framework

Hadoop Components

• Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO

bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance

Understanding file structure

1 GB file

File is split into

blocks

Each block is typically

64MB

Each block is stored as two files – one holding

data and second for metadata, checksum

Block

Hadoop Processes

• Processes running on Hadoop– NameNode– DataNode– Secondary NameNode– Task Tracker– Job Tracker

NameNode

• Single point of contact• HDFS master• Holds meta information– List of files and directories– Location of blocks

• Single node per cluster– Cluster can have thousands of

DataNodes and tens of thousands of HDFS client.

NameNode

DataNode

• Can execute multiple tasks concurrently• Holds actual data blocks, checksum and generation

stamp• If block is half full, needs only half of the space of

full block• At start-up, connects to NameNode and perform

handshake• No binding to IP address or port, uses Storage ID• Sends heartbeat to NameNode

DataNodeStorage ID: XYZ001

Communication• Total Storage Capacity• Fraction of storage in

use• No of data transfer

currently in progress

• Instructs DataNode• Replicate block to other node• Remove local block replica• Send immediate block report• Shut down the node

Every 3 seconds.

“I AM ALIVE”

NameNode

DataNodeStorage ID: XYZ001 DataNode

Storage ID: XYZ002

DataNodeStorage ID: XYZ003

Reply

No heartbeat

for 10 minutes

Heartbeat

Overview of HDFS

HDFS Client

hadoop distributed file system

Documents