hdfs ( hadoop distributed file system)

HDFS (Hadoop Distributed File System)

2011-10-10Taejoong Chung, MMLAB

Contents• Introduction– Hadoop Distributed File System?– Assumption & Goals

• Mechanism– Structure– Data Management–Maintenance

• Pros and Cons

HDFS• Hadoop Distributed File System– Started from ‘Nutch’ (open-source search

engine project) in 2005– Java based, Apache top-level project– To save massive data with low cost

• Characteristics– User-level distributed file system– Fault-tolerant – Could be deployed on low-cost hardwares

Assumption & Goals1) Protection of Failure• Detection of faults and quick, automatic re-

covery• Consider hardware & software failure

2) Streaming Data Access• Batch processing rather than interactive use• High throughput of data access rather than

low latency of data access

Assumption & Goals - contd3) Large Data Set• Typical file in HDFS is gigabytes to ter-

abytes• High aggregate data bandwidth scaling to

hundreds of nodes.

4) Simple Coherency Model• Write-once-read-many access• File once created, not allowed to modified

Assumption & Goals - contd5) Migrating Computation into data• Provides interface for applications to

move themselves closer to where the data is located

6) Portability• Easily portable from one platfrom to an-

other• Java based

Structure• Master / Slave architecture• NameNode (Master)– Manages the file system namespace– Regulates access to files by clients– Not contain any data files– Unique

• DataNode (Slave)– Actual repository– Multiple nodes are required

Namespace (Headquar-ter)Directory service

a DataNode: contain multiple blocks of data

Block: Piece of data

Conceptual Diagram

Operation• A file is distributed with multiple blocks with multiple

duplication over the DataNodes– A file is cut into multiple blocks whose size is 64MB (de-

fault)– Each block is replicated over the DataNodes (# of replica:

3, default)

• Scheme– Direction to maximize the ‘tolerance’– Local Tolerance

• Inside of rack– Global Tolerance

• Outside of rack

ExampleCommand to save files from NameNode

Data

Node

s

Rack 2 Rack 3Rack 1Local tolerance: in same rackGlobal tolerance: outside of rack

Rack Awareness

Data Maintenance• Each DataNode send ‘Heartbeat’

messages containing ‘Blockreport’ to NameNode – Blockreport• A list of all blocks on a DataNode

– Heartbeat• Kinds of ‘Ping’ (I’m alive!)• Receipt of a Hearbeat implies that the

DataNodes is functioning properly

Data Management• NameNode manages all data– EditLog• All the transaction is recorded from NameN-

ode– FsImage (File System Image)• To configure the which data blocks are

stored in which DataNodes• Key matadata is stored in memory• Heartbeat messages from DataNodes are

stored in here

Data Integrity (1)• Safemode– On startup, NameNode receives Heart-

beat and Blockreport messages from DataNode

– Each block has a specified minimum number of replicas• Under this threshold, re-replication hap-

pened– No replication of new data blocks does

not occur in this period– This happens regularly

Data Integrity (2)• Data fetched from a DataNode could be cor-

rupted– Checksum algorithms are implemented

• Operation① When a client creates an HDFS files, it also cre-

ate calculated checksum② A client receives a file, it also downloads

checksum③ Comparing downloaded checksum and another

calculated checksum from file, a client could verify the content

Robustness• Data disk failure, heartbeats and re-replication

– From heartbeats message, NameNode could check the liveness of DataNode

• Cluster rebalancing– If a DataNode have much more data than the oth-

ers, procedure for redistribution of blocks hap-pened

• Data integrity– Checksum

• Metadata disk failure– FsImage, EditLog are copied

Pros and Cons• Pros– Powerful mechanism for ‘Fault-Tolerant’– Easy to deploy– Free

• Cons– Single point of failure – NameNode– Not optimized solution

• Same magnitude of replication for each block– Not that fast

Download & More Informa-tion

• Official site– http://hadoop.apache.org/– Last build at March, 2011

• Korean Dev.– http://www.hadoop.co.kr/– Last uploaded materials at Oct, 2011

http://www.hadoop.co.kr/

hdfs ( hadoop distributed file system)

Documents