hdfs ( hadoop distributed file system)
DESCRIPTION
HDFS ( Hadoop Distributed File System). 2011-10-10 Taejoong Chung, MMLAB. Contents. Introduction Hadoop Distributed File System? Assumption & Goals Mechanism Structure Data Management Maintenance Pros and Cons. HDFS. Hadoop Distributed File System - PowerPoint PPT PresentationTRANSCRIPT
HDFS (Hadoop Distributed File System)
2011-10-10Taejoong Chung, MMLAB
Contents• Introduction– Hadoop Distributed File System?– Assumption & Goals
• Mechanism– Structure– Data Management–Maintenance
• Pros and Cons
HDFS• Hadoop Distributed File System– Started from ‘Nutch’ (open-source search
engine project) in 2005– Java based, Apache top-level project– To save massive data with low cost
• Characteristics– User-level distributed file system– Fault-tolerant – Could be deployed on low-cost hardwares
Assumption & Goals1) Protection of Failure• Detection of faults and quick, automatic re-
covery• Consider hardware & software failure
2) Streaming Data Access• Batch processing rather than interactive use• High throughput of data access rather than
low latency of data access
Assumption & Goals - contd3) Large Data Set• Typical file in HDFS is gigabytes to ter-
abytes• High aggregate data bandwidth scaling to
hundreds of nodes.
4) Simple Coherency Model• Write-once-read-many access• File once created, not allowed to modified
Assumption & Goals - contd5) Migrating Computation into data• Provides interface for applications to
move themselves closer to where the data is located
6) Portability• Easily portable from one platfrom to an-
other• Java based
Structure• Master / Slave architecture• NameNode (Master)– Manages the file system namespace– Regulates access to files by clients– Not contain any data files– Unique
• DataNode (Slave)– Actual repository– Multiple nodes are required
Namespace (Headquar-ter)Directory service
a DataNode: contain multiple blocks of data
Block: Piece of data
Conceptual Diagram
Operation• A file is distributed with multiple blocks with multiple
duplication over the DataNodes– A file is cut into multiple blocks whose size is 64MB (de-
fault)– Each block is replicated over the DataNodes (# of replica:
3, default)
• Scheme– Direction to maximize the ‘tolerance’– Local Tolerance
• Inside of rack– Global Tolerance
• Outside of rack
ExampleCommand to save files from NameNode
Data
Node
s
Rack 2 Rack 3Rack 1Local tolerance: in same rackGlobal tolerance: outside of rack
Rack Awareness
Data Maintenance• Each DataNode send ‘Heartbeat’
messages containing ‘Blockreport’ to NameNode – Blockreport• A list of all blocks on a DataNode
– Heartbeat• Kinds of ‘Ping’ (I’m alive!)• Receipt of a Hearbeat implies that the
DataNodes is functioning properly
Data Management• NameNode manages all data– EditLog• All the transaction is recorded from NameN-
ode– FsImage (File System Image)• To configure the which data blocks are
stored in which DataNodes• Key matadata is stored in memory• Heartbeat messages from DataNodes are
stored in here
Data Integrity (1)• Safemode– On startup, NameNode receives Heart-
beat and Blockreport messages from DataNode
– Each block has a specified minimum number of replicas• Under this threshold, re-replication hap-
pened– No replication of new data blocks does
not occur in this period– This happens regularly
Data Integrity (2)• Data fetched from a DataNode could be cor-
rupted– Checksum algorithms are implemented
• Operation① When a client creates an HDFS files, it also cre-
ate calculated checksum② A client receives a file, it also downloads
checksum③ Comparing downloaded checksum and another
calculated checksum from file, a client could verify the content
Robustness• Data disk failure, heartbeats and re-replication
– From heartbeats message, NameNode could check the liveness of DataNode
• Cluster rebalancing– If a DataNode have much more data than the oth-
ers, procedure for redistribution of blocks hap-pened
• Data integrity– Checksum
• Metadata disk failure– FsImage, EditLog are copied
Pros and Cons• Pros– Powerful mechanism for ‘Fault-Tolerant’– Easy to deploy– Free
• Cons– Single point of failure – NameNode– Not optimized solution
• Same magnitude of replication for each block– Not that fast
Download & More Informa-tion
• Official site– http://hadoop.apache.org/– Last build at March, 2011
• Korean Dev.– http://www.hadoop.co.kr/– Last uploaded materials at Oct, 2011
QnA