hdfs ( hadoop distributed file system)

18
HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB

Upload: berget

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

HDFS ( Hadoop Distributed File System). 2011-10-10 Taejoong Chung, MMLAB. Contents. Introduction Hadoop Distributed File System? Assumption & Goals Mechanism Structure Data Management Maintenance Pros and Cons. HDFS. Hadoop Distributed File System - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HDFS  ( Hadoop  Distributed File System)

HDFS (Hadoop Distributed File System)

2011-10-10Taejoong Chung, MMLAB

Page 2: HDFS  ( Hadoop  Distributed File System)

Contents• Introduction– Hadoop Distributed File System?– Assumption & Goals

• Mechanism– Structure– Data Management–Maintenance

• Pros and Cons

Page 3: HDFS  ( Hadoop  Distributed File System)

HDFS• Hadoop Distributed File System– Started from ‘Nutch’ (open-source search

engine project) in 2005– Java based, Apache top-level project– To save massive data with low cost

• Characteristics– User-level distributed file system– Fault-tolerant – Could be deployed on low-cost hardwares

Page 4: HDFS  ( Hadoop  Distributed File System)

Assumption & Goals1) Protection of Failure• Detection of faults and quick, automatic re-

covery• Consider hardware & software failure

2) Streaming Data Access• Batch processing rather than interactive use• High throughput of data access rather than

low latency of data access

Page 5: HDFS  ( Hadoop  Distributed File System)

Assumption & Goals - contd3) Large Data Set• Typical file in HDFS is gigabytes to ter-

abytes• High aggregate data bandwidth scaling to

hundreds of nodes.

4) Simple Coherency Model• Write-once-read-many access• File once created, not allowed to modified

Page 6: HDFS  ( Hadoop  Distributed File System)

Assumption & Goals - contd5) Migrating Computation into data• Provides interface for applications to

move themselves closer to where the data is located

6) Portability• Easily portable from one platfrom to an-

other• Java based

Page 7: HDFS  ( Hadoop  Distributed File System)

Structure• Master / Slave architecture• NameNode (Master)– Manages the file system namespace– Regulates access to files by clients– Not contain any data files– Unique

• DataNode (Slave)– Actual repository– Multiple nodes are required

Page 8: HDFS  ( Hadoop  Distributed File System)

Namespace (Headquar-ter)Directory service

a DataNode: contain multiple blocks of data

Block: Piece of data

Conceptual Diagram

Page 9: HDFS  ( Hadoop  Distributed File System)

Operation• A file is distributed with multiple blocks with multiple

duplication over the DataNodes– A file is cut into multiple blocks whose size is 64MB (de-

fault)– Each block is replicated over the DataNodes (# of replica:

3, default)

• Scheme– Direction to maximize the ‘tolerance’– Local Tolerance

• Inside of rack– Global Tolerance

• Outside of rack

Page 10: HDFS  ( Hadoop  Distributed File System)

ExampleCommand to save files from NameNode

Data

Node

s

Rack 2 Rack 3Rack 1Local tolerance: in same rackGlobal tolerance: outside of rack

Rack Awareness

Page 11: HDFS  ( Hadoop  Distributed File System)

Data Maintenance• Each DataNode send ‘Heartbeat’

messages containing ‘Blockreport’ to NameNode – Blockreport• A list of all blocks on a DataNode

– Heartbeat• Kinds of ‘Ping’ (I’m alive!)• Receipt of a Hearbeat implies that the

DataNodes is functioning properly

Page 12: HDFS  ( Hadoop  Distributed File System)

Data Management• NameNode manages all data– EditLog• All the transaction is recorded from NameN-

ode– FsImage (File System Image)• To configure the which data blocks are

stored in which DataNodes• Key matadata is stored in memory• Heartbeat messages from DataNodes are

stored in here

Page 13: HDFS  ( Hadoop  Distributed File System)

Data Integrity (1)• Safemode– On startup, NameNode receives Heart-

beat and Blockreport messages from DataNode

– Each block has a specified minimum number of replicas• Under this threshold, re-replication hap-

pened– No replication of new data blocks does

not occur in this period– This happens regularly

Page 14: HDFS  ( Hadoop  Distributed File System)

Data Integrity (2)• Data fetched from a DataNode could be cor-

rupted– Checksum algorithms are implemented

• Operation① When a client creates an HDFS files, it also cre-

ate calculated checksum② A client receives a file, it also downloads

checksum③ Comparing downloaded checksum and another

calculated checksum from file, a client could verify the content

Page 15: HDFS  ( Hadoop  Distributed File System)

Robustness• Data disk failure, heartbeats and re-replication

– From heartbeats message, NameNode could check the liveness of DataNode

• Cluster rebalancing– If a DataNode have much more data than the oth-

ers, procedure for redistribution of blocks hap-pened

• Data integrity– Checksum

• Metadata disk failure– FsImage, EditLog are copied

Page 16: HDFS  ( Hadoop  Distributed File System)

Pros and Cons• Pros– Powerful mechanism for ‘Fault-Tolerant’– Easy to deploy– Free

• Cons– Single point of failure – NameNode– Not optimized solution

• Same magnitude of replication for each block– Not that fast

Page 17: HDFS  ( Hadoop  Distributed File System)

Download & More Informa-tion

• Official site– http://hadoop.apache.org/– Last build at March, 2011

• Korean Dev.– http://www.hadoop.co.kr/– Last uploaded materials at Oct, 2011

Page 18: HDFS  ( Hadoop  Distributed File System)

QnA