hdfs introduction

23
+ + Hadoop 기기 기기

Upload: -

Post on 27-Jan-2015

135 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: HDFS introduction

++Hadoop 기본 과정

Page 2: HDFS introduction

++Overview

HDFSHDFSHDFSHDFS

ImpalaImpalaImpalaImpala

MapReduceMapReduceMapReduceMapReduce

CascadingCascadingCascadingCascading HiveHiveHiveHive

Page 3: HDFS introduction

++Big Data for What?

Service

CAP Theorem, Fast Response ,Scale Out , Schema Free ...

Distributor with RDBMS

NoSQL

MongoDB , HBASE , CouchDB ...

Analysis

Hadoop <--- today’s topic!!!

Page 4: HDFS introduction

++What’s Hadoop

Consist ofHDFS (Hadoop Distributed File System)

MapReduce

Page 5: HDFS introduction

++HDFS Architecture

master

namenode

slave

bunch of datanode

NameNodNameNodee

NameNodNameNodee

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

Page 6: HDFS introduction

++single master

Strong Point

simple architecture

master have global knowledge.

file and block namespace (memory and disk)

mapping from files to blocks (memory and disk)

location of each block’s replicas ( only memory)

master can make sophisticated decisions.

Page 7: HDFS introduction

++single master

Weak PointSPOF(= single point of failure )

bottleneck

minimizing master’s involvement is important

Page 8: HDFS introduction

++Fast Recovery for NameNode

Secondary Namenode

crawls namenode’s operation log

maintains namenode’s data

NameNodeNameNodeNameNodeNameNode

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

Secondary Secondary NameNodeNameNodeSecondary Secondary NameNodeNameNode

Page 9: HDFS introduction

++HA for NameNode

active namenode

do normal namenode’s operation

standby namenode

maintain namenode’s data

ready to be active namenode

NameNode(active)NameNode(active)NameNode(active)NameNode(active)

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)

Page 10: HDFS introduction

++block

each file consists of blocks

sizedefault 64M

replication ( default 3 )

Page 11: HDFS introduction

++write operation

client send ‘write request’ to namenode

namenode lock file and select datanode to be written.

namenode response datanode list to client.

client send file content to datanode.

datanode store file and relay to other datanode.

finally client send close request to namenode.

namenode release write lock

NameNodeNameNodeNameNodeNameNode

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

clientclientclientclient

write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode

Page 12: HDFS introduction

++read operation

client send ‘read request’ to namenode

namenode lock file and select datanode to be written.

namenode response datanode list to client.

client send read request to datanode.

datanode send content to client

finally client send close request to namenode.

namenode release read lock

NameNodeNameNodeNameNodeNameNode

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

clientclientclientclient

read lock read lock read lock read lock

Page 13: HDFS introduction

++block(again)

reason to use big-size-block reduce client’s need to interact with namenode

reduce the size of metadata stored on namenode

Page 14: HDFS introduction

++namenode’s operation

namespace management and locking

replica placement

creation, re-replication, rebalancing

garbage collection

stale replica detection

Page 15: HDFS introduction

++namespace management and locking

goalensure proper serialization

use read lock/write lock

Page 16: HDFS introduction

++block replica placement

goal

maximize data reliability and availability

maximize network bandwidth utilization

default strategy is ...

one on same datanode.

one on other datanode in same rack.

one on other datanode in other rack.

Page 17: HDFS introduction

++creation, re-replication, rebalancing

creation

client create new files

consider

disk space utilization

number of recent creation

spread replicas

re-replication

number of available replica falls below proper goal

datanode down, replica corruption ...

rebalancing

move replicas for better disk space and load balancing

Page 18: HDFS introduction

++garbage collection

what’s garbage?

block not in namenode’s metadata.

mechanism

when exchanging HeartBeat with namenode, datanode reports subset of block it has.

master replies with garbage blocks.

datanode deletes grabage blocks.

Page 19: HDFS introduction

++stale replica detection

mechanismstoring with generation timestamp.

when restarting, datanode reports its set of blocks with its generation timestamp

Page 20: HDFS introduction

++Datanode’s operation

check data integritydatanode use checksumming to detect corruption.

Page 21: HDFS introduction

++filesystem api

hdfs provide basic linux utilities.ex)

hdfs dfs -mkdir -p /foo

hdfs dfs -ls /foo

hdfs dfs -cat /foo/bar.txt

hdfs dfs -rm -r /foo

Page 22: HDFS introduction

++etc

raid?

native library?

Page 23: HDFS introduction

++end

thanks ....