design for a distributed name node

Reaching 10,000Aaron CordovaBooz Allen Hamilton | Hadoop Meetup DC | Sep 7 2010

[email protected]

mailto:[email protected]

mailto:[email protected]

Lots of Applications Require Scalability

Intelligence

Bio-Metrics

Bio-Informatics

Defense

Video

Images

Text

Structured Data

Graph Analytics

Machine Learning

Network Security

Hadoop Scales

Cos

t ->

Data Size ->

Shared Nothing Shared Disk

Linear Scalability

Massive Parallelism

MapReduce

Simplified Distributed Programming Model

Fault Tolerant

Designed to Scale to Thousands of Servers

Many Algorithms Easily Expressed as Map and Reduce

HDFS

Distributed File System

Optimized for High-Throughput

Fault Tolerant Through Replication, Checksumming

Designed to Scale to 10,000 servers

Hadoop is a Platform

MapReduce

HDFS

HBase

Mahout Hive

Pig

FlumeCascading

Nutch

HBase

Scalable Structured store

Fast Lookups

Durable, Consistent Writes

Automatic Partitioning

Mahout

Scalable Machine Learning Algorithms

Clustering

Classification

Fuzzy Table

Low-Latency Parallel Search

Generalized Fuzzy Matching

Images, Biometrics, Audio

One Major Problem

HDFS Single NameNode

Single NameSpace - easy to serialize operations

NameSpace stored entirely in memory

Changes written to transaction log first

Single Point of Failure

Performance Bottleneck?

NameNode Scalability

By software evolution standards Hadoop is a young project. In 2005, inspired by two Google papers, Doug Cutting and Mike Cafarella implemented the core of Hadoop. Its wide acceptance and growth started in 2006 when Yahoo! began investing in its development and committed to use Hadoop as its internal distributed platform. During the past sev-eral years Hadoop installations have grown from a handful of nodes to thousands. It is now used in many organizations around the world.

In 2006, when the buzzword for storage was Exabyte, the Hadoop group at Yahoo! formulated long-term target requirements [7] for the Hadoop Distributed File System and outlined a list of projects intended to bring the requirements to life. What was clear then has now become a reality: the need for large distributed storage systems backed by distributed computational frameworks like Ha-doop MapReduce is imminent.

Today, when we are on the verge of the Zettabyte Era, it is time to take a retrospective view of the targets and analyze what has been achieved, how aggressive our views on the evolution and needs of the storage world have been, how the achievements compare to competing systems, and what our lim-its to growth may be.

The main four-dimensional scale requirement targets for HDFS were formulated [7] as follows:

10PB capacity x 10,000 nodes x 100,000,000 files x 100,000 clients

The biggest Hadoop clusters [8, 5], such as the one recently used at Yahoo! to set sorting records, consist of 4000 nodes and have a total space capac-

“100,000 HDFS clients on a 10,000-node HDFS cluster will exceed the throughput capacity of a single name-node.

... any solution intended for single namespace server optimization lacks scalability.

... the most promising solutions seem to be based on distributing the namespace server ...”

Konstantin Shvachko

Login Apr 2010

0

12.5

25

37.5

50

writ

es/s

econ

d (th

ousa

nds)

Single NN Target

Goal

HDFS Single NameNode

Server grade machine

Lots of memory

Reliable components

RAID

Hot-Failover

Needs Parallelism

Scaling NameNode

Grow memory

Read-only Replicas of NameNode

Multiple static namespace partitions

Distributed name server, partition namespace dynamically

Distributed NameNode Features

Fast Lookups

Durable, Consistent writes

Automatic Partitioning

Can we use HBase?

NameSpace

filename : blocks DataNodes

node : blocks Blocks

block : nodes

Mappings as HBase Tables

How to order namespace?

Depth First Search Order

/

/dir1

/dir1/subdir

/dir1/subdir/file

/dir2/file1

/dir2/file2

Depth First Operations

Delete (Recursive)Move / Rename

Breadth First Search Order

0/

1/dir1

2/dir2/file1

2/dir2/file2

2/dir1/subdir

3/dir2/subdir/file

Breadth First Operations

List

NameNode

DFSClientDataNode DataNode DFSClient

Current Architecture

DFSClient

DNNProxy

DataNode

DNNProxy

DataNode

DNNProxy

DFSClient

DNNProxy

RServer RServer RServer RServer

Proposed Architecture

100k clients -> 41k writes/s

0

12.5

25

37.5

50

100 150 200 250

writ

es/s

econ

d (th

ousa

nds)

# machines hosting namespace

Single NN Distributed NN Target

Anticipated Performance

Issues

Synchronization - multiple writers, changes

Name distribution hotspots

Current Status

Working code exists that uses HBase with slightly modified DFSClient and DataNode for create, write, close, open, read, mkdirs, delete.

New component: HealthServer monitors DataNodes and does garbage collection. More like BigTable master, can die, restart without affecting clients.

Code

Will be at http://code.google.com/p/hdfs-dnn

Available under the Apache license - whichever is compatible with Hadoop

http://code.google.com/p/hdfs-dnn

http://code.google.com/p/hdfs-dnn

Doesn’t HBase run on HDFS?

Self-Hosted HBase

May be possible to have HBase use the same HDFS instance it’s supporting

Some recursion and self-reference already exists: HBase Metadata table is itself a table in HBase

Have to work out bootstrapping and failure recovery to resolve any potential circular dependencies

design for a distributed name node

Technology

single nn target

operations namespace

partition namespace

modied dfsclient

hdfs clients

memory changes

blocks blocks block

blocks datanodes node