giraffa - november 2014

20
Giraffa A highly available, scalable, distributed file system PLAMEN JELIAZKOV & MILAN DESAI

Upload: plamen-jeliazkov

Post on 12-Jul-2015

344 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Giraffa - November 2014

Giraffa

A highly available, scalable, distributed file system

PLAMEN JELIAZKOV & MILAN DESAI

Page 2: Giraffa - November 2014

Quick Introduction

• Giraffa is a new file system.

• Distributes it’s namespace by utilizing features of HDFS

and HBase.

• Open source project in experimental stage.

Page 3: Giraffa - November 2014

Design Principals

• Linear scalability – more nodes can do more work within the same

time. Scale data size and compute resources.

• Reliability and availability – 1/1000 probability that a drive will fail

today; on a large cluster with thousands of drives there can be

several failures.

• Move computation to data – minimize expensive data transfers.

• Sequential data processing – avoid random reads. [Use HBase for

random access].

Page 4: Giraffa - November 2014

Scalability Limits

• Single-master architecture: a constraining resource

• Single NameNode limits linear performance growth – a few

bad clients / jobs can saturate the NameNode.

• Single point of failure – takes entire File System out of

service.

• NameNode space limit:

-- 100 million files and 200 million blocks with 64GB RAM

-- Restricts storage capacity to about 20 PB

-- Small file problem: block-to-file ratio is shrinking as people

store more small files in HDFS.

These are Konstantin’s own discoveries as published in

“HDFS Scalability: The limits to growth”, USENIX;login: 2010.

Page 5: Giraffa - November 2014

The Goals for Giraffa

• Support millions of concurrent clients

- More servers -> higher concurrent connections can be accepted.

• Store hundreds of billions of objects

- More servers -> higher total memory.

• Maintain Exabyte total storage capacity

- More servers -> host more slaves -> higher total storage.

Sharding the namespace achieves all three goals.

Page 6: Giraffa - November 2014

What About Federation?

1. HDFS Federation allows independent NameNodes to share a

common pool of DataNodes.

2. In Federation, a user sees NameNodes as volumes, or as isolated

file systems.

Federation is a static approach to Namespace partitioning.

We call it static because sub-trees are statically assigned to disjoint

volumes.

Relocating sub-trees to a new volume requires copying between file

systems.

A dynamic Namespace partitioning could move sub-trees

automatically based on utilization or load-balancing requirements.

In some cases, sub-trees could be relocated without copying data

blocks.

Page 7: Giraffa - November 2014

VS

Page 8: Giraffa - November 2014

Giraffa Requirements

Availability – the primary goal

- Region splitting leads to load balancing of metadata traffic.

- Same data streaming speed to / from DataNodes.

- No SPOF. Continuous availability.

Scalability

- Each RegionServer stores a part of the namespace.

Cluster operability

- Cost running larger cluster is same as a smaller one.

- But, running multiple clusters is more expensive.

Page 9: Giraffa - November 2014

The Big Picture

1. Use HBase to store HDFS Namespace metadata.

2. DataNodes continue to store HDFS blocks.

3. Introduce coprocessors to act as communication layer between

HBase, HDFS, and the file system.

4. Store files and directories as rows in HBase.

A Giraffa “shard” consists of:

HBase RegionServer

HDFS NameNode – to be replaced with Giraffa BlockManager.

HDFS DataNode(s)

*HBase Master

*ZooKeeper(s)

* == Not required per shard, but necessary within the network.

Page 10: Giraffa - November 2014
Page 11: Giraffa - November 2014

Giraffa File System

• fs.defaultFS = grfa:///

• fs.grfa.impl = org.apache.giraffa.GiraffaFileSystem

• Namespace is cached in RegionServer RAM.

• Regions lead to dynamic Namespace partitioning.

• Block management handled by specialized RegionObserver

coprocessor to handle communication to DataNodes -> performs

block allocation, replication, deletion, heartbeats, and block

reports.

• Namespace manipulation handled by specialized coprocessor ->

performs all NameNode RPC Server calls.

Page 12: Giraffa - November 2014

NamespaceAgent

Quick run through of this class:

1. Implements ClientProtocol. Not a coprocessor.

2. Replaces NameNode RPC channel for GiraffaClient

(which extends DFSClient and is the client used by

GiraffaFileSystem class).

3. Has an HBaseClient member that communicates RPC

requests to the NamespaceProcessor coprocessor of a

RegionServer.

Page 13: Giraffa - November 2014

Namespace Table

Single HBase table called “Namespace” stores:

1. A RowKey: the bytes that identify the row and therefore

the file / directory.

2. File attributes: name, owner, group, permissions, access-

time, modification-time, block size, replication, length.

3. List of blocks for the file.

4. List of block locations.

5. State of the file: under construction, closed.

Page 14: Giraffa - November 2014

Row Keys

• Files and directories are stored as rows in HBase.

• The key bytes of a row determine its sorting in the Namespace

table.

• Different RowKey definitions change locality of files and

directories within the HBase region.

• FullPathRowKey is the default implementation. The key bytes

of the row are the full source path to the file or directory.

-- Problem: Renames may cause row to move to another Region.

• Another idea is NumberedRowKey. The key bytes are some

decided number.

-- Problem: You lose locality within HBase Namespace table.

Page 15: Giraffa - November 2014

Locality of Reference

• Traditional tree structured namespace is flattened into

linear array.

• Ordered list of files is self-partitioned into regions.

• RowKey implementations define sorting of files and

directories in the table.

• Files in the same directory will belong to the same region

(most of the time).

-- This leads to an efficient “ls” implementation by purely

scanning across a Region.

Page 16: Giraffa - November 2014

Giraffa Today

A lot of work has been done by the current team, the newest to

date are:

• Introduction of custom Giraffa WebUI.

• Atomic in-place rename, non-atomic moves, and non-atomic

move failure recovery.

• Serializing Exceptions over RPC.

• Support for YARN.

• (Coming soon) Introduction of Lease management.

Page 17: Giraffa - November 2014

Neat Futures

• Full Hadoop compatibility / HDFS replacement. We are 96%

compliant with hadoop/hdfs shell today. Shown by passing

bulk of TestHDFSCLI. Missing dfsadmin commands today.

• Since file system metadata lives among the same pool as

regular data, it is possible to deploy analytics and obtain

detailed analysis of your own file system.

• Snapshot implementation becomes a matter of increasing the

number of versions of a row allowed in HBase.

• Extended attributes implementation just mean adding a new

column to the file row.

Page 18: Giraffa - November 2014

History

2009 – Study on scalability limits.

2010 – Konstantin Shvachko works on design with Michael Stack;

presentation at HDFS contributors meeting.

2011 – Plamen Jeliazkov implements first POC.

2012 – Presented at Hadoop Summit. Open sourced as Apache

Extra’s project.

2013 – Milan Desai and Konstantin Pelykh added as committers.

Konstantin Boudnik as a contributor.

2014 – Giraffa Scalability tested – ~46,300 mkdirs / second with 64

RegionServer nodes and 64 client nodes.

Page 19: Giraffa - November 2014

?’s

Page 20: Giraffa - November 2014

http://apache-extras.org/p/giraffa/

https://code.google.com

/a/apache-extras.org/p/giraffa/

DEMO TIME!LINKS TO PROJECT WEBSITE BELOW