topic 11: google filesystem

11: Google Filesystem

Zubair Nabi

[email protected]

April 20, 2013

Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem

Outline

1 Introduction

2 Google Filesystem

Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.

Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.

Maintains a namespace which maps logical names to physical namesI Simplifies replication and migration

Outline

1 Introduction

2 Google Filesystem

Introduction

Designed by Google to meet its massive storage needs

Shares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availability

At the same time, design driven by key observations of their workloadand infrastructure, both current and future

Introduction

Design Goals

1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure

2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files

3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this case

Design Goals

Design Goals (2)

4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them

5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously

6 Applications process data in bulk at a high rate: Favour throughputover latency

Design Goals (2)

Interface

The interface is similar to traditional filesystems but no support for astandard POSIX-like API

Files are organized hierarchically into directories with pathnames

Support for create, delete, open, close, read, and write operations

Interface

Architecture

Consists of a single master and multiple chunkservers

The system can be accessed by multiple clients

Both the master and chunkservers run as user-space server processeson commodity Linux machines

Architecture

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Reads and writes to a chunk are specified by a handle and a byterange

Each chunk is replicated on multiple chunkserversI 3 by default

I 3 by default

Master

In charge of all filesystem metadata

I Namespace, access control information, mapping between files andchunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file

Master

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Master

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructions

Clients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

Master

Consistency Model: Master

All namespace mutations (such as file creation) are atomic as they areexclusively handled by the master

Namespace locking guarantees atomicity and correctness

The operation log maintained by the master defines a global total orderof these operations

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or append

I Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or fails

I Whether there are other concurrent mutations

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or failsI Whether there are other concurrent mutations

Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistent

Concurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined

If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistent

Mutation Operations

Each chunk has many replicas

The primary replica holds a lease from the master

It decides the order of all mutations for all replicas

Mutation Operations

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Write Operation

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

It waits for a reply from all replicas before returning to the client1 If the records fits in the current chunk, it is written and communicated to

the client2 If it does not, the chunk is padded and the client is told to try the next

Application Safeguards

Use record append rather than write

Insert checksums in record headers to detect fragments

Insert sequence numbers to detect duplicates

Chunk Placement

Put on chunkservers with below average disk space usage

Limit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh data

For reliability, replicas spread across racks

Chunk Placement

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Garbage Collection

Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected

Outline

1 Introduction

2 Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Introduction

Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29

Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29

References

1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. TheGoogle file system. In Proceedings of the nineteenth ACM symposiumon Operating systems principles (SOSP ’03). ACM, New York, NY,USA, 29-43.

topic 11: google filesystem

Technology