topic 11: google filesystem

94
11: Google Filesystem Zubair Nabi [email protected] April 20, 2013 Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29

Upload: zubair-nabi

Post on 18-Nov-2014

730 views

Category:

Technology


1 download

DESCRIPTION

Cloud Computing Workshop 2013, ITU

TRANSCRIPT

Page 1: Topic 11: Google Filesystem

11: Google Filesystem

Zubair Nabi

[email protected]

April 20, 2013

Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29

Page 2: Topic 11: Google Filesystem

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 2 / 29

Page 3: Topic 11: Google Filesystem

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 3 / 29

Page 4: Topic 11: Google Filesystem

Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29

Page 5: Topic 11: Google Filesystem

Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29

Page 6: Topic 11: Google Filesystem

Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29

Page 7: Topic 11: Google Filesystem

Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29

Page 8: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 9: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 10: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.

Maintains a namespace which maps logical names to physical namesI Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 11: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 12: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 13: Topic 11: Google Filesystem

Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.

Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29

Page 14: Topic 11: Google Filesystem

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 6 / 29

Page 15: Topic 11: Google Filesystem

Introduction

Designed by Google to meet its massive storage needs

Shares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availability

At the same time, design driven by key observations of their workloadand infrastructure, both current and future

Zubair Nabi 11: Google Filesystem April 20, 2013 7 / 29

Page 16: Topic 11: Google Filesystem

Introduction

Designed by Google to meet its massive storage needs

Shares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availability

At the same time, design driven by key observations of their workloadand infrastructure, both current and future

Zubair Nabi 11: Google Filesystem April 20, 2013 7 / 29

Page 17: Topic 11: Google Filesystem

Introduction

Designed by Google to meet its massive storage needs

Shares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availability

At the same time, design driven by key observations of their workloadand infrastructure, both current and future

Zubair Nabi 11: Google Filesystem April 20, 2013 7 / 29

Page 18: Topic 11: Google Filesystem

Design Goals

1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure

2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files

3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this case

Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29

Page 19: Topic 11: Google Filesystem

Design Goals

1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure

2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files

3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this case

Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29

Page 20: Topic 11: Google Filesystem

Design Goals

1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure

2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files

3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this case

Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29

Page 21: Topic 11: Google Filesystem

Design Goals (2)

4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them

5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously

6 Applications process data in bulk at a high rate: Favour throughputover latency

Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29

Page 22: Topic 11: Google Filesystem

Design Goals (2)

4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them

5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously

6 Applications process data in bulk at a high rate: Favour throughputover latency

Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29

Page 23: Topic 11: Google Filesystem

Design Goals (2)

4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them

5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously

6 Applications process data in bulk at a high rate: Favour throughputover latency

Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29

Page 24: Topic 11: Google Filesystem

Interface

The interface is similar to traditional filesystems but no support for astandard POSIX-like API

Files are organized hierarchically into directories with pathnames

Support for create, delete, open, close, read, and write operations

Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29

Page 25: Topic 11: Google Filesystem

Interface

The interface is similar to traditional filesystems but no support for astandard POSIX-like API

Files are organized hierarchically into directories with pathnames

Support for create, delete, open, close, read, and write operations

Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29

Page 26: Topic 11: Google Filesystem

Interface

The interface is similar to traditional filesystems but no support for astandard POSIX-like API

Files are organized hierarchically into directories with pathnames

Support for create, delete, open, close, read, and write operations

Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29

Page 27: Topic 11: Google Filesystem

Architecture

Consists of a single master and multiple chunkservers

The system can be accessed by multiple clients

Both the master and chunkservers run as user-space server processeson commodity Linux machines

Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29

Page 28: Topic 11: Google Filesystem

Architecture

Consists of a single master and multiple chunkservers

The system can be accessed by multiple clients

Both the master and chunkservers run as user-space server processeson commodity Linux machines

Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29

Page 29: Topic 11: Google Filesystem

Architecture

Consists of a single master and multiple chunkservers

The system can be accessed by multiple clients

Both the master and chunkservers run as user-space server processeson commodity Linux machines

Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29

Page 30: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 31: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 32: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 33: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterange

Each chunk is replicated on multiple chunkserversI 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 34: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 35: Topic 11: Google Filesystem

Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default

Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29

Page 36: Topic 11: Google Filesystem

Master

In charge of all filesystem metadata

I Namespace, access control information, mapping between files andchunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 37: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 38: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 39: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 40: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructions

Clients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 41: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 42: Topic 11: Google Filesystem

Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck

Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29

Page 43: Topic 11: Google Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 14 / 29

Page 44: Topic 11: Google Filesystem

Consistency Model: Master

All namespace mutations (such as file creation) are atomic as they areexclusively handled by the master

Namespace locking guarantees atomicity and correctness

The operation log maintained by the master defines a global total orderof these operations

Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29

Page 45: Topic 11: Google Filesystem

Consistency Model: Master

All namespace mutations (such as file creation) are atomic as they areexclusively handled by the master

Namespace locking guarantees atomicity and correctness

The operation log maintained by the master defines a global total orderof these operations

Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29

Page 46: Topic 11: Google Filesystem

Consistency Model: Master

All namespace mutations (such as file creation) are atomic as they areexclusively handled by the master

Namespace locking guarantees atomicity and correctness

The operation log maintained by the master defines a global total orderof these operations

Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29

Page 47: Topic 11: Google Filesystem

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or append

I Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29

Page 48: Topic 11: Google Filesystem

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or fails

I Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29

Page 49: Topic 11: Google Filesystem

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29

Page 50: Topic 11: Google Filesystem

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29

Page 51: Topic 11: Google Filesystem

Consistency Model: Data

The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety

Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29

Page 52: Topic 11: Google Filesystem

Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistent

Concurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined

Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29

Page 53: Topic 11: Google Filesystem

Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined

Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29

Page 54: Topic 11: Google Filesystem

Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined

Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29

Page 55: Topic 11: Google Filesystem

Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined

Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29

Page 56: Topic 11: Google Filesystem

Mutation Operations

Each chunk has many replicas

The primary replica holds a lease from the master

It decides the order of all mutations for all replicas

Zubair Nabi 11: Google Filesystem April 20, 2013 18 / 29

Page 57: Topic 11: Google Filesystem

Mutation Operations

Each chunk has many replicas

The primary replica holds a lease from the master

It decides the order of all mutations for all replicas

Zubair Nabi 11: Google Filesystem April 20, 2013 18 / 29

Page 58: Topic 11: Google Filesystem

Mutation Operations

Each chunk has many replicas

The primary replica holds a lease from the master

It decides the order of all mutations for all replicas

Zubair Nabi 11: Google Filesystem April 20, 2013 18 / 29

Page 59: Topic 11: Google Filesystem

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29

Page 60: Topic 11: Google Filesystem

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29

Page 61: Topic 11: Google Filesystem

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29

Page 62: Topic 11: Google Filesystem

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29

Page 63: Topic 11: Google Filesystem

Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client

Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29

Page 64: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 65: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 66: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client1 If the records fits in the current chunk, it is written and communicated to

the client2 If it does not, the chunk is padded and the client is told to try the next

chunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 67: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 68: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 69: Topic 11: Google Filesystem

Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk

Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29

Page 70: Topic 11: Google Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 21 / 29

Page 71: Topic 11: Google Filesystem

Application Safeguards

Use record append rather than write

Insert checksums in record headers to detect fragments

Insert sequence numbers to detect duplicates

Zubair Nabi 11: Google Filesystem April 20, 2013 22 / 29

Page 72: Topic 11: Google Filesystem

Application Safeguards

Use record append rather than write

Insert checksums in record headers to detect fragments

Insert sequence numbers to detect duplicates

Zubair Nabi 11: Google Filesystem April 20, 2013 22 / 29

Page 73: Topic 11: Google Filesystem

Application Safeguards

Use record append rather than write

Insert checksums in record headers to detect fragments

Insert sequence numbers to detect duplicates

Zubair Nabi 11: Google Filesystem April 20, 2013 22 / 29

Page 74: Topic 11: Google Filesystem

Chunk Placement

Put on chunkservers with below average disk space usage

Limit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh data

For reliability, replicas spread across racks

Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29

Page 75: Topic 11: Google Filesystem

Chunk Placement

Put on chunkservers with below average disk space usage

Limit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh data

For reliability, replicas spread across racks

Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29

Page 76: Topic 11: Google Filesystem

Chunk Placement

Put on chunkservers with below average disk space usage

Limit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh data

For reliability, replicas spread across racks

Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29

Page 77: Topic 11: Google Filesystem

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29

Page 78: Topic 11: Google Filesystem

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29

Page 79: Topic 11: Google Filesystem

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29

Page 80: Topic 11: Google Filesystem

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29

Page 81: Topic 11: Google Filesystem

Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space

Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29

Page 82: Topic 11: Google Filesystem

Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected

Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29

Page 83: Topic 11: Google Filesystem

Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected

Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29

Page 84: Topic 11: Google Filesystem

Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected

Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29

Page 85: Topic 11: Google Filesystem

Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected

Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29

Page 86: Topic 11: Google Filesystem

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem

Zubair Nabi 11: Google Filesystem April 20, 2013 26 / 29

Page 87: Topic 11: Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29

Page 88: Topic 11: Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29

Page 89: Topic 11: Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29

Page 90: Topic 11: Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29

Page 91: Topic 11: Google Filesystem

Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface

Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29

Page 92: Topic 11: Google Filesystem

Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29

Page 93: Topic 11: Google Filesystem

Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29

Page 94: Topic 11: Google Filesystem

References

1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. TheGoogle file system. In Proceedings of the nineteenth ACM symposiumon Operating systems principles (SOSP ’03). ACM, New York, NY,USA, 29-43.

Zubair Nabi 11: Google Filesystem April 20, 2013 29 / 29