an implementation of a log-structured file system for unix margo seltzer, harvard u. keith bostic,...

AN IMPLEMENTATION OF ALOG-STRUCTURED FILE SYSTEM

FOR UNIX

Margo Seltzer, Harvard U.Keith Bostic, U. C. BerkeleyMarshall Kirk McKusick, U. C. BerkeleyCarl Staelin, HP Labs

Overview

• Paper presents a redesign and implementation of the Sprite LFS

• BSD-LFS is– Faster than conventional UNIX FFS (the “fast”

file system of the early 80’s)– Not as fast as an enhanced version of FFS

with read and write clustering

Historical Perspective

• Early UNIX FS used small block sizes and did not try to optimize block placement

• The UNIX FFS– Increased block sizes– Added cylinder groups– Incorporated rotational disk positioning to

reduce delays when accessing sequential blocks

Limitations of FFS (I)• Synchronous file deletion and creation

– Makes file system recoverable after a crash• Same result can be achieved through

NVRAM hardware or logging software

Limitations of FFS (II)• Seek times between I/O requests for different

files– Has most impact on performance whenever

vast majority of files are small– FFS does not address the problem

Log-Structured File Systems

• Attempt to address both limitations of FFS• Store all data in a single, continuous log• Optimized for

– All writes– Reading files written in their entirety over a

short period of time – Accessing files that were created or modified

at the same time

General Organization

• Disk is partitioned into segments– Writes are always sequential within a

segment• Segment cleaner maintains a pool of empty

(“clean”) segments through disk compaction– “Live” data existing in a a set of segments are

regrouped in a smaller subset of segments

Overview

Task FFS LFSAllocate disk address

Block creation Segment Write

Allocate i-node Fixed location Appended to logMap an i-node number into a disk address

Static address Lookup ini-node map

Maintain free space

Bitmap CleanerSegment usage table

LFS Data Structures

• Superblock:– Same function as one used by FFS

• I-node map:– Maps i-node numbers into disk addresses

• Segment usage tables:– Show number of live bytes in a segment and last

modification time• Checkpoints:

– Created every time system does a sync()

Limitations of Sprite LFS

• Recovery does not verify the consistency of the file system directory structure

• LFS consumes “excessive amounts” of main memory [ by 1993 standards]

• Write requests are successful even if there is insufficient disk space

• Segment validation is hardware dependent• All file systems use a single cleaner and a single

cleaning policy• No measure of the cleaner overhead

Recovery (I)

• Two major aspects– Bringing the file system to a physically

consistent state– Verifying the logical structure of the file system

• FFS achieves both goals through fsck– Rebuilds the whole file system – Verifies the directory structure and all block

pointers

Recovery (II)

• Sprite LFS uses a two-step recovery process:– Initializes first all the file structures from the

most recent checkpoint– “Roll forward” to incorporate all subsequent

modifications• Done by reading each segment in

time order after the last checkpoint

Recovery (III)

• Standard LFS recovery does not verify the directory structure– Weakness to be addressed in BSD-LFS

Memory Consumption

• Sprite LFS reserves “large amounts” of main memory including four half-megabyte segments and many buffers

• BSD-LFS:– Does not use special staging buffers – Does not reserve two read-only segments that

can be reclaimed without any I/O– Implements cleaner as a user-level process

Block Accounting

• Sprite LFS maintained a count of disk blocks available for physical writing– Blocks written to the cache but not written to disk

do not affect that count• What if a block is “successfully” written to the

cache but the disk becomes full before the blocks are actually written?

• BSD-LFS keeps a separate count of disk blocks that are not yet committed to any dirty block in the cache

Segment Structure (I)

• Sprint LFS places segment summary blocks at the end of the segment

– Write containing the segment summary validates the whole segment

• Makes two incorrect assumptions

1. Controller will not reorder write requests2. Disk will always write the contents of a buffer

in the order presented

Segment Structure (II)

• BSD-LFS – Does not make these assumptions

• Segment blocks can be written in any order– Segment summary is in front of each partial

segment and contains a checksum of four bytes of every block in the partial segment• Partial segments constitute the atomic

recovery units of BDS-LFS

File System Verification

• BSD-LFS offers two recovery strategies– Quick roll forward from last checkpoint– Complete consistency check of the file system

• Recovers lost or corrupted data• Same functionality as FFS fsck()• Takes a long time to run• Can be run in the background

The Cleaner

• BSD-LFS makes it possible to implement the cleaner as a user process– Allows for multiple cleaning policies– Makes it easier to experiment with new

policies

Implementation Issues

• BSD-LFS uses on-disk data structures that are nearly identical to those used by FFS– Existing performance tools can continue to

function with only minor modification– Makes system easier to implement and maintain

• Two type of operations– Vfs operations affect the whole file system– Vnode operations affect individual files

More Implementation Issues

• BSD-LFS does not implement block fragments– Less needed block sizes could be smaller– Still want large blocks to keep metadata to

data ratio low– BSD-LFS should (but does not yet) allocate

progressively larger blocks.

The Buffer Cache (I)

• Had to modify the FFS buffer cache– Cannot assume that cache blocks can be

flushed one at a time• Would destroy any performance advantage

of LFS• LFS may need extra memory to write

modified metadata and partial segment summary blocks

The Buffer Cache (II)

– Cache blocks do not have a disk address until they are written to the disk• Violates assumption that all blocks have disk

addresses• Cannot use disk address to access indirect

blocks• BSD-LFS incorporates metadata block

numbering (negative values)

The IFILE

• Sprite-LFS maintained the i-node map and segment usage table as kernel data structures written to disk at checkpoint time

• BSD-LFS places both data structures in a read-only file visible in the file system– Allows unlimited number of i-nodes– Cleaner can be migrated into user space

• I-node map also contains a list of free i-nodes

Directory Operations (I)

• BSD-LFS does not retain synchronous behavior of directory operations (create, link, mkdir, …)

• Sprite-LFS maintains ordering of directory operations by maintaining a directory operation log inside the file system log – Before any directory updates are written to

disk, it writes a log entry describing that operation

Directory Operations (II)

• BSD-LFS has a unit of atomicity– the partial segment

• It does not have a mechanism that guarantees that all i-nodes involved in a directory operation will fit into a single partial segment

• BSD-LFS allows operations to span partial segments

Directory Operations (III)

• Introduces a new recovery restriction– Cannot roll forward a partial segment that

has an unfinished directory operation if the partial segment that completes the directory operation did not make it to disk(segment batching)

COMPARISON

• BSD-LFS was found to perform– Better to the 4BSD FFS in a variety of

benchmarks – Not significantly worse than FFS in any test

• EFS, a version of FFS with read and write clustering was found to provide comparable and sometimes superior performance to BSD- LFS.

EFS

• Extended version of FFS• Provides extent-based file system behavior• Parameter maxcontig specifies how many logically

sequential disk blocks should be allocated contiguously– Large maxcontig is same as track allocation

• EFS accumulates sequential dirty buffers in the cache before writing them as a cluster

Multi-user Andrew Benchmark

• We measure execution times• LFS performs well in phases 1 and 2 (mostly

writes) and poorly in phase 5 (random I/O)

CONCLUSIONS• A LFS operates best when it can write out many dirty

buffers at once– Requires more buffer space in main memory

• Delayed allocation of BSD-LFS complicates accounting of available free space– Issue was not correctly handled by Sprite-LFS

• Cleaner might sometimes consumes more disk space than it frees– Must reserve additional disk space

an implementation of a log-structured file system for unix margo seltzer, harvard u. keith bostic,...

Documents