lanfranco muzi psu – may 26 th , 2005

M. Rosenblum and J.K. Ousterhout

The design and The design and implementation of a log-implementation of a log-

structured file systemstructured file system

Proceedings of the 13th ACM Symposium onOperating Systems Principles, December 1991

Lanfranco Muzi

PSU – May 26th, 2005

Presentation outline

•Motivation

•Basic functioning of Sprite LFS

•Design issues and choices

•Performance of Sprite LFS

•Conclusions

Motivation“New” facts…

•CPU speeds were dramatically increasing (1991 - and continued to do so…)

•Memories become cheaper and larger

•Disk have larger capacities, but performance does not keep up with other components: access time dominated by seek and rotational latency (mechanical issues

Consequences…

Motivation…consequences

•Applications become more disk-bound

•Size of cache increases

•Most read requests hit in cache

•All writes must eventually go to disk (safety)

•Higher write traffic fraction

• But a file system optimized for reads But a file system optimized for reads pays a high cost during writespays a high cost during writes (to (to achieve logical locality)achieve logical locality)

Problems with conventional file systems - 1

Information is “spread around” on diskE.g. create new file in FFS requires 5 disk I/Os:

2 for file i-node1 for file data2 for directory i-node and data

Seek takes much longer than data writing in the case of small files, which is the target in this study

Problems with conventional file systems - 2

Tendency to write synchronously

E.g. Unix FFS:Data blocks written asynchronously……But metadata (inodes, directories)

written synchronously•Synchronous writes slave application performance (and CPU usage) to the disk•Again, seeking for metadata update dominates write performance for small files

The Sprite LFS•Write asynchronously: buffer a series of writes in memory•Periodically copy buffer to disk

In a single writeOn a single contiguous segment (data

blocks, attributes, directories…)Rewrite instead of updating in place

•All info on disk is in a single sequential structure: the logthe log

Sprite LFS – main issues

1)1) How to retrieve information from the How to retrieve information from the log?log?

2)2) How to make large extents of free How to make large extents of free space always available for writing space always available for writing contiguous segments?contiguous segments?

File location and readingBasic data structures analogous to Unix FFS:

One inodeinode per file: attributes, address of first 10 blocks or indirect blocksindirect blocks

…But inodes are in the log, i.e. NOT at fixed locations on disk…

New data structure: inode mapinode mapLocated in the log (no free list)Fixed checkpoint region on disk holds

addresses of all map blocksIndexed by file identifying number, gives

location of relative inode

Checkpoint regions

•Contain addresses of all blocks in inode map and segment usage table, current time and pointer to last segment written

•Two (for safety)

•Located at fixed positions on disk

•Used for crash recovery

Free space management - I

GOAL: keep large free extentsGOAL: keep large free extentsto write new datato write new data

Disk divided into fixed-length segments (512kB or 1MB)

A segment is always written sequentially from beginning to end

Segments written in succession until end of disk space (older segments get fragmented meanwhile)

……and then?and then?

Free space management - II

Segment cleaning – copying live data out of a segment

Read a number of segments into memoryIdentify live dataWrite live data only back to smaller

number of clean segments

Free space management - example

Old log endRead these segments

Writing memory buffer

Cleaner thread: copy segments to memory buffer

Free segment


Old log end New log end

Cleaner thread: identify live blocks





Cleaner thread: queue compacted data for writing




Writer thread: write compacted and new data to segments, then mark old segments as free

Free space management - implementation

Segment summary block – identify each piece of information in segment

E.g.: for a file, each data block identified by version number+inode number (=unique identifier, UID) and block number

Version number incremented in inode map when file deleted

If UID of block not equal to that in inode map when scanned, block is discarded

Free space management – cleaning policies

1)1) Which segments to clean?Which segments to clean?

2)2) How should live blocks be grouped How should live blocks be grouped when they are written out?when they are written out?

Free space management – cleaning policies

Cleaning policies can be compared in terms of the Write cost:

uuN

uNuNN

1

2

1

1

writtendata new

writtenand read bytes totalcost Write

N=number of segments read

U=fraction of live data in read segments (0 u <1)

•Average amount of time disk is busy per byte if new data written (seek and rot. Latency negligible in LFS)•Note: includes cleaning overhead•Note dependence on u

Free space mgmt – cleaning policies

•Note: underutilized disk gives low write cost, but high storage cost!•…But u defined only for read segment (not overall)•Achieve bimodal distribution: keep most segments nearly full, but a few nearly empty (have cleaner work on these)

Low u = low write costLow u = low write cost

How to achieve bimodal distribution?•First attempt: cleaner always chooses lowest u segments and sorts by age before writing – FAILURE!FAILURE!•Free space in “cold” (i.e. more stable) segments is more “valuable” (will last longer)•Assumption: stability of segment proportional to age of youngest block (i.e. older = colder)•Replace greedy policy with Cost-benefit criterion

u

ageu

cost

agegenerated spacefree

cost

benefit

1

1

•Clean segments with higher ratio•Still group by age before rewriting

Cost-benefit - Results

•Left: bimodal distribution achieved - Cold cleaned at u=75%, hot at u=15%•Right: cost-benefit better, especially at utilization>60%

Support for Cost-benefit

•Segment usage table: record number of live bytes and most recent modification time (used by cleaner to choose segments: uu and ageage)

•Segment summary: record age of youngest block (used by writer to sort live blocks)

Performance – micro-benchmarks I

•SunOS based on Unix FFS•NB: best case for SpriteLFS: no cleaning overhead•Sprite keeps disk 17% busy (85% for SunOS) and CPU saturated: will improve with CPU speed (right)

Small file performance

Performance – micro-benchmarks II

•Traditional FS: logical locality – pay additional cost for organizing disk layout, assuming read patterns•LFS: temporal locality – group information created at the same time – not optimal for reading randomly written files

Large file performance

Single 100MB file

Performance – cleaning overhead

•This experiment: statistics over several months of real usage

•Previous results did not include cleaning

•Write cost ranges 1.2-1.6 - more than half of cleaned segments empty

•Cleaning overhead limits write performance: about 70% of bandwidth for writing

•Improvement: cleaning could be performed at night or during idle periods

Conclusions• Prototype log-structured file system implemented and tested

• Due to cleaning overhead, segment cleaning policies are crucial - tested in simulations before implementation

• Results in tests (without cleaning overhead)

Higher performance than FFS in writes for both small and large filesComparable read performance (except one case)

• Results in real usage (with cleaning)

Simulation results confirmed70% of bandwidth can be used for writing

References

• M. Rosenblum and J.Ousterhout, “The design and implementation of a log-structured file system”, Proceedings of the 13th ACM Symposium on Operating Systems Principles, December 1991

• Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry, “A Fast File System for Unix”, ACM Transactions on Computer Systems, 2(3), August 1984, pp. 181-197

• A. Tanenbaum “Modern operating systems” 2nd ed. (Chpt.4 “File systems”), Prentice Hall

lanfranco muzi psu – may 26 th , 2005

Documents