state of the art: where we are with the ext3 ï¬lesystem

30
State of the Art: Where we are with the Ext3 filesystem Mingming Cao, Theodore Y. Ts’o, Badari Pulavarty, Suparna Bhattacharya IBM Linux Technology Center {cmm, theotso, pbadari}@us.ibm.com, [email protected] Andreas Dilger, Alex Tomas, Cluster Filesystem Inc. [email protected], [email protected] Abstract The ext2 and ext3 filesystems on Linux R are used by a very large number of users. This is due to its reputation of dependability, ro- bustness, backwards and forwards compatibil- ity, rather than that of being the state of the art in filesystem technology. Over the last few years, however, there has been a significant amount of development effort towards making ext3 an outstanding filesystem, while retaining these crucial advantages. In this paper, we dis- cuss those features that have been accepted in the mainline Linux 2.6 kernel, including direc- tory indexing, block reservation, and online re- sizing. We also discuss those features that have been implemented but are yet to be incorpo- rated into the mainline kernel: extent maps, delayed allocation, and multiple block alloca- tion. We will then examine the performance improvements from Linux 2.4 ext3 filesystem to Linux 2.6 ext3 filesystem using industry- standard benchmarks features. Finally, we will touch upon some potential future work which is still under discussion by the ext2/3 developers. 1 Introduction Although the ext2 filesystem[4] was not the first filesystem used by Linux and while other filesystems have attempted to lay claim to be- ing the native Linux filesystem (for example, when Frank Xia attempted to rename xiafs to linuxfs), nevertheless most would consider the ext2/3 filesystem as most deserving of this dis- tinction. Why is this? Why have so many sys- tem administrations and users put their trust in the ext2/3 filesystem? There are many possible explanations, includ- ing the fact that the filesystem has a large and diverse developer community. However, in our opinion, robustness (even in the face of hardware-induced corruption) and backwards compatibility are among the most important reasons why the ext2/3 filesystem has a large and loyal user community. Many filesystems have the unfortunate attribute of being frag- ile. That is, the corruption of a single, unlucky, block can be magnified to cause a loss of far larger amounts of data than might be expected. A fundamental design principle of the ext2/3 filesystem is to avoid fragile data structures by limiting the damage that could be caused by the loss of a single critical block. 69

Upload: others

Post on 04-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: State of the Art: Where we are with the Ext3 ï¬lesystem

State of the Art: Where we are with the Ext3 filesystem

Mingming Cao, Theodore Y. Ts’o, Badari Pulavarty, Suparna BhattacharyaIBM Linux Technology Center

{cmm, theotso, pbadari}@us.ibm.com, [email protected]

Andreas Dilger, Alex Tomas,Cluster Filesystem Inc.

[email protected], [email protected]

Abstract

The ext2 and ext3 filesystems on LinuxR© areused by a very large number of users. Thisis due to its reputation of dependability, ro-bustness, backwards and forwards compatibil-ity, rather than that of being the state of theart in filesystem technology. Over the last fewyears, however, there has been a significantamount of development effort towards makingext3 an outstanding filesystem, while retainingthese crucial advantages. In this paper, we dis-cuss those features that have been accepted inthe mainline Linux 2.6 kernel, including direc-tory indexing, block reservation, and online re-sizing. We also discuss those features that havebeen implemented but are yet to be incorpo-rated into the mainline kernel: extent maps,delayed allocation, and multiple block alloca-tion. We will then examine the performanceimprovements from Linux 2.4 ext3 filesystemto Linux 2.6 ext3 filesystem using industry-standard benchmarks features. Finally, we willtouch upon some potential future work which isstill under discussion by the ext2/3 developers.

1 Introduction

Although the ext2 filesystem[4] was not thefirst filesystem used by Linux and while otherfilesystems have attempted to lay claim to be-ing the native Linux filesystem (for example,when Frank Xia attempted to rename xiafs tolinuxfs), nevertheless most would consider theext2/3 filesystem as most deserving of this dis-tinction. Why is this? Why have so many sys-tem administrations and users put their trust inthe ext2/3 filesystem?

There are many possible explanations, includ-ing the fact that the filesystem has a large anddiverse developer community. However, inour opinion, robustness (even in the face ofhardware-induced corruption) and backwardscompatibility are among the most importantreasons why the ext2/3 filesystem has a largeand loyal user community. Many filesystemshave the unfortunate attribute of beingfrag-ile. That is, the corruption of a single, unlucky,block can be magnified to cause a loss of farlarger amounts of data than might be expected.A fundamental design principle of the ext2/3filesystem is to avoid fragile data structures bylimiting the damage that could be caused by theloss of a single critical block.

• 69 •

Page 2: State of the Art: Where we are with the Ext3 ï¬lesystem

70 • State of the Art: Where we are with the Ext3 filesystem

This has sometimes led to the ext2/3 filesys-tem’s reputation of being a little boring, andperhaps not the fastest or the most scalablefilesystem on the block, but which is one of themost dependable. Part of this reputation canbe attributed to the extremely conservative de-sign of the ext2 filesystem [4], which had beenextended to add journaling support in 1998,but which otherwise had very few other mod-ern filesystem features. Despite its age, ext3is actually growing in popularity among enter-prise users/vendors because of its robustness,good recoverability, and expansion characteris-tics. The fact thate2fsck is able to recoverfrom very severe data corruption scenarios isalso very important to ext3’s success.

However, in the last few years, the ext2/3 de-velopment community has been working hardto demolish the first part of this common wis-dom. The initial outline of plans to “modern-ize” the ext2/3 filesystem was documented in a2002 Freenix Paper [15]. Three years later, it istime to revisit those plans, see what has beenaccomplished, what still remains to be done,and what further extensions are now under con-sideration by the ext 2/3 development commu-nity.

This paper is organized into the following sec-tions. First, we describe about those fea-tures which have already been implementedand which have been integrated into the main-line kernel in Section 2. Second, we discussthose features which have been implemented,but which have not yet been integrated in main-line in Section 3 and Section 4. Next, we ex-amine the performance improvements on ext3filesystem during the last few years in Sec-tion 5. Finally, we will discuss some potentialfuture work in Section 6.

2 Features found in Linux 2.6

The past three years have seen many discus-sions of ext2/3 development. Some of theplanned features [15] have been implementedand integrated into the mainline kernel duringthese three years, including directory indexing,reservation based block allocation, online re-sizing, extended attributes, large inode support,and extended attributes in large inode. In thissection, we will give an overview of the designand the implementation for each feature.

2.1 Directory indexing

Historically, ext2/3 directories have used a sim-ple linked list, much like the BSD Fast Filesys-tem. While it might be expected that theO(n) lookup times would be a significant per-formance issue, the Linux VFS-level direc-tory cache mitigated the O(n) lookup times formany common workloads. However, ext2’slinear directory structure did cause significantperformance problems for certain applications,such as web caches and mail systems using theMaildir format.

To address this problem, various ext2 develop-ers, including Daniel Phillips, Theodore Ts’o,and Stephen Tweedie, discussed using a B-treedata structure for directories. However, stan-dard B-trees had numerous characteristics thatwere at odds with the ext2 design philosophy ofsimplicity and robustness. For example, XFS’sB-tree implementation was larger than all ofext2 or ext3’s source files combined. In addi-tion, users of other filesystems using B-treeshad reported significantly increased potentialfor data loss caused by the corruption of a high-level node in the filesystem’s B-tree.

To address these concerns, we designed a rad-ically simplified tree structure that was specifi-cally optimized for filesystem directories[10].

Page 3: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 71

This is in contrast to the approach used bymany other filesystems, including JFS, Reis-erfs, XFS, and HFS, which use a general-purpose B-tree. Ext2’s scheme, which wedubbed “HTree,” uses 32-bit hashes for keys,where each hash key references a range of en-tries stored in a leaf block. Since internalnodes are only 8 bytes, HTrees have a veryhigh fanout factor (over 500 blocks can be ref-erenced using a 4K index block), two levels ofindex nodes are sufficient to support over 16million 52-character filenames. To further sim-plify the implementation, HTrees are constantdepth (either one or two levels). The combina-tion of the high fanout factor and the use of ahash of the filename, plus a filesystem-specificsecret to serve as the search key for the HTree,avoids the need for the implementation to dobalancing operations.

We maintain forwards compatibility in old ker-nels by clearing theEXT3_INDEX_FL when-ever we modify a directory entry. In order topreserve backwards compatibility, leaf blocksin HTree are identical to old-style linear di-rectory blocks, and index blocks are prefixedwith an 8-byte data structure that makes themappear to non-HTree kernels as deleted direc-tory entries. An additional advantage of thisextremely aggressive attention towards back-wards compatibility is that HTree directoriesare extremely robust. If any of the index nodesare corrupted, the kernel or the filesystem con-sistency checker can find all of the directory en-tries using the traditional linear directory datastructures.

Daniel Phillips created an initial implementa-tion for the Linux 2.4 kernel, and TheodoreTs’o significantly cleaned up the implementa-tion and merged it into the mainline kernel dur-ing the Linux 2.5 development cycle, as wellas implementing e2fsck support for the HTreedata structures. This feature was extremelywell received, since for very large directories,

performance improvements were often betterby a factor of 50–100 or more.

While the HTree algorithm significantly im-proved lookup times, it could cause some per-formance regressions for workloads that usedreaddir() to perform some operation of allof the files in a large directory. This is causedby readdir() returning filenames in a hash-sorted order, so that reads from the inode tablewould be done in a random order. This perfor-mance regression can be easily fixed by mod-ifying applications to sort the directory entriesreturned byreaddir() by inode number. Al-ternatively, anLD_PRELOADlibrary can beused, which intercepts calls toreaddir()and returns the directory entries in sorted order.

One potential solution to mitigate this perfor-mance issue, which has been suggested byDaniel Phillips and Andreas Dilger, but not yetimplemented, involves the kernel choosing freeinodes whose inode numbers meet a propertythat groups the inodes by their filename hash.Daniel and Andreas suggest allocating the in-ode from a range of inodes based on the sizeof the directory, and then choosing a free in-ode from that range based on the filename hash.This should in theory reduce the amount ofthrashing that results when accessing the inodesreferenced in the directory in readdir order. Init is not clear that this strategy will result in aspeedup, however; in fact it could increase thetotal number of inode blocks that might haveto be referenced, and thus make the perfor-mance ofreaddir() + stat() workloadsworse. Clearly, some experimentation and fur-ther analysis is still needed.

2.2 Improving ext3 scalability

The scalability improvements in the block layerand other portions of the kernel during 2.5development uncovered a scaling problem for

Page 4: State of the Art: Where we are with the Ext3 ï¬lesystem

72 • State of the Art: Where we are with the Ext3 filesystem

ext3/JBD under parallel I/O load. To addressthis issue, Alex Tomas and Andrew Mortonworked to remove a per-filesystem superblocklock (lock_super() ) from ext3 block allo-cations [13].

This was done by deferring the filesystem’saccounting of the number of free inodes andblocks, only updating these counts when theyare needed bystatfs() or umount() sys-tem call. This lazy update strategy was en-abled by keeping authoritative counters of thefree inodes and blocks at the per-block grouplevel, and enabled the replacement of thefilesystem-widelock_super() with fine-grained locks. Since a spin lock for everyblock group would consume too much mem-ory, a hashed spin lock array was used to pro-tect accesses to the block group summary in-formation. In addition, the need to use thesespin locks was reduced further by using atomicbit operations to modify the bitmaps, thus al-lowing concurrent allocations within the samegroup.

After addressing the scalability problems inthe ext3 code proper, the focus moved to thejournal (JBD) routines, which made exten-sive use of the big kernel lock (BKL). AlexTomas and Andrew Morton worked togetherto reorganize the locking of the journalinglayer in order to allow as much concurrencyas possible, by using a fine-grained lockingscheme instead of using the BKL and the per-filesystem journal lock. This fine-grained lock-ing scheme uses a new per-bufferhead lock(BH_JournalHead ), a new per-transactionlock (t_handle_lock ) and several new per-journal locks (j_state_lock , j_list_lock , and j_revoke_lock ) to protect thelist of revoked blocks. The locking hierarchy(to prevent deadlocks) for these new locks isdocumented in theinclude/linux/jbd.h header file.

The final scalability change that was needed

was to remove the use ofsleep_on() (whichis only safe when called from within code run-ning under the BKL) and replacing it with thenewwait_event() facility.

These combined efforts served to improvemultiple-writer performance on ext3 notice-ably: ext3 throughput improved by a factorof 10 on SDET benchmark, and the contextswitches are dropped significantly [2, 13].

2.3 Reservation based block allocator

Since disk latency is the key factor that affectsthe filesystem performance, modern filesys-tems always attempt to layout files on a filesys-tem contiguously. This is to reduce disk headmovement as much as possible. However, ifthe filesystem allocates blocks on demand, thenwhen two files located in the same directory arebeing written simultaneously, the block alloca-tions for the two files may end up getting inter-leaved. To address this problem, some filesys-tems use the technique ofpreallocation, by an-ticipating which files will likely need allocateblocks and allocating them in advance.

2.3.1 Preallocation background

In ext2 filesystem, preallocation is performedon the actual disk bitmap. When a new diskdata block is allocated, the filesystem internallypreallocates a few disk data blocks adjacent tothe block just allocated. To avoid filling upfilesystem space with preallocated blocks tooquickly, each inode is allowed at most sevenpreallocated blocks at a time. Unfortunately,this scheme had to be disabled when journal-ing was added to ext3, since it is incompatiblewith journaling. If the system were to crash be-fore the unused preallocated blocks could be re-claimed, then during system recovery, the ext3

Page 5: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 73

journal would replay the block bitmap updatechange. At that point the inode’s block map-ping could end up being inconsistent with thedisk block bitmap. Due to the lack of fullforced fsck for ext3 to return the preallocatedblocks to the free list, preallocation was dis-abled when the ext3 filesystem was integratedinto the 2.4 Linux kernel.

Disabling preallocation means that if multipleprocesses attempted to allocate blocks to twofiles in the same directory, the blocks would beinterleaved. This was a known disadvantage ofext3, but this short-coming becomes even moreimportant with extents (see Section 3.1) sinceextents are far more efficient when the file ondisk is contiguous. Andrew Morton, MingmingCao, Theodore Ts’o, and Badari Pulavarty ex-plored various possible ways to add preallo-cation to ext3, including the method that hadbeen used for preallocation in ext2 filesystem.The method that was finally settled upon was areservation-based design.

2.3.2 Reservation design overview

The core idea of the reservation based alloca-tor is that for every inode that needs blocks, theallocator reserves a range of blocks for that in-ode, called a reservation window. Blocks forthat inode are allocated from that range, insteadof from the whole filesystem, and no other in-ode is allowed to allocate blocks in the reserva-tion window. This reduces the amount of frag-mentation when multiple files are written in thesame directory simultaneously. The key differ-ence between reservation and preallocation isthat the blocks are only reserved in memory,rather than on disk. Thus, in the case the systemcrashes while there are reserved blocks, there isno inconsistency in the block group bitmaps.

The first time an inode needs a new block,a block allocation structure, which describes

the reservation window information and otherblock allocation related information, is allo-cated and linked to the inode. The block allo-cator searches for a region of blocks that fulfillsthree criteria. First, the region must be near theideal “goal” block, based on ext2/3’s existingblock placement algorithms. Secondly, the re-gion must not overlap with any other inode’sreservation windows. Finally, the region musthave at least one free block. As an inode keepsgrowing, free blocks inside its reservation win-dow will eventually be exhausted. At that point,a new window will be created for that inode,preferably right after the old with the guide ofthe “goal” block.

All of the reservation windows are indexed viaa per-filesystem red-black tree so the block al-locator can quickly determine whether a par-ticular block or region is already reserved by aparticular inode. All operations on that tree areprotected by a per-filesystem global spin lock.

Initially, the default reservation window sizefor an inode is set to eight blocks. If the reser-vation allocator detects the inode’s block allo-cation pattern to be sequential, it dynamicallyincreases the window size for that inode. Anapplication that knows the file size ahead of thefile creation can employ an ioctl command toset the window size to be equal to the antici-pated file size in order to attempt to reserve theblocks immediately.

Mingming Cao implemented this reservationbased block allocator, with help from StephenTweedie in converting the per-filesystem reser-vation tree from a sorted link list to a red-blacktree. In the Linux kernel versions 2.6.10 andlater, the default block allocator for ext3 hasbeen replaced by this reservation based blockallocator. Some benchmarks, such as tiobenchand dbench, have shown significant improve-ments on sequential writes and subsequent se-quential reads with this reservation-based block

Page 6: State of the Art: Where we are with the Ext3 ï¬lesystem

74 • State of the Art: Where we are with the Ext3 filesystem

allocator, especially when a large number ofprocesses are allocating blocks concurrently.

2.3.3 Future work

Currently, the reservation window only lastsuntil the last process writing to that file closes.At that time, the reservation window is releasedand those blocks are available for reservation orallocation by any other inode. This is necessaryso that the blocks that were reserved can be re-leased for use by other files, and to avoid frag-mentation of the free space in the filesystem.

However, some files, such as log files andUNIX R© mailbox files, have aslow growthpat-tern. That is, they grow slowly over time, byprocesses appending a small amount of data,and then closing the file, over and over again.For these files, in order to avoid fragmentation,it is necessary that the reservation window bepreserved even after the file has been closed.

The question is how to determine which filesshould be allowed to retain their reservationwindow after the last close. One possible so-lution is to tag the files or directories with anattribute indicating that they contain files thathave a slow growth pattern. Another possibil-ity is to implement heuristics that can allow thefilesystem to automatically determines whichfile seems to have a slow growth pattern, andautomatically preserve the reservation windowafter the file is closed.

If reservation windows can be preserved in thisfashion, it will be important to also implementa way for preserved reservation windows to bereclaimed when the filesystem is fully reserved.This prevents an inode that fails to find a newreservation from falling back to no-reservationmode too soon.

2.4 Online resizing

The online resizing feature was originally de-veloped by Andreas Dilger in July of 1999 forthe 2.0.36 kernel. The availability of a Logi-cal Volume Manager (LVM), motivated the de-sire for on-line resizing, so that when a logicalvolume was dynamically resized, the filesys-tem could take advantage of the new space.This ability to dynamically resize volumes andfilesystems is very useful in server environ-ments, where taking downtime for unmountinga filesystem is not desirable. After missing thecode freeze for the 2.4 kernel, the ext2onlinecode was finally included into the 2.6.10 ker-nel and e2fsprogs 1.36 with the assistance ofStephen Tweedie and Theodore Ts’o.

2.4.1 The online resizing mechanism

The online resizing mechanism, despite itsseemingly complex task, is actually rather sim-ple in its implementation. In order to avoid alarge amount of complexity it is only possibleto increase the size of a filesystem while it ismounted. This addresses the primary require-ment that a filesystem that is (nearly) full canhave space added to it without interrupting theuse of that system. The online resizing code de-pends on the underlying block device to handleall aspects of its own resizing prior to the startof filesystem resizing, and does nothing itselfto manipulate the partition tables of LVM/MDblock devices.

The ext2/3 filesystem is divided into one ormore block allocation groups of a fixed size,with possibly a partial block group at the endof the filesystem [4]. The layout of each blockgroup (where the inode and block allocationbitmaps and the inode table are stored) is keptin the group descriptor table. This table isstored at the start of at the first block group, and

Page 7: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 75

consists of one or more filesystem blocks, de-pending on the size of the filesystem. Backupcopies of the group descriptor table are kept inmore groups if the filesystem is large enough.

There are three primary phases by which afilesystem is grown. The first, and simplest, isto expand the last partial block group (if any)to be a full block group. The second phase isto add a new block group to an existing blockin the group descriptor table. The third phaseis to add a new block to the group descriptortable and add a new group to that block. Allfilesystem resizes are done incrementally, go-ing through one or more of the phases to addfree space to the end of the filesystem until thedesired size is reached.

2.4.2 Resizing within a group

For the first phase of growth, the online resizingcode starts by briefly locking the superblockand increasing the total number of filesystemblocks to the end of the last group. All of theblocks beyond the end of the filesystem are al-ready marked as “in use” by the block bitmapfor that group, so they must be cleared. Thisis accomplished by the same mechanism thatis used when deleting a file—ext3_free_blocks() and can be done without lock-ing the whole filesystem. The online resizersimply pretends that it is deleting a file thathad allocated all of the blocks at the end ofthe filesystem, andext3_free_blocks()handles all of the bitmap and free block countupdates properly.

2.4.3 Adding a new group

For the second phase of growth, the onlineresizer initializes the next group beyond the

end of the filesystem. This is easily done be-cause this area is currently unused and un-known to the filesystem itself. The blockbitmap for that group is initialized as empty,the superblock and group descriptor backups (ifany) are copied from the primary versions, andthe inode bitmap and inode table are initialized.Once this has completed successfully the on-line resizing code briefly locks the superblockto increase the total and free blocks and inodescounts for the filesystem, add a new group tothe end of the group descriptor table, and in-crease the total number of groups in the filesys-tem by one. Once this is completed the backupsuperblock and group descriptors are updatedin case of corruption of the primary copies. Ifthere is a problem at this stage, the next e2fsckwill also update the backups.

The second phase of growth will be repeateduntil the filesystem has fully grown, or the lastgroup descriptor block is full. If a partial groupis being added at the end of the filesystem theblocks are marked as “in use” before the groupis added. Both first and second phase of growthcan be done on any ext3 filesystem with a sup-ported kernel and suitable block device.

2.4.4 Adding a group descriptor block

The third phase of growth is needed periodi-cally to grow a filesystem over group descrip-tor block boundaries (at multiples of 16 GB forfilesystems with 4 KB blocksize). When thelast group descriptor block is full, a new blockmust be added to the end of the table. How-ever, because the table is contiguous at the startof the first group and is normally followed im-mediately by the block and inode bitmaps andthe inode table, the online resize code needsa bit of assistance while the filesystem is un-mounted (offline) in order to maintain compat-ibility with older kernels. Either atmke2fs

Page 8: State of the Art: Where we are with the Ext3 ï¬lesystem

76 • State of the Art: Where we are with the Ext3 filesystem

time, or for existing filesystems with the assis-tance of theext2prepare command, a smallnumber of blocks at the end of the group de-scriptor table are reserved for online growth.The total amount of reserved blocks is a tinyfraction of the total filesystem size, requiringonly a few tens to hundreds of kilobytes to growthe filesystem 1024-fold.

For the third phase, it first gets the next re-served group descriptor block and initializes anew group and group descriptor beyond the endof the filesystem, as is done in second phase ofgrowth. Once this is successful, the superblockis locked while reallocating the array that in-dexes all of the group descriptor blocks to addanother entry for the new block. Finally, thesuperblock totals are updated, the number ofgroups is increased by one, and the backup su-perblock and group descriptors are updated.

The online resizing code takes advantage of thejournaling features in ext3 to ensure that thereis no risk of filesystem corruption if the resizeis unexpectedly interrupted. The ext3 journalensures strict ordering and atomicity of filesys-tem changes in the event of a crash—either theentire resize phase is committed or none of itis. Because the journal has no rollback mech-anism (except by crashing) the resize code iscareful to verify all possible failure conditionsprior to modifying any part of the filesystem.This ensures that the filesystem remains valid,though slightly smaller, in the event of an errorduring growth.

2.4.5 Future work

Future development work in this area involvesremoving the need to do offline filesystemmanipulation to reserve blocks before doingthird phase growth. The use of Meta BlockGroups [15] allows new groups to be added to

the filesystem without the need to allocate con-tiguous blocks for the group descriptor table.Instead the group descriptor block is kept in thefirst group that it describes, and a backup is keptin the second and last group for that block. TheMeta Block Group support was first introducedin the 2.4.25 kernel (Feb. 2004) so it is reason-able to think that a majority of existing systemscould mount a filesystem that started using thiswhen it is introduced.

A more complete description of the onlinegrowth is available in [6].

2.5 Extended attributes

2.5.1 Extended attributes overview

Many new operating system features (such asaccess control lists, mandatory access con-trols, Posix Capabilities, and hierarchical stor-age management) require filesystems to be ableassociate a small amount of custom metadatawith files or directories. In order to implementsupport for access control lists, Andreas Gru-enbacher added support for extended attributesto the ext2 filesystems. [7]

Extended attributes as implemented by AndreasGruenbacher are stored in a single EA block.Since a large number of files will often use thesame access control list, as inherited from thedirectory’s default ACL as an optimization, theEA block may be shared by inodes that haveidentical extended attributes.

While the extended attribute implementationwas originally optimized for use to store ACL’s,the primary users of extended attributes to datehave been the NSA’s SELinux system, Samba4 for storing extended attributes from Windowsclients, and the Lustre filesystem.

In order to store larger EAs than a singlefilesystem block, work is underway to store

Page 9: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 77

large EAs in another EA inode referenced fromthe original inode. This allows many arbitrary-sized EAs to be attached to a single file, withinthe limitations of the EA interface and whatcan be done inside a single journal transaction.These EAs could also be accessed as additionalfile forks/streams, if such an API were added tothe Linux kernel.

2.5.2 Large inode support and EA-in-inode

Alex Tomas and Andreas Dilger implementedsupport for storing the extended attribute in anexpanded ext2 inode, in preference to using aseparate filesystem block. In order to do this,the filesystem must be created using an inodesize larger than the default 128 bytes. Inodesizes must be a power of two and must be nolarger than the filesystem block size, so for afilesystem with a 4 KB blocksize, inode sizesof 256, 512, 1024, 2048, or 4096 bytes arevalid. The 2 byte field starting at offset 128(i_extra_size ) of each inode specifies thestarting offset for the portion of the inode thatcan be used for storing EA’s. Since the startingoffset must be a multiple of 4, and we have notextended the fixed portion of the inode beyondi_extra_size , currentlyi_extra_sizeis 4 for all filesystems with expanded inodes.Currently, all of the inode past the initial 132bytes can be used for storing EAs. If the user at-tempts to store more EAs than can fit in the ex-panded inode, the additional EAs will be storedin an external filesystem block.

Using the EA-in-inode, a very large (seven-foldimprovement) difference was found in someSamba 4 benchmarks, taking ext3 from lastplace when compared to XFS, JFS, and Reis-erfs3, to being clearly superior to all of theother filesystems for use in Samba 4. [5] The in-inode EA patch started by Alex Tomas and An-dreas Dilger was re-worked by Andreas Gruen-bacher. And the fact that this feature was such a

major speedup for Samba 4, motivated it beingintegrated into the mainline 2.6.11 kernel veryquickly.

3 Extents, delayed allocation andextent allocation

This section and the next (Section 4) will dis-cuss features that are currently under develop-ment, and (as of this writing) have not beenmerged into the mainline kernel. In most casespatches exist, but they are still being polished,and discussion within the ext2/3 developmentcommunity is still in progress.

Currently, the ext2/ext3 filesystem, like othertraditionalUNIX filesystems, uses a direct, indi-rect, double indirect, and triple indirect blocksto map file offsets to on-disk blocks. Thisscheme, sometimes simply called an indirectblock mapping scheme, is not efficient for largefiles, especially large file deletion. In order toaddress this problem, many modern filesystems(including XFS and JFS on Linux) use someform of extent maps instead of the traditionalindirect block mapping scheme.

Since most filesystems try to allocate blocks ina contiguous fashion, extent maps are a moreefficient way to represent the mapping betweenlogical and physical blocks for large files. Anextentis a single descriptor for a range of con-tiguous blocks, instead of using, say, hundredsof entries to describe each block individually.

Over the years, there have been many discus-sions about moving ext3 from the traditional in-direct block mapping scheme to an extent mapbased scheme. Unfortunately, due to the com-plications involved with making an incompati-ble format change, progress on an actual imple-mention of these ideas had been slow.

Page 10: State of the Art: Where we are with the Ext3 ï¬lesystem

78 • State of the Art: Where we are with the Ext3 filesystem

Alex Tomas, with help from Andreas Dilger,designed and implemented extents for ext3. Heposted the initial version of his extents patchon August, 2003. The initial results on file cre-ation and file deletion tests inspired a round ofdiscussion in the Linux community to consideradding extents to ext3. However, given the con-cerns that the format changes were ones that allof the ext3 developers will have to support ona long-term basis, and the fact that it was verylate in the 2.5 development cycle, it was not in-tegrated into the mainline kernel sources at thattime.

Later, in April of 2004, Alex Tomas postedan updated extents patch, as well as addi-tional patches that implemented delayed allo-cation and multiple block allocation to the ext2-devel mailing list. These patches were repostedin February 2005, and this re-ignited interestin adding extents to ext3, especially when itwas shown that the combination of these threefeatures resulted in significant throughput im-provements on some sequential write tests.

In the next three sections, we will discuss howthese three features are designed, followed by adiscussion of the performance evaluation of thecombination of the three patches.

3.1 Extent maps

This implementation of extents was originallymotivated by the problem of long truncatetimes observed for huge files.1 As noted above,besides speeding up truncates, extents help im-prove the performance of sequential file writessince extents are a significantly smaller amountof metadata to be written to describe contigu-ous blocks, thus reducing the filesystem over-head.

1One option to address the issue is performing asyn-chronous truncates, however, while this makes the CPUcycles to perform the truncate less visible, excess CPUtime will still be consumed by the truncate operations.

Most files need only a few extents to describetheir logical-to-physical block mapping, whichcan be accommodated within the inode or asingle extent map block. However, some ex-treme cases, such as sparse files with randomallocation patterns, or a very badly fragmentedfilesystem, are not efficiently represented usingextent maps. In addition, allocating blocks in arandom access pattern may require inserting anextent map entry in the middle of a potentiallyvery large data representation.

One solution to this problem is to use a treedata structure to store the extent map, either aB-tree, B+ tree, or some simplified tree struc-ture as was used for the HTree feature. AlexTomas’s implementation takes the latter ap-proach, using a constant-depth tree structure. Inthis implementation, the extents are expressedusing a 12 byte structure, which include a32-bit logical block number, a 48-bit physicalblock number, and a 16-bit extent length. With4 KB blocksize, a filesystem can address up to1024 petabytes, and a maximum file size of 16terabytes. A single extent can cover up to 216

blocks or 256 MB.2

The extent tree information can be stored inthe inode’si_data array, which is 60 byteslong. An attribute flag in the inode’si_flagsword indicates whether the inode’si_data ar-ray should be interpreted using the traditionalindirect block mapping scheme, or as an ex-tent data structure. If the entire extent infor-mation can be stored in thei_data field, thenit will be treated as a single leaf node of theextent tree; otherwise, it will be treated as theroot node of inode’s extent tree, and additionalfilesystem blocks serve as intermediate or leafnodes in the extent tree.

At the beginning of each node, theext3_ext_header data structure is 12 bytes long,

2Currently, the maximum block group size given a 4KB blocksize is 128 MB, and this will limit the maxi-mum size for a single extent.

Page 11: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 79

and contains a 16-bit magic number, 2 16-bitintegers containing the number of valid entriesin the node, and the maximum number of en-tries that can be stored in the node, a 16-bit inte-ger containing the depth of the tree, and a 32-bittree generation number. If the depth of the treeis 0, then root inode contains leaf node infor-mation, and the 12-byte entries contain the ex-tent information described in the previous para-graph. Otherwise, the root node will contain12-byte intermediate entries, which consist of32-bit logical block and a 48-bit physical block(with 16 bits unused) of the next index or leafblock.

3.1.1 Code organization

The implementation is divided into two parts:Generic extents support that implements ini-tialize/lookup/insert/remove functions for theextents tree, and VFS support that allowsmethods and callbacks likeext3_get_block() , ext3_truncate() , ext3_new_block() to use extents.

In order to use the generic extents layer, theuser of the generic extents layer must declare itstree via anext3_extents_tree structure.The structure describes where the root of thetree is stored, and specifies the helper routinesused to operate on it. This way one can roota tree not only ini_data as described above,but also in a separate block or in EA (ExtendedAttributes) storage. The helper routines de-scribed by structext3_extents_helperscan be used to control the block allocationneeded for tree growth, journaling metadata,using different criteria of extents mergability,removing extents etc.

3.1.2 Future work

Alex Tomas’s extents implementation is still awork-in-progress. Some of the work that needsto be done is to make the implementation in-dependent of byte-order, improving the errorhandling, and shrinking the depth of the treewhen truncated the file. In addition, the extentscheme is less efficient than the traditional indi-rect block mapping scheme if the file is highlyfragmented. It may be useful to develop someheuristics to determine whether or not a fileshould use extents automatically. It may alsobe desirable to allow block-mapped leaf blocksin an extent-mapped file for cases where thereis not enough contiguous space in the filesys-tem to allocate the extents efficiently.

The last change would necessarily change theon-disk format of the extents, but it is not onlythe extent format that has been changed. Forexample, the extent format does not supportlogical block numbers that are greater than 32bits, and a more efficient, variable-length for-mat would allow more extents to be stored inthe inode before spilling out to an external treestructure.

Since deployment of the extent data struc-ture is disruptive because it involved an non-backwards-compatible change to the filesystemformat, it is important that the ext3 developersare comfortable that the extent format is flexi-ble and powerful enough for present and futureneeds, in order to avoid the need for additionalincompatible format changes.

3.2 Delayed allocation

3.2.1 Why delayed allocation is needed

Procrastination has its virtues in the ways of anoperating system. Deferring certain tasks un-

Page 12: State of the Art: Where we are with the Ext3 ï¬lesystem

80 • State of the Art: Where we are with the Ext3 filesystem

til an appropriate time often improves the over-all efficiency of the system by enabling optimaldeployment of resources. Filesystem I/O writesare no exception.

Typically, when a filesystemwrite() sys-tem call returns success, it has only copied thedata to be written into the page cache, mappedrequired blocks in the filesystem and markedthe pages as needing write out. The actualwrite out of data to disk happens at a laterpoint of time, usually when writeback opera-tions are clustered together by a backgroundkernel thread in accordance with system poli-cies, or when the user requests file data tobe synced to disk. Such an approach ensuresimproved I/O ordering and clustering for thesystem, resulting in more effective utilizationof I/O devices with applications spending lesstime in thewrite() system call, and usingthe cycles thus saved to perform other work.

Delayed allocation takes this a step further,by deferring the allocation of new blocks inthe filesystem to disk blocks until writebacktime [12]. This helps in three ways:

• Reduces fragmentation in the filesystemby improving chances of creating contigu-ous blocks on disk for a file. Althoughpreallocation techniques can help avoidfragmentation, they do not address frag-mentation caused by multiple threads writ-ing to the file at different offsets simul-taneously, or files which are written in anon-contiguous order. (For example, thelibfd library, which is used by the GNUC compiler will create object files that arewritten out of order.)

• Reduces CPU cycles spent in repeatedget_block() calls, by clustering allo-cation for multiple blocks together. Bothof the above would be more effective whencombined with a good multi-block alloca-tor.

• For short lived files that can be bufferedin memory, delayed allocation may avoidthe need for disk updates for metadata cre-ation altogether, which in turn reduces im-pact on fragmentation [12].

Delayed allocation is also useful for the Ac-tive Block I/O Scheduling System (ABISS) [1],which provides guaranteed read/write bit ratesfor applications that require guaranteed real-time I/O streams. Without delayed allocation,the synchronous code path forwrite() hasto read, modify, update, and journal changes tothe block allocation bitmap, which could dis-rupt the guaranteed read/write rates that ABISSis trying to deliver.

Since block allocation is deferred until back-ground writeback when it is too late to return anerror to the caller ofwrite() , thewrite()operation requires a way to ensure that theallocation will indeed succeed. This can beaccomplished by carving out, or reserving, aclaim on the expected number of blocks on disk(for example, by subtracting this number fromthe total number of available blocks, an op-eration that can be performed without havingto go through actual allocation of specific diskblocks).

Repeated invocations of ext3_get_block()/ext3_new_block() is notefficient for mapping consecutive blocks,especially for an extent based inode, where it isnatural to process a chunk of contiguous blocksall together. For this reason, Alex Tomasimplemented an extents based multiple blockallocation and used it as a basis for extentsbased delayed allocation. We will discussthe extents based multiple block allocation inSection 3.3.

Page 13: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 81

3.2.2 Extents based delayed allocation im-plementation

If the delayed allocation feature is enabled foran ext3 filesystem and a file uses extent maps,then the address space operations for its inodeare initialized to a set of ext3 specific routinesthat implement the write operations a little dif-ferently. The implementation defers allocationof blocks fromprepare_write() and em-ploys extent walking, together with the multipleblock allocation feature (described in the nextsection), for clustering block allocations maxi-mally into contiguous blocks.

Instead of allocating the disk block inprepare_write() , the the page is markedas needing block reservation. Thecommit_write() function calculates the requirednumber of blocks, and reserves them to makesure that there are enough free blocks in thefilesystem to satisfy the write. When thepages get flushed to disk bywritepage()or writepages() , these functions will walkall the dirty pages in the specified inode, clus-ter the logically contiguous ones, and submitthe page or pages to the bio layer. After theblock allocation is complete, the reservationis dropped. A single block I/O request (orBIO) is submitted for write out of pages pro-cessed whenever a new allocated extent (or thenext mapped extent if already allocated) onthe disk is not adjacent to the previous one,or whenwritepages() completes. In thismanner the delayed allocation code is tightlyintegrated with other features to provide bestperformance.

3.3 Buddy based extent allocation

One of the shortcomings of the current ext3block allocation algorithm, which allocates oneblock at a time, is that it is not efficient enough

for high speed sequential writes. In one ex-periment utilizing direct I/O on a dual Opteronworkstation with fast enough buses, fiber chan-nel, and a large, fast RAID array, the CPU lim-ited the I/O throughput to 315 MB/s. Whilethis would not be an issue on most machines(since the maximum bandwidth of a PCI busis 127 MB/s), but for newer or enterprise-classservers, the amount of data per second that canbe written continuously to the filesystem is nolonger limited by the I/O subsystem, but by theamount of CPU time consumed by ext3’s blockallocator.

To address this problem, Alex Tomas designedand implemented a multiple block allocation,called mballoc, which uses a classic buddy datastructure on disk to store chunks of free or usedblocks for each block group. This buddy datais an array of metadata, where each entry de-scribes the status of a cluster of 2n blocks, clas-sified as free or in use.

Since block buddy data is not suitable for de-termining a specific block’s status and locatinga free block close to the allocation goal, the tra-ditional block bitmap is still required in orderto quickly test whether a specific block is avail-able or not.

In order to find a contiguous extent of blocksto allocate, mballoc checks whether the goalblock is available in the block bitmap. If it isavailable, mballoc looks up the buddy data tofind the free extent length starting from the goalblock. To find the real free extent length, mbal-loc continues by checking whether the physicalblock right next to the end block of the pre-viously found free extent is available or not.If that block is available in the block bitmap,mballoc could quickly find the length of thenext free extent from buddy data and add it upto the total length of the free extent from thegoal block.

For example, if block M is the goal block and

Page 14: State of the Art: Where we are with the Ext3 ï¬lesystem

82 • State of the Art: Where we are with the Ext3 filesystem

is claimed to be available in the bitmap, andblock M is marked as free in buddy data of or-der n, then initially the free chunk size fromblock M is known to be 2n. Next, mballocchecks the bitmap to see if blockM + 2n + 1is available or not. If so, mballoc checks thebuddy data again, and finds that the free extentlength from blockM + 2n + 1 is k. Now, thefree chunk length from goal block M is knownto be 2n + 2k. This process continues until atsome point the boundary block is not available.In this manner, instead of testing dozens, hun-dreds, or even thousands of blocks’ availabilitystatus in the bitmap to determine the free blockschunk size, it can be enough to just test a fewbits in buddy data and the block bitmap to learnthe real length of the free blocks extent.

If the found free chunk size is greater than therequested size, then the search is consideredsuccessful and mballoc allocates the found freeblocks. Otherwise, depending on the allocationcriteria, mballoc decides whether to accept theresult of the last search in order to preserve thegoal block locality, or continue searching forthe next free chunk in case the length of con-tiguous blocks is a more important factor thanwhere it is located. In the later case, mballocscans the bitmap to find out the next availableblock, then, starts from there, and determinesthe related free extent size.

If mballoc fails to find a free extent that sat-isfies the requested size after rejecting a pre-defined number (currently 200) of free chunks,it stops the search and returns the best (largest)free chunk found so far. In order to speed up thescanning process, mballoc maintains the totalnumber of available blocks and the first avail-able block of each block group.

3.3.1 Future plans

Since in ext3 blocks are divided into blockgroups, the block allocator first selects a blockgroup before it searches for free blocks. Thepolicy employed in mballoc is quite simple: totry the block group where the goal block is lo-cated first. If allocation from that group fails,then scan the subsequent groups. However, thisimplies that on a large filesystem, especiallywhen free blocks are not evenly distributed,CPU cycles could be wasted on scanning lotsof almost full block groups before finding ablock group with the desired free blocks crite-ria. Thus, a smarter mechanism to select theright block group to start the search should im-prove the multiple block allocator’s efficiency.There are a few proposals:

1. Sort all the block groups by the total num-ber of free blocks.

2. Sort all the groups by the group fragmen-tation factor.

3. Lazily sort all the block groups by the to-tal number of free blocks, at significantchange of free blocks in a group only.

4. Put extents into buckets based on extentsize and/or extent location in order toquickly find extents of the correct size andgoal location.

Currently the four options are under evaluationthough probably the first one is a little more in-teresting.

3.4 Evaluating the extents patch set

The initial evaluation of the three patches (ex-tents, delayed allocation and extent alloca-tion) shows significant throughput improve-ments, especially under sequential tests. The

Page 15: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 83

tests show that the extents patch significantlyreduces the time for large file creation and re-moval, as well as file rewrite. With extentsand extent allocation, the throughput of Di-rect I/O on the aforementioned Opteron-basedworkstation is significantly improved, from 315MB/sec to 500MB/sec, and the CPU usage issignificantly dropped from 100% to 50%. Inaddition, extensive testing on various bench-marks, including dbench, tiobench, FFSB [11]and sqlbench [16], has been done with andwithout this set of patches. Some initial analy-sis indicates that the multiple block allocation,when combined with delayed allocation, is akey factor resulting in this improvement. Moretesting results can be obtained fromhttp://

www.bullopensource.org/ext4 .

4 Improving ext3 without changingdisk format

Replacing the traditional indirect block map-ping scheme with an extent mapping scheme,has many benefits, as we have discussed in theprevious section. However, changes to the on-disk format that are not backwards compati-ble are often slow to be adopted by users, fortwo reasons. First of all, robust e2fsck sup-port sometimes lags the kernel implementation.Secondly, it is generally not possible to mountthe filesystem with an older kernel once thefilesystem has been converted to use these newfeatures, preventing rollback in case of prob-lems.

Fortunately, there are a number of improve-ments that can be made to the ext2/3 filesys-tem without making these sorts of incompatiblechanges to the on-disk format.

In this section, we will discuss a few of fea-tures that are implemented based on the current

ext3 filesystem. Section 4.1 describes the ef-fort to reduce the usage of bufferheads struc-ture in ext3; Section 4.2 describes the effortto add delayed allocation without requiring theuse of extents; Section 4.3 discusses the work toadd multiple block allocation; Section 4.4 de-scribes asynchronous file unlink and truncate;Section 4.5 describes a feature to allow morethan 32000 subdirectories; and Section 4.6 de-scribes a feature to allow multiple threads toconcurrently create/rename/link/unlink files ina single directory.

4.1 Reducing the use of bufferheads in ext3

Bufferheads continue to be heavily used inLinux I/O and filesystem subsystem, eventhough closer integration of the buffer cachewith the page cache since 2.4 and the new blockI/O subsystem introduced in Linux 2.6 have insome sense superseded part of the traditionalLinux buffer cache functionality.

There are a number of reasons for this. First ofall, the buffer cache is still used as a metadatacache. All filesystem metadata (superblock,inode data, indirect blocks, etc.) are typi-cally read into buffer cache for quick reference.Bufferheads provide a way to read/write/accessthis data. Second, bufferheads link a page todisk block and cache the block mapping infor-mation. In addition, the design of bufferheadssupports filesystem block sizes that do notmatch the system page size. Bufferheads pro-vide a convenient way to map multiple blocksto a single page. Hence, even the generic multi-page read-write routines sometimes fall back tousing bufferheads for fine-graining or handlingof complicated corner cases.

Ext3 is no exception to the above. Besides theabove reasons, ext3 also makes use of buffer-heads to enable it to provide ordering guaran-tees in case of a transaction commit. Ext3’s or-

Page 16: State of the Art: Where we are with the Ext3 ï¬lesystem

84 • State of the Art: Where we are with the Ext3 filesystem

dered mode guarantees that file data gets writ-ten to the disk before the corresponding meta-data gets committed to the journal. In orderto provide this guarantee, bufferheads are usedas the mechanism to associate the data pagesbelonging to a transaction. When the transac-tion is committed to the journal, ext3 uses thebufferheads attached to the transaction to makesure that all the associated data pages have beenwritten out to the disk.

However, bufferheads have the following dis-advantages:

• All bufferheads are allocated from the“buffer_head” slab cache, thus they con-sume low memory3 on 32-bit architec-tures. Since there is one bufferhead(or more, depending on the block size)for each filesystem page cache page, thebufferhead slab can grow really quicklyand consumes a lot of low memory space.

• When bufferheads get attached to a page,they take a reference on the page. The ref-erence is dropped only when VM tries torelease the page. Typically, once a pagegets flushed to disk it is safe to release itsbufferheads. But dropping the bufferhead,right at the time of I/O completion is noteasy, since being in interrupt handler con-text restricts the kind of operations feasi-ble. Hence, bufferheads are left attachedto the page, and released later as and whenVM decides to re-use the page. So, it istypical to have a large number of buffer-heads floating around in the system.

• The extra memory references to buffer-heads can impact the performance ofmemory caches, the Translation Looka-side Buffer (TLB) and the Segment

3Low memory is memory that can be directly mappedinto kernel virtual address space, i.e. 896MB, in the caseof IA32.

Lookaside Buffer4 (SLB). We have ob-served that when running a large NFSworkload, while the ext3 journaling threadkjournald() is referencing all the transac-tions, all the journal heads, and all thebufferheads looking for data to flush/cleanit suffers a large number of SLB misseswith the associated performance penalty.The best solution for these performanceproblems appears to be to eliminate theuse of bufferheads as much as possible,which reduces the number of memory ref-erences required by kjournald().

To address the above concerns, BadariPulavarty has been working on removingbufferheads usage from ext3 from majorimpact areas, while retaining bufferheads foruncommon usage scenarios. The focus was onelimination of bufferhead usage for user datapages, while retaining bufferheads primarilyfor metadata caching.

Under the writeback journaling mode, sincethere are no ordering requirements betweenwhen metadata and data gets flushed to disk,eliminating the need for bufferheads is rel-atively straightforward because ext3 can usemost recent generic VFS helpers for writeback.This change is already available in the latestLinux 2.6 kernels.

For ext3 ordered journaling mode, however,since bufferheads are used as linkage betweenpages and transactions in order to provide flush-ing order guarantees, removal of the use ofbufferheads gets complicated. To address thisissue, Andrew Morton proposed a new ext3journaling mode, which works without buffer-heads and provides semantics that are some-what close to that provided in ordered mode[9].The idea is that whenever there is a transactioncommit, we go through all the dirty inodes and

4The SLB is found on the 64-bit Power PC.

Page 17: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 85

dirty pages in that filesystem and flush everyone of them. This way metadata and user dataare flushed at the same time. The complexity ofthis proposal is currently under evaluation.

4.2 Delayed allocation without extents

As we have discussed in Section 3.2, de-layed allocation is a powerful technique thatcan result in significant performance gains,and Alex Tomas’s implementation shows somevery interesting and promising results. How-ever, Alex’s implementation only provide de-layed allocation when the ext3 filesystemis using extents, which requires an incom-patible change to the on-disk format. Inaddition, like past implementation of de-layed allocation by other filesystems, such asXFS, Alex’s changes implement the delayedallocation in filesystem-specific versions ofprepare_write() , commit_write() ,writepage() , and writepages() , in-stead of using the filesystem independent rou-tines provided by the Linux kernel.

This motivated Suparna Bhattacharya, BadariPulavarty and Mingming Cao to implement de-layed allocation and multiple block allocationsupport to improve the performance of the ext3to the extent possible without requiring any on-disk format changes.

Interestingly, the work to remove the useof bufferheads in ext3 implemented mostof the necessary changes required for de-layed allocation, when bufferheads are not re-quired. Thenobh_commit_write() func-tion, delegates the task of writing data tothewritepage() andwritepages() , bysimply marking the page as dirty. Sincethe writepage() function already has tohandle the case of writing a page which ismapped to a sparse memory-mapped files,the writepage() function already handles

block allocation by calling the filesystem spe-cific get_block() function. Hence, if thenobh_prepare_write function were toomit call get_block() , the physical blockwould not be allocated until the page is ac-tually written out via thewritepage() orwritepages() function.

Badari Pulavarty implemented a relativelysmall patch as a proof-of-concept, whichdemonstrates that this approach works well.The work is still in progress, with a few lim-itations to address. The first limitation isthat in the current proof-of-concept patch, datacould be dropped if the filesystem was full,without thewrite() system call returning -ENOSPC.5 In order to address this problem, thenobh_prepare_write function must notethat the page currently does not have a phys-ical block assigned, and request the filesys-tem reserve a block for the page. So whilethe filesystem will not have assigned a spe-cific physical block as a result ofnobh_prepare_write() , it must guarantee thatwhenwritepage() calls the block allocator,the allocation must succeed.

The other major limitation is, at present, itonly worked when bufferheads are not needed.However, thenobh code path as currentlypresent into the 2.6.11 kernel tree only sup-ports filesystems when the ext3 is journaling inwriteback mode and not in ordered journalingmode, and when the blocksize is the same as theVM pagesize. Extending thenobh code pathsto support sub-pagesize blocksizes is likely notvery difficult, and is probably the appropriateway of addressing the first part of this short-coming.

5The same shortcoming exists today if a sparse fileis memory-mapped, and the filesystem is full whenwritepage() tries to write a newly allocated page tothe filesystem. This can potentially happen after userprocess which wrote to the file viammap() has exited,where there is no program left to receive an error report.

Page 18: State of the Art: Where we are with the Ext3 ï¬lesystem

86 • State of the Art: Where we are with the Ext3 filesystem

However, supporting delayed allocation forext3 ordered journaling using this approach isgoing to be much more challenging. Whilemetadata journaling alone is sufficient in write-back mode, ordered mode needs to track I/Osubmissions for purposes of waiting for com-pletion of data writeback to disk as well, sothat it can ensure that metadata updates hit thedisk only after the corresponding data blocksare on disk. This avoids potential exposuresand inconsistencies without requiring full datajournaling[14].

However, in the current design of generic multi-page writeback routines, block I/O submis-sions are issued directly by the generic rou-tines and are transparent to the filesystem spe-cific code. In earlier situations where buffer-heads were used for I/O, filesystem specificwrappers around generic code could track I/Othrough the bufferheads associated with a pageand link them with the transaction. With therecent changes, where I/O requests are built di-rectly as multi-page bio requests with no linkfrom the page to the bio, this no longer applies.

A couple of solution approaches are under con-sideration, as of the writing of this paper:

• Introducing yet another filesystem spe-cific callback to be invoked by the genericmulti-page write routines to actually issuethe I/O. ext3 could then track the numberof in-flight I/O requests associated withthe transaction, and wait for this to fall tozero at journal commit time. Implement-ing this option is complicated because themulti-page write logic occasionally fallsback to the older bufferheads based logicin some scenarios. Perhaps ext3 orderedmode writeback would need to provideboth the callback and the page buffer-head tracking logic if this approach is em-ployed.

• Find a way to get ext3 journal commit toeffectively reuse a part the fsync/O_SYNCimplementation that waits for writeback tocomplete on the pages for relevant inodes,using a radix-tree walk. Since the journallayer is designed to be unaware of filesys-tems [14], this could perhaps be accom-plished by associating a (filesystem spe-cific) callback with journal commit, as re-cently suggested by Andrew Morton[9].

It remains to be seen which approach works outto be the best, as development progresses. Itis clear that since ordered mode is the defaultjournaling mode, any delayed allocation imple-mentation must be able to support it.

4.3 Efficiently allocating multiple blocks

As with the Alex Tomas’s delayed allocationpatch, Alex’s multiple block allocator patch re-lies on an incompatible on-disk format changeof the ext3 filesystem to support extent maps.In addition, the extent-based mballoc patch alsorequired a format change in order to store datafor the buddy allocator which it utilized. Sinceoprofile measurements of Alex’s patch indi-cated the multiple block allocator seemed tobe responsible for reducing CPU usage, andsince it seemed to improve throughput in someworkloads, we decided to investigate whether itwas possible to obtain most of the benefits of amultiple block allocator using the current ext3filesystem format. This seemed to be a reason-able approach since many of the advantages ofsupporting Alex’s mballoc patch seemed to de-rive from collapsing a large number of calls toext3_get_block() into much fewer callsto ext3_get_blocks() , thus avoiding ex-cess calls into the journaling layer to recordchanges to the block allocation bitmap.

In order to implement a multiple-block allo-cator based on the existing block allocation

Page 19: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 87

bitmap, Mingming Cao first changedext3_new_block() to accept a new argumentspecifying how many contiguous blocks thefunction should attempt to allocate, on a bestefforts basis. The function now allocates thefirst block in the existing way, and then contin-ues allocating up to the requested number of ad-jacent physical blocks at the same time if theyare available.

The modifiedext3_new_block() functionwas then used to implement ext3’sget_blocks() method, the standardized filesys-tem interface to translate a file offset and alength to a set of on-disk blocks. It does thisby starting at the first file offset and translat-ing it into a logical block number, and then tak-ing that logical block number and mapping it toa physical block number. If the logical blockhas already been mapped, then it will continuemapping the next logical block until the requi-site number of physical blocks have been re-turned, or an unallocated block is found.

If some blocks need to be allocated, firstext3_get_blocks() will look ahead tosee how many adjacent blocks are needed, andthen passes this allocation request toext3_new_blocks() , searches for the requestedfree blocks, marks them as used, and re-turns them toext3_get_blocks() . Next,ext3_get_blocks() will update the in-ode’s direct blocks, or a single indirect blockto point at the allocated blocks.

Currently, this ext3_get_blocks() im-plementation does not allocate blocks across anindirect block boundary. There are two rea-sons for this. First, theJBD journaling re-quests the filesystem to reserve the maximumof blocks that will require journaling, when anew transaction handle is requested viaext3_journal_start() . If we were to allowa multiple block allocation request to span anindirect block boundary, it would be difficultto predict how many metadata blocks may get

dirtied and thus require journaling. Secondly, itwould be difficult to place any newly allocatedindirect blocks so they are appropriately inter-leaved with the data blocks.

Currently, only the Direct I/O code pathuses the get_blocks() interfaces; thempage_writepages() function callsmpage_writepage() which in turn callsget_block() . Since only a few work-loads (mainly databases) use Direct I/O,Suparna Bhattacharya has written a patchto change mpage_writepages() useget_blocks() instead. This changeshould be generically helpful for anyfilesystems which implement an efficientget_blocks() function.

Draft patches have already been posted tothe ext2-devel mailing list. As of this writ-ing, we are trying to integrate Mingming’sext3_get_blocks() patch, Suparna Bhat-tacharya’smpage_writepage() patch andBadari Pulavarty’s generic delayed allocationpatch (discussed in Section 4.2) in order toevaluate these three patches together usingbenchmarks.

4.4 Asynchronous file unlink/truncate

With block-mapped files and ext3, truncationof a large file can take a considerable amountof time (on the order of tens to hundreds of sec-onds if there is a lot of other filesystem activ-ity concurrently). There are several reasons forthis:

• There are limits to the size of a sin-gle journal transaction (1/4 of the jour-nal size). When truncating a large frag-mented file, it may require modifying somany block bitmaps and group descriptorsthat it forces a journal transaction to closeout, stalling the unlink operation.

Page 20: State of the Art: Where we are with the Ext3 ï¬lesystem

88 • State of the Art: Where we are with the Ext3 filesystem

• Because of this per-transaction limit, trun-cate needs to zero the [dt]indirect blocksstarting from the end of the file, in case itneeds to start a new transaction in the mid-dle of the truncate (ext3 guarantees that apartially-completed truncate will be con-sistent/completed after a crash).

• The read/write of the file’s [dt]indirectblocks from the end of the file to the be-ginning can take a lot of time, as it doesthis in single-block chunks and the blocksare not contiguous.

In order to reduce the latency associated withlarge file truncates and unlinks on the LustreR©filesystem (which is commonly used by sci-entific computing applications handling verylarge files), the ability for ext3 to perform asyn-chronous unlink/truncate was implemented byAndreas Dilger in early 2003.

The delete thread is a kernel thread that ser-vices a queue of inode unlink or truncate-to-zero requests that are intercepted from nor-mal ext3_delete_inode() and ext3_truncate() calls. If the inode to be un-linked/truncated is small enough, or if there isany error in trying to defer the operation, it ishandled immediately; otherwise, it is put intothe delete thread queue. In the unlink case, theinode is just put into the queue and the deletethread is woke up, before returning to the caller.For the truncate-to-zero case, a free inode is al-located and the blocks are moved over to thenew inode before waking the thread and return-ing to the caller. When the delete thread is wokeup, it does a normal truncate of all the blocks oneach inode in the list, and then frees the inode.

In order to handle these deferred delete/truncaterequests in a crash-safe manner, the inodesto be unlinked/truncated are added into theext3 orphan list. This is an already exist-ing mechanism by which ext3 handles file un-link/truncates that might be interrupted by a

crash. A persistent singly-linked list of in-ode numbers is linked from the superblock and,if this list is not empty at filesystem mounttime, the ext3 code will first walk the list anddelete/truncate all of the files on it before themount is completed.

The delete thread was written for 2.4 kernels,but is currently only in use for Lustre. Thepatch has not yet been ported to 2.6, but theamount of effort needed to do so is expectedto be relatively small, as the ext3 code haschanged relatively little in this area.

For extent-mapped files, the need to have asyn-chronous unlink/truncate is much less, becausethe number of metadata blocks is greatly re-duced for a given file size (unless the file is veryfragmented). An alternative to the delete thread(for both files using extent maps as well as in-direct blocks) would be to walk the inode andpre-compute the number of bitmaps and groupdescriptors that would be modified by the oper-ation, and try to start a single transaction of thatsize. If this transaction can be started, then allof the indirect, double indirect, and triple in-direct blocks (also referenced as [d,t] indirectblocks) no longer have to be zeroed out, andwe only have to update the block bitmaps andtheir group summaries, reducing the amount ofI/O considerably for files using indirect blocks.Also, the walking of the file metadata blockscan be done in forward order and asynchronousreadahead can be started for indirect blocks tomake more efficient use of the disk. As anadded benefit, we would regain the ability toundelete files in ext3 because we no longer haveto zero out all of the metadata blocks.

4.5 Increased nlinks support

The use of a 16-bit value for an inode’s linkcount (i_nlink ) limits the number of hardlinks on an inode to 65535. For directories, it

Page 21: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 89

starts with a link count of 2 (one for “.” and onefor “..”) and each subdirectory has a hard linkto its parent, so the number of subdirectories issimilarly limited.

The ext3 implementation further reduced thislimit to 32000 to avoid signed-int problems.Before indexed directories were implemented,the practical limit for files/subdirectories wasabout 10000 in a single directory.

A patch was implemented to overcome thissubdirectory limit by not counting the subdi-rectory links after the counter overflowed (at65000 links actually); instead, a link count ofone is stored in the inode. The ext3 code al-ready ignores the link count when determiningif a directory is full or empty, and a link countof one is otherwise not possible for a directory.

Using a link count of one is also required be-cause userspace tools like “find” optimize theirdirectory walking by only checking a numberof subdirectories equal to the link count minustwo. Having a directory link count of one dis-ables that heuristic.

4.6 Parallel directory operations

The Lustre filesystem (which is built on top ofthe ext3 filesystem) has to meet very high goalsfor concurrent file creation in a single directory(5000 creates/second for 10 million files) forsome of its implementations. In order to meetthis goal, and to allow this rate to scale withthe number of CPUs in a server, the implemen-tation of parallel directory operations (pdirops)was done by Alex Tomas in mid 2003. Thispatch allows multiple threads to concurrentlycreate, unlink, and rename files within a singledirectory.

There are two components in the pdiropspatches: one in the VFS to lock individual en-tries in a directory (based on filesystem pref-erence), instead of using the directory inode

semaphore to provide exclusive access to thedirectory; the second patch is in ext3 to imple-ment proper locking based on the filename.

In the VFS, the directory inode semaphore ac-tually protects two separate things. It protectsthe filesystem from concurrent modification ofa single directory and it also protects the dcachefrom races in creating the same dentry multipletimes for concurrent lookups. The pdirops VFSpatch adds the ability to lock individual dentries(based on the dentry hash value) within a direc-tory to prevent concurrent dcache creation. Allof the places in the VFS that would takei_semon a directory instead calllock_dir() andunlock_dir() to determine what type oflocking is desired by the filesystem.

In ext3, the locking is done on a per-directory-leaf-block basis. This is well suited to thedirectory-indexing scheme, which has a treewith leaf blocks and index blocks that veryrarely change. In the rare case that adding anentry to the leaf block requires that an indexblock needs locking the code restarts at the topof the tree and keeps the lock(s) on the indexblock(s) that need to be modified. At about100,000 entries, there are 2-level index blocksthat further reduce the chance of lock collisionson index blocks. By not locking index blocksinitially, the common case where no changeneeds to be made to the index block is im-proved.

The use of the pdirops VFS patch was alsoshown to improve the performance of the tmpfsfilesystem, which needs no other locking thanthe dentry locks.

5 Performance comparison

In this section, we will discuss some perfor-mance comparisons between the ext3 filesys-

Page 22: State of the Art: Where we are with the Ext3 ï¬lesystem

90 • State of the Art: Where we are with the Ext3 filesystem

tem found on the 2.4 kernel and the 2.6 ker-nel. The goal is to evaluate the progress ext3has made over the last few years. Of course,many improvements other than the ext3 spe-cific features, for example, VM changes, blockI/O layer re-write, have been added to the Linux2.6 kernel, which could affect the performanceresults overall. However, we believe it is stillworthwhile to make the comparison, for thepurpose of illustrating the improvements madeto ext3 on some workload(s) now, comparedwith a few years ago.

We selected linux 2.4.29 kernel as the base-line, and compared it with the Linux 2.6.10kernel. Linux 2.6.10 contains all the featuresdiscussed in Section 2, except the EA-in-inodefeature, which is not relevant for the bench-marks we had chosen. We also performed thesame benchmarks using a Linux 2.6.10 ker-nel patched with Alex Tomas’ extents patchset, which implements extents, delayed allo-cation, and extents-based multiple block allo-cation. We plan to run the same benchmarksagainst a Linux 2.6.10 kernel with some of thepatches described in Section 4 in the future.

In this study we chose two benchmarks. Oneis tiobench, a benchmark testing filesystemsequential and random I/O performance withmultiple running threads. Another benchmarkwe used is filemark, a modified postmark[8]benchmark which simulates I/O activity on amail server with multiple threads mode. File-mark was used by Ray Bryant when he con-ducted filesystem performance study on Linux2.4.17 kernel three years ago [3].

All the tests were done on the same 8-CPU700 MHZ Pentium III system with 1 GB RAM.All the tests were run with ext3’s writebackjournaling mode enabled. When running testswith the extents patch set, the filesystem wasmouted with the appropriate mount options toenable the extents, multiple block allocation,and delayed allocation features. These test runs

0

5

10

15

20

25

30

35

40

64 32 16 8 4 2 1

thro

ughp

ut(M

B/se

c)

threads

tiobench_sequential_write

2.4.29_writeback_ 2.6.10_writeback_

2.6.10_writeback_emd

Figure 1: tiobench sequential write throughputresults comparison

0

5

10

15

20

25

30

35

40

64 32 16 8 4 2 1

thro

ughp

ut(M

B/se

c)

threads

tiobench_sequential_read

2.4.29_writeback_ 2.6.10_writeback_

2.6.10_writeback_emd

Figure 2: tiobench sequential read throughputresults comparison

are shown as “2.6.10_writeback_emd ”in the graphs.

5.1 Tiobench comparison

Although there have been a huge number ofchanges between the Linux 2.4.29 kernel to theLinux 2.6.10 kernel could affect overall perfor-mance (both in and outside of the ext3 filesys-tem), we expect that two ext3 features, remov-ing BKL from ext3 (as described in Section 2.2)and reservation based block allocation (as de-scribed in Section 2.3) are likely to signifi-cantly impact the throughput of the tiobench

Page 23: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 91

benchmark. In this sequential write test, mul-tiple threads are sequentially writing/allocatingblocks in the same directory. Allowing allo-cations concurrently in this case most likelywill reduces the CPU usage and improves thethroughput. Also, with reservation block al-location, files created by multiple threads inthis test could be more contiguous on disk, andlikely reduce the latency while writing and se-quential reading after that.

Figure 1 and Figure 2 show the sequential writeand sequential read test results of the tiobenchbenchmark, on the three selected kernels, withthreads ranging from 1 to 64. The total filessize used in this test is 4GB and the blocksizeis 16348 byte. The test was done on a single18G SCSI disk. The graphs indicate signifi-cant throughput improvement from the 2.4.29kernel to the Linux 2.6.10 kernel on this par-ticular workload. Figure 2 shows the sequen-tial read throughput has been significantly im-proved from Linux 2.4.29 to Linux 2.6.10 onext3 as well.

When we applied the extents patch set, we sawan additional 7-10% throughput improvementon tiobench sequential write test. We suspectthe improvements comes from the combinationof delayed allocation and multiple block alloca-tion patches. As we noted earlier, having bothfeatures could help lay out files more contigu-ously on disk, as well as reduce the times toupdate the metadata, which is quite expensiveand happens quite frequently with the currentext3 single block allocation mode. Future test-ing are needed to find out which feature amongthe three patches (extents, delayed allocationand extent allocation) is the key contributor ofthis improvement.

5.2 Filemark comparison

A Filemark execution includes three phases:creation, transaction, and delete phase. The

0

200

400

600

800

1000

1286481

trans

actio

ns p

er s

econ

d

threads

Filemark

2.4.29_writeback 2.6.10_writeback

2.6.10_writeback_emd

Figure 3: Filemark benchmark transaction ratecomparison

transaction phase includes file read and ap-pend operations, and some file creation and re-moval operations. The configuration we usedin this test is the so called “medium system”mentioned in Bryant’s Linux filesystem perfor-mance study [3]. Here we run filemark with 4target directories, each on a different disk, 2000subdirectories per target directory, and 100,000total files. The file sizes ranged from 4KBto 16KB and the I/O size was 4KB. Figure 3shows the average transactions per second dur-ing the transaction phase, when running File-mark with 1, 8, 64, and 128 threads on the threekernels.

This benchmark uses a varying number ofthreads. We therefore expected the scalabilityimprovements to the ext3 filesystem in the 2.6kernel should improve Linux 2.6’s performancefor this benchmark. In addition, during thetransaction phase, some files are deleted soonafter the benchmark creates or appends data tothose files. The delayed allocation could avoidthe need for disk updates for metadata changesat all. So we expected Alex’s delayed allocationto improve the throughput on this benchmark aswell.

The results are shown in Figure 3. At 128threads, we see that the 2.4.29 kernel had sig-

Page 24: State of the Art: Where we are with the Ext3 ï¬lesystem

92 • State of the Art: Where we are with the Ext3 filesystem

nificant scalability problems, which were ad-dressed in the 2.6.10 kernel. At up to 64threads, there is approximately a 10% to 15%improvement in the transaction rate betweenLinux 2.4.29 and Linux 2.6.10. With the ex-tents patch set applied to Linux 2.6.10, thetransaction rate is increased another 10% at 64threads. In the future, we plan to do futher workto determine how much of the additional 10%improvement can be ascribed to the differentcomponents of the extents patch set.

More performance results, both of the bench-mark tests described above, and additionalbenchmark tests expected to be done be-fore the 2005 OLS conference can befound at http://ext2.sourceforge.net/ols05-testing .

6 Future Work

This section will discuss some features that arestill on the drawing board.

6.1 64 bit block devices

For a long time the Linux block layer limitedthe size of a single filesystem to 2 TB (232∗512-byte sectors), and in some cases the SCSIdrivers further limited this to 1TB because ofsigned/unsigned integer bugs. In the 2.6 ker-nels there is now the ability to have larger blockdevices and with the growing capacity and de-creasing cost of disks the desire to have largerext3 filesystems is increasing. Recent vendorkernel releases have supported ext3 filesystemsup to 8 TB and which can theoretically be aslarge as 16 TB before it hits the 232 filesys-tem block limit (for 4 KB blocks and the 4 KBPAGE_SIZE limit on i386 systems). There isalso a page cache limit of 232 pages in an ad-dress space, which are used for buffered block

devices. This limit affects both ext3’s internalmetadata blocks, and the use of buffered blockdevices when running e2fsprogs on a device tocreate the filesystem in the first place. So thisimposes yet another 16TB limit on the filesys-tem size, but only on 32-bit architectures.

However, the demand for larger filesystems isalready here. Large NFS servers are in thetens of terabytes, and distributed filesystemsare also this large. Lustre uses ext3 as the back-end storage for filesystems in the hundreds ofterabytes range by combining dozens to hun-dreds of individual block devices and smallerext3 filesystems in the VFS layer, and havinglarger ext3 filesystems would avoid the need toartificially fragment the storage to fit within theblock and filesystem size limits.

Extremely large filesystems introduce a num-ber of scalability issues. One such concern isthe overhead of allocating space in very largevolumes, as described in Section 3.3. Anothersuch concern is the time required to back upand perform filesystem consistency checks onvery large filesystems. However, the primier is-sue with filesystems larger than 232 filesystemblocks is that the traditional indirect block map-ping scheme only supports 32-bit block num-bers. The additional fact that filling such a largefilesystem would take many millions of indi-rect blocks (over 1% of the whole filesystem,at least 160 GB of just indirect blocks) makesthe use of the indirect block mapping schemein such large filesystems undesirable.

Assuming a 4 KB blocksize, a 32-bit blocknumber limits the maximum size of the filesys-tem to 16 TB. However, because the superblockformat currently stores the number of blockgroups as a 16-bit integer, and because (againon a 4 KB blocksize filesystem) the maximumnumber of blocks in a block group is 32,768(the number of bits in a single 4k block, forthe block allocation bitmap), a combination of

Page 25: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 93

these constraints limits the maximum size ofthe filesystem to 8 TB.

One of the plans for growing beyond the 8/16TB boundary was to use larger filesystemblocks (8 KB up to 64 KB), which increasesthe filesystem limits such as group size, filesys-tem size, maximum file size, and makes blockallocation more efficient for a given amount ofspace. Unfortunately, the kernel currently lim-its the size of a page/buffer to virtual memory’spage size, which is 4 KB for i386 processors.A few years ago, it was thought that the adventof 64-bit processors like the Alpha, PPC64, andIA64 would break this limit and when they be-came commodity parts everyone would be ableto take advantage of them. The unfortunatenews is that the commodity 64-bit processor ar-chitecture, x86_64, also has a 4 KB page sizein order to maintain compatibility with its i386ancestors. Therefore, unless this particular lim-itation in the Linux VM can be lifted, mostLinux users will not be able to take advantageof a larger filesystem block size for some time.

These factors point to a possible paradigm shiftfor block allocations beyond the 8 TB bound-ary. One possibility is to use only larger ex-tent based allocations beyond the 8 TB bound-ary. The current extent layout described in Sec-tion 3.1 already has support for physical blocknumbers up to 248 blocks, though withonly232

blocks (16 TB) for a single file. If, at sometime in the future larger VM page sizes be-come common, or the kernel is changed to al-low buffers larger than the the VM page size,then this will allow filesystem growth up to 264

bytes and files up to 248 bytes (assuming 64 KBblocksize). The design of the extent structuresalso allows for additional extent formats like afull 64-bit physical and logical block numbersif that is necessary for 4 KBPAGE_SIZEsys-tems, though they would have to be 64-bit inorder for the VM to address files and storagedevices this large.

It may also make sense to restrict inodes to thefirst 8 TB of disk, and in conjunction with theextensible inode table discussed in Section 6.2use space within that region to allocate all in-odes. This leaves the > 8 TB space free for ef-ficient extent allocations.

6.2 Extensible Inode Table

Adding an dynamically extensible inode tableis something that has been discussed exten-sively by ext2/3 developers, and the issues thatmake adding this feature difficult have been dis-cussed before in [15]. Quickly summarized,the problem is a number of conflicting require-ments:

• We must maintain enough backup meta-data about the dynamic inodes to allow usto preserve ext3’s robustness in the pres-ence of lost disk blocks as far as possible.

• We must not renumber existing inodes,since this would require searching and up-dating all directory entries in the filesys-tem.

• Given the inode number the block alloca-tion algorithms must be able to determinethe block group where the inode is located.

• The number of block groups may changesince ext3 filesystems may be resized.

Most obvious solutions will violate one or moreof the above requirements. There is a cleversolution that can solve the problem, however,by using the space counting backwards from231− 1, or “negative” inode. Since the num-ber of block groups is limited by 232/(8 ∗blocksize), and since the maximum number ofinodes per block group is also the same as themaximum number of blocks per block group

Page 26: State of the Art: Where we are with the Ext3 ï¬lesystem

94 • State of the Art: Where we are with the Ext3 filesystem

is (8 ∗ blocksize), and if inode numbers andblock numbers are both 32-bit integers, thenthe number of inodes per block group in the“negative” inode space is simply(8∗blocksize)- normal-inodes-per-blockgroup. The locationof the inode blocks in the negative inode spaceare stored in a reserved inode.

This particular scheme is not perfect, however,since it is not extensible to support 64 bit blocknumbers unless inode numbers are also ex-tended to 64 bits. Unfortunately, this is not soeasy, since on 32-bit platforms, the Linux ker-nel’s internal inode number is 32 bits. Worseyet, theino_t type in thestat structure isalso 32 bits. Still, for filesystems that are utiliz-ing the traditional 32 bit block numbers, this isstill doable.

Is it worth it to make the inode table extensi-ble? Well, there are a number of reasons whyan extensible inode table is interesting. Histori-cally, administrators and themke2fs programhave always over-allocated the number of in-odes, since the number of inodes can not be in-creased after the filesystem has been formatted,and if all of the inodes have been exhausted,no additional files can be created even if thereis plenty of free space in the filesystem. Asinodes get larger in order to accommodate theEA-in-inode feature, the overhead of over-allocating inodes becomes significant. There-fore, being able to initially allocate a smallernumber of inodes and adding more inodes lateras needed is less wasteful of disk space. Asmaller number of initial inodes also makes thethe initial mke2fs takes less time, as well asspeeding up thee2fsck time.

On the other hand, there are a number of dis-advantages of an extensible inode table. First,the “negative” inode space introduces quite abit of complexity to the inode allocation andread/write functions. Second, as mentionedearlier, it is not easily extensible to filesystems

that implement the proposed 64-bit block num-ber extension. Finally, the filesystem becomesmore fragile, since if the reserved inode thatdescribes the location of the “negative” inodespace is corrupted, the location of all of the ex-tended inodes could be lost.

So will extensible inode tables ultimately beimplemented? Ultimately, this will depend onwhether an ext2/3 developer believes that it isworth implementing—whether someone con-siders extensible inode an “itch that they wishto scratch.” The authors believe that the ben-efits of this feature only slightly outweigh thecosts, but perhaps not by enough to be worthimplementing this feature. Still, this view is notunanimously held, and only time will tell.

7 Conclusion

As we have seen in this paper, there has beena tremendous amount of work that has goneinto the ext2/3 filesystem, and this work is con-tinuing. What was once essentially a simpli-fied BSD FFS descendant has turned into anenterprise-ready filesystem that can keep upwith the latest in storage technologies.

What has been the key to the ext2/3 filesystem’ssuccess? One reason is the forethought of theinitial ext2 developers to add compatibility fea-ture flags. These flags have made ext2 easilyextensible in a variety of ways, without sacri-ficing compatibility in many cases.

Another reason can be found by looking at thecompany affiliations of various current and pastext2 developers: Cluster File Systems, Digeo,IBM, OSDL, Red Hat, SuSE, VMWare, andothers. Different companies have different pri-orities, and have supported the growth of ext2/3capabilities in different ways. Thus, this di-verse and varied set of developers has allowedthe ext2/3 filesystem to flourish.

Page 27: State of the Art: Where we are with the Ext3 ï¬lesystem

2005 Linux Symposium • 95

The authors have no doubt that the ext2/3filesystem will continue to mature and cometo be suitable for a greater and greater numberof workloads. As the old Frank Sinatra songstated, “The best is yet to come.”

Patch Availability

The patches discussed in this paper can befound at http://ext2.sourceforge.net/ols05-patches .

Acknowledgments

The authors would like to thank all ext2/3 de-velopers who make ext2/3 better, especiallygrateful to Andrew Morton, Stephen Tweedie,Daniel Phillips, and Andreas Gruenbacher formany enlightening discussions and inputs.

We also owe thanks to Ram Pai, Sonny Rao,Laurent Vivier and Avantika Mathur for theirhelp on performance testing and analysis, andto Paul Mckenney, Werner Almesberger, andDavid L Stevens for paper reviewing and refin-ing. And lastly, thanks to Gerrit Huizenga whoencouraged us to finally get around to submit-ting and writing this paper in the first place.:-)

References

[1] A LMESBERGER, W., AND VAN DEN

BRINK , B. Active block i/o schedul-ing systems (abiss). InLinux Kongress(2004).

[2] BLIGH , M. Re: 2.5.70-mm1, May, 2003. http:

//marc.theaimsgroup.com/?l=

linux-mm&m=105418949116972&w=

2.

[3] BRYANT, R., FORESTER, R., AND

HAWKES, J. Filesystem performance andscalability in linux 2.4.17. InUSENIX An-nual Technical Conference(2002).

[4] CARD, R., TWEEDIE, S., AND TS’ O, T.Design and implementation of the secondextended filesystem. InFirst Dutch Inter-national Symposium on Linux(1994).

[5] CORBET, J. Which filesystemfor samba4? http://lwn.net/

Articles/112566/ .

[6] D ILGER, A. E. Online resizing with ext2and ext3. InOttawa Linux Symposium(2002), pp. 117–129.

[7] GRUENBACHER, A. Posix access controllists on linux. InUSENIX Annual Techni-cal Conference(2003), pp. 259–272.

[8] K ATCHER, J. Postmark a new filesystembenchmark. Tech. rep., Network Appli-ances, 2002.

[9] M ORTON, A. Re: [ext2-devel] [rfc]adding ’delayed allocation’ sup-port to ext3 writeback, April, 2005.http://marc.theaimsgroup.

com/?l=ext2-devel&m=

111286563813799&w=2 .

[10] PHILLIPS, D. A directory index for ext2.In 5th Annual Linux Showcase and Con-ference(2001), pp. 173–182.

[11] RAO, S. Re: [ext2-devel] re: Lat-est ext3 patches (extents, mballoc,delayed allocation), February, 2005.http://marc.theaimsgroup.

com/?l=ext2-devel&m=

110865997805872&w=2 .

[12] SWEENY, A. Scalability in the xfs filesystem. InUSENIX Annual TechnicalConference(1996).

Page 28: State of the Art: Where we are with the Ext3 ï¬lesystem

96 • State of the Art: Where we are with the Ext3 filesystem

[13] TOMAS, A. Speeding up ext2, March,2003. http://lwn.net/Articles/

25363/ .

[14] TWEEDIE, S. Ext3 journalling filesystem.In Ottawa Linux Symposium(2000).

[15] TWEEDIE, S.,AND TS’ O, T. Y. Plannedextensions to the linux ext2/3 filesystem.In USENIX Annual Technical Conference(2002), pp. 235–244.

[16] V IVIER , L. Filesystems comparisonusing sysbench and mysql, April, 2005.http://marc.theaimsgroup.

com/?l=ext2-devel&m=

111505444514651&w=2 .

Legal Statement

Copyright c© 2005 IBM.

This work represents the view of the authors anddoes not necessarily represent the view of IBM.

IBM and the IBM logo are trademarks or registeredtrademarks of International Business Machines Cor-poration in the United States and/or other countries.

Lustre is a trademark of Cluster File Systems, Inc.

UNIX is a registered trademark of The Open Groupin the United States and other countries.

Linux is a registered trademark of Linus Torvalds inthe United States, other countries, or both.

Other company, product, and service names may betrademarks or service marks of others.

References in this publication to IBM products orservices do not imply that IBM intends to makethem available in all countries in which IBM oper-ates.

This document is provied “AS IS,” with no expressor implied warranties. Use the information in thisdocument at your own risk.

Page 29: State of the Art: Where we are with the Ext3 ï¬lesystem

Proceedings of theLinux Symposium

Volume One

July 20nd–23th, 2005Ottawa, Ontario

Canada

Page 30: State of the Art: Where we are with the Ext3 ï¬lesystem

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.C. Craig Ross,Linux SymposiumStephanie Donovan,Linux Symposium

Review Committee

Gerrit Huizenga,IBMMatthew Wilcox,HPDirk Hohndel,IntelVal Henson,Sun MicrosystemsJamal Hadi Salimi,ZnyxMatt Domsch,DellAndrew Hutton,Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart,Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.