flash file systems

15
Flash File Systems Hackers Hut / Linux Kernel Guido R. Kok Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente August 2008 Abstract Using flash memory in mass storage devices is an up- coming trend. With high performance, low energy con- sumption and shock proof properties, flash memory is an appealing alternative to the large magnetic disk drives. However, flash memory has other properties that warrant special treatment in comparison to magnetic disk drives. Instead of the disadvantages of slow seek times for mag- netic hard drives, flash memory has other disadvantages that need to be coped with, such as slow erasure times, big erasure blocks and blocks wearing out. Several flash file systems have been developed to deal with these shortcom- ings. An overview and comparison of flash file systems is presented. 1 Introduction Flash memory is growing to be one of the main means of mass data storage. Compared to the traditional and very common magnetic hard disks, flash memory offers faster access times and better kinetic shock resistance. These two characteristics explain the popularity of flash memory in portable devices, such as PDAs, laptops, mobile phones, digital cameras and digital audio players. Two types of flash will be considered in this paper; NOR (Negative OR) and NAND (Negative AND) flash. These two types of memory differ in details, but they share the same principles as any flash memory does. Flash differentiates between write and erase. An erased flash is completely filled with ones, while a flash write will flip bits from 1 to 0. The only way to flip bits back from 0 to 1 is to erase. A big difference between flash memory and mag- netic hard disks is that the erase operations on flash only operate on large blocks. Erases on flash happen in coarse granularities of powers of two, ranging from 32k to 256k blocks. Writes can occur in much smaller granularities, such as individual bits for NOR flash, to 256, 512 or 2048 bytes in case of NAND flash. More details on NOR and NAND flash will be given in Section 2. Hardware manufacturers try to use flash memory as magnetic hard disk replacements. Before flash memory can threaten the dominating market position of magnetic hard disks as means of mass storage media, there are several limitations on flash that need to be coped with by the filesystems running on them. 1.1 Flash limitations 1. The lifetime of flash memory is finite. It is measured in the amount of write-erase cycles on an erase block before the block begins to fail. Most hardware man- ufacturers guarantee 100.000 cycles on each erase block in their chips. Magnetic hard disks do not have such a limitation. 2. Flash requires out of place updates of data, mean- ing that a new version of data has to be written to a new block instead of overwriting the old data. Before being able to write to a specific location, the target erase block must be erased. If an unclean unmount occurs at this time, data loss will occur, as both the old and new data cannot be retrieved. 3. Erase blocks are far larger than magnetic hard disk sectors or blocks. One erase block on flash is there- fore shared by multiple filesystem blocks. Because of the out of place updates, erase blocks get par- tially obsoleted. Once free space runs low, a tech- nique called Garbage Collection starts to collect valid filesystem blocks to free space. For more de- tails, see Section 3.3 1

Upload: flashdomain

Post on 10-Jun-2015

1.120 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flash File Systems

Flash File SystemsHackers Hut / Linux Kernel

Guido R. Kok

Faculty of Electrical Engineering,Mathematics and Computer Science

University of Twente

August 2008

Abstract

Using flash memory in mass storage devices is an up-coming trend. With high performance, low energy con-sumption and shock proof properties, flash memory is anappealing alternative to the large magnetic disk drives.However, flash memory has other properties that warrantspecial treatment in comparison to magnetic disk drives.Instead of the disadvantages of slow seek times for mag-netic hard drives, flash memory has other disadvantagesthat need to be coped with, such as slow erasure times, bigerasure blocks and blocks wearing out. Several flash filesystems have been developed to deal with these shortcom-ings. An overview and comparison of flash file systems ispresented.

1 Introduction

Flash memory is growing to be one of the main meansof mass data storage. Compared to the traditional andvery common magnetic hard disks, flash memory offersfaster access times and better kinetic shock resistance.These two characteristics explain the popularity of flashmemory in portable devices, such as PDAs, laptops,mobile phones, digital cameras and digital audio players.

Two types of flash will be considered in this paper;NOR (Negative OR) and NAND (Negative AND) flash.These two types of memory differ in details, but theyshare the same principles as any flash memory does.

Flash differentiates between write and erase. An erasedflash is completely filled with ones, while a flash writewill flip bits from 1 to 0. The only way to flip bits backfrom 0 to 1 is to erase.

A big difference between flash memory and mag-netic hard disks is that the erase operations on flash onlyoperate on large blocks. Erases on flash happen in coarsegranularities of powers of two, ranging from 32k to 256kblocks. Writes can occur in much smaller granularities,such as individual bits for NOR flash, to 256, 512 or 2048bytes in case of NAND flash. More details on NOR andNAND flash will be given in Section 2.

Hardware manufacturers try to use flash memory asmagnetic hard disk replacements. Before flash memorycan threaten the dominating market position of magnetichard disks as means of mass storage media, there areseveral limitations on flash that need to be coped with bythe filesystems running on them.

1.1 Flash limitations

1. The lifetime of flash memory is finite. It is measuredin the amount of write-erase cycles on an erase blockbefore the block begins to fail. Most hardware man-ufacturers guarantee 100.000 cycles on each eraseblock in their chips. Magnetic hard disks do not havesuch a limitation.

2. Flash requires out of place updates of data, mean-ing that a new version of data has to be written to anew block instead of overwriting the old data. Beforebeing able to write to a specific location, the targeterase block must be erased. If an unclean unmountoccurs at this time, data loss will occur, as both theold and new data cannot be retrieved.

3. Erase blocks are far larger than magnetic hard disksectors or blocks. One erase block on flash is there-fore shared by multiple filesystem blocks. Becauseof the out of place updates, erase blocks get par-tially obsoleted. Once free space runs low, a tech-nique called Garbage Collection starts to collectvalid filesystem blocks to free space. For more de-tails, see Section 3.3

1

Page 2: Flash File Systems

It can be seen that traditional file systems that were de-signed for magnetic hard disk usage, are not suitable tocope with the flash memory limitations mentioned above.This paper will will give an overview of techniques thatcope with the above problems and flash file systems thatimplement these techniques in their own way.

1.2 Paper layout

In section 2 a more detailed explanation on NOR and thenewer NAND flash will be given, pointing out their dif-ferences and limitations. Section 3 a presentation is givenon the general block mapping techniques and garbage col-lection in flash filesystems. With the general approach ongarbage collection and mapping techniques, section 4 ad-vances on this topic by giving a survey on several flashfile systems, including FFS, JFFS, YAFFS, LogFS andUBIFS. Section 5 compares the flash filesystems and con-cludes the paper.

2 Background

NOR and NAND flash are non-volatile memory thatcan be electrically erased and reprogrammed. Theyare successors of EEPROM (Electrically Erasable Pro-grammable Read-Only Memory), which was consideredtoo slow to write to. NOR flash was the first to appear in1984, while the first NAND principles were presented in1989.

2.1 NOR Flash

Read operations on NOR flash is the same as reading fromRAM, as NOR flash has external address buses that al-low to read NOR memory bit-by-bit. This direct accessread ability allows the NOR flash to be used as eXecute-In-Place (XIP) memory, meaning that programs stored inNOR flash can be executed directly without the need tocopy them into RAM first. XIP reduces the need of RAM,but as a disadvantage compression cannot be used.Unlocking, erasing and writing NOR memory operates ona block-by-block basis, which are typically 64, 128 or 256bytes big.Due to the integration of external address buses that allow

direct-access, the density of NOR memory is low com-pared to NAND flash. In 2001, a 16MB NOR array wouldbe considered large, while a 128MB NAND was alreadyavailable as a single chip.

2.2 NAND Flash

All operations on NAND flash (read, write, unlock, erase)operate on a block-by-block basis. For NAND flash thereare two kinds of blocks; write blocks, also known aspages, are typically 512 to 4096 bytes in size. Associatedwith each page are some Out-Of-Band (OOB) bytes,which are used to store EEC codes and other headerinformation. The other type of NAND block is the eraseblock.Typical erase block sizes can vary between 32 pages of512 bytes each for an erase block size of 16 kB, up to128 pages of 4,096 bytes each for a total erase block sizeof 512 kB. Programming and reading can be performedon a page basis, while erasing can only be performed onerase blocks. Pages are buffered before a write operationis executed, because each erase block can be written onlyup to four times before information starts to leak and theblock must be erased.

NAND memory has more capacity and lower costsdue to two reasons:

• The external buses in NOR flash have been removed,placing the memory cells in series rather than paral-lel to the bit lanes. This saves space, enhancing thedensity of memory cells, but at the cost of no moredirect access.

• The memory cells inside NAND are not guaranteederror-free when shipped. Bad block managementneeds to be implemented in the file systems in or-der to handle bad blocks. Because the manufac-turers dropped the requirement that each cell mustbe error-free (except the first physical block), farhigher yields can be made, dropping the manufac-turing costs.

2

Page 3: Flash File Systems

3 Block mapping techniques

The earliest approach to use flash memory is to treat itthe same way older filesystems do, like FAT. FAT treatsflash memory as a block device that allows data blocks tobe read and written. This is the approach typically usedon magnetic hard disks. However, this linear mapping ofblocks on flash addresses give cause to several problems.First of all, by rewriting new data to the old location, fre-quently used data translates to a block that is used fre-quently. This is no problem on magnetic hard disks, butas mentioned in section 1.1, flash memory blocks wearout, causing failure of the memory block if it has beenwritten to too often.Secondly, data blocks can be a lot smaller than the phys-ical erase blocks. Writing a small data block to the oldlocation means that a big erase block has to be read intoRAM, the appropriate data block is overwritten, the eraseblock in the flash is erased and the whole erase block inthe RAM is written back into flash. Clearly this approachis very inefficient.Finally, if an unclean unmount, such as a power outage,occurs during the above operation data is not only lost inthe form of the small data block, but also the total con-tents of the erase block can be lost.The above problems are solved with more sophisticatedblock-to-flash mappings and wear-leveling.

3.1 Block-mapping in flash

Instead of using a direct mapping of blocks onto physi-cal addresses, the basic idea of dynamic mapping is de-veloped. The idea is that blocks presented by the higherlayer functions, such as an Operating System, are identi-fied by a virtual block number. This virtual block num-ber is mapped to a physical flash address called a sector(Some authors use a different terminology [GT05]). Thismapping is stored in RAM and can be updated, giving avirtual block a new physical location. This idea is com-monly used in wear leveling techniques as follows. Whena virtual block is written to flash, it is not written to theold location, but written to a new location and the virtual-block-to-sector mapping is updated with the new physicallocation of the block. This dynamic mapping approachserves multiple purposes:

1. Frequently updated blocks are written to differentsectors each time it is modified, evening out the wearof different erase blocks

2. An updated block is written to a new location with-out the need to erase and rewrite an entire erase block

3. When power is lost during a write operation, the dy-namic mapping makes sure that this is an atomic op-eration, meaning that no data is lost

The atomicity mentioned in item 3 is achieved by in-formation stored in the header associated with each datasector. When blocks needs to be written to flash, the soft-ware searches for a free and erased sector. In this sectorand its associated header, all bits are set to 1. To achieveatomicity, three bits in the header are used. Before theblock is written, the used bit is cleared (made 0), to in-dicate that the sector is no longer free. Then the virtualblock number is written to the header and the new data iswritten to the sector. The used bit can be used in conjunc-tion with the virtual block number, under the requirementthat an virtual block number consisting of all ones, is nota valid block number.Once the data has been written, the so called valid bit iscleared to indicate that the data in the sector is ready to beread. The last bit called the obsolete bit, is cleared in theheader of the old sector, once that sector does not containthe latest version of the virtual block.When power is lost during a write operation, the systemcan be in two states. The first happens when the poweris lost before the valid is cleared. In this case, the sectorcontains data that is no longer valid and when the flash isused again, this sector is marked obsolete to ready it forerasure.The second inconsistent state occurs after the sector ismarked valid, but before the old sector is marked obso-lete. In this case there are two valid data sectors of thesame virtual block and the system can choose which oneto use. In the case it is important to pick the most recentversion, a two-bit version number can be inserted in thesector header, indicating a version number that has 4 dif-ferent values, where the version number 0 is more recentthan 3.For more information on these header fields the reader isreferred to [GT05] and [Man02].

3

Page 4: Flash File Systems

3.2 Block mapping data structures

In flash memory mappings cannot be stored in the samesector due to the wear-leveling. In order to find a sectorthat contains a new block, two approaches have beendeveloped. Direct maps are maps containing the currentlocation of a given block, while inverse maps store theidentity of a block given a sector. In other words, directmaps allow efficient mapping of blocks to sectors, whileinverse maps does the inverse by mapping sectors toblocks.Inverse maps are stored in the flash itself. The virtualblock number in the header of a sector point out to whichblock the sector belongs to. The virtual block number canalso be stored in a sector full of block numbers, as longas they are stored within the same erase unit, so that onerase all data and associated block numbers are erased.The main purpose of these inverse maps is to reconstructthe direct map upon device initialization, such as a mount.Direct maps are at least partially, if not totally, storedin RAM, which supports fast lookups. When a blockis updated and written to a new location, its mappingis updated in the RAM. Updating of the mapping inflash would not be possible because it does not supportin-place modification.

To summarize, inverse maps ensure that sectors can belinked to the blocks they contain, while direct maps allowthe system to make fast lookups of the physical where-abouts of blocks, as it is stored in RAM. Direct and in-verse maps, as well as the atomicity property are illus-trated in Figure 1.

3.3 Garbage Collection

Garbage collection for flash filesystems has its basis in theprinciple of ”segment cleaning,” designed by Rosenblumet al. [RO92]. Data that is no longer needed, is not deletedbut obsoleted. Obsolete data is still in the flash and oc-cupies space. Obsolete data cannot be deleted at once,as there may be valid data remaining in the same eraseblock. The implementation, efficiency and activation ofthe Garbage Collection depends on which file system isused, but can generally be described in four stages:

1. One or more erase blocks are selected for garbagecollection

2. The valid sectors in each erase block are copied tofree sectors in newly allocated erased blocks.

3. The data structures used in the mapping proccess areupdated to reflect the new location of the valid sec-tors.

4. The erase block is erased and its sectors are added toa free-sector pool. This step might include writingan erase-block header to specify details such as anerase counter.

The choice of which erase blocks to reclaim and whereto move the valid sectors to, affect the file system in threeways. The first is the efficiency of the garbage collection,measured in how many obsolete sectors are reclaimedin each erased block. The more obsolete sectors incomparison to valid sectors, the higher the efficiency.The second effect is wear leveling, where the target freeblock to copy the valid sectors to, is under influency ofhow many times that block has already been used.The last effect is the way these choices affect the mappingdata structures, as some garbage collections only requireone simple update of a direct map while other systemsmay require complex updates.

Wear leveling and garbage collection efficiency of-ten produce contradictory results. The best example iswhen we have an erase block filled with static data, datathat is never or rarely updated. In terms of efficiency,it would be a terrible mistake to select this block forgarbage collection, as there are no obsolete blocks thatwould be freed. However, garbage collecting this blockhas it’s use in terms of wear leveling, as such a staticblock would have a very low wear. Moving the static datato a block that already has a high wear, reducing the wearon that block, while making the low wear erase blockavailable for dynamic data.

4 File Systems

This section will give an overview of file systems used onflash memory. The first approach is to use a Flash Transla-tion Layer and run a normal filesystem on top of that, suchas FAT. The combination of FTL and FAT is used a lot in

4

Page 5: Flash File Systems

Figure 1: Block mapping in a flash device. The gray array on the right is the direct map, which resides in RAM. Eachsector contains a header and data. The header contains the virtual block number, an erase counter, a valid and obsoletebit, as well as an ECC code for error checking and a version number. The virtual block numbers in used sectorsconstitute the inverse map, from which the direct map can be constructed. The erase counter is used in wear-leveling,where the valid and obsolete bit and the version number support the atomicity and consistency of write operations.The ECC code supports to detect errors in failing blocks. Courtesy of Gal et al. [GT05]

removable flash devices as portability is a main require-ment. After FTL a background will be given on whichprinciple dedicated flash file systems work; Journalingand Logging. Then several flash filesystems will be han-dled, old file systems such as Microsoft’s FFS, currentlyused filesystems such as JFFS and YAFFS and promisingfilesystems still in development, such as LogFS.

4.1 Flash Translation Layer

The first approach for file systems to be used on flashmemory is to emulate a virtual block device, which can beused by a regular filesystem such as FAT. A Flash Trans-lation Layer (FTL) provides this functionality and takescare of the drawbacks mentioned in section 1.1.A write of a block to the virtual block device handled bythe FTL causes the FTL to do three things:

• The content of the data block is written to flash

• The location of the old data is marked obsolete

• Garbage collection may be activated to free up spacefor later use

• Optionally, depending on the concrete FTL used,more writes to flash may be necessary to updateblock-to-sector mappings, increasing erase and freeblock counters and so on

A FTL keeps track of the current location of each sec-tor in the emulated block device, which makes it a sortof journaling file system. Journaling file systems will behandled in the next section. The idea of the FTL stemsfrom a patent owned by A. Ban [Ban95], while this patentwas adopted as a PCMIA standard in 1998 [Cor98b].FTL was created for NOR memory only, but in 1999 ona NAND version of FTL was developed, called NTFL[Ban99] . NTFL is incorporated in the DiskOnChip de-vice.Because the use of an extra layer between a regular filesystem and the flash device is inefficient, scientists startedto work filesystems specifically made for flash memory.The fact that the use of FTL and NTFL was heavily re-stricted by all the patents involved, fueled the need forflash specific filesystems.

5

Page 6: Flash File Systems

4.2 Background: Journaling and Log struc-tured file systems

Most of the flash specific file systems are based on theprinciple of Log-structured file systems. This principle isthe successor of Journaling file systems, which we willexplain first.

In Journaling file systems modifications of metadataare stored in a journal before modifications are made tothe data block itself. When a crash has occurred, thefixing-up process examines the tail of the journal and rollsback or completes each metadata operation, dependingon which moment the crash occurred. Journaling filesystems are used a lot in current file systems for magnetichard disks, such as ext3 [Rob] and ReiserFS [Mas] onLinux systems.

The principle of Log-structured filesystems was de-signed to be used on magnetic disks, on which it is notcurrently used much. However, the idea is very useful inthe context of flash memory. Log-structured filesystemsare a rather extreme version of journaling filesystems, ina way that the journal/log is the filesystem. The disk isorganized as one long log, which consists of fixed-sizesegments of contiguous disk fragments, chained togetheras a linked list.When data and metadata are written, they are appendedto the tail of the log, never rewritten in-place. When(meta)data is appended to the end of the log, two prob-lems arise; how to find new data and how can garbagecollection work properly. In order to find new data,pointers to that data must be updated, and these newpointers are normally also appended to the end of the log.This recursive updating of data and pointers can lead toa snowball effect, so Rosenblum et al [RO92] came upwith the idea of implementing inodes in Log-structuredfilesystems.Inodes are data structures containing of file attributes,such as type, owner and permissions, as well as thephysical addresses of the first ten blocks of the file. If thefile data consists of more than 10 blocks, the inode willpoint to indirect blocks, which will point further to datablocks or lower layered indirect blocks, see Figure 2. Theinode to physical location of related files is stored in atable, kept in RAM memory. This table is periodically

flushed to the log. When a crash occurs the table needs tobe reconstructed, so the filesystem searches for the latestentry of this table in the log and scans the remaining partof the log to find files whose position changed after thetable was flushed.

The main advantage of Log-structured file systems isthe favorable write speed, as writings occur at the end ofthe log, so it rarely has seek and rotational delays. Thedownside of log-structured filesystems is the read perfor-mance. Reading can be very slow as blocks of files maybe scattered around, especially if blocks were modified atdifferent times. In the case of magnetic hard disks this re-sults in a lot of seek and rotational delays. This downsideis the main reason Log-structured filesystems are rarelyused on magnetic hard disks.However, Log-structured filesystems are an excellentchoice for flash devices, as old data cannot be overwrit-ten, so a new version must be written to a new locationanyway. Furthermore, read performance is not affectedby flash devices, as flash has uniform low random-accesstimes. Kawaguchi, Nishioka and Motoda were the first topoint out that Log-structured filesystems would be verysuitable for flash memory [KNM95].

4.3 Microsoft’s Flash File System

In the mid 1990’s Microsoft developed a filesystem forremovable flash memories, called FFS2. Documentationof the supposedly earlier version FFS1 can not be found.The first patent used in the development of FFS2 [BL95]describes a system for NOR flash that consist of one largeerase unit. This results in a write-once device with theexception that bits that are not cleared yet can be clearedlater. FF2 uses linked lists to keep track of the files andit attributes and data. When a file is extended, a newrecord is appended to the end of the linked list, followedby clearing the next field of the last record of the currentlist (which was all ones before).As can be seen in Figure 3, each record consists of 4fields; a raw data pointer which points to the start of adata block, a data size field to state the length of the dataused, a replacement pointer to be used in updates of dataand a next pointer to be used in appending data in the file.

Updates within the data of a file is a more difficult prob-lem. Because records point to raw data and data is writtenonce, the replacement pointer is used to indicate that the

6

Page 7: Flash File Systems

Figure 2: An inode in the top of the figure can point directly to data at level 0, or via indirect blocks which supportfragmented and/or big data files. Courtesy of Engel et al. [EBM07]

Figure 3: The data structure of the Microsoft Flash File System. The data structure shows a linked-list elementpointing to a block of 20 raw bytes in a file. Courtesy of Gal et al. [GT05]

data pointer and next pointer are not valid anymore. Thereplacement pointer points to a new record that uses a partof the old data, while a second record points to the up-dated data. Figure 4 shows this lengthy and cumbersomeapproach.

As can be seen, a big drawback of FFS2 is that in caseof dynamic data which changes frequently, a long linkedlist has to be traversed in order to access all the data.Suppose we have a file with the first 5 bytes of data beingupdated 10 times. In this case when we try to access thefile, a chain of 10 invalid records needs to be traversedbefore the record is reached that points to the most recent

data. This drawback is caused by the design decision totry to keep the objects with static addresses, meaning thateach file starts at the same physical address, no matterwhich version it is. This design makes it easy to findthings in the filesystem, but requires long and inefficienttraversing of invalid data chains to find current data.The log-structured approach makes it more difficult tolocate objects as they are moved around on updates, butonce found, sectors pointing to invalid data do not needto be traversed.

Douglis et al. report very poor write performance

7

Page 8: Flash File Systems

Figure 4: Updating in FFS2. The data structure is modified to accommodate an update of 5 of the 20 bytes. The dataand next-in-list pointers of the original node are invalidated. The replacement pointer, which was originally free (all1s; marked in gray in the figure), is set to point to a chain of 3 new nodes, two of which point to still-valid data withinthe existing block, and one of which points to a new block of raw data. The last node in the new chain points back tothe tail of the original list. Courtesy of Gal et al. [GT05]

for FFS2 in 1994 [DCK+94], which is presumed tobe the main reason why it failed and to be reported asobsolete by Intel in 1998 [Cor98a].

4.4 JFFS

4.4.1 JFFS1

Journaling Flash File System (FFS1) was developed byAxis Communication AB [AC04], designed to be usedin Linux embedded systems. JFFS1 was designed to beused with NOR flash memory only. JFFS1 is a purelylog-structured filesystem. Nodes containing metadata andpossibly data are stored on the NOR flash in a circularlog fashion. In JFFS1 there is only one type of node; thestruct jffs raw inode, which is associated with asingle inode by an inode number in its header. Next toan inode number there is a version number of the nodeand filesystem metadata in the header. The node may alsocarry a variable amount of data.When the flash device is mounted a scan is made of the en-tire medium. With the information found in the nodes, acomplete direct map is reconstructed and stored in RAM.When a node is superseded by a newer node, that node is

marked obsolete. When storage space runs low, garbagecollection kicks in. Garbage collection examines the headof the circular log and moves valid nodes to the tail of thelog and marks the valid node at the head of the log obso-lete. Once a complete erase block is rendered obsolete, itis erased and made available for reuse by the tail of thelog.JFFS1 has several drawbacks:

• At mount time, the entire device must be scannedto construct the direct map. This scanning processcan be very slow and the space occupied in RAM bythe direct map can be quite large, proportional to thenumber of files in the file system.

• The circular log design results in that all the data atthe head of the log is deleted, even if the head ofthe log consists of only valid nodes. This is not onlyinefficient, it is also not positive in terms of wear-leveling.

• Compression is not supported.

• Hard links are not supported.

• JFFS1 does not support NAND flash.

8

Page 9: Flash File Systems

4.4.2 JFFS2

David Woodhouse of Red Hat enhanced JFFS1 intoJFFS2 [Woo01]. Compression using zlib, rubin or rtimeis available. Hard linking and NAND flash memory arenow supported. Instead of one type of node, JFFS2 usesthree types of nodes:

• inodes: just like the struct jffs raw inodein JFFS1, but without file name nor parent inodenumber. An inode is removed once the last directoryentry referring to is has been unlinked.

• dirent nodes: directory entries, holding a name andan inode number. Hard links are maintained with dif-ferent names but the same inode number. A link isremoved by writing a dirent node with a higher ver-sion number, having the same name but with targetinode number 0.

• cleanmarker node: this node is written in an erasedblock to inform that the block has been properlyerased, in case of a scan at mount time.

Like in JFFS1, nodes with a lower version than the mostrecent one are considered obsolete. Instead of the circularlog in JFFS1, the filesystem deals in blocks, whichcorrespond to physical erase blocks in the flash device. Ablock containing only valid nodes is called clean, blockshaving at least one obsolete node are called dirty and afree block only contains the cleanmarker node.When a JFF2 system is mounted, the system scansall nodes in the flash device and constructs two datastructures, called struct jffs2 inode cache andstruct jffs2 raw node ref. The first is a directmap from each inode number to the start of a linked listof the physical nodes which belong to that inode. Thesecond structure represents each valid node on the flash,containing two linked lists, one pointing to the next nodein the same physical block, and the other list that pointsto the next node belonging to the same inode. Figure 5shows how these two data structures interconnect.

When a JFF2 system is mounted, the system scansall nodes in the flash device and constructs two datastructures, called struct jffs2 inode cache andstruct jffs2 raw node ref. The first is a directmap from each inode number to the start of a linked list

Figure 5: Two data structures in JFFS2. Each inode is rep-resented by the struct jffs2 inode cache, whichpoints to the start of the chain of nodes representing thefile. To indicate the end of the chain, the last node pointsback to the inode. Courtesy of Gal et al. [GT05]

9

Page 10: Flash File Systems

of the physical nodes which belong to that inode. Thesecond structure represents each valid node on the flash,containing two linked lists, one pointing to the next nodein the same physical block, and the other list that pointsto the next node belonging to the same inode. Figure 5shows how these two data structures interconnect.

Garbage collection frees up dirty blocks, turningthem into free blocks. To provide wear leveling onsemi-static data, JFFS2 picks a clean block once every100 selections, instead of a dirty block.The big drawback on JFFS2 remains the mounting time.As JFFS2 also supports NAND and thus bigger flashdevices, the time to scan the whole device becomes aserious problem.

4.5 YAFFS

Yet Another Flash File System was developed by CharlesManning of Aleph One [Man02]. YAFFS is the firstNAND-only flash file system. YAFFS was made forNAND flash of 512 byte chunks and 16 bytes headers(see section 2.2), while YAFFS2 supports bigger NANDchips, 1KB or 2KB pages with respectively 30 and 42byte headers.Because the earliest NAND flash memory with 512page chunks allowed up to three writes to the same areabefore an erasure was needed, YAFFS1 marked chunks asobsolete by rewriting a field in the header of each chunk.YAFFS2 required a more complex arrangement to obso-lete chunks newer flash only supported write-once beforeerasure was needed. In YAFFS2 every header does notonly contain a file ID and the position within the file, butalso a sequence number. When multiple chunks with thesame file ID and position within the file are encountered,the chunk with the higher sequence number counts andthe others are considered obsolete.When the system boots, a scan is performed to createa direct map that maps files to chunks using a tree likestructure, see Figure 6. To speed up the scan YAFFSincorporates checkpointing, saving the RAM map inflash before a clean unmount. When a system boots,it reads the flash device from the end to the beginning,encountering the checkpoint fairly fast. Any write to thefilesystem after the creation of a checkpoint renders thecheckpoint invalid.

File

Figure 6: yaffs Tnode tree of data chunks in a file. Ifthe file grows in size, the levels increase. Each Tnode is32 bytes big. Level 0 (i.e. lowest level) has 16 2-bytepointers to data chunks. Higher level Tnodes comprise 84-byte pointers to other Tnodes lower in the tree.

Garbage collection comes a deterministic mode and anaggressive mode. The former is the normal mode, acti-vated when a write occurs. When a write has been com-pleted and there is a block that is completely filled withdiscarded chunks, it is garbage collected. The aggressivemode is activated once free space is running low, collect-ing blocks that contain valid chunks, copying the validchunks to a free block and erasing the old erase block.Wear leveling is not of a high priority, as the authors arguethat NAND devices are already shipped with bad blocksso the filesystem needs to take care of bad blocks anyway.Uneven wear will only lead to loss of storage capacity,not to errors as bad blocks are handled by the filesystem.YAFFS does not support compression.

4.6 LogFS

LogFS is a creation of Engel et al. [EM] [EBM07] as aresponse on user comments that JFFS2 and YAFFS havehigh RAM usage and long mount times.The flash medium is split into segments, each segmentconsists of multiple erase blocks. LogFS structures thedevice in three storage areas:

• Superblock (1 segment)

• Journal (2-8 segments)

• Object store

10

Page 11: Flash File Systems

The superblock contains the global information such asfile system type. The journals will be discussed later.Each object store consists of one segment, in where allbut the last erase block are normal data blocks. The lastblock of each segment contains a summary of the Objectstore containing for each data block its Inode number, itslogical position in a file, physical offset of the block. Nextto these block specific fields, the summary maintainssegment-global information such as erase count, writetime, etc.When an update of data occurs, the data block is rewrittenout-of-place, so the pointer referring to the data mustbe updated (also out-of-place), which needs an updateof each parent node of that node. So basically eachchange at the bottom of the tree will propagate upwardall the way to the root. This method of updating the treebottom-up is known as the wandering tree algorithm. Acrash before the root node has been rewritten only causesthe loss of the last operation, as the root node still pointsto the previous data and structure.

Because inodes do not have reserved areas in flash de-vices, LogFS stores the inodes in an inode file (ifile). Thethe root inode of this ifile is stored in the journal. Thisdesign of ifile and normal files not only simplifies thecode (file writes and inode writes are identical now), italso makes it possible to use hardlinks. Figure 7 showsthe setup of the ifile and normal files. All data, inodes,indirect blocks and the ifile inode are stored in the flash,although the ifile inode is stored in the journal.The journal is a circular log but much smaller than thelog used in JFFS2, in which the log is the filesystem. InLogFS the small journal is filled with ifile inodes, beingtuples of a version number and offset. This offset pointsto the tree root node of the Ifile.Upon mounting, the system does not perform a full scanbut only scans the superblock and the journal to find themost recent version of the root node. This approach im-proves the mount time by a big factor (Jorn Engel states anOLPC system mount goes from 3.3 seconds under JFFS2to 60ms under LogFS [Cor07]).As each updated and new data block would indirectly leadto a new version of the root node, the erase blocks contain-ing the journals would wear at a rapid pace. Two solutionscounter this aggressive wearing, write buffering and jour-nal replacing. Write buffering stores updates in a buffer

before applying them to the flash device. This not onlyincreases write speed but also decreases the number ofinode updates as some data updates may have the samedirect or indirect parent inode.Journal replacing is activated when the erase blocks areweared too often. A clean segment is designated as a newjournal and the first entry in first journal points to this newjournal.As LogFS is still under development, wear leveling isnot optimized yet. As of January 2007, the segmentpicked for writing is the first empty segment encounteredwhen scanning some segments ahead. Future develop-ment would optimize the wear leveling on basis of ageand/or erase count.A free space counter is maintained in the journal and whenspace is running out, garbage collection comes into play.Due to the wandering tree algorithm and the fact that thetree is stored in flash itself, garbage collection in LogFSis complex. When garbage collection is needed on a seg-ment containing valid nodes, one free segment is neededfor each level of the tree, because blocks on different lev-els should be written to different segments and blocks onthe same level should be written to the same segment. Be-cause of this, LogFS becomes slow when the device isgetting full.The author states that LogFS is designed for big flash de-vices, ranging from gigabytes upwards. For smaller flashdevices, the author recommends using JFFS2. LogFS wasplanned to be included in the 2.6.25 Linux kernel. LogFShas a codesize of around 8 KLOC.

4.7 UBI and UBIFS

4.7.1 Unsorted Block Images - UBI

UBI is a flash management layer with almost the samefunctionality as the Logical Volume Manager (LVM) onhard drives, but with additional functions. It is designedby IBM [TG06]. An UBI runs on top of a flash device,and UBIFS runs on top of UBI (see Figure 8 ).UBI has the following relevant functionalities:

• Bad block management

• Wear leveling across all physical flash

• Logical to physical block mapping

11

Page 12: Flash File Systems

Figure 7: LogFS. Combination of Inode file and normal file tree structure. Directory entries are inodes with no pointerto data. Courtesy of Engel et al. [EBM07]

As we see, UBI hides these functions from higher lay-ered filesystems. UBI provides an UBI volume to higherlayers, consisting of logical erase blocks. Higher layerfilesystems may rewrite this logical erase block over andover without danger of wearing, because UBI transpar-ently changes the mapping to another physical eraseblockwhen it is time.UBI is not a FTL, as it was designed for bare flashes andnot for flash devices such as MMC/SD carde, USB sticks,CompactFlash and so on. As so, neither ext2 nor other”traditional” file systems can be run on top of an UBI de-vice. UBI weighs around 11 KLOC. For more informa-tion on UBI, the reader is referred to [TG06].

4.7.2 UBI Filesystem - UBIFS

UBIFS is developed by Nokia engineers with help of theUniversity of Szeged [Hun08]. UBIFS is designed towork on top of UBI volumes, it cannot operate directlyon top of MTD devices or FTL’s [Hun08]. Basicly thewhole setup of UBI, UBIFS and MTD is as follows (seealso Figure 8):

Figure 8: Layered structure of UBI, UBIFS and the flashdevice.

12

Page 13: Flash File Systems

• MTD subsystem, providing a uniform interface to ac-cess raw flash

• UBI subsystem, the volume manager providing wear-leveling, bad block management and logical to phys-ical erase block mapping

• UBIFS filesystem, providing all other functionalityfilesystems should provide

In contrast, FFS, JFFS2, YAFFS and LogFS work directlyon top of raw MTD devices.As UBIFS runs on top of an UBI volume, it is notprovided with physical erase blocks but with logical eraseblocks (LEBs). As such, UBIFS does not need to takecare of wear leveling as that is handled by the UBI layer.Just like in LogFS, UBIFS uses a wandering tree in a treejust like the one pictured in Figure 7.

There are 6 areas in UBIFS whose position is fixedat filesystem creation. The first area is the superblock,using one LEB. The second area are two LEBs filled withmaster nodes, which store the position of all on-flashdata structures that do not have fixed logical positions.To prevent data corruption, two LEBs are used instead ofone.

The third fixed area is the log of UBIFS, designed toreduce the frequency of updates to the on-flash tree asupdated nodes can share the same parent. The log is partof the journal. Nodes that are updated are placed in thejournal and the index tree in memory (called the TNC) isupdated. Once the journal is full it is committed.The commit process consists of writing the new versionof the on-flash tree and the corresponding master node.This process is based on two special type of nodes storedin the log, being the commit start node which recods thecommit has begun, and the reference nodes that recordthe LEB numbers of the LEBs in the rest of the journal.Those LEBs are called buds, so the journal consists of thelog and the buds. The start of a commit is recorded by thecommit start node, while the end of a commit is definedwhen the master node has been written. After that thereference nodes are obsolete and can be deleted.

The fourth area is the LEB properties tree (LPT), whichis a tree in where each leaf node represents information

on each known LEB in the main area. The main areais discussed later. Each leaf node contains three valuesabout each LEB in the main are: free space, dirty spaceand whether the LEB is a index eraseblock or not. Indexnodes (being part of the on-flash tree) and non-indexnodes are kept seperate in different blocks, meaningeraseblocks either contain only index nodes or onlynon-index nodes.The free space can be used in new writes and the dirtyspace counter is used in garbage collection. The LPT isupdated only during a commit.The on-flash tree and LPT represent the filesystem justafter the last commit. The difference between these twoand the actual state of the filesystem is represented by thenodes in the journal.

The fifth area is called the orphan area, consistingof inode numbers whose inodes have a link count ofzero. After an commit these inodes appear in the treeas leaves with no parent. This is possible when anunclean unmount occurs when an open file is unlinkedand committed. To delete these orphan nodes after anunclean unmount, either the entire on-flash tree must bescanned for unlinked leaf nodes, or a list of orphans mustbe kept somewhere. UBIFS incorporates the latter withthe orphan area. When the link count of an inode dropsto zero, the inode number is added to the orphan areaas leaves of the orphan tree. These inode numbers aredeleted when the corresponding inode is deleted.

The sixth and last area is the main area, containing thedata nodes and the on-flash tree (also called index). Asdescribes earlier, main area LEBs are either filled withindex nodes or non-index nodes.When a UBIFS is mounted, the LPT and on-flash tree arescanned, after which the journal is replayed to receive thecorrect stats of the filesystem.The UBIFS code size is around 30 KLOC.

5 Comparison and Conclusion

Flash memory is gowing rapidly in speed, capacity andpopularity. Newer flash device with bigger storage andhigher speeds appear constantly, often at the expense ofease of use. This trend requires constant development of

13

Page 14: Flash File Systems

software techniques for these newer flash devices.Several approaches have been discussed, from theinefficient and potentially dangerous Flash TranslationLayer and the first and abandoned Microsoft’s Flash FS,to more advanced dedicated flash filesystems like JFFS,YAFFS, LogFS and UBIFS. The first filesystems handleswere designed for NOR flash, while nowadays NANDflash is commonly used in flash devices.FTL is commonly used on removable devices like USBsticks, because so far the only filesystem that is supportedby every system is FAT. This approach of FTL witha traditional filesystem works, but it is inefficient andpotentially dangerous as FTL does not treat flash memoryproperties properly, even at the cost of potential data lossin case of a crash.Microsofts FFS had very poor performance and wasabandoned early, but it gave other developers ideas forfurther development. JFFS was the first dedicated flashfile system that brought good performance. JFFS1 wasfocussed on NOR memory, JFFS2 released later withseveral serious improvements, including NAND andhardlink support. JFFS scans the whole device uponmount time and with the introduction of NAND flash andits enormous size potential, the mount time became themajor disadvantage of JFFS. The full structure of JFFSremains in memory, laying a heavy burden on RAMcapacity.YAFFS was developed for NAND flash only and tocope with the long scan time and high RAM usage ofJFFS2. YAFFS maintains a smaller tree structure in RAMand supports checkpointing, decreasing the mount scantime dramaticly if and only if the device is unmountedproperly. In case of a crash the whole system needs to bescanned again. The author of YAFFS state it is better touse JFFS2 on devices smaller than 64MB and YAFFS onbigger devices.LogFS solves the mounting time problem and high RAMusage by maintaining the tree structure in the flash itself,rather then reconstructing it with a scan and keeping it inRAM only. LogFS is created for large NAND devices of1 GB and bigger and performance drops when the deviceis almost full with valid data. The LogFS author states itis better to use JFFS2 for smaller devices.UBI and UBIFS introduce a new approach for flashfilesystems using a layered approach to providetransparancy and simplicity to the higher layered

file system. This approachs seems very promising asother filesystems can be adapted to work on top of UBI,as a patched JFFS2 is already capable of. UBIFS alsomaintains on-flash trees to minimize mount times.UBIFS and LogFS are developed around the same timeand changes are implemented at the moment of writing.The codebase for the UBI/UBIFS combination is quitelarge in comparison to LogFS, respectively 11/30 and 8KLOC.

References

[AC04] Sweden. Axis Communica-tions, Lund. Jffs homepage.http://developer.axis.com/software/jffs/,2004.

[Ban95] A. Ban. Flash file system. us patent5,404,485. filed march 8, 1993; issued april4,1995;assigned to m-systems. 1995.

[Ban99] A. Ban. Flash file system optimizedfor page-mode flash technologies. us patent5,937,425.filed october 16, 1997; issued au-gust 10, 1999; assigned to m-systems. 1999.

[BL95] S. D. Barrett, P. L. Quinn and R. A. Lipe.System for updating data stored on a flash-erasable, programmable, read-only memory(feprom) based upon predetermined bit valueof indicating pointers. us patent 5,392,427.filed may 18, 1993; issued february 21, 1995;assigned to microsoft. 1995.

[Cor98a] Intel Corporation. Flash file system selectionguide. application note 686. 1998.

[Cor98b] Intel Corporation. Understanding the flashtranslation layer (ftl) specification, applica-tion note 648. 1998.

[Cor07] Corbet. Logfs.http://lwn.net/Articles/234441/, 2007.

[DCK+94] F. Douglis, R. Caceres, M.F. Kaashoek,K. Li, B. Marsh, and J.A. Tauber. Storage

14

Page 15: Flash File Systems

alternatives for mobile computers. In In Pro-ceedings of the First USENIX Symposium onOperating Systems Design and Implementa-tion (OSDI), pages 25–37, Monterey, Califor-nia, 1994. ACM.

[EBM07] Jorn Engel, Dirk Bolte, and RobertMertens. Garbage collection in logfs.http://www.logfs.org/logfs/, 2007.

[EM] Jorn Engel and Robert Mertens. Logfs- finally a scalable flash file system.http://lazybastard.org/ joern/logfs1.pdf.

[GT05] Eran Gal and Sivan Toledo. Algorithms anddata structures for flash memories. ACMComput. Surv., 37(2):138–163, 2005.

[Hun08] A. Hunter. A brief introduction tothe design of ubifs. http://www.linux-mtd.infradead.org/doc/ubifs whitepaper.pdf,2008.

[KNM95] Atsuo Kawaguchi, Shingo Nishioka, and Hi-roshi Motoda. A flash-memory based filesystem. In USENIX Winter, pages 155–164,1995.

[Man02] Charles Manning. Yaffs: Yet another flash fil-ing system, available at http://www.yaffs.net.Sep 2002.

[Mas] Chris Mason. Journaling with reiserfs.http://www.linuxjournal.com/article/4466.

[RO92] Mendel Rosenblum and John K. Ousterhout.The design and implementation of a log-structured file system. ACM Transactions onComputer Systems, 10(1):26–52, 1992.

[Rob] Daniel Robbins. Intro-ducing ext3. http://www-128.ibm.com/developerworks/linux/library/l-fs7.html.

[TG06] A. Bityutskiy T. Gleixner, F. Haverkamp. Ubi- unsorted block images. http://www.linux-mtd.infradead.org/doc/ubidesign/ubidesign.pdf,2006.

[Woo01] Jffs: The journaling flash file system.Presented in the Ottawa Linux Sym-posium, July 2001 (no proceedings);a 12-page article is available online athttp://sources.redhat.com/jffs2/jffs2.pdf,2001.

15