using crash hoare logic for certifying the fscq file...

20
Using Crash Hoare Logic for Certifying the FSCQ File System Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich MIT CSAIL Abstract FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sucient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file sys- tem correctly without losing data. To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at dierent abstraction levels. CHL also reduces the proof eort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sucient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers. 1 Introduction This paper describes Crash Hoare logic (CHL), which allows developers to write specifications for crash-safe storage sys- tems and also prove them correct. “Correct” means that, if a computer crashes due to a power failure or other fail-stop fault and subsequently reboots, the storage system will recover to a state consistent with its specification (e.g., POSIX [34]). For example, after recovery, either all disk writes from a file system call will be on disk, or none will be. Using CHL we build the FSCQ certified file system, which comes with a machine-checkable proof that its implementation is correct. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author. Copyright is held by the owner/author(s). SOSP’15, Oct. 4–7, 2015, Monterey, California, USA. ACM 978-1-4503-3834-9/15/10. http://dx.doi.org/10.1145/2815400.2815402 Proving that a file system is crash-safe is important, because it is otherwise hard for the file-system developer to ensure that the code correctly handles all possible points where a crash could occur, both while a file-system call is running and during the execution of recovery code. Often, a system may work correctly in many cases, but if a crash happens at a particular point between two specific disk writes, then a problem arises [55, 70]. Current approaches to building crash-safe file systems fall roughly into three categories (see §2 for more details): testing, program analysis, and model checking. Although they are ef- fective at finding bugs in practice, none of them can guarantee the absence of crash-safety bugs in actual implementations. This paper focuses precisely on this issue: helping developers build file systems with machine-checkable proofs that they correctly recover from crashes at any point. Researchers have used theorem provers for certifying real- world systems such as compilers [45], small kernels [43], ker- nel extensions [61], and simple remote servers [30], but none of these systems are capable of reasoning about file-system crashes. Reasoning about crash-free executions typically in- volves considering the states before and after some operation. Reasoning about crashes is more complicated because crashes can expose intermediate states. Challenges and contributions. Building an infrastructure for reasoning about file-system crashes poses several chal- lenges. Foremost among those challenges is the need for a specification framework that allows the file-system developer to state the system behavior under crashes. Second, it is im- portant that the specification framework allows for proofs to be automated, so that one can make changes to a specifica- tion and its implementation without having to redo all of the proofs manually. Third, the specification framework must be able to capture important performance optimizations, such as asynchronous disk writes, so that the implementation of a file system has acceptable performance. Finally, the specifica- tion framework must allow modular development: developers should be able to specify and verify each component in iso- lation and then compose verified components. For instance, once a logging layer has been implemented, file-system devel- opers should be able to prove end-to-end crash safety in the inode layer by simply relying on the fact that logging ensures atomicity; they should not need to consider every possible crash point in the inode code. To meet these challenges, this paper makes the following contributions: 18

Upload: others

Post on 18-Aug-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

Using Crash Hoare Logic for Certifying the FSCQ File System

Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai ZeldovichMIT CSAIL

AbstractFSCQ is the first file system with a machine-checkable proof(using the Coq proof assistant) that its implementation meetsits specification and whose specification includes crashes.FSCQ provably avoids bugs that have plagued previous filesystems, such as performing disk writes without sufficientbarriers or forgetting to zero out directory blocks. If a crashhappens at an inopportune time, these bugs can lead to dataloss. FSCQ’s theorems prove that, under any sequence ofcrashes followed by reboots, FSCQ will recover the file sys-tem correctly without losing data.

To state FSCQ’s theorems, this paper introduces the CrashHoare logic (CHL), which extends traditional Hoare logic witha crash condition, a recovery procedure, and logical addressspaces for specifying disk states at different abstraction levels.CHL also reduces the proof effort for developers throughproof automation. Using CHL, we developed, specified, andproved the correctness of the FSCQ file system. AlthoughFSCQ’s design is relatively simple, experiments with FSCQrunning as a user-level file system show that it is sufficientto run Unix applications with usable performance. FSCQ’sspecifications and proofs required significantly more workthan the implementation, but the work was manageable evenfor a small team of a few researchers.

1 IntroductionThis paper describes Crash Hoare logic (CHL), which allowsdevelopers to write specifications for crash-safe storage sys-tems and also prove them correct. “Correct” means that, ifa computer crashes due to a power failure or other fail-stopfault and subsequently reboots, the storage system will recoverto a state consistent with its specification (e.g., POSIX [34]).For example, after recovery, either all disk writes from a filesystem call will be on disk, or none will be. Using CHL webuild the FSCQ certified file system, which comes with amachine-checkable proof that its implementation is correct.

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for third-partycomponents of this work must be honored. For all other uses, contact theowner/author.

Copyright is held by the owner/author(s).SOSP’15, Oct. 4–7, 2015, Monterey, California, USA.ACM 978-1-4503-3834-9/15/10.http://dx.doi.org/10.1145/2815400.2815402

Proving that a file system is crash-safe is important, becauseit is otherwise hard for the file-system developer to ensurethat the code correctly handles all possible points where acrash could occur, both while a file-system call is runningand during the execution of recovery code. Often, a systemmay work correctly in many cases, but if a crash happens ata particular point between two specific disk writes, then aproblem arises [55, 70].

Current approaches to building crash-safe file systems fallroughly into three categories (see §2 for more details): testing,program analysis, and model checking. Although they are ef-fective at finding bugs in practice, none of them can guaranteethe absence of crash-safety bugs in actual implementations.This paper focuses precisely on this issue: helping developersbuild file systems with machine-checkable proofs that theycorrectly recover from crashes at any point.

Researchers have used theorem provers for certifying real-world systems such as compilers [45], small kernels [43], ker-nel extensions [61], and simple remote servers [30], but noneof these systems are capable of reasoning about file-systemcrashes. Reasoning about crash-free executions typically in-volves considering the states before and after some operation.Reasoning about crashes is more complicated because crashescan expose intermediate states.

Challenges and contributions. Building an infrastructurefor reasoning about file-system crashes poses several chal-lenges. Foremost among those challenges is the need for aspecification framework that allows the file-system developerto state the system behavior under crashes. Second, it is im-portant that the specification framework allows for proofs tobe automated, so that one can make changes to a specifica-tion and its implementation without having to redo all of theproofs manually. Third, the specification framework must beable to capture important performance optimizations, suchas asynchronous disk writes, so that the implementation of afile system has acceptable performance. Finally, the specifica-tion framework must allow modular development: developersshould be able to specify and verify each component in iso-lation and then compose verified components. For instance,once a logging layer has been implemented, file-system devel-opers should be able to prove end-to-end crash safety in theinode layer by simply relying on the fact that logging ensuresatomicity; they should not need to consider every possiblecrash point in the inode code.

To meet these challenges, this paper makes the followingcontributions:

18

Page 2: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

• The Crash Hoare logic (CHL), which allows programmersto specify what invariants hold in case of crashes and whichincorporates the notion of a recovery procedure that runsafter a crash. CHL supports the construction of modularsystems through a notion of logical address spaces. Finally,CHL allows for a high degree of proof automation.

• A model for asynchronous disk writes, specified usingCHL, that captures the notion of multiple outstandingwrites (which is important for performance). The modelallows file-system developers to reason about all possibledisk states that might result when some subset of thesewrites is applied after a crash, and is compatible with proofautomation.

• A write-ahead log, FscqLog, certified with CHL, that pro-vides all-or-nothing transactions on top of asynchronousdisk writes, and which provides a simple synchronous diskabstraction to other layers in the file system.

• The FSCQ file system, built on top of FscqLog, whichis the first file system to be certified for crash safety. Itembeds design patterns that work well for constructingmodular certified file systems: the use of logical addressspaces to reason easily about inode numbers, directoryentries, file contents, and so on; a certified generic objectallocator that can be instantiated for disk blocks and inodes;and a certified library for laying out data structures on disk.

• A specification of a subset of the POSIX file-system APIthat captures its semantics under crashes. Recent workhas shown that many application developers misunderstandcrash properties of file systems [55]; our specification canhelp application developers build applications on top of thePOSIX API and reason precisely about crash safety.

• An evaluation that shows that a certified file system canachieve usable performance and run a wide range of un-modified Unix applications.

• A case study of code evolution in FSCQ, demonstratingthat CHL combined with FSCQ’s design allows for incre-mental changes to both the proof and the implementation.This suggests that the FSCQ file system is amenable toincremental improvements.

System overview. We have implemented the CHL speci-fication framework with the widely used Coq proof assis-tant [15], which provides a single programming language forboth proving and implementing. The source code is availableat https://github.com/mit-pdos/fscq-impl. Figure 1shows the components involved in the implementation. CHLis a small specification language embedded in Coq that allowsa file-system developer to write specifications that include thenotion of crash conditions and a recovery procedure, and toprove that their implementations meet these specifications. Wehave stated the semantics of CHL and proven it sound in Coq.

We implemented and certified FSCQ using CHL. That is,we wrote specifications for a subset of the POSIX system calls

mv a b

Crash Hoare logic (CHL)

Crash modelProof automation

. . .

FSCQ

Definition rename := ...Theorem rename_ok: spec.Proof. . . .Qed.

Coq proof checker

OK?

Coq extraction

Haskell FUSE driverand libraries

Haskell code for FSCQ

Haskell compiler

FSCQ's FUSE file server

Linux kernel

rename()FUSEupcall

Disk reads,writes, and syncs

Disk

Figure 1: Overview of FSCQ’s implementation. Rectangular boxes denotesource code; rounded boxes denote processes. Shaded boxes denote sourcecode written by hand. The dashed line denotes the Haskell compiler producingan executable binary for FSCQ’s FUSE file server.

using CHL, implemented those calls inside of Coq, and provedthat the implementation of each call meets its specification.CHL reduces the proof burden because it automates the chain-ing of pre- and postconditions. Despite the automation, writingspecifications and proofs still took a significant amount of time,compared to the time spent writing the implementation.

As a target for FSCQ’s completeness, we aimed for thesame features as the xv6 file system [16], a teaching operatingsystem that implements the Unix v6 file system with write-ahead logging. FSCQ supports fewer features than today’sUnix file systems; for example, it lacks support for multipro-cessors and deferred durability (i.e., fsync). But, it supportsthe core POSIX file-system calls, including support for largefiles using indirect blocks, nested directories, and rename.

Using Coq’s extraction feature, we extract a Haskell imple-mentation of FSCQ. We run this implementation combinedwith a small (uncertified) Haskell driver as a FUSE [24] user-level file server. This implementation strategy allows us to rununmodified Unix applications but pulls in Haskell, our Haskelldriver, and the Haskell FUSE library as trusted components.

We compare the performance of the extracted FSCQ filesystem with xv6 and ext4. FSCQ’s performance is close tothat of xv6, and FSCQ is about 2× slower than ext4 withsynchronous data journaling. For example, building the xv6source code on the FSCQ and xv6 file systems takes 3.7s onan SSD, while on ext4 with data journaling it takes 2.2s.

19

Page 3: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

Roadmap. The rest of the paper is organized as follows. §2relates FSCQ to previous work. §3 introduces the CHL speci-fication framework. §4 describes how proofs of CHL specifi-cations can be partially automated. §5 describes FscqLog, theFSCQ file system, and the design patterns that we used to buildand certify FSCQ. §6 summarizes FSCQ’s implementationand how we produce a running file system. §7 evaluates thisimplementation. §8 summarizes alternative approaches thatwe tried before settling on CHL. §9 concludes.

2 Related WorkWe were motivated to design and build FSCQ by several linesof prior work: (1) research that has found and fixed bugs infile systems, (2) formal reasoning about file systems, (3) therecent successes in verification of complete systems, and (4)formalization of failure models, which we discuss in turn.

Finding and fixing bugs in file systems. Previous papershave studied bugs in file systems [47] and in applicationsthat make inconsistent assumptions about the underlying filesystems [55, 70]. One recent example is the 2013 Bitcoinbounty for tracking down a serious bug that corrupted financialtransactions stored in a LevelDB database [21].

Model-checking [67–69] and fuzz-testing [35] techniquesare effective at detecting file-system bugs. They enumeratepossible user inputs and disk states, inject crashes, and lookfor cases where the file system encounters a bug or violatesits invariants. These techniques find real bugs, and we usesome of them in §7.3 to do an end-to-end check on FSCQand its specifications. However, these techniques often cannotcheck all possible execution paths and thus cannot guaranteebug-free systems.

When faced with file-system inconsistencies, system admin-istrators run tools such as fsck [4: §42] and SQCK [29] todetect and repair corruption [17]. By construction, certifiedfile systems avoid inconsistencies caused by software bugs.

Formal reasoning about file systems. Building a correct filesystem has been an attractive goal for verification [23, 36].There is a rich literature of formalizing file systems usingmany specification languages, including ACL2 [5], Alloy [37],Athena [3], Isabelle/HOL [62], PVS [32], SSL [25], Z [6],KIV [19], and combinations of them [22]. Most of thesespecifications do not model crashes. The ones that do, suchas the work by Kang and Jackson [37], do not connect thespecification to an executable implementation.

The closest effort in this area is work in progress by Schell-horn, Pfähler, and others to verify a flash file system calledFlashix [20, 54, 58]. They aim to produce a verified file sys-tem for raw flash, to support a POSIX-like interface, and tohandle crashes. One difference from our work is that Flashixspecifications are abstract state machines; in contrast, CHLspecifications are written in a Hoare-logic style with pre- andpostconditions. One downside of CHL is that all procedures(including internal layers inside the file system) must have

explicit pre- and postconditions. However, CHL’s approachhas two advantages. First, developers can easily write CHLspecifications that capture asynchronous I/O, such as for write-back disk caches. Second, it allows for proof automation.Since Flashix is not available yet, it is difficult for us to do amore detailed comparison (e.g., how much proof effort Flashixrequires, what is captured in the Flashix specification, or howFlashix performs).

In concurrent work, Ntzik et al. [52] extend Views [18] withfault conditions, which are similar to CHL’s crash conditions.Because Views deals with shared-memory concurrency, theirlogic models both volatile state and durable state. CHL modelsonly durable state, and relies on its shallow embedding in Coqfor volatile state. Their paper focuses on the design of the logic,illustrating it with a logging system modeled on the ARIESrecovery algorithm [50]. Their aim isn’t to build a completeverified system, and their logic lacks, for example, logicaladdress spaces, which help proof automation and certifyingFSCQ in a modular fashion.

Certified systems software. The last decade has seen tremen-dous progress in certifying systems software, which inspiredus to work on FSCQ. The CompCert compiler [45] is formallyspecified and verified in Coq. As a compiler, CompCert doesnot deal with crashes, but we adopt CompCert’s validationapproach for proving FSCQ’s cache-replacement algorithm.

The seL4 project [43] developed a formally verified mi-crokernel using the Isabelle proof assistant. Since seL4 is amicrokernel, its file system is not part of the seL4 kernel. seL4makes no guarantees about the correctness of the file system.seL4 itself has no persistent state, so its specification does notmake statements about crashes.

A recent paper [42] argues that file systems deserve verifi-cation too, and describes work-in-progress on BilbyFS, whichuses layered domain-specific languages, but appears not tohandle crashes [41]. Two other position papers [1, 9] alsoargue for verifying storage systems. One of them [9] sum-marizes our initial thinking about different ways of writingsystem specifications, including Hoare-style ones.

Verve [66], Bedrock [12, 13], Ironclad [30], and Certi-KOS [28] have shown that proof automation can considerablyreduce the burden of proving. Of these, Bedrock, Verve, andIronclad are the most related to this work, because they sup-port proof automation for Hoare logic. We also adopt a fewCoq libraries from Bedrock. Our main contribution here is toextend Hoare logic with crash predicates, recovery procedures,and logical address spaces, while keeping a high degree ofproof automation.

Reasoning about failures. Failures are a core concern in dis-tributed systems, and TLA [44] has been used to prove thatdistributed-system protocols adhere to some specification inthe presence of node and network failures, and to specifyfault-tolerant replicated storage systems [26]. However, TLAreasons purely about designs and not about executable code.

20

Page 4: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

Verdi [63] reasons about distributed systems written in Coqand can extract the implementation into executable code. How-ever, Verdi’s node-failure model is a high-level description ofwhat state is preserved across reboots, which is assumed to becorrect. Extracted code must use other software (such as a filesystem) to preserve state across crashes, and Verdi provides noproof that this is done correctly. FSCQ and CHL address thiscomplementary problem: reasoning about crash safety startingfrom a small set of assumptions (e.g., atomic sector writes).

Project Zap [53, 60] and Rely [8] explore using type sys-tems to mitigate transient faults (e.g., due to charged particlesrandomly flipping bits in a CPU) in cases when the systemkeeps executing after a fault occurs. These models considerpossible faults at each step of a computation. The type systemenables the programmer to reason about the results of runningthe program in the presence of faults, or to ensure that a pro-gram will correctly repeat its computation enough times todetect or mask faults. In contrast, CHL’s model is fail-stop:every fault causes the system to crash and reboot.

Andronick [2] verified anti-tearing properties for smart-cardsoftware, which involves being prepared for the interruptionof code at any point. This verification proceeds by instrument-ing C programs to call, between any two atomic statements,a function that may nondeterministically choose to raise anuncatchable exception. In comparison, CHL handles the ad-ditional challenges of asynchronous disk writes and layeredabstractions of on-disk data structures.

3 Crash Hoare LogicOur goal is to allow developers to certify the correctness ofa storage system formally—that is, to prove that it functionscorrectly during normal operation and that it recovers prop-erly from any possible crashes. As mentioned in the abstract,a file system might forget to zero out the contents of newlyallocated directory or indirect blocks, leading to corruptionduring normal operation, or it might perform disk writes with-out sufficient barriers, leading to disk contents that might beunrecoverable. Prior work has shown that even mature filesystems in the Linux kernel have such bugs during normaloperation [47] and in crash recovery [69].

To prove that an implementation meets its specification, wemust have a way for the developer to state what correct behav-ior is under crashes. To do so, we extend Hoare logic [33],where specifications are of the form {P} procedure {Q}. Here,procedure could be a sequence of disk operations (e.g., readand write), interspersed with computation, that manipulatesthe persistent state on disk, such as the implementation of therename system call or a lower-level operation such as allo-cating a disk block. P corresponds to the precondition thatshould hold before procedure is run, and Q is the postcondi-tion. To prove that a specification is correct, we must provethat procedure establishes Q, assuming P holds before invok-ing procedure. In our rename system call example, P mightrequire that the file system be represented by some tree t, and

Q might promise that the resulting file system is representedby a modified tree t′ reflecting the rename operation.

Hoare logic is insufficient to reason about crashes, becausea crash may cause procedure to stop at any point in its execu-tion and may leave the disk in a state where Q does not hold(e.g., in the rename example, the new file name has been cre-ated already, but the old file name has not yet been removed).Furthermore, if the computer reboots, it often runs a recoveryprocedure (such as fsck) before resuming normal operation.Hoare logic does not provide a notion that at any point duringprocedure’s execution, a recovery procedure may run.

CHL extends Hoare logic with crash conditions, logicaladdress spaces, and recovery execution semantics. Thesethree extensions together allow developers to write concisespecifications for storage systems, including specifying thecorrect behavior in the presence of crashes. As we will showin §5, these features allow us to state precise specifications(e.g., in the case of rename, that once recovery succeeds, eitherthe entire rename operation took effect, or none of it did) andprove that implementations meet them. However, for the restof this section, we will consider much simpler examples toexplain the basics of CHL.

3.1 ExampleMany file-system operations must update two or more diskblocks as an atomic operation; for example, when creating afile, the file system must both mark an inode as allocated aswell as update the directory in which the file is created (torecord the file name with the allocated inode number). Toensure correct behavior under crashes, a common approach isto run the operation in a transaction. The transaction systemguarantees that, in the case of a crash, either all disk writessucceed or none do. Using transactions, a file system can avoidthe undesirable intermediate state where the inode is allocatedbut not recorded in the directory, effectively losing an inode.Many file systems, including widely used file systems such asLinux’s ext4 [59], use transactions exactly for this reason.

def atomic_two_write(a1, v1, a2, v2):log_begin()log_write(a1, v1)log_write(a2, v2)log_commit()

Figure 2: Pseudocode of atomic_two_write

The simple procedure shown in Figure 2 captures theessence of file-system calls that must update two or moreblocks. The procedure performs two disk writes inside ofa transaction using a write-ahead log, which supplies thelog_begin, log_commit, and log_write APIs. The procedurelog_write appends a block’s content to an in-memory log,instead of updating the disk block in place. The procedurelog_commit writes the log to disk, writes a commit record,and then copies the block contents from the log to the blocks’locations on disk. If this procedure crashes and the systemreboots, the recovery procedure of the transaction system runs.

21

Page 5: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

The recovery procedure looks for the commit record. If thereis a commit record, it completes the transaction by copying theblock contents from the log into the proper locations and thencleans the log. If there is no commit record, then the recoveryprocedure just cleans the log.

If there is a crash during recovery, then after reboot therecovery procedure runs again. In principle, this may happenseveral times. If the recovery finishes, however, then eitherboth blocks have been updated or neither have. Thus, in theatomic_two_write procedure from Figure 2, the transactionsystem guarantees that either both writes happen or none do,no matter when and how many crashes happen.

CHL makes it possible to write specifications for proce-dures such as atomic_two_write and the write-ahead loggingsystem, as we will explain in the rest of the section.

3.2 Crash conditionsCHL needs a way for developers to write down predicatesabout disk states, such as a description of the possible interme-diate states where a crash could occur. For modularity, CHLshould allow reasoning about just one part of the disk, ratherthan having to specify the contents of the entire disk at alltimes. For example, we want to specify what happens withthe two blocks that atomic_two_write updates and not haveto say anything about the rest of the disk.

To do this, CHL employs separation logic [56], which is away of combining predicates on disjoint parts of a store (inour case, the disk). The basic predicate in separation logic is apoints-to relation, written as a 7→ v, which means that addressa has value v. Given two predicates x and y, separation logicallows CHL to produce a combined predicate x ⋆ y. The ⋆operator means that the disk can be split into two disjoint parts,where one satisfies the x predicate, and the other satisfies y.

To reason about the behavior of a procedure in the pres-ence of crashes, CHL allows a developer to capture both thestate at the end of the procedure’s crash-free execution andthe intermediate states during the procedure’s execution inwhich a crash could occur. For example, Figure 3 shows theCHL specification for FSCQ’s disk_write. (In our imple-mentation of CHL, these specifications are written in Coqcode; we show here an easier-to-read version.) We will de-scribe the precise notation shortly, but for now, note thatthe specification has four parts: the procedure about whichwe are reasoning, disk_write(a, v); the precondition, disk:a 7→ ⟨v0, vs⟩ ⋆ other_blocks; the postcondition if there areno crashes, disk: a 7→ ⟨v, [v0]⊕ vs⟩ ⋆ other_blocks; and thecrash condition, disk: a 7→ ⟨v0, vs⟩ ⋆ other_blocks ∨ a 7→⟨v, [v0]⊕ vs⟩ ⋆ other_blocks. Moreover, note that the crashcondition specifies that disk_write could crash in two possi-ble states (either before making the write or after).

The specification in Figure 3 captures asynchronous writes.To do so, CHL models the disk as a (partial) function from ablock number to a tuple, ⟨v,vs⟩, consisting of the last-writtenvalue v and a set of previous values vs, one of which could

SPEC disk_write(a, v)PRE disk: a 7→ ⟨v0, vs⟩ ⋆ other_blocksPOST disk: a 7→ ⟨v, [v0]⊕ vs⟩ ⋆ other_blocksCRASH disk: a 7→ ⟨v0, vs⟩ ⋆ other_blocks ∨

a 7→ ⟨v, [v0]⊕ vs⟩) ⋆ other_blocks

Figure 3: Specification for disk_write

appear on disk after a crash. Block numbers greater than thesize of the disk do not map to anything. Reading from a blockreturns the last-written value, since even if there are previousvalues that might appear after a crash, in the absence of acrash a read should return the last write. Writing to a blockmakes the new value the last-written value and adds the oldlast-written value to the set of previous values. Reading orwriting a block number that does not exist causes the systemto “fail” (as opposed to finishing or crashing). Finally, CHL’sdisk model supports a sync operation, which waits until thelast-written value is on disk and discards previous values.

Returning to Figure 3, the disk_write specification assertsthrough the precondition that the address being written, a,must be valid (i.e., within the disk’s size), by stating thataddress a points to some value ⟨v0, vs⟩ on disk. The spec-ification’s postcondition asserts that the block being modi-fied will contain the new value ⟨v, [v0]⊕ vs⟩; that is, the newlast-written value is v, and v0 is added to the set of previousvalues. The specification also asserts through the crash condi-tion that disk_write could crash in a state that satisfies a 7→⟨v0, vs⟩ ⋆ other_blocks ∨ a 7→ ⟨v, [v0]⊕vs⟩ ⋆ other_blocks,i.e., either the write did not happen (a still has ⟨v0, vs⟩), or itdid (a has ⟨v, [v0]⊕ vs⟩). Finally, the specification asserts thatthe rest of the disk is unaffected: if other disk blocks satisfiedsome predicate other_blocks before disk_write, they will stillsatisfy the same predicate afterwards.

One subtlety of CHL’s crash conditions is that they describethe state of the disk just before the crash occurs, rather thanjust after. Right after a crash, CHL’s disk model specifiesthat each block nondeterministically chooses one value fromthe set of possible values before the crash. For instance, thefirst line of Figure 3’s crash condition says that the disk still“contains” all previous writes, represented by ⟨v0,vs⟩, ratherthan a specific value that persisted across the crash, chosenout of [v0]⊕ vs. This choice of representing the state beforethe crash rather than after the crash allows the crash conditionto be similar to the pre- and postconditions. For example, inFigure 3, the state of other sectors just before a crash matchesthe other_blocks predicate, as in the pre- and postconditions.However, describing the state after the crash would requirea more complex predicate (e.g., if other_blocks contains un-synced disk writes, the state after the crash must choose oneof the possible values). Making crash conditions similar topre- and postconditions is good for proof automation (as wedescribe in §4).

The specification of disk_write captures two importantbehaviors of real disks—that I/O can happen asynchronouslyand that writes can be reordered—in order to achieve good

22

Page 6: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

performance. CHL could model a simpler synchronous diskby specifying that a points to a single value (instead of a setof values) and changing the crash condition to say that eithera points to the new value (a 7→ v) or a points to the old value(a 7→ v0). This change would simplify proofs, but this modelof a disk would be accurate only if the disk were running insynchronous mode with no write buffering, which achieveslower performance.

The disk_write specification states that blocks are writtenatomically; that is, after a crash, a block must contain eitherthe last-written value or one of the previous values, and partialblock writes are not allowed. This is a common assumptionmade by file systems, as long as each block is exactly onesector, and we believe it matches the behavior of many disksin practice (modern disks often have 4 KB sectors). UsingCHL, we could capture the notion of partial sector writes byspecifying a more complicated crash condition, but the speci-fication shown here matches the common assumption aboutatomic sector writes. We leave to future work the question ofhow to build a certified file system without that assumption.

Much like other Hoare-logic-based approaches, CHL re-quires developers to write complete specifications for everyprocedure, including internal ones (e.g., allocating an objectfrom a free bitmap). This requires stating precise precondi-tions and postconditions. In CHL, developers must also writecrash conditions for every procedure. In practice, we havefound that the crash conditions are often simpler to state thanthe pre- and postconditions. For example, in FSCQ, most crashconditions in layers above the log simply state that there is anactive (uncommitted) transaction; only the top-level system-call code begins and commits transactions.

3.3 Logical address spacesThe above example illustrates how CHL can specify predicatesabout disk contents, but file systems often need to express sim-ilar predicates at other levels of abstraction as well. Considerthe Unix pwrite system call. Its specification should be sim-ilar to disk_write, except that it should describe offsets andvalues within the file’s contents, rather than block numbersand block values on disk. Expressing this specification directlyin terms of disk contents is tedious. For example, describingpwrite might require saying that we allocated a new blockfrom the bitmap allocator, grew the inode, perhaps allocatedan indirect block, and modified some disk block that happensto correspond to the correct offset within the file. Writing suchcomplex specifications is also error-prone, which can result insignificant wasted effort in trying to prove an incorrect spec.

To capture such high-level abstractions in a concise manner,we observe that many of these abstractions deal with logicaladdress spaces. For example, the disk is an address space fromdisk-block numbers to disk-block contents; the inode layeris an address space from inode numbers to inode structures;each file is a logical address space from offsets to data withinthat file; and a directory is a logical address space from file

names to inode numbers. Building on this observation, CHLgeneralizes the separation logic for reasoning about the diskto similarly reason about higher-level address spaces like files,directories, or the logical disk contents in a logging system.

SPEC atomic_two_write(a1, v1, a2, v2)PRE disk: log_rep(NoTxn, start_state)

start_state: a1 7→ vx ⋆ a2 7→ vy ⋆ othersPOST disk: log_rep(NoTxn, new_state)

new_state: a1 7→ v1 ⋆ a2 7→ v2 ⋆ othersCRASH disk: log_intact(start_state, new_state)

Figure 4: Specification for atomic_two_write

As an example of address spaces, consider the specifica-tion of atomic_two_write, shown in Figure 4. Rather thandescribe how atomic_two_write modifies the on-disk datastructures, the specification introduces new address spaces,start_state and new_state, which correspond to the contentsof the logical disk provided by the logging system. Logicaladdress spaces allow the developer of the logging system tostate a clean specification, which provides the abstraction of asimple, synchronous disk to higher layers in the file system.Developers of higher layers can then largely ignore the detailsof the underlying asynchronous disk.

Specifically, in the precondition, a1 7→ vx applies to the ad-dress space representing the starting contents of the logicaldisk, and in the postcondition, a1 7→ v1 applies to the new con-tents of the logical disk. Like the physical disk, these addressspaces are partial functions from addresses to values (in thiscase, mapping 64-bit block numbers to 4 KB block values).Unlike the physical disk, the logical disk address space pro-vided by the logging system associates a single value witheach block, rather than a set of values, because the transactionsystem exports a sound synchronous interface, proven correcton top of the asynchronous interface below.

To make this specification precise, we must describe whatit means for the transaction’s logical disk to have value v1 ataddress a1. We do this by connecting the transaction’s logicaladdress spaces, start_state and new_state, to their physicalrepresentation on disk. For instance, we specify where thestarting state is stored on disk and how the new state is log-ically constructed (e.g., by applying the log contents to thestarting state). We specify this connection using a represen-tation invariant; in this example, the representation invariantlog_rep is a predicate describing the physical disk contents.log_rep takes logical address spaces as arguments and

specifies how those logical address spaces are representedon disk. Several states of the logging system are possible;log_rep(NoTxn, start_state) means the disk has no activetransaction and is in state start_state. In Figure 4, the log_repinvariant shows up twice. It is first applied to the disk ad-dress space, representing the physical disk contents, in theprecondition. Here, it relates the starting disk contents to thestart_state logical address space. log_rep also shows up inthe postcondition, where it connects the physical disk state

23

Page 7: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

after atomic_two_write returns and the logical address spacenew_state. Syntactically, we use the notation “las: predicate”to say that the logical address space las matches a particularpredicate; this is used both to apply a representation invari-ant to an address space (such as log_rep in the disk addressspace) as well as to write other predicates about an addressspace (such as a1 7→ v1 in new_state).

Representation invariants can be thought of as macros,which boil down to a set of “points-to” relationships. Forinstance, Figure 5 shows part of the log_rep definition for aninteresting case, namely an active transaction. It says that, inorder for a transaction to be in an ActiveTxn state, the com-mit block must contain zero, all of the blocks in start_statemust be on disk, and cur_state is the result of replaying thein-memory log starting from start_state. The arrow notation,→, denotes implication; that is, if start_state[a] = v, then itmust be that a 7→ ⟨v,∅⟩. replay is the one part of log_repthat does not boil down to points-to predicates: it is simplya function that takes one logical name space and producesanother logical name space (by applying log entries). Notethat, by using ∅ as the previous value sets, log_rep requiresthat all blocks must have been synced.

log_rep(ActiveTxn, start_state, cur_state) :=COMMITBLOCK 7→ ⟨0,∅⟩⋆ (∀a, start_state[a] = v → a 7→ ⟨v,∅⟩)∧ replay(start_state, inMemoryLog) = cur_state

Figure 5: Part of log_rep representation invariant

The crash condition of atomic_two_write, from Figure 4,describes all of the states in which atomic_two_write couldcrash using log_intact(d1, d2), which stands for all possi-ble log_rep states that recover to transaction states d1 or d2.Using log_intact allows us to concisely capture all possiblecrash states during atomic_two_write, including crashes deepinside any procedure that atomic_two_write might call (e.g.,crashes inside log_commit).

3.4 Recovery execution semanticsCrash conditions and address spaces allow us to specify thepossible states in which the computer might crash in the middleof a procedure’s execution. But we also need a way to reasonabout recovery, including crashes during recovery.

For example, we want to argue that a transaction providesall-or-nothing atomicity: if atomic_two_write crashes priorto invoking log_commit, none of the calls to log_writewill beapplied; after log_commit returns, all of them will be applied;and if atomic_two_write crashes during log_commit, eitherall or none of them will take effect. To achieve this property,the transaction system must run log_recover after every crashto roll forward any committed transaction, including aftercrashes during log_recover itself.

The specification of the log_recover procedure is shownin Figure 6. It says that, starting from any state matchinglog_intact(last_state, committed_state), log_recoverwill either roll back the transaction to last_state or will roll

forward a committed transaction to committed_state. Fur-thermore, the specification is idempotent, since the crash condi-tion implies the precondition; this will allow for log_recoverto crash and restart multiple times.

SPEC log_recover()PRE disk: log_intact(last_state, committed_state)POST disk: log_rep(NoTxn, last_state) ∨

log_rep(NoTxn, committed_state)CRASH disk: log_intact(last_state, committed_state)

Figure 6: Specification of log_recover

To state that log_recover must run after a crash, CHL pro-vides a recovery execution semantics. In contrast to CHL’sregular execution semantics, which talks about a procedureproducing either a failure (accessing an invalid disk block), acrash, or a finished state, the recovery semantics talks abouttwo procedures executing (a normal procedure and a recoveryprocedure) and producing either a failure, a completed state(after finishing the normal procedure), or a recovered state(after finishing the recovery procedure). This regime modelsthe notion that the normal procedure tries to execute and reacha completed state, but if the system crashes, it starts runningthe recovery procedure (perhaps multiple times if there arecrashes during recovery), which produces a recovered state.

SPEC atomic_two_write(a1, v1, a2, v2)≫ log_recoverPRE disk: log_rep(NoTxn, start_state)

start_state: a1 7→ vx ⋆ a2 7→ vy ⋆ othersPOST disk: log_rep(NoTxn, new_state) ∨

(ret = recovered ∧ log_rep(NoTxn, start_state))new_state: a1 7→ v1 ⋆ a2 7→ v2 ⋆ others

Figure 7: Specification for atomic_two_write with recovery. The≫ operatorindicates the combination of a regular procedure and a recovery procedure.

Figure 7 shows how to extend the atomic_two_write spec-ification to include recovery execution using the≫ notation.The postcondition indicates that, if atomic_two_write finisheswithout crashing, both blocks were updated, and if one ormore crashes occurred, with log_recover running after eachcrash, either both blocks were updated or neither was. Thespecial ret variable indicates whether the system reached acompleted or a recovered state and in this case enables callersof atomic_two_write to conclude that, if atomic_two_writecompleted without crashes, it updated both blocks (i.e., updat-ing none of the blocks is allowed only if the system crashedand recovered).

Note that distinguishing the completed and recovered statesallows the specification to state stronger properties for com-pletion than recovery. Also note that the recovery-executionversion of atomic_two_write does not have a crash condition,because the execution always succeeds, perhaps after runninglog_recover many times.

In this example, the recovery procedure is just log_recover,but the recovery procedure of a system built on top of thetransaction system may be composed of several recovery pro-

24

Page 8: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

log_recover

PRE

POST

RECOVER

if bnum >= NDIRECT: indirect = log_read(inode.blocks[NDIRECT]) return indirect[bnum - NDIRECT]else: return inode.blocks[bnum]

if

log_read

return

return

Figure 8: Example control flow of a CHL procedure that looks up the address of a block in an inode, with support for indirect blocks. (The actual code in FSCQchecks for some additional error cases.) Gray boxes represent the specifications of procedures. The dark red box represents the recovery procedure. Green andpink boxes represent preconditions and crash conditions, respectively. Blue boxes represent postconditions. Dashed arrows represent logical implication.

cedures. For example, recovery in a file system consists offirst reading the superblock from disk to locate the log andthen running log_recover.

4 Proving specificationsThe preceding section explains how to write specificationsusing CHL. This section describes how to prove that an im-plementation meets its specification. A key challenge in thedesign of CHL was to reduce the proof burden on developers.In developing FSCQ, we often refactored specifications andimplementations, and each time we did so we had to redo thecorresponding proofs. To reduce the burden of proving, wedesigned CHL so that it allows for stylized proofs. As a resultof this design, several proof steps can be done automatically,as we describe in this section.

Even with this automation, a significant amount of manualeffort is still required for proving. First, CHL itself must beproven to be sound, which we have done as part of imple-menting CHL in Coq; developers using CHL need not redothis proof. Second, each application that uses CHL typicallyrequires a significant amount of effort to develop its specifica-tions and proofs, because there are many aspects that cannotbe fully automated. We examine the amount of work requiredto build the FSCQ file system in more detail in §7.

4.1 OverviewTo get some intuition for how CHL can help automate proofs,consider the control flow of the example procedure in Figure 8.The outer box corresponds to the top-level specification ofa procedure; in this example, it is a procedure that returnsthe address of the bnumth block from an inode, with recovery-execution semantics. It has a precondition, a postcondition,and a recovery condition.

The arrows correspond to the procedure’s control flow, andsmaller boxes correspond to procedures that the top-level pro-cedure invokes (e.g., log_read). Each of these procedures hasa precondition, a postcondition, and a crash condition. In thefigure, after calling the if statement, the procedure can followtwo different paths: it may call log_read and then return, or

it may immediately return a different value. More compli-cated procedures may have more complicated control flows,including loops.

The top-level procedure also has a recovery procedure (e.g.,log_recover). The recovery procedure has a precondition,postcondition, and recovery condition. The recovery proce-dure may be invoked at any point after a crash. To capturethis, the control flow can jump from the crash condition of aprocedure to the recovery procedure. The recovery procedurecan itself crash, so there is also an arrow from the recoveryprocedure’s crash condition to its own precondition.

Proving the correctness of the top-level procedure p entailsproving that, if p is executed and the precondition held beforep started running, either 1) its postcondition holds; or 2) therecovery condition holds after recovery finishes. For the firstcase, we must show that the precondition of the top-level pro-cedure implies the precondition of the first procedure invoked,that the postcondition of the first procedure called implies theprecondition of the next procedure invoked in the control flow,and so on. Similarly for the second case, we must prove thatthe crash condition of each procedure implies the preconditionof the recovery procedure, and so on.

In both cases, the logical implications follow exactly thecontrol flow of the procedure, which allows for a high degreeof automation. Our implementation of CHL automaticallychains the pre- and postconditions based on the control flowof the procedure. If a precondition is trivially implied by apreceding postcondition in the control flow, then the developerdoes not have to prove anything. In practice this is often thecase, and the developer must prove only the representationfunctions (e.g., log_rep and log_intact). The rest of this sec-tion describes this automation in more detail, while explicitlynoting what must be proven by hand by developers. The basicstrategy is inspired by the Bedrock system [12] but extendsthe approach to handle crashes and address spaces.

4.2 Proving without recoveryCHL’s proof strategy consists of the following steps:

25

Page 9: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

Phase 1: Procedure steps. The first phase of CHL’s proofstrategy is to turn the theorem about p’s specification into aseries of proof obligations that will be proven in the next phase.Specifically, CHL considers every step in p (e.g., a procedurecall) and reasons about what the state of the system is beforeand after that step. CHL assumes that each step already hasa proven specification. The base primitives (e.g., disk_readand disk_write) of CHL have proven specifications providedby the implementation of CHL.

CHL starts by assuming that the initial state matches theprecondition, and, for every step in p, generates two proofobligations: (1) that the current condition (either p’s precon-dition or the previous step’s postcondition) implies the step’sprecondition, and (2) that the step’s crash condition implies p’scrash condition. At the end of p, CHL generates a final proofobligation that the final condition implies p’s postcondition.

SPEC log_begin()PRE disk: log_rep(NoTxn, start_state)POST disk: log_rep(ActiveTxn, start_state, start_state)CRASH disk: log_intact(start_state, start_state)

Figure 9: Specification for log_begin

For example, consider the atomic_two_write procedurefrom Figure 2, whose specification is shown in Figure 4. Asthe first step, CHL considers the call to log_begin and, us-ing the specification shown in Figure 9, generates two proofobligations: that atomic_two_write’s precondition matchesthe precondition of log_begin, and that log_begin’s crashcondition implies atomic_two_write’s crash condition.

SPEC log_write(a, v)PRE disk: log_rep(ActiveTxn, start_state, old_state)

old_state: a 7→ v0 ⋆ other_blocksPOST disk: log_rep(ActiveTxn, start_state, new_state)

new_state: a 7→ v ⋆ other_blocksCRASH disk: log_rep(ActiveTxn, start_state, any_state)

Figure 10: Specification for FscqLog’s write

When specifications involve multiple address spaces, CHLrecursively matches up the address spaces starting from thebuilt-in disk address space. For instance, the next step inatomic_two_write is the call to log_write, whose specifica-tion appears in Figure 10. By matching up the disk addressspaces in log_begin’s postcondition and log_write’s precon-dition, CHL concludes that the address space called start_statein atomic_two_write is the same as the old_state addressspace in log_write. CHL then generates another proof obli-gation that the predicate for start_state in atomic_two_writeimplies the predicate for old_state in log_write.

Phase 2: Predicate implications. Some obligations gener-ated in phase 1 are trivial, such as that the precondition ofatomic_two_write implies the precondition of log_begin;since the two are identical, CHL immediately proves the im-plication between them.

For more complicated cases, CHL relies on separation logicto prove the obligations and to help carry information from pre-condition to postcondition. Continuing with our example, con-sider the proof obligation generated at atomic_two_write’sfirst call to log_write, which requires us to prove that a1 7→

vx ⋆ a2 7→ vy ⋆ others implies a1 7→ v0 ⋆ other_blocks.Because ⋆ applies only to disjoint predicates, CHL matchesup a1 7→ vx with a1 7→ v0 (thereby setting v0 to vx) and “can-cels out” these terms from both sides of the implication obli-gation. CHL then sets the arbitrary other_blocks predicatefrom log_write’s precondition to a2 7→ vy ⋆ others. Thishas two effects: first, it proves this particular obligation, andsecond, it carries over information about a2 7→ vy into sub-sequent proof obligations that mention other_blocks fromlog_write’s postcondition (such as atomic_two_write’s nextcall to log_write).

Some implication obligations cannot be proven by CHLautomatically and require developer input. This usually oc-curs when the developer is working with multiple levels ofabstraction, where one predicate mentions a higher-level repre-sentation invariant (e.g., a directory represented by a functionfrom file names to inode numbers) but the other predicate talksabout lower-level state (e.g., a directory represented by a setof directory entries in a file).

4.3 Proving recovery specificationsSo far, we described how CHL proves specifications about pro-cedures without recovery. Proofs of specifications that involvea recovery procedure, such as Figure 7, are also automated inCHL: if the recovery procedure is idempotent (i.e., its crashcondition implies its precondition), CHL can automaticallyprove the specification of procedure p with recovery basedon the spec of p without recovery. For instance, since thelog_recover procedure is idempotent (see Figure 6), CHLcan automatically prove the specification shown in Figure 7based on the spec from Figure 4.

5 Building a file systemThis section describes FSCQ, a simple file system that wespecified and certified using CHL. FSCQ’s design closelyfollows the xv6 file system. The key differences are the lackof multiprocessor support and the use of a separate bitmapfor allocating inodes (instead of using a particular inode typeto represent a free state). The rest of this section describesFSCQ, the challenges we encountered in proving FSCQ, andthe design patterns that we came up with for addressing them.

5.1 OverviewFigure 11 shows the overall components that make up FSCQ.FscqLog provides a simple write-ahead log, which FSCQ usesto update several disk blocks atomically. The other compo-nents provide simple implementations of standard file-systemabstractions. The Cache module provides a buffer cache. Bal-loc implements a bitmap allocator, used for both block and

26

Page 10: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

inode allocation. Inode implements an inode layer; the mostinteresting logic here is combining the direct and indirectblocks together into a single list of block addresses. Inodeinvokes Balloc to allocate indirect blocks. BFile implementsa block-level file interface, exposing to higher levels an inter-face where each file is a list of blocks. BFile invokes Balloc toallocate file data blocks. Dir implements directories on top ofblock-level files. ByteFile implements a byte-level interfaceto files, where each file is an array of bytes. DirTree combinesdirectories and byte-level files into a hierarchical directory-tree structure; it invokes Balloc to allocate/deallocate inodeswhen creating/deleting files or subdirectories. Finally, FSimplements complete system calls in transactions.

Cache FscqLog

Inode Balloc

BFile

Dir ByteFile

FSDirTree

Figure 11: FSCQ components. Arrows represent procedure calls.

Figure 12 shows FSCQ’s disk layout. The layout blockcontains information about where all other parts of the filesystem are located on disk and is initialized by mkfs.

Filedata

Blockbitmaps

Inodes Inodebitmaps

Loglength

Logheader

Logdata

Layoutblock

Figure 12: FSCQ on-disk layout

5.2 Disk and crash modelFSCQ builds on the specification for disk_write shown in Fig-ure 3, which allows FSCQ to perform asynchronous disk oper-ations and captures the fact that, after a crash, not every issuedwrite will be reflected in the disk state. The other two disk op-erations that CHL models are reading a disk block (disk_read)and waiting until all write operations have reached nonvolatilememory (disk_sync).disk_sync takes a block address as an extra argument to

indicate which block must be synced. The per-address vari-ant of disk_sync discards all previous values for that block,making the last-written value the only possible value thatcan appear after a crash. All other blocks remain unchanged,which enables separation-logic-based local reasoning aboutsync operations. At execution time, consecutive invocationsof per-address disk_syncs are collapsed into a single globaldisk_sync operation, achieving the desired performance.

5.3 POSIX specificationFSCQ provides a POSIX-like interface at the top level; themain differences from POSIX are (i) that FSCQ does notsupport hard links, and (ii) that FSCQ does not implement filedescriptors and instead requires naming open files by inodenumber. FSCQ relies on the FUSE driver to maintain themapping between open file descriptors and inode numbers.

One challenge is specifying the behavior of POSIX callsunder crashes. The POSIX standard is vague about whatthe correct behavior is after a crash. File systems in prac-tice implement different semantics, which cause problems forapplications that need to implement application-level crashconsistency on top of a file system [55]. We decided to specifyall-or-nothing atomicity with respect to crashes and immediatedurability (i.e., when a system call returns, its effects are ondisk). In future work, we would like to allow applicationsto defer durability by specifying and supporting system callssuch as fsync.

SPEC rename(cwd_ino, oldpath, newpath)≫ fs_recoverPRE disk: log_rep(NoTxn, start_state)

start_state: tree_rep(old_tree) ∧find_subtree(old_tree, cwd) = cwd_tree ∧tree_inum(cwd_tree) = cwd_ino

POST disk: ((ret = (completed, NoErr) ∨ ret = recovered) ∧log_rep(NoTxn, new_state)) ∨

((ret = (completed, Error) ∨ ret = recovered) ∧log_rep(NoTxn, start_state))

new_state: tree_rep(new_tree) ∧mover = find_subtree(cwd_tree, oldpath) ∧pruned = tree_prune(cwd_tree, oldpath) ∧grafted = tree_graft(pruned, newpath, mover) ∧new_tree = update_subtree(old_tree, cwd, grafted)

Figure 13: Specification for rename with recovery

Figure 13 shows FSCQ’s specification for its most com-plicated system call, rename, in combination with FSCQ’srecovery procedure fs_recover. rename’s precondition re-quires that the directory tree is in a consistent state, matchingthe tree_rep invariant, and that the caller’s current workingdirectory inode, cwd_ino, corresponds to some valid pathname in the tree. The postcondition asserts that rename willeither return an error, with the tree unchanged, or succeed,with the new tree being logically described by the functionstree_prune, tree_graft, etc. These functions operate on alogical representation of the directory tree structure, ratherthan on low-level disk representations, and are defined in afew lines of code each. In case of a crash, the state will eitherhave no effects of rename or will be as if rename had finished.

5.4 FscqLogTo run system calls as transactions, FSCQ uses FscqLog, asimple write-ahead logging system. Figure 14 shows thepseudocode for FscqLog’s implementation. The specifica-tions of log_begin (Figure 9), log_write (Figure 10), and

27

Page 11: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

log_recover (Figure 6) are the specifications for the corre-sponding procedures in FscqLog’s API. FscqLog does notaim to achieve the high performance of sophisticated loggingsystems but captures the core idea of write-ahead logging.

inMemoryLog = {}

def log_begin(): pass

def log_write(a, v): inMemoryLog[a] = v

def log_read(a):if inMemoryLog.has_key(a): return inMemoryLog[a]else: return disk_read(a)

def log_flush():i = 0for (a, v) in inMemoryLog.iteritems():

if i >= LOGSIZE: return Nonedisk_write(LOGSTART + i, v)logHeader.sector[i] = ai = i + 1

logHeader.len = len(inMemoryLog)return logHeader

def log_apply():for (a, v) in inMemoryLog.iteritems():

disk_write(a, v)

def log_commit():logHeader = log_flush()if logHeader is None: return Falsedisk_sync()disk_write(COMMITBLOCK, logHeader)disk_sync()log_apply()disk_sync()logHeader.len = 0disk_write(COMMITBLOCK, logHeader)disk_sync()inMemoryLog = {}return True

def log_abort(): inMemoryLog = {}

def log_recover():logHeader = disk_read(COMMITBLOCK)for i in range(0, logHeader.len):

v = disk_read(LOGSTART + i)disk_write(logHeader.sector[i], v)

disk_sync()logHeader.len = 0disk_write(COMMITBLOCK, logHeader)disk_sync()

Figure 14: Pseudocode for FscqLog

FscqLog allows only one transaction at a time and assumesthat the memory is large enough to hold an in-memory logfor the current transaction. More sophisticated logging sys-tems remove such restrictions, but even for this simple designthe logging protocol is complex because disk_write is asyn-chronous. After a crash, the logging system can be in one ofseven states: NoTxn (no active transaction), ActiveTxn (a trans-action has started), LogFlushed (log_flush has happened),CommittedUnsync (commit block has been written), Commit-

tedSync (commit block has been synced), AppliedUnsync (loghas been applied), and AppliedSync (log has been applied andsynced). We proved that FscqLog is always in one of thesestates, which correspond to natural phases in the FscqLogimplementation. Different invariants might hold for differentstates. The publicly exported specs of the FscqLog methodsonly mention states NoTxn and ActiveTxn, since recovery codehides the others.

After a crash, FSCQ’s recovery procedure fs_recover readsthe layout block to determine where the log is located, andinvokes FscqLog’s log_recover to bring the disk to a consis-tent state. We have proven that log_recover always correctlyrecovers from each of the seven states after a crash.

By using FscqLog, FSCQ can factor out crash recovery.FSCQ updates the disk only through log_write and wrapsthose writes into transactions at the system-call granularity toachieve crash safety. For example, FSCQ wraps each systemcall like open, unlink, etc, in an FscqLog transaction, similarto atomic_two_write in Figure 2. This allows us to prove thatthe entire system call is atomic. That is, we can prove that themodifications a system call makes to the disk (e.g., allocatinga block to grow a file, then writing that block, and so on) allhappen or none happen, even if the system call fails due to acrash after issuing some log_writes.

Furthermore, although FscqLogmust deal with the complex-ity of asynchronous writes, it presents to higher-level softwarea simpler synchronous interface, because transactions hide theasynchrony by providing all-or-nothing atomicity. We wereable to do this because the transaction API exposes a logicaladdress space that maps each block to unique block contents,even though the physical disk maps each block to a tuple ofa last-written value and a set of previous values. As a result,software written on top of FscqLog does not have to worryabout asynchronous writes.

5.5 Using address spacesSince transactions take care of crashes, the remaining chal-lenge lies in specifying the behavior of a file system and prov-ing that the implementation meets its specification on a reliabledisk. As mentioned in §3.3, CHL’s address spaces help ex-press predicates about address spaces at different levels ofabstraction. For example, consider the specification shownin Figure 15 for file_bytes_write, which overwrites exist-ing bytes. This specification uses separation logic in fourdifferent address spaces: the bare disk (which implementsasynchronous writes and matches the log_rep predicate); theabstract disks inside the transaction, old_state and new_state(which have synchronous writes and match the files_reppredicate); the address spaces of files indexed by inode num-ber, old_files and new_files; and finally the address spaces offile bytes indexed by offset, old_f.data and new_f.data. The useof separation logic within each address space allows us to con-cisely specify the behavior of file_bytes_write at all theselevels of abstraction. Furthermore, CHL applies its proof au-

28

Page 12: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

tomation machinery to separation logic in every address space.This helps developers construct short proofs about higher-levelabstractions.

SPEC file_bytes_write(inum, off, len, bytes)PRE disk: log_rep(ActiveTxn, start_state, old_state)

old_state: files_rep(old_files) ⋆ other_stateold_files: inum 7→ old_f ⋆ other_filesold_f.data: [off . . .off + len) 7→ old_bytes ⋆ other_bytes

POST disk: log_rep(ActiveTxn, start_state, new_state)new_state: files_rep(new_files) ⋆ other_statenew_files: inum 7→ new_f ⋆ other_files ∧

new_f.attr = old_f.attrnew_f.data: [off . . .off + len) 7→ bytes ⋆ other_bytes

CRASH disk: log_rep(ActiveTxn, start_state, any_state)

Figure 15: Specification for writing to a file

5.6 Resource allocationFile systems must implement resource allocation at multiplelevels of abstraction—in particular, allocating disk blocks andallocating inodes. We built and proved correct a commonallocator in FSCQ. It works by storing a bitmap spanningseveral contiguous blocks, with bit i corresponding to whetherobject i is available. FSCQ instantiates this allocator for bothdisk-block and inode allocation, each with a separate bitmap.

Writing a naïve specification of the allocator is straightfor-ward: freeing an object adds it to a set of free objects, andallocating returns one of these objects. The allocator’s repre-sentation invariant asserts that the free set is correctly encodedusing “one” bits in the on-disk bitmap. However, the callerof the allocator must prove more complex statements—forexample, that any object obtained from the allocator is notalready in use elsewhere. Reproving this property from firstprinciples each time the allocator is used is labor-intensive.

To address this problem, FSCQ’s allocator provides afree_objects_pred(obj_set) predicate that can be applied tothe address space whose resources are being allocated. Thispredicate is defined as a set of (∃v, i 7→ v) predicates for each iin obj_set, combined using the ⋆ operator. obj_set is typicallythe allocator’s set of free object IDs, so this predicate statesthat every free object ID points to some value.

Using free_objects_pred simplifies reasoning about re-source allocation, because it can be combined with other pred-icates about the objects that are currently in use (e.g., diskblocks used by files), to give a complete description of the ad-dress space in question. The disjoint nature of the ⋆ operatorprecisely capture the idea that all objects are either available(and managed by the allocator) or are in use (and match someother predicate about the in-use objects).

For example, Figure 16 shows the representation invariantfor FSCQ’s file layer, which is typically applied to FscqLog’sabstract disk address space, as shown in Figure 15. The ab-stract disk, according to Figure 16, is split up into four disjointparts: the allocation bitmap (represented by allocator_rep),

files_rep(files) := ∃ free_blocks, ∃ inodes,allocator_rep(free_blocks) ⋆inode_rep(inodes) ⋆files_inuse_rep(inodes, files) ⋆free_objects_pred(free_blocks)

Figure 16: Representation invariant for FSCQ’s file layer

the inode area (represented by inode_rep), file data blocks(represented by files_inuse_rep), and free blocks (describedby free_objects_pred). The allocator’s representation invari-ant (allocator_rep) connects the on-disk bitmap to the set ofavailable blocks (free_blocks). The file_inuse_rep functioncombines the inode state in inodes (containing a list of blockaddresses for each inode) and the logical file state files to pro-duce a predicate describing the blocks currently used by allfiles. Finally, free_objects_pred asserts that the free blocksare disjoint from blocks used by the other three predicates.

The same pattern applies to allocating inodes as well. Theonly difference is that, in files_rep, the predicate describingthe actual bitmap, allocator_rep, and the predicate describ-ing the available objects, free_objects_pred, were both ap-plied to the same address space (the abstract disk). In the caseof inodes, the two predicates are applied to different addressspaces: the bitmap predicate is applied to the abstract disk, butfree_objects_pred is applied to the inode address space.

5.7 On-disk data structuresAnother common task in a file system is to lay out data struc-tures in disk blocks. For example, this shows up when storingseveral inodes in a block; storing directory entries in a file;storing addresses in the indirect block; and even storing indi-vidual bits in the allocator bitmap blocks. To factor out thispattern, we built the Rec library for packing and unpackingdata structures into bit-level representations. We often use thislibrary to pack multiple fields of a data structure into a singlebit vector (e.g., the bit-level representation of an inode) andthen to pack several of these bit-vectors into one disk block.

Definition inode_type : Rec.type := Rec.RecF ([("len", Rec.WordF 64); (* number of blocks *)("attr", iattr_type); (* file attrs *)("iptr", Rec.WordF 64); (* indirect ptr *)("blks", Rec.ArrayF 5 (Rec.WordF 64))]).

Figure 17: FSCQ’s on-disk inode layout

For example, Figure 17 shows FSCQ’s on-disk inode struc-ture, in Coq syntax. The first field is len, storing the numberof blocks in the inode, as a 64-bit integer (Rec.WordF indicatesa word field). The other fields are the file’s attributes (such asthe modification time), the indirect block pointer iptr, and alist of 5 direct block addresses, blks.

The library proves basic theorems, such as the fact thataccesses to different fields are commutative, that reading afield returns the last write, and that packing and unpacking areinverses of each other. As a result, code using these recordsdoes not have to prove low-level facts about layout in general.

29

Page 13: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

5.8 Buffer cacheThe design of our buffer cache has one interesting aspect: howit implements replacement policies. We wanted the flexibilityto use different replacement algorithms, but proving the cor-rectness of each algorithm posed a nontrivial burden. Instead,we borrowed the validation approach from CompCert [45]:rather than proving that the replacement algorithm alwaysworks, FSCQ checks if the result is safe (i.e., is a currentlycached block) before evicting that block. If the replacementalgorithm malfunctions, FSCQ evicts the first block in thebuffer cache. This allows FSCQ to implement replacementalgorithms in unverified code while still guaranteeing overallcorrectness.

6 Prototype implementationThe implementation follows the organization shown in Fig-ure 1 in §1. FSCQ and CHL are implemented using Coq,which provides a single programming language for implemen-tation, stating specifications, and proving them. Figure 18breaks down the source code of FSCQ and CHL. BecauseCoq provides a single language, proofs are interleaved withsource code and are difficult to separate. The developmenteffort took several researchers about a year and a half; mostof it was spent on proofs and specifications. Checking theproofs takes 11 hours on an Intel i7-3667U 2.00 GHz CPUwith 8 GB DRAM. The proofs are complete; we used Coq’sPrint Assumptions command to verify that FSCQ did notintroduce any unproven axioms or assumptions.

Component Lines of code

Fixed-width words 2,691CHL infrastructure 5,835Proof automation 2,304On-disk data structures 7,572Buffer cache 660FscqLog 3,163Bitmap allocator 441Inodes and files 2,930Directories 4,489FSCQ’s top-level API 1,312

Total 31,397

Figure 18: Combined lines of code and proof for FSCQ components

CHL. CHL is implemented as a domain-specific language in-side of Coq, much like a macro language (i.e., using a shallowembedding). We specified the semantics of this language andproved that it is sound. For example, we proved the standardHoare-logic specifications for the for and if combinators.We also proved the specifications of disk_read, disk_write(whose spec is in Figure 3 in §3), and disk_sync manually,starting from CHL’s execution and crash model. Much of theautomation (e.g., the chaining of pre- and postconditions) isimplemented using Ltac, Coq’s domain-specific language forproof search.

FSCQ. We implemented FSCQ also inside of Coq, writingthe specifications using CHL. We proved that the implemen-tation obeys the specifications, starting from the basic opera-tions in CHL. FscqLog simplified FSCQ’s specification andimplementation tremendously, because much of the detailedreasoning about crashes is localized in FscqLog.

FSCQ file server. We produced running code by using Coq’sextraction mechanism to generate equivalent Haskell codefrom our Coq implementation. We wrote a driver programin Haskell (400 lines of code) along with an efficient Haskellreimplementation of fixed-size words, disk-block operations,and a buffer-cache replacement policy (350 more lines ofHaskell). The extracted code, together with this driver andword library, allows us to efficiently execute our certifiedimplementation.

To allow applications to use FSCQ, we exported FSCQ as aFUSE file system, using the Haskell FUSE bindings [7] in ourHaskell FSCQ driver. We mount this FUSE FSCQ file systemon Linux, allowing Linux applications to use FSCQ withoutany modifications. Compiling the Coq and Haskell code toproduce the FUSE executable, without checking proofs, takesa little under two minutes.

Limitations. Although extraction to Haskell simplifies theprocess of generating executable code from our Coq implemen-tation, it adds the Haskell compiler and runtime into FSCQ’strusted computing base. In other words, a bug in the Haskellcompiler or runtime could subvert any of the guarantees thatwe prove about FSCQ. We believe this is a reasonable trade-off, since our goal is to certify higher-level properties of thefile system, and other projects have shown that it is possible toextend certification all the way to assembly [12, 30, 43].

Another limitation of the FSCQ prototype lies in dealingwith in-memory state in Coq, which is a functional language.CHL’s execution model provides a mutable disk but givesno primitives for accessing mutable memory. We addressthis by explicitly passing an in-memory state variable throughall FSCQ functions. This contains the current buffer cachestate (a map from address to cached block value), as well asthe current transaction state, if present (an in-memory log ofblocks written in the current transaction). In the future, wewant to support multiprocessors where several threads share amutable buffer cache, and we will address this limitation.

A limitation of FscqLog’s implementation is that it does notguarantee how much log space is available to commit a trans-action; if a transaction performs too many writes, log_commitcan return an error. Some file systems deal with this by rea-soning about how many writes each transaction can generateand ensuring that the log has sufficient space before startinga transaction. We have not done this in FSCQ yet, althoughit should be possible to expose the number of available logentries in FscqLog’s representation invariant. Instead, we al-low log_commit to return an error, in which case the entiretransaction (e.g., system call) aborts and returns an error.

30

Page 14: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

0

5

10

15

20

25

git clone make xv6 make lfs largefile smallfile cleanup mailbench

Runnin

g t

ime (

seco

nd

s) fscqxv6ext4-fuseext4ext4-journal-asyncext4-orderedext4-async

Figure 19: Running time for each phase of the application benchmark suite, and for delivering 200 messages with mailbench

7 EvaluationTo evaluate FSCQ, this section answers several questions:

• Is FSCQ complete enough for realistic applications, andcan it achieve reasonable performance? (§7.1)

• What kinds of bugs do FSCQ’s theorems preclude? (§7.2)

• Does FSCQ recover from crashes? (§7.3)

• How difficult is it to build and evolve the code and proofsfor FSCQ? (§7.4)

7.1 Application performanceFSCQ is complete enough that we can use FSCQ for softwaredevelopment, running a mail server, etc. For example, we haveused FSCQ with the GNU coreutils (ls, grep, etc.), editors(vim and emacs), software development tools (git, gcc, make,and so on), and running a qmail-like mail server. Applicationsthat, for instance, use extended attributes or create very largefiles do not work on FSCQ, but there is no fundamental reasonwhy they could not be made to work.

Experimental setup. We used a set of applications represent-ing typical software development: cloning a git repository,compiling the sources of the xv6 file system and the LFSbenchmark [57] using make, running the LFS benchmark, anddeleting all of the files to clean up at the end. We also runmailbench, a qmail-like mail server from the sv6 operatingsystem [14]. This models a real mail server, where usingFSCQ would ensure email is not lost even in case of crashes.

We compare FSCQ’s performance to two other file sys-tems: the Linux ext4 file system and the file system from thexv6 operating system (chosen because its design is similar toFSCQ’s). We modified xv6 to use asynchronous disk writesand ported the xv6 file system to FUSE so that we can run itin the same way as FSCQ. Finally, to evaluate the overhead ofFUSE, we also run the experiments on top of ext4 mountedvia FUSE.

To make a meaningful comparison, we run the file systemsin synchronous mode; i.e., every system call commits to diskbefore returning. (Disk writes within a system call can be asyn-chronous, as long as they are synced at the end.) For FSCQ

and xv6, this is the standard mode of operation. For ext4, weuse the data=journal and sync options. Although this is notthe default mode of operation for ext4, the focus of this evalua-tion is on whether FSCQ can achieve good performance for itsdesign, not whether its simple design can match that of a so-phisticated file system like ext4. To give a sense of how muchperformance can be obtained through further optimizationsor spec changes, we measure ext4 in three additional configu-rations: the journal_async_commit mode, which uses check-sums to commit in one disk sync instead of two (“ext4-journal-async” in our experiments); the data=ordered mode, which isincompatible with journal_async_commit (“ext4-ordered”);and the default data=ordered and async mode, which doesnot sync to disk on every system call (“ext4-async”).

We ran all of these experiments on a quad-core Intel i7-3667U 2.00 GHz CPU with 8 GB DRAM running Linux 3.19.The file system was stored on a separate partition on an IntelSSDSCMMW180A3L flash SSD. Running the experimentson an SSD ensures that potential file-system CPU bottlenecksare not masked by a slow rotational disk. We compiled FSCQ’sHaskell code using GHC 7.10.2.

Results. The results of running our experiments are shown inFigure 19. The first conclusion is that FSCQ’s performanceis close to that of the xv6 file system. The small gap betweenFSCQ and xv6 is due to the fact that FSCQ’s Haskell imple-mentation uses about 4× more CPU time than xv6’s. Thiscan be reduced by generating C or assembly code instead ofHaskell. Second, FUSE imposes little overhead, judging bythe difference between ext4 and ext4-fuse. Third, both FSCQand xv6 lag behind ext4. This is due to the fact that our bench-marks are bottlenecked by syncs to the SSD, and that ext4has a more efficient logging design that defers applying thelog contents until the log fills up, instead of at each commit.This allows ext4 to commit a transaction with two disk syncs,compared to four disk syncs for FSCQ and xv6. For exam-ple, mailbench requires 10 transactions per message, and theSSD can perform a sync in about 2.8 msec. This matchesthe observed performance of ext4 (64 msec per message) and

31

Page 15: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

xv6 and FSCQ (103 and 118 msec per message respectively).FSCQ is slightly slower than xv6 due to CPU overhead.

Finally, there is room for even further optimizations: ext4’sjournal_async_commit commits with one disk sync insteadof two, achieving almost twice the throughput in some cases;data=ordered avoids writing file data twice, achieving almosttwice the throughput in other cases; and asynchronous modeachieves much higher throughput by avoiding disk syncs alto-gether (at the cost of not persisting data right away).

7.2 Bug discussionTo understand whether FSCQ eliminates real problems thatarise in current file systems, we consider broad categories ofbugs that have been found in real-world file systems [47, 69]and discuss whether FSCQ’s theorems eliminate similar bugs:

1. Violating file or directory invariants, such as all link countsadding up [64] or the absence of directory cycles [49].

2. Improper handling of unexpected corner cases, such asrunning out of blocks during rename [27].

3. Mistakes in logging and recovery logic, such as not issuingdisk writes and syncs in the right order [39].

4. Misusing the logging API, such as freeing an indirect blockand clearing the pointer to it in different transactions [38].

5. Low-level programming errors, such as integer over-flows [40] or double frees [11].

6. Resource allocation bugs, such as losing disk blocks [65]or returning ENOSPC when there is available space [51].

7. Returning incorrect error codes [10].

8. Bugs due to concurrent execution of system calls, such asraces between two threads allocating blocks [48].

Some categories of bugs (#1–5) are eliminated by FSCQ’stheorems and proofs. For example, FSCQ’s representationinvariant for the entire file system enforces a tree shape forit, and the specification guarantees that it will remain a tree(without cycles) after every system call.

With regards to resource allocation (#6), FSCQ guaranteesresources are never lost, but our prototype’s specification doesnot require that the system be out of resources in order to returnan out-of-resource error. Strengthening the specification (andproving it) would eliminate this class of bugs.

Incorrect error codes (#7) can be a problem for our FSCQprototype in cases where we did not specify what exact code(e.g., EINVAL or ENOTDIR) should be returned. Extending thespecification to include specific error codes could avoid thesebugs, at the cost of more complex specifications and proofs.On the other hand, FSCQ can never have a bug where anoperation fails without an error code being returned.

Multi-processor bugs (#8) are out of scope for our FSCQprototype, because it does not support multi-threading.

7.3 Crash recoveryWe proved that FSCQ implements its specification, but inprinciple it is possible that the specification is incomplete orthat our unproven code (e.g., the FUSE driver) has bugs. Todo an end-to-end check, we ran two experiments. First, we ranfsstress from the Linux Test Project, which issues randomfile-system operations; this did not uncover any problems.Second, we experimentally induced crashes and verified thateach resulting disk image after recovery is consistent.

In particular, we created an empty file system using mkfs,mounted it using FSCQ’s FUSE interface, and then ran a work-load on the file system. The workload creates two files, writesdata to the files, creates a directory and a subdirectory underit, moves a file into each directory, moves the subdirectory tothe root directory, appends more data to one of the files, andthen deletes the other file. During the workload, we recordedall disk writes and syncs. After the workload completed, weunmounted the file system and constructed all possible crashstates. We did this by taking a prefix of the writes up to somesync, combined with every possible subset of writes from thatsync to the next sync. For the workload described above, thisproduced 320 distinct crash states.

For each crash state, we remounted the file system (whichruns the recovery procedure) and then ran a script to examinethe state of the file system, looking at directory structure, filecontents, and the number of free blocks and inodes. For theabove workload, this generates just 10 distinct logical states(i.e., distinct outputs from the examination script). Based on amanual inspection of each of these states, we conclude that allof them are consistent with what a POSIX application shouldexpect. This suggests that FSCQ’s specifications, as well asthe unverified components, are likely to be correct.

7.4 Development effortThe final question is, how much effort is involved in devel-oping FSCQ? One metric is the size of the FSCQ code base,reported in Figure 18; FSCQ consists of about 30,000 lines ofcode. In comparison, the xv6 file system is about 3,000 linesof C code, so FSCQ is about 10× larger, but this includes asignificant amount of CHL infrastructure, including librariesand proof machinery, which is not FSCQ-specific.

A more interesting question is how much effort is requiredto modify FSCQ, after an initial version has been developedand certified. Does adding a new feature to FSCQ requirereproving everything, or is the work commensurate with thescale of the modifications required to support the new feature?To answer this question, the rest of this section presents severalcase studies, where we had to add a significant feature to FSCQafter the initial design was already complete.

Asynchronous disk writes. We initially developed FSCQand FscqLog to operate with synchronous disk writes. Imple-menting asynchronous disk writes required changing about1,000 lines of code in the CHL infrastructure and changingover half of the implementations and proofs for FscqLog.

32

Page 16: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

However, layers above FscqLog did not require any changes,since FscqLog provided the same synchronous disk abstrac-tion in both cases.

Indirect blocks. Initially, FSCQ supported only directblocks. Adding indirect blocks required changing about 1,500lines of code and proof in the Inode layer, including infras-tructure changes for reasoning about on-disk objects that spanmultiple disk blocks (the inode and its indirect block). Wemade almost no changes to code above the Inode layer; theonly exception was BFile, in which we had to fix about 50lines of proof due to a hard-coded constant bound for themaximum number of blocks per file.

Buffer cache. We added a buffer cache to FSCQ after wehad already built FscqLog and several layers above it. SinceCoq is a pure functional language, keeping buffer-cache staterequired passing the current buffer-cache object to and fromall functions. Incorporating the buffer cache required changingabout 300 lines of code and proof in FscqLog, to pass aroundthe buffer-cache state, to access disk via the buffer cache andto reason about disk state in terms of buffer-cache invariants.We also had to make similar straightforward changes to about600 lines of code and proof for components above FscqLog.

Optimizing log layout. FscqLog’s initial design used onedisk block to store the length of the on-disk log and anotherblock to store a commit bit, indicating whether log recov-ery should replay the log contents after a crash. Once weintroduced asynchronous writes, storing these fields separatelynecessitated an additional disk sync between writing the lengthfield and writing the commit bit. To avoid this sync, we mod-ified the logging protocol slightly: the length field was nowalso the commit bit, and the log is applied on recovery iffthe length is nonzero. Implementing this change requiredmodifying about 50 lines of code and about 100 lines of proof.

7.5 Evaluation summaryAlthough FSCQ is not as complete and high-performance astoday’s high-end file systems, our results demonstrate thatthis is largely due to FSCQ’s simple design. Furthermore,the case studies in §7.4 indicate that one can improve FSCQincrementally. In future work we hope to improve FSCQ’slogging to batch transactions and to log only metadata; weexpect this will bring FSCQ’s performance closer to that ofext4’s logging. The one exception to incremental improvementis multiprocessor support, which may require global changesand is an interesting direction for future research.

8 DiscussionIn earlier stages of this project, we considered (and imple-mented) several different verification approaches. Here webriefly summarize a few of these alternatives and the problemsthat caused us to abandon them.

Our first approach, influenced by CompCert [46], organizedthe system as a number of abstraction layers (such as blocks,

inodes, directories, and pathnames), each with its own domain-specific language. Specifications took the form of an abstractmachine for each domain-specific language. Implementationscompiled higher-level languages into lower-level ones, andproofs showed simulation relations between the concrete ma-chine and the specification’s abstract machine. We abandonedthis approach for two reasons. First, at the time, we had notyet come up with the idea of crash conditions; without them,it was difficult to prove the idempotence of log recovery. Sec-ond, we had no notion of separation logic for logical addressspaces; without it, we statically partitioned disk blocks be-tween higher layers (such as file data blocks vs. directoryblocks), which made it difficult to reuse blocks for differentpurposes. Switching from this compiler-oriented approachto a Hoare-logic-based approach also allowed us to improveproof automation.

Another approach we explored is to reason about execu-tion traces [31], an approach commonly used for reasoningabout concurrent events. Under this approach, the developerwould have to show that every possible execution trace of diskread, write, and sync operations, intermingled with crashesand recovery, obeys a specification (e.g., rename is atomic).Although execution traces allow reasoning about ordering ofregular execution, crashes, and recovery, it is cumbersome towrite file-system invariants for execution traces at the levelof disk operations. Many file-system invariants also do notinvolve ordering (e.g., every block should be allocated onlyonce).

9 ConclusionThis paper’s contributions are CHL and a case study of ap-plying CHL to build FSCQ, the first certified crash-safe filesystem. CHL allowed us to concisely and precisely specify theexpected behavior of FSCQ. Because of CHL’s proof automa-tion, the burden of proving that FSCQ meets its specificationwas manageable. The benefit of the verification approach isthat FSCQ has a machine-checked proof that FSCQ avoidsbugs that have a long history of causing data loss in previousfile systems. For critical infrastructure such as a file system,the cost of proving seems a reasonable price to pay. We hopethat others will find CHL useful for constructing crash-safestorage systems.

AcknowledgmentsThanks to Nathan Beckmann, Butler Lampson, Robert Morris,and the IronClad team for insightful discussions and feedback.Thanks also to the anonymous reviewers for their comments,and to our shepherd, Herbert Bos. This research was supportedin part by NSF awards CNS-1053143 and CCF-1253229.

References[1] R. Alagappan, V. Chidambaram, T. S. Pillai, A. C.

Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Beyondstorage APIs: Provable semantics for storage stacks. In

33

Page 17: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

Proceedings of the 15th Workshop on Hot Topics in Op-erating Systems (HotOS), Kartause Ittingen, Switzerland,May 2015.

[2] J. Andronick. Formally proved anti-tearing propertiesof embedded C code. In Proceedings of the 2nd IEEEInternational Symposium on Leveraging Applicationsof Formal Methods, Verification and Validation, pages129–136, Paphos, Cyprus, Nov. 2006.

[3] K. Arkoudas, K. Zee, V. Kuncak, and M. Rinard. Veri-fying a file system implementation. In Proceedings ofthe 6th International Conference on Formal EngineeringMethods, Seattle, WA, Nov. 2004.

[4] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Op-erating Systems: Three Easy Pieces. Arpaci-DusseauBooks, May 2014.

[5] W. R. Bevier and R. M. Cohen. An executable model ofthe Synergy file system. Technical Report 121, Compu-tational Logic, Inc., Oct. 1996.

[6] W. R. Bevier, R. M. Cohen, and J. Turner. A specifica-tion for the Synergy file system. Technical Report 120,Computational Logic, Inc., Sept. 1995.

[7] J. Bobbio et al. Haskell bindings for the FUSE library,2014. https://github.com/m15k/hfuse.

[8] M. Carbin, S. Misailovic, and M. C. Rinard. Verifyingquantitative reliability for programs that execute on un-reliable hardware. In Proceedings of the 2013 AnnualACM Conference on Object-Oriented Programming, Sys-tems, Languages, and Applications, Indianapolis, IN,Oct. 2013.

[9] H. Chen, D. Ziegler, A. Chlipala, M. F. Kaashoek,E. Kohler, and N. Zeldovich. Specifying crash safety forstorage systems. In Proceedings of the 15th Workshopon Hot Topics in Operating Systems (HotOS), KartauseIttingen, Switzerland, May 2015.

[10] D. Chinner. xfs: xfs_dir_fsync() re-turns positive errno, May 2014. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=43ec1460a2189fbee87980dd3d3e64cba2f11e1f.

[11] D. Chinner. xfs: fix double free inxlog_recover_commit_trans, Sept. 2014.http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=88b863db97a18a04c90ebd57d84e1b7863114dcb.

[12] A. Chlipala. Mostly-automated verification of low-levelprograms in computational separation logic. In Pro-ceedings of the 2011 ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation (PLDI),pages 234–245, San Jose, CA, June 2011.

[13] A. Chlipala. The Bedrock structured programmingsystem: Combining generative metaprogramming andHoare logic in an extensible program verifier. In Pro-ceedings of the 18th ACM SIGPLAN International Con-ference on Functional Programming (ICFP), pages 391–402, Boston, MA, Sept. 2013.

[14] A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T.Morris, and E. Kohler. The scalable commutativity rule:Designing scalable software for multicore processors. InProceedings of the 24th ACM Symposium on OperatingSystems Principles (SOSP), pages 1–17, Farmington, PA,Nov. 2013.

[15] Coq development team. Coq Reference Manual, Version8.4pl5. INRIA, Oct. 2014. http://coq.inria.fr/distrib/current/refman/.

[16] R. Cox, M. F. Kaashoek, and R. T. Morris. Xv6, asimple Unix-like teaching operating system, 2014. http://pdos.csail.mit.edu/6.828/2014/xv6.html.

[17] B. Demsky and M. Rinard. Automatic detection andrepair of errors in data structures. In Proceedings ofthe 18th Annual ACM Conference on Object-OrientedProgramming, Systems, Languages, and Applications,pages 78–95, Vancouver, Canada, Oct. 2003.

[18] T. Dinsdale-Young, L. Birkedal, P. Gardner, M. Parkin-son, and H. Yang. Views: Compositional reasoningfor concurrent programs. In Proceedings of the 40thACM Symposium on Principles of Programming Lan-guages (POPL), pages 287–300, Rome, Italy, Jan. 2013.

[19] G. Ernst, G. Schellhorn, D. Haneberg, J. Pfähler, andW. Reif. Verification of a virtual filesystem switch. InProceedings of the 5th Working Conference on VerifiedSoftware: Theories, Tools and Experiments, Menlo Park,CA, May 2013.

[20] G. Ernst, J. Pfähler, G. Schellhorn, and W. Reif. Inside averified flash file system: Transactions & garbage collec-tion. In Proceedings of the 7th Working Conference onVerified Software: Theories, Tools and Experiments, SanFrancisco, CA, July 2015.

[21] R. Escriva. Claiming Bitcoin’s bug bounty, Nov.2013. http://hackingdistributed.com/2013/11/27/bitcoin-leveldb/.

34

Page 18: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

[22] M. A. Ferreira and J. N. Oliveira. An integrated formalmethods tool-chain and its application to verifying a filesystem model. In Proceedings of the 12th BrazilianSymposium on Formal Methods, Aug. 2009.

[23] L. Freitas, J. Woodcock, and A. Butterfield. POSIXand the verification grand challenge: A roadmap. InProceedings of 13th IEEE International Conference onEngineering of Complex Computer Systems, pages 153–162, Mar.–Apr. 2008.

[24] FUSE. FUSE: Filesystem in userspace, 2013. http://fuse.sourceforge.net/.

[25] P. Gardner, G. Ntzik, and A. Wright. Local reasoningfor the POSIX file system. In Proceedings of the 23rdEuropean Symposium on Programming, pages 169–188,Grenoble, France, 2014.

[26] R. Geambasu, A. Birrell, and J. MacCormick. Experi-ences with formal specification of fault-tolerant storagesystems. In Proceedings of the 38th Annual IEEE/IFIPInternational Conference on Dependable Systems andNetworks (DSN), Anchorage, AK, June 2008.

[27] A. Goldstein. ext4: handle errors inext4_rename, Mar. 2011. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ef6078930263bfcdcfe4dddb2cd85254b4cf4f5c.

[28] R. Gu, J. Koenig, T. Ramananandro, Z. Shao, X. Wu,S.-C. Weng, H. Zhang, and Y. Guo. Deep specificationsand certified abstraction layers. In Proceedings of the42nd ACM Symposium on Principles of ProgrammingLanguages (POPL), Mumbai, India, Jan. 2015.

[29] H. S. Gunawi, A. Rajimwale, A. C. Arpaci-Dusseau,and R. H. Arpaci-Dusseau. SQCK: A declarative filesystem checker. In Proceedings of the 8th Symposium onOperating Systems Design and Implementation (OSDI),pages 131–146, San Diego, CA, Dec. 2008.

[30] C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan,B. Parno, D. Zhang, and B. Zill. Ironclad Apps: End-to-end security via automated full-system verification.In Proceedings of the 11th Symposium on OperatingSystems Design and Implementation (OSDI), pages 165–181, Broomfield, CO, Oct. 2014.

[31] M. P. Herlihy and J. M. Wing. Linearizability: a correct-ness condition for concurrent objects. ACM Transactionson Programming Languages Systems, 12(3):463–492,1990.

[32] W. H. Hesselink and M. Lali. Formalizing a hierarchi-cal file system. In Proceedings of the 14th BCS-FACSRefinement Workshop, pages 67–85, Dec. 2009.

[33] C. A. R. Hoare. An axiomatic basis for computer pro-gramming. Communications of the ACM, 12(10):576–580, Oct. 1969.

[34] IEEE (The Institute of Electrical and Electronics En-gineers) and The Open Group. The Open Group basespecifications issue 7, 2013 edition (POSIX.1-2008/Cor1-2013), Apr. 2013.

[35] D. Jones. Trinity: A Linux system call fuzz tester,2014. http://codemonkey.org.uk/projects/trinity/.

[36] R. Joshi and G. J. Holzmann. A mini challenge: Build averifiable filesystem. Formal Aspects of Computing, 19(2):269–272, June 2007.

[37] E. Kang and D. Jackson. Formal modeling and analysisof a Flash filesystem in Alloy. In Proceedings of the 1stInt’l Conference of Abstract State Machines, B and Z,pages 294–308, London, UK, Sept. 2008.

[38] J. Kara. ext3: Avoid filesystem corruption af-ter a crash under heavy delete load, July 2010.https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=f25f624263445785b94f39739a6339ba9ed3275d.

[39] J. Kara. jbd2: issue cache flush after check-pointing even with internal journal, Mar. 2012.http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=79feb521a44705262d15cc819a4117a447b11ea7.

[40] J. Kara. ext4: fix overflow when updatingsuperblock backups after resize, Oct. 2014.http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=9378c6768e4fca48971e7b6a9075bc006eda981d.

[41] G. Keller. Trustworthy file systems, 2014.http://www.ssrg.nicta.com.au/projects/TS/filesystems.pml.

[42] G. Keller, T. Murray, S. Amani, L. O’Connor, Z. Chen,L. Ryzhyk, G. Klein, and G. Heiser. File systems deserveverification too. In Proceedings of the 7th Workshop onProgramming Languages and Operating Systems, Farm-ington, PA, Nov. 2013.

[43] G. Klein, K. Elphinstone, G. Heiser, J. Andronick,D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, M. Nor-rish, R. Kolanski, T. Sewell, H. Tuch, and S. Winwood.seL4: Formal verification of an OS kernel. In Proceed-ings of the 22nd ACM Symposium on Operating SystemsPrinciples (SOSP), pages 207–220, Big Sky, MT, Oct.2009.

35

Page 19: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

[44] L. Lamport. The temporal logic of actions. ACM Trans-actions on Programming Languages and Systems, 16(3):872–923, May 1994.

[45] X. Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7):107–115, July 2009.

[46] X. Leroy. A formally verified compiler back-end. Jour-nal of Automated Reasoning, 43(4):363–446, Dec. 2009.

[47] L. Lu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau,and S. Lu. A study of Linux file system evolution. InProceedings of the 11th USENIX Conference on File andStorage Technologies (FAST), pages 31–44, San Jose,CA, Feb. 2013.

[48] F. Manana. Btrfs: fix race between writ-ing free space cache and trimming, Dec. 2014.http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=55507ce3612365a5173dfb080a4baf45d1ef8cd1.

[49] C. Mason. Btrfs: prevent loops in the direc-tory tree when creating snapshots, Nov. 2008.http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ea9e8b11bd1252dcbc23afefcf1a52ec6aa3c113.

[50] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, andP. Schwarz. ARIES: A transaction recovery method sup-porting fine-granularity locking and partial rollbacks us-ing write-ahead logging. ACM Transactions on DatabaseSystems, 17(1):94–162, Mar. 1992.

[51] A. Morton. [PATCH] ext2/ext3 -ENOSPC bug, Mar.2004. https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/?id=5e9087ad3928c9d80cc62b583c3034f864b6d315.

[52] G. Ntzik, P. da Rocha Pinto, and P. Gardner. Fault-tolerant resource reasoning. In Proceedings of the 13thAsian Symposium on Programming Languages and Sys-tems (APLAS), Pohang, South Korea, Nov.–Dec. 2015.

[53] F. Perry, L. Mackey, G. A. Reis, J. Ligatti, D. I. Au-gust, and D. Walker. Fault-tolerant typed assembly lan-guage. In Proceedings of the 2007 ACM SIGPLAN Con-ference on Programming Language Design and Imple-mentation (PLDI), San Diego, CA, June 2007.

[54] J. Pfähler, G. Ernst, G. Schellhorn, D. Haneberg, andW. Reif. Crash-safe refinement for a verified flash filesystem. Technical Report 2014-02, University of Augs-burg, 2014.

[55] T. S. Pillai, V. Chidambaram, R. Alagappan, S. Al-Kiswany, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. All file systems are not created equal: On

the complexity of crafting crash-consistent applications.In Proceedings of the 11th Symposium on OperatingSystems Design and Implementation (OSDI), pages 433–448, Broomfield, CO, Oct. 2014.

[56] J. C. Reynolds. Separation logic: A logic for shared mu-table data structures. In Proceedings of the 17th AnnualIEEE Symposium on Logic in Computer Science, pages55–74, Copenhagen, Denmark, July 2002.

[57] M. Rosenblum and J. Ousterhout. The design and im-plementation of a log-structured file system. In Proceed-ings of the 13th ACM Symposium on Operating SystemsPrinciples (SOSP), pages 1–15, Pacific Grove, CA, Oct.1991.

[58] G. Schellhorn, G. Ernst, J. Pfähler, D. Haneberg, andW. Reif. Development of a verified flash file system. InProceedings of the ABZ Conference, June 2014.

[59] S. C. Tweedie. Journaling the Linux ext2fs filesystem.In Proceedings of the 4th Annual LinuxExpo, Durham,NC, May 1998.

[60] D. Walker, L. Mackey, J. Ligatti, G. A. Reis, and D. I.August. Static typing for a faulty Lambda calculus. InProceedings of the 11th ACM SIGPLAN InternationalConference on Functional Programming (ICFP), Port-land, OR, Sept. 2006.

[61] X. Wang, D. Lazar, N. Zeldovich, A. Chlipala, and Z. Tat-lock. Jitk: A trustworthy in-kernel interpreter infrastruc-ture. In Proceedings of the 11th Symposium on Operat-ing Systems Design and Implementation (OSDI), pages33–47, Broomfield, CO, Oct. 2014.

[62] M. Wenzel. Some aspects of Unix file-system se-curity, Aug. 2014. http://isabelle.in.tum.de/library/HOL/HOL-Unix/Unix.html.

[63] J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock,X. Wang, M. D. Ernst, and T. Anderson. Verdi: Aframework for implementing and formally verifying dis-tributed systems. In Proceedings of the 2015 ACM SIG-PLAN Conference on Programming Language Designand Implementation (PLDI), pages 357–368, Portland,OR, June 2015.

[64] D. J. Wong. ext4: fix same-dir rename wheninline data directory overflows, Aug. 2014.https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d80d448c6c5bdd32605b78a60fe8081d82d4da0f.

[65] M. Xie. Btrfs: fix broken free space cache af-ter the system crashed, June 2014. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=e570fd27f2c5d7eac3876bccf99e9838d7f911a3.

36

Page 20: Using Crash Hoare Logic for Certifying the FSCQ File Systemhuang/cs718/spring18/readings/fscq.pdfHoare logic (CHL), which extends traditional Hoare logic with a crash condition, a

[66] J. Yang and C. Hawblitzel. Safe to the last instruc-tion: Automated verification of a type-safe operatingsystem. In Proceedings of the 2010 ACM SIGPLANConference on Programming Language Design and Im-plementation (PLDI), pages 99–110, Toronto, Canada,June 2010.

[67] J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Usingmodel checking to find serious file system errors. InProceedings of the 6th Symposium on Operating SystemsDesign and Implementation (OSDI), pages 273–287, SanFrancisco, CA, Dec. 2004.

[68] J. Yang, C. Sar, P. Twohey, C. Cadar, and D. Engler.Automatically generating malicious disks using symbolicexecution. In Proceedings of the 27th IEEE Symposiumon Security and Privacy, pages 243–257, Oakland, CA,May 2006.

[69] J. Yang, P. Twohey, D. Engler, and M. Musuvathi. eX-plode: A lightweight, general system for finding seriousstorage system errors. In Proceedings of the 7th Sym-posium on Operating Systems Design and Implementa-tion (OSDI), pages 131–146, Seattle, WA, Nov. 2006.

[70] M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge,E. S. Yang, B. W. Zhao, and S. Singh. Torturingdatabases for fun and profit. In Proceedings of the 11thSymposium on Operating Systems Design and Implemen-tation (OSDI), pages 449–464, Broomfield, CO, Oct.2014.

37