5-oct-2001

ext3 Journaling File System absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionalitychadd williams SHRUG 10/05/2001

ContentsDesign GoalsFile System reliabilityWhat is JFS?Why JFS?How?Details of ext3Detail of JFS

Design GoalsNo performance lossShould have a performance gainBackwards compatibleReliable!

File System reliabilityPreservationStable data is not affected by a crashPredictabilityKnown failure modesAtomicityEach operation either fully completes or is fully undone after recovery

What is a JFS?Atomically updatedOld and new versions of data held on disk until the update commitsUndo logging:Copy old data to the logWrite new data to diskIf you crash during update, copy old data from logRedo logging:Write new data to the logOld data remains on diskIf you crash, copy new data from the log

Why JFS?Speed recovery time after a crashfsck on a large disk can be very slowto eliminate enormously long filesystem recovery times after a crashWith JFS you just reread the journal after a crash, from the last checkpoint

JournalContains three types of data blocksMetadata: entire contents of a single block of filesystem metadata as updated by the transactionDescriptor: describe other journal metadata blocks (where they really live on disk)Header: contain head and tail of the journal, sequence number, the journal is a circular structure

Versus log-structured file systemA log structured file system ONLY contains a log, everything is written to the end of this logLSFS dictates how the data is stored on diskJFS does not dictate how the data is stored on disk

How does it work?Each disk update is a Transaction (atomic update)Write new data to the disk (journal)The update is not final until a commitOnly after the commit block is written is the update finalThe commit block is a single block of data on the diskNot necessarily flushed to disk yet!

How does data get out of the journal?After a commit the new data is in the journal it needs to be written back to its home location on the diskCannot reclaim that journal space until we resync the data to disk

To finish a Commit (checkpoint)Close the transactionAll subsequent filesystem operations will go into another transactionFlush transaction to disk (journal), pin the buffersAfter everything is flushed to the journal, update journal header blocksUnpin the buffers in the journal only after they have been synced to the diskRelease space in the journal

How does this help crash recovery?Only completed updates have been committedDuring reboot, the recovery mechanism reapplies the committed transactions in the journalThe old and updated data are each stored separately, until the commit block is written

ext3 and JFSTwo separate layers/fs/ext3 just the filesystem with transactions/fs/jdb just the journaling stuff (JFS)ext3 calls JFS as neededStart/stop transactionAsk for a journal recovery after unclean rebootActually do compound transactionsTransactions with multiple updates

ext3 detailsThis grew out of ext2Exact same code baseCompletely backwards compatible (if you have a clean reboot)Set some option bits in the superblock, preventing ext2 mountUnset during a clean unmount

JFS detailsAbstract API based on handles, not transactionsHandle: represents one single operation that marks a consistent set of updates on the diskCompound transactions

JFS detailsAPI provides start()/stop() pairing to define the operations within a handleAllows for nesting of transactionsThere must be enough space in the journal for each update before it startsLack of space can lead to deadlockstart/stop used to make reservations in journal

JFS detailsAlso need to make VM reservationsCannot free memory from an in process transactionNo transaction aborts the transaction must completeIf you run out of VM you can deadlockProvide call backs to VM so the VM can say, your using too much memory

GuaranteesWrite ordering within a transactionAll the updates in a transaction will hit the disk after the commitUpdates are still write-behind, so you dont know when the commit will be doneWrite ordering between transactionsNo formal guaranteesSequential implementation happens to preserve write ordering

Internalsext3 & JFS do redo loggingNew data written to logOld data left on diskAfter commit new data is moved to diskCommitConsists of writing one 512-byte sector to diskDisks can guarantee the write of 1 sector

InternalsCheckpointingWriting the data from the log out to diskAllows reuse of the logCurrently everything is written to disk twiceUses a zero-copy to do the second write new I/O request points to old data buffer

InternalsWhat if we need to update data in a pending transaction?Copy-on-writeGive a copy of the data buffer to the new transactionBoth transactions still get committed in fullIn order

5-oct-2001

Documents

updated data

logold data

journal recovery

single block of data

new versions of data

logwrite new data

journal space

types of data blocksmetadata