transactions and reliability

Transactions and Reliability

Sarah Diesburg

Operating Systems

COP 4610

Motivation

File systems have lots of metadata:Free blocks, directories, file headers, indirect

blocks

Metadata is heavily cached for performance

Problem

System crashesOS needs to ensure that the file system

does not reach an inconsistent state

Example: move a file between directoriesRemove a file from the old directoryAdd a file to the new directory

What happens when a crash occurs in the middle?

UNIX File System (Ad Hoc Failure-Recovery)

Metadata handling:Uses a synchronous write-through caching

policyA call to update metadata does not return until the

changes are propagated to disk

Updates are orderedWhen crashes occur, run fsck to repair in-

progress operations

Some Examples of Metadata Handling

Undo effects not yet visible to usersIf a new file is created, but not yet added to the

directoryDelete the file

Continue effects that are visible to usersIf file blocks are already allocated, but not

recorded in the bitmapUpdate the bitmap

UFS User Data Handling

Uses a write-back policyModified blocks are written to disk at 30-second

intervalsUnless a user issues the sync system call

Data updates are not orderedIn many cases, consistent metadata is good

enough

Example: Vi

Vi saves changes by doing the following1. Writes the new version in a temp file

Now we have old_file and new_temp file

2. Moves the old version to a different temp fileNow we have new_temp and old_temp

3. Moves the new version into the real fileNow we have new_file and old_temp

4. Removes the old versionNow we have new_file

Example: Vi

When crashes occurLooks for the leftover filesMoves forward or backward depending on the

integrity of files

Transaction Approach

A transaction groups operations as a unit, with the following characteristics:Atomic: all operations either happen or they

do not (no partial operations)Serializable: transactions appear to happen

one after the otherDurable: once a transaction happens, it is

recoverable and can survive crashes

More on Transactions

A transaction is not done until it is committed

Once committed, a transaction is durableIf a transaction fails to complete, it must

rollback as if it did not happen at allCritical sections are atomic and

serializable, but not durable

Transaction Implementation (One Thread)

Example: money transfer

Begin transaction

x = x – 1;

y = y + 1;

Commit


Common implementations involve the use of a log, a journal that is never erased

A file system uses a write-ahead log to track all transactions


Once accounts of x and y are on a log, the log is committed to disk in a single write

Actual changes to those accounts are done later

Transaction Illustrated

x = 1;y = 1;

x = 1;y = 1;


x = 1;y = 1;

x = 0;y = 2;


x = 1;y = 1;

x = 0;y = 2;

begin transaction

old x: 1

old y: 1

new x: 0

new y: 2

commit

Commit the log to disk before updating the actual values on disk

Transaction Steps

Mark the beginning of the transactionLog the changes in account xLog the changes in account yCommitModify account x on diskModify account y on disk

Scenarios of Crashes

If a crash occurs after the commitReplays the log to update accounts

If a crash occurs before or during the commitRolls back and discard the transaction

Two-Phase Locking (Multiple Threads)

Logging alone not enough to prevent multiple transactions from trashing one another (not serializable)

Solution: two-phase locking1. Acquire all locks

2. Perform updates and release all locks

Thread A cannot see thread B’s changes until thread A commits and releases locks

Transactions in File Systems

Almost all file systems built since 1985 use write-ahead loggingNTFS, HFS+, ext3, ext4, …

+ Eliminates running fsck after a crash

+ Write-ahead logging provides reliability

- All modifications need to be written twice

Log-Structured File System (LFS)

If logging is so great, why don’t we treat everything as log entries?

Log-structured file systemEverything is a log entry (file headers,

directories, data blocks)Write the log only once

Use version stamps to distinguish between old and new entries

More on LFS

New log entries are always appended to the end of the existing logAll writes are sequentialSeeks only occurs during reads

Not so bad due to temporal locality and caching

Problem:Need to create more contiguous space all the

time

RAID and Reliability

So far, we assume that we have a single diskWhat if we have multiple disks?

The chance of a single-disk failure increases

RAID: redundant array of independent disksStandard way of organizing disks and classifying the

reliability of multi-disk systemsGeneral methods: data duplication, parity, and error-

correcting codes (ECC)

RAID 0

No redundancyUses block-level striping across disks

i.e., 1st block stored on disk 1, 2nd block stored on disk 2

Failure causes data loss

Non-Redundant Disk Array Diagram (RAID Level 0)

open(foo) read(bar) write(zoo)

FileSystem

Mirrored Disks (RAID Level 1)

Each disk has a second disk that mirrors its contentsWrites go to both disks

+ Reliability is doubled

+ Read access faster

- Write access slower

- Expensive and inefficient

Mirrored Disk Diagram (RAID Level 1)


FileSystem

Memory-Style ECC (RAID Level 2)

Some disks in array are used to hold ECCByte to detect error, extra bits for error

correcting

+ More efficient than mirroring

+ Can correct, not just detect, errors

- Still fairly inefficiente.g., 4 data disks require 3 ECC disks

Memory-Style ECC Diagram (RAID Level 2)


FileSystem

Bit-Interleaved Parity (RAID Level 3)

Uses bit-level striping across disksi.e., 1st byte stored on disk 1, 2nd byte stored on

disk 2One disk in the array stores parity for the

other disksNo detection bits needed, relies on disk

controller to detect errors

+ More efficient than Levels 1 and 2- Parity disk doesn’t add bandwidth

Parity Method

Disk 1: 1001Disk 2: 0101Disk 3: 1000Parity: 0100 = 1001 xor 0101 xor 1000

To recover disk 2Disk 2: 0101 = 1001 xor 1000 xor 0100

Bit-Interleaved RAID Diagram (Level 3)


FileSystem

Block-Interleaved Parity (RAID Level 4)

Like bit-interleaved, but data is interleaved in blocks

+ More efficient data access than level 3- Parity disk can be a bottleneck- Small writes require 4 I/Os

Read the old blockRead the old parityWrite the new blockWrite the new parity

Block-Interleaved Parity Diagram (RAID Level 4)


FileSystem

Block-Interleaved Distributed-Parity (RAID Level 5)

Sort of the most general level of RAIDSpreads the parity out over all disks

+ No parity disk bottleneck

+ All disks contribute read bandwidth

– Requires 4 I/Os for small writes

Block-Interleaved Distributed-Parity Diagram (RAID Level 5)


FileSystem

transactions and reliability

Documents

new file

file headers

new version

unix file system

eraseda file system

usersif file blocks

diskmodify account y

new directorywhat