1 improving file system reliability with i/o shepherding haryadi s. gunawi, vijayan prabhakaran +,...
TRANSCRIPT
1
Improving File System Reliability with I/O Shepherding
Improving File System Reliability with I/O Shepherding
Haryadi S. Gunawi, Vijayan Prabhakaran+, Swetha Krishnan,
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
University of Wisconsin - MadisonUniversity of Wisconsin - Madison
+
2
Storage RealityStorage Reality
Complex Storage Subsystem– Mechanical/electrical
failures, buggy drivers
Complex Failures:– Intermittent faults,
latent sector errors, corruption, lost writes, misdirected writes, etc.
FS Reliability is important– Managing disk and
individual block failures
File SystemFile System
Electrical
Mechanical
Firmware Media
Transport
Device DriverDevice Driver
3
File System RealityFile System Reality
Good news:– Rich literature
• Checksum, parity, mirroring• Versioning, physical/logical identity
– Important for single and multiple disks setting
Bad news:– File system reliability is broken[SOSP’05]
• Unlike other components (performance, consistency)
• Reliability approaches hard-to understand and evolve
4
Broken FS ReliabilityBroken FS Reliability
Lack of good reliability strategy– No remapping, checksumming, redundancy– Existing strategy is coarse-grained
• Mount read-only, panic, retry
Inconsistent policies– Different techniques in similar failure
scenarios
Bugs– Ignored write failures
Let’s fix them!With
currentFramework?Not so easy
…
5
No Reliability FrameworkNo Reliability Framework Diffused
– Handle each fault in each I/O location
– Different developers might increase diffusion
Inflexible– Fixed policies, hard to change– But, no policy that fits all diverse settings
• Less reliable vs. more reliable drives• Desktop workload vs. web-server apps
The need for new framework– Reliability is a first-class file system concern
Reliability Policy
File SystemFile System
Disk SubsystemDisk Subsystem
6
LocalizedLocalized
I/O Shepherd– Localized policies, …
• More correct, less bug, simpler reliability management
File SystemFile System
Disk SubsystemDisk Subsystem
ShepherdShepherd
7
FlexibleFlexible
I/O Shepherd– Localized, flexible policies, …
Disk Subsystem
ShepherdAdd
Mirror
ArchivalScientificData
Check-sum
NetworkedStorage
MoreRetry
ATA
LessReliable
Drive
MoreProtection
SCSI
MoreReliable
Drive
LessProtection
File System
8
PowerfulPowerful
I/O Shepherd– Localized, flexible, and powerful policies
Disk Subsystem
ShepherdAdd
Mirror
ArchivalScientificData
Check-sum
NetworkedStorage
MoreRetry
ATA
LessReliable
Drive
MoreProtection
SCSI
MoreReliable
Drive
LessProtection
File System
CustomDrive
Compo-sable
Policies
AddMirror
Check-sum
MoreRetry
MoreProtection
9
OutlineOutline
Introduction
I/O Shepherd Architecture
Implementation
Evaluation
Conclusion
10
ArchitectureArchitecture
Building reliability framework– How to specify reliability
policies?– How to make powerful policies?– How to simplify reliability
management?
I/O Shepherd layer
Four important components– Policy table– Policy code– Policy primitives– Policy Metadata
File SystemFile System
Disk Subsystem
Disk Subsystem
I/O Shepherd I/O Shepherd Policy Table
Data Mirror()
Inode …
Super …
DynMirrorWrite(DiskAddr D, MemAddr A)DiskAddr copyAddr;IOS_MapLookup(MMap, D, ©Addr);if (copyAddr == NULL) PickMirrorLoc(MMap, D, ©Addr); IOS_MapAllocate(MMap, D, copyAddr);return (IOS_Write(D, A, copyAddr, A));
Policy Code
Checksum
Lookup
Read
SanityCheck
OnlineFsck
Write
PrimitivesLocation
…
…
Policy MetadataMirror-Map Remap-Map Checksum-Map
11
Policy TablePolicy Table
How to specify reliability policies?– Different block types,
different levels of importance
– Different volumes, different reliability levels
– Need fine-grained policy
Policy table– Different policies across
different block types– Different policy tables
across different volumes
Policy Table
Block Type Write Policy Read Policy
… … …
Superblock TrippleMirror() …
Inode ChecksumParity()
…
Inode Bitmap
ChecksumParity()
…
Data WriteRetry1sec()
…
… … …
Shepherd/lib/tmp /boot /archive
File System
No protectionHigh-levelreliability
12
Policy MetadataPolicy Metadata
What support is needed to make powerful policies?
– Remapping: track bad block remapping– Mirroring: allocate new block– Sanity check: need on-disk structure specification
Integration with file system– Runtime allocation– Detailed knowledge of on-disk structures
I/O Shepherd Maps– Managed by the shepherd– Commonly used maps:
• Mirror-map• Checksum-map• Remap-map 1001 2001
Mirror-Map
1002 2002
1003 null
… …
File System
I/O Shepherd
1001 1010
Csum-Map
1002 1010
1003 1010
… …
1001 null
Remap
1002 null
1003 3003
… …
13
Policy Primitives and CodePolicy Primitives and Code
How to make reliability management simple?
I/O Shepherd Primitives– Rich set and reusable– Complexities are hidden
Policy writer simply composes primitives into Policy Code
Policy Code
Maps Computation
FS-Level
Sanity Check
Policy Primitives
Checksum
Stop FS
Map Update
ParityMap Lookup
Layout
Allocate Near
Allocate Far
MirrorData(Addr D) Addr M; MapLookup (MMap, D, M); if (M == NULL) M = PickMirrorLoc (D); MapAllocate (MMap, D, M); Copy (D, M); Write (D, M);
14
Disk Subsystem
File System
I/O Shepherd
Policy Table
Data MirrorData()
Inode …
Super …
MirrorData(Addr D) Addr R; R = MapLookup(MMap, D);
if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R);
Policy Code
D
D
D
D R
Mirror-Map
RD D NULL
Mirror-Map
… …D R
15
SummarySummary
Interposition simplifies reliability management– Localized policies – Simple and extensible policies
Challenge: Keeping new data and metadata consistent
16
OutlineOutline
Introduction
I/O Shepherd Architecture
Implementation– Consistency Management
Evaluation
Conclusion
17
ImplementationImplementation
CrookFS– (named for the hooked staff of a shepherd)– An ext3 variant with I/O shepherding capabilities
Implementation– Changes in Core OS
• Semantic information, layout and allocation interface, allocation during recovery
• Consistency management (data journaling mode)• ~900 LOC (non-intrusive)
– Shepherd Infrastructure• Shepherd primitives, thread support, maps management, etc.• ~3500 LOC (reusable for other file systems)
Well-integrated with the file system– Small overhead
18
Fixed Location
Checkpoint (intent is realized)
Data Journaling ModeData Journaling ModeMemory
Journal
DI
TB D I TC
D IBm
Tx ReleaseSync (intent is logged)
19
Reliability Policy + JournalingReliability Policy + Journaling
When to run policies?– Policies (e.g. mirroring) are executed during
checkpoint
Is current journaling approach adequate to support reliability policy?– Could we run remapping/mirroring during
checkpoint?
No – Problem of failed intentions– Cannot react to checkpoint failures
20
Fixed Location
Journal Inconsistencies:1) Pointer ID invalid2) No reference to R
Memory
RMD0
Failed IntentionsFailed Intentions
DI
TB D I TC
Example Policy: Remapping
Remap-Map
RD I
Checkpoint completes RMDR
RMD0
RMDR
Impossible
Tx Release
Checkpoint (failed intent)
R I
Crash
21
Journaling FlawJournaling Flaw
Journal: log intent to the journal– If journal write failure occurs? Simply abort the transaction
Checkpoint: intent is realized to final location– If checkpoint failure occurs? No solution!
• Ext3, IBM JFS: ignore • ReiserFS: stop the FS (coarse-grained recovery)
Flaw in current journaling approach– No consistency for any checkpoint recovery that changes
state• Too late, transaction has been committed• Crash could occur anytime
– Hopes checkpoint writes always succeed (wrong!)
Consistent reliability + current journal = impossible
22
Chained TransactionsChained Transactions
Contains all recent changes (e.g. modified shepherd’s metadata)
“Chained” with previous transaction
Rule: Only after the chained transaction commits, can we release the previous transaction
23
New: Tx Release after CTx commits
Fixed Location
Journal
Memory
RMD0
Chained TransactionsChained Transactions
DI
TB D I TC
Example Policy: Remapping
RD I
Checkpoint completes
RMDR
Old : Tx Release
TB TC
RMDR
24
SummarySummary
Chained Transactions– Handles failed-intentions– Works for all policies– Minimal changes in the journaling
layer
Repeatable across crashes– Idempotent policy
• An important property for consistency in multiple crashes
25
OutlineOutline
Introduction
I/O Shepherd Architecture
Implementation
Evaluation
Conclusion
26
EvaluationEvaluation
Flexible– Change ext3 to all-stop or more-retry policies
Fine-Grained– Implement gracefully-degrade RAID[TOS’05]
Composable– Perform multiple lines of defense
Simple– Craft 8 policies in a simple manner
27
FlexibilityFlexibility
Not applicableWorkload
Failed
Blo
ck T
yp
e
Stop
Propagate
No Recovery
Retry
Failed Block: Indirect block Workload: Path traversal cd /mnt/fs2/test/a/b/Policy observed: Detect failure and propagate failure to app
Propagate Retry Ignorefailure
Stop
Modify ext3 inconsistent read recovery policies
ext3
28
FlexibilityFlexibility Modify ext3 policies to all-stop policies
AllStopRead (Block B) if (Read(B) == OK) return OK; else Stop();
Policy Table
Any Block Type AllStopRead()
ext3 All-Stop
Stop
Propagate
No Recovery
Retry
29
FlexibilityFlexibility Modify ext3 policies to retry-more policies
RetryMoreRead (Block B) for (int i = 0; i < RETRY_MAX; i++) if (Read(B) == SUCCESS) return SUCCESS; return FAILURE;
Policy Table
Any Block Type RetryMoreRead()
ext3 Retry-More
Stop
Propagate
No Recovery
Retry
30
Shepherd + DGRAID
File System
RAID-0
Fine-GranularityFine-Granularity
RAID problem– Extreme unavailability
• Partially available data• Unavailable root
directory
DGRAID[TOS’05] – Degrade gracefully
• Fault isolate a file to a disk
• Highly replicate metadata
File System
RAID-0
file1.pdf
/root /root/…
f1.pdf f2.pdf
31
Fine-GranularityFine-Granularity
DGRAID Policy Table
Superblock
MirrorXway()
Group Desc
Bitmaps
Directory
Inode
Indirect
DataIsolateAFileToADisk()
X = 1, 5, 10
F: 1A: 90%
F: 2A: 80%
F: 3A: ~40%
10-wayLinear
32
ReadInode(Block B){ C = Lookup(Ch-Map, B); Read(B,C); if ( CompareChecksum(B, C) == OK ) return OK; M = Lookup(M-Map, B); Read(M); if ( CompareChecksum(M, C) == OK ) B = M; return OK; if ( SanityCheck(B) == OK ) return OK; if ( SanityCheck(M) == OK ) B = M; return OK; RunOnlineFsck(); return ReadInode(B);}
ComposabilityComposability
Multiple lines of defense
Assemble both low-level and high-level recovery mechanism
Time (ms)
33
SimplicitySimplicity
Writing reliability policy is simple– Implement 8 policies
• Using reusable primitives
– Complex one < 80 LOC
Policy LOC
Propagate 8
Sanity Check 10
Reboot 15
Retry 15
Mirroring 18
Parity 28
Multiple Lines of D
39
D-GRAID 79
34
ConclusionConclusion
Modern storage failures are complex– Not only fail-stop, but also exhibit individual block
failures
FS reliability framework does not exist– Scattered policy code – can’t expect much
reliability– Journaling + Block Failures Failed intentions
(Flaw)
I/O Shepherding– Powerful
• Deploy disk-level, RAID-level, FS-level policies– Flexible
• Reliability as a function of workload and environment– Consistent
• Chained-transactions
35
ADvanced Systems Laboratorywww.cs.wisc.edu/adsl
Scholarship
Sponsor:
ResearchSponsor:
Thanks to:
I/O Shepherd’s shepherd – Frans Kaashoek
36
Extra SlidesExtra Slides
37
Disk Subsystem
Policy Table
Data
RemapMirrorData()
.. …
.. …
RemapMirrorData(Addr D) Addr R, Q; MapLookup(MMap, D, R); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R);
if (Fail(R)) Deallocate(R); Q = PickMirrorLoc(D); MapAllocate(MMap, D, Q); Write(Q);
Policy CodeD
D Q
Mirror-Map
D NULL
Mirror-Map
… …D QRD
D R
Mirror-Map
Q
38
Memory
Journal
Fixed Location MD0
MDR1
Chained Transactions (2)Chained Transactions (2)
DI
TB
D I
TB TC
R1 R2
MDR2MDR2
TC
MD0
D
Checkpointcompletes
I
Example Policy: RemapMirrorData
39
Existing Solution Enough?Existing Solution Enough?
Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)?– Not pervasive in home environment (store
photos, tax returns)– New trend: commodity storage clusters
(Google, EMC Centera)
Is RAID enough?– Requires more than one disk– Does not protect faults above disk system– Focus on whole disk failure– Does not enable fine-grained policies