reducing file system tail latencies with - usenix · reducing file system tail latencies with...
TRANSCRIPT
Reducing File System Tail Latencies with
ChopperJun He, Duy Nguyen+,
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Department of Computer Sciences, +Department of StatisticsUniversity of Wisconsin, Madison
Uncommon tail latencies become common at scale
“Temporary high-latency episodes (unimportant in moderate-size systems) may come to dominate overall
service performance at large scale.”
Uncommon tail latencies become common at scale
Important to avoid long latencies at every node in the data center
Local file systems contribute to tail latency
Local FS
Local FS
Local FS
Local FS
Local FS
Important to avoid long latencies in local file systems
Chopper discovers high-latency operations in local FS
Local FS
BlockAllocator
GoalFind problematic corner cases in
file system block allocator
ChallengeFile system input space is huge
Chopper explores file systems by statistical techniques
We provide an overall analysis of file system block allocations (XFS, ext4)
Chopper explores file systems by statistical techniques
We provide an overall analysis of file system block allocations (XFS, ext4)
We have analyzed unexpected behaviors in detail
Chopper explores file systems by statistical techniques
We provide an overall analysis of file system block allocations (XFS, ext4)
We have analyzed unexpected behaviors in detail
We have found and fixed four allocation issues in ext4 and significantly improved layout quality
We quantify and qualify everything
Layout quality
WorkloadFile system
d-span(unit: byte)
…file: First Last
INPUT
OUTPUT
{
We quantify and qualify everything
Layout quality
WorkloadFile system
d-span(unit: byte)
…file: First Last
First Last
First Last
Good:
Bad: …
INPUT
OUTPUT
{
What values to pick for the factors?
• Disk Size• Used Ratio• Fragmentation• CPU Count
• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span
File System Workload
What values to pick for the factors?
• Disk Size• Used Ratio• Fragmentation• CPU Count
• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span
1,2,4,..64GB
File System Workload
What values to pick for the factors?
• Disk Size• Used Ratio• Fragmentation• CPU Count
• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span
1,2,4,..64GB 8,16,..256KB
File System Workload
What values to pick for the factors?
• Disk Size• Used Ratio• Fragmentation• CPU Count
• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span
1,2,4,..64GB 8,16,..256KB
After refining, 250 years to explore all combinations
File System Workload
We use Latin Hypercube Sampling to search efficiently
Random Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
Random Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
Random Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
Random Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
File Size
Dis
k Si
zeLatin Hypercube Sampling
XX
XX
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
BRandom Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
File Size
Dis
k Si
zeLatin Hypercube Sampling
XX
XX
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
BRandom Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
We use Latin Hypercube Sampling to search efficiently
File Size
Dis
k Si
zeLatin Hypercube Sampling
XX
XX
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
BRandom Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
• Explores space evenly• Aids visualization• Explores interactions between factors well
We use Latin Hypercube Sampling to search efficiently
File Size
Dis
k Si
zeLatin Hypercube Sampling
XX
XX
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
BRandom Sampling
XX X
X
File Size
Dis
k Si
ze
8KB 16KB 24KB 32KB
1GB
2G
B 8
GB
16G
B
• Explores space evenly• Aids visualization• Explores interactions between factors well
16384 samples, 30 mins with 32 machines
d-span is a signal of block allocation problems
We used-span
A
B
Alternative 1Average
block distance
d-span is a signal of block allocation problems
We used-span
A
B
Complex
Alternative 1Average
block distance
d-span is a signal of block allocation problems
App
Alternative 2End-to-end
performance
We used-span
A
B
Complex
Alternative 1Average
block distance
d-span is a signal of block allocation problems
App
Alternative 2End-to-end
performance
We used-span
A
B
Complex Confounded
Alternative 1Average
block distance
d-span is a signal of block allocation problems
App
Alternative 2End-to-end
performance
We used-span
A
B
Simple &
Informative Complex Confounded
Alternative 1Average
block distance
Factors d-span
How Chopper works?
experimental plan
Workload
File System
RAM FS
looping device
tmp file
Factors d-span
How Chopper works?
experimental plan
Workload
File System
RAM FS
looping device
tmp file
Factors d-span
How Chopper works?
experimental plan
Workload
File System
RAM FS
looping device
tmp file
• All operations are in user space • No kernel modification needed
0.00.10.20.30.40.50.60.70.80.91.0
0 8 16 24 32 40 48 56 64d−span (GB)
Cum
ulat
ive D
ensi
ty
xfs−vanillaext4−vanilla
10% of tests on ext4 have d-span > 10GB
ext4
XFS
0.00.10.20.30.40.50.60.70.80.91.0
0 8 16 24 32 40 48 56 64d−span (GB)
Cum
ulat
ive D
ensi
ty
xfs−vanillaext4−vanilla
10% of tests on ext4 have d-span > 10GB
ext4
XFS
ext4’s block allocator has a large performance tail
Factor prioritization shows which factors influence layout the most
Factor A
d-sp
an
1 2
Factor Bd-
span
1 2
More importantLess important
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Factor prioritization shows which factors influence layout the most in ext4
CPUCountFileCount
InternalDensityUsedRatio
DirectorySpanFreeSpaceLayout
FsyncChunkOrder
SyncFileSize
DiskSize
0.0 0.2 0.4 0.6Variance Contribution
Most important
Least importantImportance (Variance Contribution)
Many factors influence data layout unexpectedly
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c
● ●Sometimes bad Always good
Some combinations always produce good layouts
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c
● ●Sometimes bad Always good
Some combinations always produce good layouts
GoodRegion
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c
● ●Sometimes bad Always good
Some combinations always produce good layouts
Factors interact unexpectedly in ext4
GoodRegion
Summary of unexpected behaviors in ext4
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Summary of unexpected behaviors in ext4
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Sequential writes can be harmful
Summary of unexpected behaviors in ext4
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Sequential writes can be harmful
Patterns of sync() and fsync() change layout
Summary of unexpected behaviors in ext4
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Sequential writes can be harmful
Patterns of sync() and fsync() change layout
Fragmentations and used ratio of disk don’t matter
Summary of unexpected behaviors in ext4
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Sequential writes can be harmful
Patterns of sync() and fsync() change layout
Fragmentations and used ratio of disk don’t matter
Factors interact when determining data layout
Summary of unexpected behaviors in ext4
More in the paper
Linux ext4 may spread files wider on larger disks
File size influences d-span more than expected
Sequential writes can be harmful
Patterns of sync() and fsync() change layout
Fragmentations and used ratio of disk don’t matter
Factors interact when determining data layout
The unexpected behaviors in ext4 can be explained by four design issues
Special End Policy
Scheduler Dependency
File Size Dependency
Normalization Bug
in this talk
in this talk
in paper
in paper
Special End Policy
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c
● ●Sometimes bad Always good
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c● ●Sometimes bad Always good
Writing and syncing file end first could avoid poor layout
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c● ●Sometimes bad Always good
Writing and syncing file end first could avoid poor layout
Write file end first
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c● ●Sometimes bad Always good
Writing and syncing file end first could avoid poor layout
Callfsync()after firstwrite
Write file end first
ending
Why layout is bad? The allocator treats the ending data extent of a file differently
Condition 1: the extent is at the end of the fileCondition 2: the file is closed
A file: non-ending
ending
Why layout is bad? The allocator treats the ending data extent of a file differently
Condition 1: the extent is at the end of the fileCondition 2: the file is closed
A file:
non-ending
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0000000100100011010001010110011110001001101010111100110111101111
0123
0132
0213
0231
0312
0321
1023
1032
1203
1230
1302
1320
2013
2031
2103
2130
2301
2310
3012
3021
3102
3120
3201
3210
ChunkOrder
Fsyn
c● ●Sometimes bad Always good
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
Cond 1 (is end?): Cond 2 (is closed?):
Special End?
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✔Cond 1 (is end?): Cond 2 (is closed?):
Special End?
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✖✔Cond 1 (is end?):
Cond 2 (is closed?):Special End?
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✖✔Cond 1 (is end?):
Cond 2 (is closed?):
✖Special End?
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✖✔Cond 1 (is end?):
Cond 2 (is closed?):
✖
✖ ✖ ✖
Special End?
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✖✔Cond 1 (is end?):
Cond 2 (is closed?):
✖
✖ ✖ ✖
Special End? ✖ ✖ ✖✱ ✱ ✱
Special End Policy is never triggered
Why ChunkOrder=3120 and Fsync=1100 always have good layout?
3 1 2 0Time
fsync()1
fsync()1 0 0
✖✔Cond 1 (is end?):
Cond 2 (is closed?):
✖
✖ ✖ ✖
Special End? ✖ ✖ ✖✱ ✱ ✱
Special End Policy
FixTreat non-ending and ending extents equally
Lesson Learned
Policies for different circumstances should be harmonious with one another
6% of d-spans are different in two repeated experiments
16384 tests
6%
Up to 44GB difference on a 64GB disk
16384 tests
6%
6% of d-spans are different in two repeated experiments
16384 tests
6%
Up to 44GB difference on a 64GB disk
16384 tests
6%
Data layout can be random
Data layouts of small files depend on OS scheduler
CPU0
Reserved Space for CPU0
Disk
flushing thread
Data layouts of small files depend on OS scheduler
CPU0
Reserved Space for CPU0
Disk
flushing thread
Scheduler DependencyFix
Choose locations based on inode number, instead of CPU id
Lesson Learned
Policies should not depend on environmental factors that may change and are outside the
control of the file system
Special End Policy
Scheduler Dependency
File Size Dependency
Target locations depend on dynamic file size
Normalization Bug
Block allocation request are not correctly adjusted
Issues found and fixed
just covered
in paper
in paper
just covered
0.00.10.20.30.40.50.60.70.80.91.0
0 8 16 24 32 40 48 56 64d−span (GB)
Cum
ulat
ive D
ensi
ty
FinalVanilla
Removing the issues significantly cuts tail size of d-span distribution
Before
After
0.00.10.20.30.40.50.60.70.80.91.0
0 8 16 24 32 40 48 56 64d−span (GB)
Cum
ulat
ive D
ensi
ty
FinalVanilla
Removing the issues significantly cuts tail size of d-span distribution
Before
After
Our fixes improve ext4’s data layout
Write
● ● ● ●●
●
●
●
0
500
1000
1500
2 4 8 16Number of Creating Threads
Writ
e Ti
me
(ms)
But, do our fixes reduce latencies?
Before
After
Upd
ate
Tim
e (m
s)
Write
● ● ● ●●
●
●
●
0
500
1000
1500
2 4 8 16Number of Creating Threads
Writ
e Ti
me
(ms)
But, do our fixes reduce latencies?
Before
After
Upd
ate
Tim
e (m
s)
Write
● ● ● ●●
●
●
●
0
500
1000
1500
2 4 8 16Number of Creating Threads
Writ
e Ti
me
(ms)
But, do our fixes reduce latencies?
1.4 secBefore
After
Upd
ate
Tim
e (m
s)
Write
● ● ● ●●
●
●
●
0
500
1000
1500
2 4 8 16Number of Creating Threads
Writ
e Ti
me
(ms)
But, do our fixes reduce latencies?
1.4 secBefore
After
Our fixes reduce tail latencies
Upd
ate
Tim
e (m
s)
Conclusions• Statistical techniques are practical • Found and fixed four allocation issues in ext4• Our fixes ▶ better layouts ▶ lower latency at a node
▶ lower latency at scale• Lessons learned
• Policies should be harmonious• Policies should not depend on environmental factors
Conclusions• Statistical techniques are practical • Found and fixed four allocation issues in ext4• Our fixes ▶ better layouts ▶ lower latency at a node
▶ lower latency at scale• Lessons learned
• Policies should be harmonious• Policies should not depend on environmental factors
Rigorous statistics will help to reduce unexpectedissues caused by intuitive but unreliable design decisions
Thanks!
Source code and data
http://research.cs.wisc.edu/adsl/Software/chopper/