reducing file system tail latencies with - usenix · reducing file system tail latencies with...

103
Reducing File System Tail Latencies with Chopper Jun He, Duy Nguyen + , Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Department of Computer Sciences, + Department of Statistics University of Wisconsin, Madison

Upload: hahanh

Post on 06-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Reducing File System Tail Latencies with

ChopperJun He, Duy Nguyen+,

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Department of Computer Sciences, +Department of StatisticsUniversity of Wisconsin, Madison

Uncommon tail latencies become common at scale

Uncommon tail latencies become common at scale

“Temporary high-latency episodes (unimportant in moderate-size systems) may come to dominate overall

service performance at large scale.”

Uncommon tail latencies become common at scale

Uncommon tail latencies become common at scale

Uncommon tail latencies become common at scale

Important to avoid long latencies at every node in the data center

Local file systems contribute to tail latency

Local FS

Local FS

Local FS

Local FS

Local FS

Local file systems contribute to tail latency

Local FS

Local FS

Local FS

Local FS

Local FS

Important to avoid long latencies in local file systems

Chopper discovers high-latency operations in local FS

Local FS

BlockAllocator

GoalFind problematic corner cases in

file system block allocator

ChallengeFile system input space is huge

Chopper explores file systems by statistical techniques

Chopper explores file systems by statistical techniques

We provide an overall analysis of file system block allocations (XFS, ext4)

Chopper explores file systems by statistical techniques

We provide an overall analysis of file system block allocations (XFS, ext4)

We have analyzed unexpected behaviors in detail

Chopper explores file systems by statistical techniques

We provide an overall analysis of file system block allocations (XFS, ext4)

We have analyzed unexpected behaviors in detail

We have found and fixed four allocation issues in ext4 and significantly improved layout quality

OutlinePart 1

Collect Data

Part 2Analyze Data

Part 3Understand File System

OutlinePart 1

Collect Data

Part 2Analyze Data

Part 3Understand File System

We quantify and qualify everything

Layout quality

WorkloadFile system

INPUT

OUTPUT

{

We quantify and qualify everything

Layout quality

WorkloadFile system

d-span(unit: byte)

…file: First Last

INPUT

OUTPUT

{

We quantify and qualify everything

Layout quality

WorkloadFile system

d-span(unit: byte)

…file: First Last

First Last

First Last

Good:

Bad: …

INPUT

OUTPUT

{

What values to pick for the factors?

• Disk Size• Used Ratio• Fragmentation• CPU Count

• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span

File System Workload

What values to pick for the factors?

• Disk Size• Used Ratio• Fragmentation• CPU Count

• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span

1,2,4,..64GB

File System Workload

What values to pick for the factors?

• Disk Size• Used Ratio• Fragmentation• CPU Count

• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span

1,2,4,..64GB 8,16,..256KB

File System Workload

What values to pick for the factors?

• Disk Size• Used Ratio• Fragmentation• CPU Count

• File Size• Chunk Count• Internal Density• Chunk Order• Fsync• Sync• File Count• Directory Span

1,2,4,..64GB 8,16,..256KB

After refining, 250 years to explore all combinations

File System Workload

We use Latin Hypercube Sampling to search efficiently

We use Latin Hypercube Sampling to search efficiently

Random Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

Random Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

Random Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

Random Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

File Size

Dis

k Si

zeLatin Hypercube Sampling

XX

XX

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

BRandom Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

File Size

Dis

k Si

zeLatin Hypercube Sampling

XX

XX

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

BRandom Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

We use Latin Hypercube Sampling to search efficiently

File Size

Dis

k Si

zeLatin Hypercube Sampling

XX

XX

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

BRandom Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

• Explores space evenly• Aids visualization• Explores interactions between factors well

We use Latin Hypercube Sampling to search efficiently

File Size

Dis

k Si

zeLatin Hypercube Sampling

XX

XX

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

BRandom Sampling

XX X

X

File Size

Dis

k Si

ze

8KB 16KB 24KB 32KB

1GB

2G

B 8

GB

16G

B

• Explores space evenly• Aids visualization• Explores interactions between factors well

16384 samples, 30 mins with 32 machines

d-span is a signal of block allocation problems

We used-span

A

B

d-span is a signal of block allocation problems

We used-span

A

B

Alternative 1Average

block distance

d-span is a signal of block allocation problems

We used-span

A

B

Complex

Alternative 1Average

block distance

d-span is a signal of block allocation problems

App

Alternative 2End-to-end

performance

We used-span

A

B

Complex

Alternative 1Average

block distance

d-span is a signal of block allocation problems

App

Alternative 2End-to-end

performance

We used-span

A

B

Complex Confounded

Alternative 1Average

block distance

d-span is a signal of block allocation problems

App

Alternative 2End-to-end

performance

We used-span

A

B

Simple &

Informative Complex Confounded

Alternative 1Average

block distance

How Chopper works?

experimental plan

How Chopper works?

experimental plan

RAM FS

How Chopper works?

experimental plan

RAM FS

tmp file

How Chopper works?

experimental plan

RAM FS

looping device

tmp file

How Chopper works?

experimental plan

File System

RAM FS

looping device

tmp file

How Chopper works?

experimental plan

Workload

File System

RAM FS

looping device

tmp file

Factors d-span

How Chopper works?

experimental plan

Workload

File System

RAM FS

looping device

tmp file

Factors d-span

How Chopper works?

experimental plan

Workload

File System

RAM FS

looping device

tmp file

Factors d-span

How Chopper works?

experimental plan

Workload

File System

RAM FS

looping device

tmp file

• All operations are in user space • No kernel modification needed

Part 1Collect Data

Part 2Analyze Data

Part 3Understand File System

Outline

0.00.10.20.30.40.50.60.70.80.91.0

0 8 16 24 32 40 48 56 64d−span (GB)

Cum

ulat

ive D

ensi

ty

xfs−vanillaext4−vanilla

10% of tests on ext4 have d-span > 10GB

ext4

XFS

0.00.10.20.30.40.50.60.70.80.91.0

0 8 16 24 32 40 48 56 64d−span (GB)

Cum

ulat

ive D

ensi

ty

xfs−vanillaext4−vanilla

10% of tests on ext4 have d-span > 10GB

ext4

XFS

ext4’s block allocator has a large performance tail

Factor prioritization shows which factors influence layout the most

Factor A

d-sp

an

1 2

Factor Bd-

span

1 2

More importantLess important

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Factor prioritization shows which factors influence layout the most in ext4

CPUCountFileCount

InternalDensityUsedRatio

DirectorySpanFreeSpaceLayout

FsyncChunkOrder

SyncFileSize

DiskSize

0.0 0.2 0.4 0.6Variance Contribution

Most important

Least importantImportance (Variance Contribution)

Many factors influence data layout unexpectedly

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c

● ●Sometimes bad Always good

Some combinations always produce good layouts

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c

● ●Sometimes bad Always good

Some combinations always produce good layouts

GoodRegion

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c

● ●Sometimes bad Always good

Some combinations always produce good layouts

Factors interact unexpectedly in ext4

GoodRegion

Summary of unexpected behaviors in ext4

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Sequential writes can be harmful

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Sequential writes can be harmful

Patterns of sync() and fsync() change layout

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Sequential writes can be harmful

Patterns of sync() and fsync() change layout

Fragmentations and used ratio of disk don’t matter

Summary of unexpected behaviors in ext4

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Sequential writes can be harmful

Patterns of sync() and fsync() change layout

Fragmentations and used ratio of disk don’t matter

Factors interact when determining data layout

Summary of unexpected behaviors in ext4

More in the paper

Linux ext4 may spread files wider on larger disks

File size influences d-span more than expected

Sequential writes can be harmful

Patterns of sync() and fsync() change layout

Fragmentations and used ratio of disk don’t matter

Factors interact when determining data layout

Part 1Collect Data

Part 2Analyze Data

Part 3Understand File System

Outline

The unexpected behaviors in ext4 can be explained by four design issues

Special End Policy

Scheduler Dependency

File Size Dependency

Normalization Bug

in this talk

in this talk

in paper

in paper

Special End Policy

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c

● ●Sometimes bad Always good

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c● ●Sometimes bad Always good

Writing and syncing file end first could avoid poor layout

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c● ●Sometimes bad Always good

Writing and syncing file end first could avoid poor layout

Write file end first

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c● ●Sometimes bad Always good

Writing and syncing file end first could avoid poor layout

Callfsync()after firstwrite

Write file end first

ending

Why layout is bad? The allocator treats the ending data extent of a file differently

Condition 1: the extent is at the end of the fileCondition 2: the file is closed

A file: non-ending

ending

Why layout is bad? The allocator treats the ending data extent of a file differently

Condition 1: the extent is at the end of the fileCondition 2: the file is closed

A file:

non-ending

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0000000100100011010001010110011110001001101010111100110111101111

0123

0132

0213

0231

0312

0321

1023

1032

1203

1230

1302

1320

2013

2031

2103

2130

2301

2310

3012

3021

3102

3120

3201

3210

ChunkOrder

Fsyn

c● ●Sometimes bad Always good

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

Cond 1 (is end?): Cond 2 (is closed?):

Special End?

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✔Cond 1 (is end?): Cond 2 (is closed?):

Special End?

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✖✔Cond 1 (is end?):

Cond 2 (is closed?):Special End?

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✖✔Cond 1 (is end?):

Cond 2 (is closed?):

✖Special End?

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✖✔Cond 1 (is end?):

Cond 2 (is closed?):

✖ ✖ ✖

Special End?

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✖✔Cond 1 (is end?):

Cond 2 (is closed?):

✖ ✖ ✖

Special End? ✖ ✖ ✖✱ ✱ ✱

Special End Policy is never triggered

Why ChunkOrder=3120 and Fsync=1100 always have good layout?

3 1 2 0Time

fsync()1

fsync()1 0 0

✖✔Cond 1 (is end?):

Cond 2 (is closed?):

✖ ✖ ✖

Special End? ✖ ✖ ✖✱ ✱ ✱

Special End Policy

FixTreat non-ending and ending extents equally

Lesson Learned

Policies for different circumstances should be harmonious with one another

Scheduler Dependency

6% of d-spans are different in two repeated experiments

16384 tests

6%

Up to 44GB difference on a 64GB disk

16384 tests

6%

6% of d-spans are different in two repeated experiments

16384 tests

6%

Up to 44GB difference on a 64GB disk

16384 tests

6%

Data layout can be random

Data layouts of small files depend on OS scheduler

CPU0

Reserved Space for CPU0

Disk

flushing thread

Data layouts of small files depend on OS scheduler

CPU0

Reserved Space for CPU0

Disk

flushing thread

CPU0

Reserved Space for CPU0

Disk

CPU1

Reserved Space for CPU1

flushing thread

CPU0

Reserved Space for CPU0

Disk

CPU1

Reserved Space for CPU1

flushing thread

Scheduler DependencyFix

Choose locations based on inode number, instead of CPU id

Lesson Learned

Policies should not depend on environmental factors that may change and are outside the

control of the file system

Special End Policy

Scheduler Dependency

File Size Dependency

Target locations depend on dynamic file size

Normalization Bug

Block allocation request are not correctly adjusted

Issues found and fixed

just covered

in paper

in paper

just covered

0.00.10.20.30.40.50.60.70.80.91.0

0 8 16 24 32 40 48 56 64d−span (GB)

Cum

ulat

ive D

ensi

ty

FinalVanilla

Removing the issues significantly cuts tail size of d-span distribution

Before

After

0.00.10.20.30.40.50.60.70.80.91.0

0 8 16 24 32 40 48 56 64d−span (GB)

Cum

ulat

ive D

ensi

ty

FinalVanilla

Removing the issues significantly cuts tail size of d-span distribution

Before

After

Our fixes improve ext4’s data layout

Write

● ● ● ●●

0

500

1000

1500

2 4 8 16Number of Creating Threads

Writ

e Ti

me

(ms)

But, do our fixes reduce latencies?

Before

After

Upd

ate

Tim

e (m

s)

Write

● ● ● ●●

0

500

1000

1500

2 4 8 16Number of Creating Threads

Writ

e Ti

me

(ms)

But, do our fixes reduce latencies?

Before

After

Upd

ate

Tim

e (m

s)

Write

● ● ● ●●

0

500

1000

1500

2 4 8 16Number of Creating Threads

Writ

e Ti

me

(ms)

But, do our fixes reduce latencies?

1.4 secBefore

After

Upd

ate

Tim

e (m

s)

Write

● ● ● ●●

0

500

1000

1500

2 4 8 16Number of Creating Threads

Writ

e Ti

me

(ms)

But, do our fixes reduce latencies?

1.4 secBefore

After

Our fixes reduce tail latencies

Upd

ate

Tim

e (m

s)

Conclusions• Statistical techniques are practical • Found and fixed four allocation issues in ext4• Our fixes ▶ better layouts ▶ lower latency at a node

▶ lower latency at scale• Lessons learned

• Policies should be harmonious• Policies should not depend on environmental factors

Conclusions• Statistical techniques are practical • Found and fixed four allocation issues in ext4• Our fixes ▶ better layouts ▶ lower latency at a node

▶ lower latency at scale• Lessons learned

• Policies should be harmonious• Policies should not depend on environmental factors

Rigorous statistics will help to reduce unexpectedissues caused by intuitive but unreliable design decisions

Thanks!

Source code and data

http://research.cs.wisc.edu/adsl/Software/chopper/