systematically testing file-system crash consistency€¦ · systematically testing file-system...

53
Systematically Testing File-System Crash Consistency Jayashree Mohan , Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, Vijay Chidambaram [Published at OSDI 2018]

Upload: others

Post on 05-Aug-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Systematically Testing File-System Crash Consistency

Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, Vijay Chidambaram

[Published at OSDI 2018]

Page 2: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Crashes

This is very

important…File saved!

I crashed ☹

2

Image source : https://www.fotolia.com

Page 3: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

I wish filesystems

were crash-consistent!

3

Page 4: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

Memory

Storage

4

Page 5: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

Memory

Storage

mkdir (A)

5

Page 6: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

Memory

Storage

mkdir (A)

touch (A/bar)

6

Page 7: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

Memory

Storage

A

bar

mkdir (A)

touch (A/bar)

fsync (A/bar)

7

Page 8: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

B

Memory

Storage

A

bar

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

8

Page 9: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

B

barMemory

Storage

A

bar

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

9

Page 10: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

B

Memory

Storage

A

bar

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

10

Page 11: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

foo

B

Memory

Storage

A

bar

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

11

Page 12: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

foo

B

Memory

Storage

A

bar

foo

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

fsync (A/foo)

12

Page 13: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

foo

B

Memory

Storage

A

bar

foo

Expected

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

fsync (A/foo)

CRASH!

13

Page 14: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

foo

B

Memory

Storage

A

bar

foo

Expected

A

foo

Actual

Persisted file A/bar

missing

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

fsync (A/foo)

CRASH!

14

Page 15: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Rename atomicity bug in btrfs

A

bar

foo

B

Memory

Storage

A

bar

foo

Expected

A

foo

Actual

Persisted file A/bar

missing

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

fsync (A/foo)

CRASH!

15

Exists in the kernel since 2014!

Found by ACE and CrashMonkey

Page 16: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Testing Crash Consistency Today

• State of the Art : xfstest suite • Collection of 482 regression tests

Only 5% of tests in xfstest check for file system crash consistency16

Annotate filesystems

Hard to do for existing FS

Verified

Filesystems• Build FS from scratch Model

Checking

Page 17: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Challenges with systematic testing

17

Infinite workload

space

ChallengesLack of

automated

infrastructure

Our work addresses both these issues, to provide a

systematic testing framework

Systematically generate workloads

Page 18: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounded Black-Box Crash Testing (B3)

• Focus on reproducible bugs resulting in metadata corruption, data loss.

• Found 10 new bugs across btrfs and F2FS;

• Found 1 bug in FSCQ (verified file system)

New approach to testing file-system crash consistency

18www.github.com/utsaslab/crashmonkey

Target Filesystem

Output:

Bug report with workload, expected state, actual state

CrashMonkey

Workload 1 Workload n…

Bounds:

(length, operations, args)

Automatic Crash Explorer(ACE)

Page 19: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey and Ace : Features

19

Fully automatedCompletely

black-box

File system

agnosticFinds real bugs

Page 20: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey and Ace : Features

20

Fully automatedCompletely black-

box

File system

agnosticFinds real bugs

No manual

checkers and

workloads

Page 21: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey and Ace : Features

21

Fully automatedCompletely

black-box

File system

agnosticFinds real bugs

No annotations

Doesn’t need to

look at file

system code

Page 22: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey and Ace : Features

22

Fully automatedCompletely black-

box

File system

agnosticFinds real bugs

Works on any

POSIX file

system,

including verified

FS

Page 23: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey and Ace : Features

23

Fully automatedCompletely black-

box

File system

agnosticFinds real bugs

Acknowledged

and patched by

kernel

developers

Page 24: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Outline

• CrashMonkey

• Bounded Black Box Crash Testing

• Automatic Crash Explorer (ACE)

• Results

24

Page 25: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Challenges with systematic testing

25

Infinite workload

space

ChallengesLack of

automated

infrastructure

Page 26: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey

26

• Efficient infrastructure to record and replay block level IO requests

• Simulate crash at different points in the workload

• Automatically test for consistency after crash.

• Copy-on-write RAM block device

Page 27: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

CrashMonkey in Action

27

Final FS stateInitial FS state

Workload

IO due to workload

Persistence point

Page 28: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

28

CrashMonkey in Action

Initial FS state

Workload

IO due to workload

Persistence point

Page 29: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

29

Phase 1 : Record IO

Initial FS state Oracle

Record IO up to persistence point

Safely unmount

Workload

IO due to workload

Persistence point

IO forced by unmount

Page 30: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

30

Phase 2 : Replay IO

Initial FS state Oracle

Initial FS state Crash State

Record IO up to persistence point

Safely unmount

Replay IO up to persistence point

Workload

IO due to workload

Persistence point

IO forced by unmount

Page 31: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

31

Phase 3 : Test for consistency

Initial FS state Oracle

Initial FS state Crash State

Auto

Checker

Record IO up to persistence point

Safely unmount

Replay IO up to persistence point

Workload

IO due to workload

Persistence point

IO forced by unmount

After recovery

Page 32: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

32

Initial FS state Oracle

Initial FS state Crash State

Auto

Checker

Bug

Report

Record IO up to persistence point

Safely unmount

Replay IO up to persistence point

Phase 3 : Test for consistency

Workload

IO due to workload

Persistence point

IO forced by unmount

Page 33: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Challenges with Systematic Testing

33

ChallengesLack of

automated

infrastructure

Infinite

workload

space

So Far…

• Given a workload compliant to POSIX API, we saw how CrashMonkey generates crash states and automatically tests for consistency

CrashMonkey

Page 34: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Challenges with Systematic Testing

34

So Far…

• Given a workload compliant to POSIX API, we saw how CrashMonkey generates crash states and automatically tests for consistency

• Next question : How to automatically generate workloads in an the infinite workload space?

ChallengesLack of

automated

infrastructure

Infinite

workload

space

CrashMonkey

Page 35: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Exploring the infinite workload space

Challenges:

• Infinite length of workloads

• Large set of filesystem operations

• Infinite parameter options (file/directory names, depth)

• Infinite options for initial filesystem state

• When in the workload to simulate a crash?

35

Page 36: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Outline

• CrashMonkey

• Bounded Black Box Crash Testing

• Automatic Crash Explorer (ACE)

• Results

36

Page 37: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

37

Length of workloads

Initial FS state

Arguments to system calls

Page 38: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

38

Length of workloads

Initial FS state

Arguments to system calls

Page 39: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

39

Length of workloads

Initial FS state

Arguments to system calls

Image source: https://en.wikipedia.org/wiki/Cube

Page 40: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

40

Length of workloads

Initial FS state

Arguments to system calls

Page 41: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

41

Length of workloads

Initial FS state

Arguments to system calls

Page 42: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

42

Length of workloads

Initial FS state

Arguments to system calls

Page 43: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

B3 : Bounded Black Box Crash Testing

Choice of crash point

• Only after fsync(), fdatasync() or sync()

• Not in the middle of system call

43

mkdir (A)

touch (A/bar)

fsync (A/bar)

mkdir (B)

touch (B/bar)

rename (B/bar, A/bar)

touch (A/foo)

fsync (A/foo)

Crash Point 1

Crash Point 2

• Developers are motivated to patch

bugs that break semantics of

persistence operations

• Crashing in the middle of system

calls leads to exponentially large

crash-states.

Page 44: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Limitations of B3

• No guarantee of finding all crash-consistency bugs in a filesystem

• Assumes the correct working of crash-consistency mechanism like journaling or CoW• Does not crash in the middle of system calls

• Can only reveal if a bug has occurred, not the reason or origin of bug.

• Needs larger compute to test higher sequence lengths

44

Page 45: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Outline

• CrashMonkey

• Bounded Black Box Crash Testing

• Automatic Crash Explorer (ACE)

• Results

45

Page 46: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounds chosen by ACE

46

Length of workloads

Initial FS state

Arguments to system calls

Bounds picked based on

insights from the study of

crash-consistency bugs

reported on Linux file

systems over the last 5

years

Page 47: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounds chosen by ACE

47

Length of workloads

Initial FS state

Arguments to system calls

Maximum # core ops is 3

Page 48: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounds chosen by ACE

48

Length of workloads

Initial FS state

Arguments to system calls

Maximum # core ops is 3

RootA

B

(foo, bar)

(foo, bar)

Overwrites to start, middle,

end of a file and append

Page 49: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounds chosen by ACE

49

Length of workloads

Initial FS state

Arguments to system calls

RootA

B

(foo, bar)

(foo, bar)

Overwrites to start,

middle, end and append

Maximum # core ops is 3

New, 100MB FS

Page 50: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Challenges with Systematic Testing

50

ChallengesLack of

automated

infrastructure

Infinite

workload

space

CrashMonkey ACE

Bounded Black-Box Testing

Page 51: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Outline

• CrashMonkey

• Bounded Black Box Crash Testing

• Automatic Crash Explorer (ACE)

• Results

51

Page 52: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Results

• Reproduced 24/26 known bugs across ext4, btrfs and F2FS

• Found 10 new bugs across btrfs and F2FS

• Found 1 bug in a verified file system, FSCQ

52

Sequence Length # workloads # Bugs Reproduced # Bugs found

1, 2, 3 3.37M 24 10

Page 53: Systematically Testing File-System Crash Consistency€¦ · Systematically Testing File-System Crash Consistency Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju,

Bounded Black-Box Crash Testing

Try our tools : https://github.com/utsaslab/crashmonkey53

• B3 makes exhaustive testing feasible using informed bound selection

• Easily generalizable to test larger workloads if more compute is available

• Found 10 new bugs across btrfs and F2FS, most of which existed since 2014

• Found 1 bug in FSCQ

Thanks!Questions?