understanding manycore scalability of file systems … · understanding manycore scalability of...

41
Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Upload: lethien

Post on 27-Sep-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

UnderstandingManycore Scalability

of File Systems

Changwoo Min, Sanidhya Kashyap, Steffen MaassWoonhak Kang, and Taesoo Kim

Page 2: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

Application must parallelizeI/O operations

● Death of single core CPU scaling– CPU clock frequency: 3 ~ 3.8 GHz– # of physical cores: up to 24 (Xeon E7 v4)

● From mechanical HDD to flash SSD– IOPS of a commodity SSD: 900K– Non-volatile memory (e.g., 3D XPoint): 1,000x ↑

But file systems become a scalability bottleneck

Page 3: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

3

Problem: Lack of understandingin internal scalability behavior

0k

2k

4k

6k

8k

10k

12k

14k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim mail server on RAMDISK

btrfsext4

F2FSXFS

3. Never scale

Embarrassingly parallelapplication!

1. Saturated

2. Collapsed

● Intel 80-core machine: 8-socket, 10-core Xeon E7-8870● RAM: 512GB, 1TB SSD, 7200 RPM HDD

Page 4: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

4

Even in slower storage mediumfile system becomes a bottleneck

0k

2k

4k

6k

8k

10k

12k

btrfs ext4 F2FS XFS

mes

sage

s/se

c

Exim email server at 80 cores

RAMDISKSSD

HDD

Page 5: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

5

Outline

● Background● FxMark design

– A file system benchmark suite for manycore scalability

● Analysis of five Linux file systems● Pilot solution● Related work● Summary

Page 6: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

6

Research questions

● What file system operations are not scalable?

● Why they are not scalable?

● Is it the problem of implementation or design?

Page 7: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

7

Technical challenges

● Applications are usually stuck with a few bottlenecks

→ cannot see the next level of bottlenecks before resolving them

→ difficult to understand overall scalability behavior

● How to systematically stress file systems to understand scalability behavior

Page 8: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

8

FxMark: evaluate & analyze manycore scalability of file systems

FxMark:

File systems:

Storage medium:

tmpfs

Memory FS

ext4J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

Page 9: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

9

FxMark: evaluate & analyze manycore scalability of file systems

FxMark:

File systems:

Storage medium:

tmpfs

Memory FS

ext4J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

>4,700

Page 10: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

10

Low Medium High

Microbenchmark:unveil hidden scalability bottlenecks● Data block read

Shar

ing

Leve

l

File Block Process

ROperation

Legend:

R R R R R R

Page 11: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

11

Stress different componentswith various sharing levels

Page 12: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

12

Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

Evaluation

● Data block read

Low:

R R

File systems:

Storage medium:

Linear scalability

Page 13: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

13

Outline

● Background● FxMark design● Analysis of five Linux file systems

– What are scalability bottlenecks?

● Pilot solution● Related work● Summary

Page 14: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

14

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

Page 15: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

15

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

Page 16: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

16

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

Page 17: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

17

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

Page 18: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

18

Data block read

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBL

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

Low:

Medium:

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

All file systems linearly scale

All file systems show performance collapse

XFS shows performance collapse

XFS

R R

R R

High: R R

Page 19: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

19

Disk

Page cache is maintained for efficient access of file data

R

1. read a file block

Page cache

2. look up a page cache

3. cache miss

4. read a page from disk

5. copy page

OS Kernel

Page 20: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

20

Disk

Page cache hit

R1. read a file block

Page cache

2. look up a page cache

3. cache hit4. copy page

OS Kernel

Page 21: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

21

Disk

Page cache can be evictedto secure free memory

Page cache

OS Kernel

Page 22: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

22

Disk

… only when not being accessed

R1. read a file block

Page cache

OS Kernel

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

Reference counting is usedto track # of accessing tasks

4. copy page

Page 23: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

23

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

R R R R R R

...100 CPI

(cycles-per-instruction)

20 CPI(cycles-per-instruction)

Page 24: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

24

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

R R R R R R

...

High contention on a page reference counter → Huge memory stall

100 CPI(cycles-per-instruction)

20 CPI(cycles-per-instruction)

Many more: directory entry cache, XFS inode, etc

Page 25: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

25

Lessons learned

High locality can cause performance collapse

Cache hit should be scalable → When the cache hit is dominant,

the scalability of cache hit does matter.

Page 26: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

26

Data block overwrite

Low:

Medium:

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

All file systems degrade gradually

Ext4, F2FS, and btrfs show performance collapse

ext4 F2FSbtrfs

W W

W W

Page 27: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

Btrfs is a copy-on-write (CoW)file system

● Directs a write to a block to a new copy of the block

→ Never overwrites the block in place

→ Maintain multiple versions of a file system imageW

Time T Time T Time T+1

Page 28: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

28

CoW triggers disk block allocation for every write

W

Time T Time T+1

BlockAllocation

BlockAllocation

Disk block allocation becomes a bottleneck

Ext4 journaling, F2FS checkpointing→ →

Page 29: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

29

Lessons learned

Overwriting could be as expensive as appending → Critical at log-structured FS (F2FS) and CoW FS (btrfs)

Consistency guarantee mechanisms should be scalable

→ Scalable journaling → Scalable CoW index structure → Parallel log-structured writing

Page 30: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

Data block overwrite

Low:

Medium:

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

All file systems degrade gradually

Ext4, F2FS, and btrfs show performance collapse

ext4 F2FSbtrfs

W W

W W

Page 31: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

Entire file is lockedregardless of update range

● All tested file systems hold an inode mutex for write operations– Range-based locking is not implemented

***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex);}

Page 32: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

Lessons learned

A file cannot be concurrently updated– Critical for VM and DBMS, which manage large files

Need to consider techniques used in parallel file systems

→ E.g., range-based locking

Page 33: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

33

Summary of findings

● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable

See our paper

Page 34: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

34

Summary of findings

● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable

See our paper

Many of them are unexpected and counter-intuitive → Contention at file system level to maintain data dependencies

Page 35: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

35

Outline

● Background● FxMark design● Analysis of five Linux file systems● Pilot solution

– If we remove contentions in a file system,

is such file system scalable?

● Related work● Summary

Page 36: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

36

RocksDB on a 60-partitioned RAMDISK scales better

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80op

s/se

c#core

btrfsext4F2FS

tmpfsXFS

tmpfs

A single-partitioned RAMDISK

A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

Page 37: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

37

RocksDB on a 60-partitioned RAMDISK scales better

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80op

s/se

c#core

btrfsext4F2FS

tmpfsXFS

tmpfs

A single-partitioned RAMDISK

A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

Reduced contention on file systemshelps improving performance and scalability

Page 38: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

38

0

100

200

300

400

500

0 5 10 15 20

ops/

sec

#core

But partitioning makesperformance worse on HDD

A single-partitioned HDD

A 60-partitionedHDD

2.7x

0

100

200

300

400

500

0 5 10 15 20

btrfsext4F2FSXFS

F2FS

** Tested workload: DB_BENCH overwrite **

Page 39: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

39

0

100

200

300

400

500

0 5 10 15 20

ops/

sec

#core

But partitioning makesperformance worse on HDD

A single-partitioned HDD

A 60-partitionedHDD

2.7x

0

100

200

300

400

500

0 5 10 15 20

btrfsext4F2FSXFS

F2FS

** Tested workload: DB_BENCH overwrite **

But reduced spatial locality degrades performance → Medium-specific characteristics (e.g., spatial locality)

should be considered

Page 40: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

40

Related work

● Scaling operating systems– Mostly use memory file system to opt out the effect of I/O

operations

● Scaling file systems– Scalable file system journaling

● ScaleFS [MIT:MSThesis'14]● SpanFS [ATC'15]

– Parallel log-structured writing on NVRAM● NOVA [FAST'16]

Page 41: Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo

41

Summary

● Comprehensive analysis of manycore scalability of five widely-used file systems using FxMark

● Manycore scalability should be of utmost importance in file system design

● New challenges in scalable file system design– Minimizing contention, scalable consistency guarantee, spatial

locality, etc.

● FxMark is open source– https://github.com/sslab-gatech/fxmark