concurrent write sharing: overcoming the bane of file systems

24
Concurrent Write Sharing: Overcoming the Bane of File Systems Garth Gibson Professor, School of Computer Science, Carnegie Mellon University Co-Founder and Chief Scientist, Panasas Inc.

Upload: phamdiep

Post on 08-Jan-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concurrent Write Sharing: Overcoming the Bane of File Systems

Concurrent Write Sharing:Overcoming the Bane of File Systems

Garth Gibson

Professor, School of Computer Science, Carnegie Mellon University Co-Founder and Chief Scientist, Panasas Inc.

Page 2: Concurrent Write Sharing: Overcoming the Bane of File Systems

Agenda •  Simple File Systems Don’t Do Write Sharing •  HPC Checkpointing: N-1 versus N-N concurrent write

•  N-1 has usability advantages & performance challenges

•  PLFS: Parallel Log-structured File System •  Library represents file as many logs of written values •  In production at Los Alamos showing good benefits for

important apps, brilliant benefits for benchmarks

•  Eliminates write size & alignment problems •  Read performance doesn’t suffer as expected

•  Index importing needs parallel impl.

•  Backend abstracted to enable use of unusual storage, such as HPC on Hadoop HDFS

Garth Gibson, June 19, 2013#www.pdl.cmu.edu

Page 3: Concurrent Write Sharing: Overcoming the Bane of File Systems

N-N File IO N-1 File IO

Page 4: Concurrent Write Sharing: Overcoming the Bane of File Systems

Parallel file

11 12 13 14

Process 1

21 22 23

24

Process 2

31 32 33 34

Process 3

41 42 43 44

Process 4

RAID Group 1 RAID Group 2 RAID Group 3

File Systems full of Locks for Consistency

Page 5: Concurrent Write Sharing: Overcoming the Bane of File Systems

Cross graph comparisons not meaningful

LANL GPFS

PSC Lustre

LANL PanFS

45X

100 MB/s 10 MB/s

20 MB/s

4.5 GB/s 3.3 GB/s

330X

1.8 GB/s

90X

N-N N-1

N-1 Concurrent Write Often Not Scalable

Page 6: Concurrent Write Sharing: Overcoming the Bane of File Systems

N-1 versus N-N Checkpointing •  N-N writing easier for lock-happy file systems •  But many users prefer N-1 checkpoints

•  Prefer to manage 1 file, rather than thousands+ •  Can’t avoid mapping because of N-M restarts •  >50% LANL cycles use N-1 checkpointing •  2 of 8 open science apps written for Roadrunner •  At least 8 of 23 parallel IO benchmarks

including BTIO, FLASH IO, Chombo, QCD •  Some programmers switch but many don’t

•  One app wrote 10K lines of code (bulkio) to “fix”

Garth Gibson, June 19, 2013#www.pdl.cmu.edu

Page 7: Concurrent Write Sharing: Overcoming the Bane of File Systems

N-1 Write-Optimization via Log-Structuring •  1991 LFS paper “write optimized” seeks

during writes (instead of reads) •  Multiple projects emphasized eliminating

seeks for checkpoint capture •  PSC Zest “write where the head is” checkpointFS •  ADIOS file formatting library uses delayed-write •  PVFS experiments embedded log ordering

•  In retrospect, log structured writing not as important as decoupling file system locks

Garth Gibson, June 19, 2013#www.pdl.cmu.edu

Page 8: Concurrent Write Sharing: Overcoming the Bane of File Systems

Parallel Log-structured File System •  Open source library for Fuse, MPI-IO

•  http://github.com/plfs (released v2.4 this week) •  Part of FAST FORWARD plan for Exascale HPC

•  Big team centered on Los Alamos Nat. Lab. •  Gary Grider, Aaron Torres, Brett Kettering, Alfred Torrez,

David Shrader, David Bonnie, John Bent, Sorin Faibish, Percy Tzelnic, Uday Gupta, William Tucker, Jun He, Carlos Maltzahn, Chuck Cranor

•  This talk draws heavily on LA-UR-11-11964, SC09, PDSW09, Cluster12, CMU-PDL-12-115 •  Jun He in HPDC13 today (index management) •  Other papers in PDSW12, DISCS12, 2x MSST12

Garth Gibson, June 19, 2013#www.pdl.cmu.edu

Page 9: Concurrent Write Sharing: Overcoming the Bane of File Systems

PLFS Virtual Layer

/foo

host1 host2 host3

/foo/

host1/ host2/ host3/

131 132 279 281 132 148

data.131

indx

data.132 data.279 data.281

indx data.132 data.148

indx Physical Underlying Parallel File System

PLFS Decouples Logical from Physical

Page 10: Concurrent Write Sharing: Overcoming the Bane of File Systems

GPFS

PSC Lustre

LANL PanFS

25X 100X

8X

3000X at scale

31 GB/s

Notice CU Boundaries IO node limits

N-1 @ N-N BW in Simple Test

Page 11: Concurrent Write Sharing: Overcoming the Bane of File Systems

FLASH IO on Roadrunner

150X

Page 12: Concurrent Write Sharing: Overcoming the Bane of File Systems

LANL1: 2X better than hand tuned library

2X 4X

Page 13: Concurrent Write Sharing: Overcoming the Bane of File Systems

LANL3: More success in production

20X

300 MB/s

6.0 GB/s

8.5 GB/s

28X ?

???

Page 14: Concurrent Write Sharing: Overcoming the Bane of File Systems

23X

7X

150X

2X

5X

28X

83X

PLFS Checkpoint BW Summary

Page 15: Concurrent Write Sharing: Overcoming the Bane of File Systems

With PLFS

Without PLFS

Stripe aligned

block aligned

Unaligned

Does not Suffer “Alignment” Preferences

Page 16: Concurrent Write Sharing: Overcoming the Bane of File Systems

PLFS Virtual Layer

/foo

host1 host2 host3

/foo/

host1/ host2/ host3/

131 132 279 281 132 148

data.131

indx

data.132 data.279 data.281

indx data.132 data.148

indx

Physical Underlying Parallel File System

/Vol1 /Vol2 /Vol3

Distribute over Federated Backends

Page 17: Concurrent Write Sharing: Overcoming the Bane of File Systems

Uniform restart

Non-uniform restart

Archiving

Read Bandwidth •  Wrote increasing addresses •  Reads increasing addresses •  Result is mergesort with

deep prefetch (one per log) •  So write optimized is also

read optimized !

Page 18: Concurrent Write Sharing: Overcoming the Bane of File Systems

Open for Read does N Squared Index Reading

Page 19: Concurrent Write Sharing: Overcoming the Bane of File Systems
Page 20: Concurrent Write Sharing: Overcoming the Bane of File Systems
Page 21: Concurrent Write Sharing: Overcoming the Bane of File Systems
Page 22: Concurrent Write Sharing: Overcoming the Bane of File Systems

Read optimizations

Page 23: Concurrent Write Sharing: Overcoming the Bane of File Systems

Abstracted Backend Storage •  PLFS uses abstracted backend storage •  E.g. HPC app w/ PLFS can run on a cloud

with non-POSIX HDFS as native file system

Garth Gibson, June 19, 2013#www.pdl.cmu.edu

Page 24: Concurrent Write Sharing: Overcoming the Bane of File Systems

PLFS Summary and Futures •  N-1 checkpoints feature intensive concurrent writing •  File systems choke on consistency preserving locks •  PLFS decouples concurrency with per-processor logs

•  Order of magnitude and larger wins for taking checkpoint •  Insensitive to ideal write sizes or alignments

•  Typical reading also faster because it mergesorts logs •  Index construction on read can be parallelized

•  Ongoing work with PLFS •  N-N checkpointing benefits from hashing logs over federated

backend storage systems – easy way to scalable metadata thruput •  Burst buffers using NAND Flash for faster checkpoint can be

managed by PLFS •  In-burst-buffer local processing needs exposed structure

Garth Gibson, June 19, 2013#www.pdl.cmu.edu