tuning hdf5 subfiling performance on parallel file …...tuning lustre striping – cscratch from...

30
Tuning HDF5 Subfiling Performance on Parallel File Systems Suren Byna, CRD, LBNL Mohamad Chaarawi, Intel Quincey Koziol, NERSC, LBNL John Mainzer and Frank Willmore, The HDF Group

Upload: others

Post on 10-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning HDF5 Subfiling Performance on Parallel File Systems Suren Byna, CRD, LBNL Mohamad Chaarawi, Intel

Quincey Koziol, NERSC, LBNL John Mainzer and Frank Willmore, The HDF Group

Page 2: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

•  Background to parallel I/O

•  HDF5 basics

•  HDF5 subfiling implementation

•  Evaluation of HDF5 subfiling: –  Edison to /scratch3 file system –  Edison to cscratch file system on Cori –  Cori Haswell to cscratch file system on Cori –  Cori Haswell to DataWarp Burst Buffer on Cori

Outline of the talk

2

Page 3: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

§  Simulations –  Multi-physics (FLASH) – 10 PB –  Cosmology (NyX) – 10 PB –  Plasma physics (VPIC) – 1 PB

§  Experimental and Observational data –  High energy physics (LHC) – 100 PB –  Cosmology (LSST) – 60 PB –  Genomics – 100 TB to 1 PB

§  Scientific applications rely on efficient access to data –  Storage and I/O are critical

requirements of HPC

Scientific applications and massive data

3

FLASH

NyX

VPIC

LHC

LSST

Genomics

Page 4: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Trends – Storage system transformation

4

IO Gap

Memory

Parallel file system (Lustre, GPFS)

Archival Storage (HPSS tape)

IO Gap

Shared burst buffer

Memory

Parallel file system (Lustre, GPFS)

Archival Storage (HPSS tape)

Memory

Parallel file system

Archival storage (HPSS tape)

Shared burst buffer

Node-local storage

Conventional Current Eg. Cori @ NERSC

Upcoming Eg. Aurora @ ALCF

Campaign storage

•  IO performance gap in HPC storage is a significant bottleneck because of slow disk-based storage

•  SSD and new memory technologies are trying to fill the gap, but increase the depth of storage hierarchy

Page 5: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Applications

High Level I/O Library (HDF5, NetCDF, ADIOS)

I/O Middleware (MPI-IO)

I/O Forwarding

Parallel File System (Lustre, GPFS,..)

I/O Hardware

Parallel I/O stack

§  I/O Libraries –  HDF5, ADIOS, PnetCDF, NetCDF-4

•  Middleware –  POSIX-IO, MPI-IO –  I/O Forwarding

•  File systems: –  Lustre –  GPFS –  Cray DataWarp (for burst buffers)

•  I/O Hardware –  disk-based –  SSD-based

5

Page 6: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Types of parallel I/O •  1 writer/reader, 1 file •  N writers/readers, N files

(File-per-process) •  N writers/readers, 1 file •  M writers/readers, 1 file –  Aggregators –  Two-phase I/O

•  M aggregators, M files (file-per-aggregator) –  Variations of this mode

Parallel I/O – Application view

P0 P1 Pn-1 Pn …

file.0

1 Writer/Reader, 1 File

P0 P1 Pn-1 Pn …

file.0

n Writers/Readers, n Files

file.1 file.n-1 file.n

P0 P1 Pn-1 Pn …

n Writers/Readers, 1 File File.1

P0 P1 Pn-1 Pn …

file.0

M Writers/Readers, M Files

file.m

P0 P1 Pn-1 Pn …

M Writers/Readers, 1 File File.1

6

Source: http://www.erdc.hpc.mil/docs/Tips/garnet-lustre-adios.pdf

Page 7: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

•  Parallel file systems –  Lustre and GPFS

•  Typical building blocks of parallel file systems –  Storage hardware – HDD or SSD

RAID –  Storage servers –  Metadata servers –  Client-side processes and

interfaces

•  Management –  Stripe files for parallelism –  Tolerate failures

Parallel I/O – System view

OST 0 OST 1 OST 2 OST 3

File

File

Physical view on a parallel file system

Logical view

Communication network

7 Source: https://www.nersc.gov/users/data-and-file-systems/i-o-resources-for-scientific-applications/ https://fs.hlrs.de/projects/craydoc/docs/books/S-2490-40/html-S-2490-40/chapter-4m6xftd3-oswald-basicsoflustreparallelio.html

Page 8: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Slide courtesy of John Shalf

Hierarchical Data Format v5 (HDF5)

“/” (root)

“Dataset0” type,space

“Dataset1” type, space

“subgrp”

“time”=0.2345

“validity”=None

“author”=JohnDoe

“Dataset0.1” type,space

“Dataset0.2” type,space

•  Groups –  Arranged in directory

hierarchy –  root group is always ‘/’

•  Datasets –  Dataspace –  Datatype

•  Attributes –  Bind to Group & Dataset

•  References

•  Flexibility to design and implement data models

8

HDF5 is a data model, file format, and I/O library

Page 9: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Heavily used at supercomputing centers

1

10

100

1000

10000

100000

1000000

10000000

gcc atp mpich darshan intel libsci HDF5 7w mkl boost netcdf openmpi petsc

No.oflinkinginstan

ces

9

Library usage on Edison, Allocation Year 2015

Collected by Zhengji Zhao, NERSC

Page 10: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Top self-describing I/O library

2%PnetCDF11%

NetCDF14%

NetCDF-HDF5Par14%

HDF521%

HDF5-parallel38%

HDF5-parallel HDF5 NetCDF-HDF5Par NetCDF PnetCDFADIOS

NetCDF - HDF5 Parallel7%

HDF590%

HDF5 NetCDF - HDF5 Parallel NetCDF PnetCDF ADIOSSILO MATIO

10

Page 11: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

•  Writing to single shared file may be slow due to: •  Locking contention •  Complications in moving large files

•  A solution: Subfiling •  Multiple small files •  A metadata file stitching the small files together

•  Benefits •  Better use of parallel I/O subsystem •  Reduced locking and contention issues improve performance

•  Related work •  File system: PLFS •  I/O middleware: PnetCDF, ADIOS •  Application libraries: BoxLib

Subfiling

11

Clients

OSTs

Page 12: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

•  Virtual Datasets §  Introduced in HDF5 1.10.0 §  Allows for pieces of a dataset to be stored

in separate files, but would be viewed as a single dataset from a master file

Subfiling in HDF5

12

•  Creating subfiles §  Split the

MPI_COMM_WORLD communicator into multiple subcommunicators

§  One subfile per subcommunicator

Page 13: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

int color = mpi_rank % subfile;if (n_nodes > subfile)color = (mpi_rank % n_nodes) % subfile;MPI_Comm_split (..., &subfile_comm);

Using HDF5 subfiling

13

§  Split the MPI_COMM_WORLD communicator into multiple subcommunicators

§  Writing subfiles sprintf (subfile_name, "Subfile_%d.h5", mpi_rank);H5Pset_subfiling_access (fapl_id, subfile_name, MPI_COMM_SELF, MPI_INFO_NULL);fid = H5Fcreate (filename, ..., fapl_id);H5Sselect_hyperslab (sid, H5S_SELECT_SET, start, stride, count, block);dapl_id = H5Pcreate (H5P_DATASET_ACCESS); H5Pset_subfiling_selection(dapl_id, sid); did = H5Dcreate (fid, DATASET, ..., sid, ..., dapl_id);H5Dwrite (did, H5T_NATIVE_INT, mem_sid, sid, H5P_DEFAULT, wbuf);

Page 14: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

int color = mpi_rank % subfile;if (n_nodes > subfile)color = (mpi_rank % n_nodes) % subfile;MPI_Comm_split (..., &subfile_comm);

Using HDF5 subfiling

14

§  Split the MPI_COMM_WORLD communicator into multiple subcommunicators

§  Reading subfiles sprintf (subfile_name, "Subfile_%d.h5", mpi_rank);H5Pset_subfiling_access (fapl_id, subfile_name, MPI_COMM_SELF, MPI_INFO_NULL);H5Sselect_hyperslab (sid, H5S_SELECT_SET, start, stride, count, block);H5Dread (did, H5T_NATIVE_INT, mem_sid, sid, H5P_DEFAULT, rbuf);

Page 15: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Experimental setup

15

•  Systems –  Cori

•  Haswell partition à 2388 nodes, 32-core Intel Xeon E5-2698 CPUs •  File systems: cscratch (Lustre, 248 OSS, 248 OSTs), SSD-based burst buffer

–  Edison •  5586 compute nodes, 24-core Ivy Bridge processors •  File system: scratch3 (Lustre, 36 OSTs), cscratch (Lustre, 248 OSS, 248 OSTs)

•  Benchmarks –  VPIC-IO

•  I/O kernel from a plasma physics simulation •  8 million particles per MPI process, 8 variables per particle

–  BD-CATS-IO •  I/O kernel from Big Data Clustering application to run DBSCAN •  Reads VPIC-IO data

•  IO Measurements –  IO time and IO rate –  Each job was run at least three times

Page 16: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Scalability tests – Edison scratch3 – IO time

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

1K 2K 4K

Write/m

e(in

second

s)

NumberofMPIprocesses(CPUranks)

Original

Subfiling

16

§  Subfiling is 4X better at 2K and 2X better at 4K

Configuration §  /scratch3 Lustre file

system on Edison

§  36 OST, 32MB stripe size

§  72 GB/s peak BW

§  Subfiling factor: 32

§  # of subfiles:

§  1K à 32

§  2K à 64

§  4K à 128

Page 17: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Scalability tests – Edison scratch3 – IO rate

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1K 2K 4K

I/Ora

te(G

B/s)

NumberofMPIprocesses(CPUcores)

Original

Subfiling

17

§  67% of the peak bandwidth with subfiling

Configuration §  /scratch3 Lustre file

system on Edison

§  36 OST, 32MB stripe size

§  72 GB/s peak BW

§  Subfiling factor: 32

§  # of subfiles:

§  1K à 32

§  2K à 64

§  4K à 128

Page 18: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Scalability tests – cscratch from Edison

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

1K 2K 4K

Writerate(G

B/s)

NumberofMPIprocesses(CPUcores)

Original

Subfiling

18

§  Up to 6.5X better performance @ 4K cores

Configuration §  cscratch Lustre file

system on Cori

§  128 OSTs out of 248, 32MB stripe size

§  >700 GB/s peak BW

§  Subfiling factor: 32

§  # of subfiles:

§  1K à 32

§  2K à 64

§  4K à 128

Page 19: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Scalability tests – cscratch from Cori

0.00

50.00

100.00

150.00

200.00

250.00

1K 2K 4K 8K 16K

Writerate(G

B/s)

NumberofMPIprocesses(CPUcores)

Original

Subfiling

19

§  Up to 60% better performance @ 16K cores §  195 GB/s IO rate at 16K processes

Configuration §  cscratch Lustre file

system on Cori

§  128 OSTs out of 248, 32MB stripe size

§  >700 GB/s peak BW

§  Subfiling factor: 32

§  # of subfiles:

§  1K à 32

§  2K à 64

§  4K à 128

§  8K à 256

§  16K à 256 (subfiling factor 64)

Page 20: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Scalability tests – Burst buffer from Cori

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

1K 2K 4K 8K 16K

Writerate(G

B/s)

NumberofMPIprocesses(cores)

Original

Subfiling

20

§  Up to 80% better performance @ 16K cores §  410 GB/s IO rate at 16K processes

Configuration §  Burst Buffer on Cori

§  1.8 TB/s peak BW

§  Subfiling factor: 32

§  # of subfiles:

§  1K à 32

§  2K à 64

§  4K à 128

§  8K à 256

§  16K à 256 (subfiling factor 64)

Page 21: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Reading HDF5 subfile data – cscratch from Edison

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

1K 2K 4K

Read

rate(G

B/s)

NumberofMPIprocesses(CPUcores)

10%

Full

21

Page 22: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning subfiling factor – Edison scratch3

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Writerate(G

B/s)

Subfilingfactor(NumberofMPIprocesseswriIngtoafile)

I/Orate(GB/s)

Average

22

§  Subfiling factors of 8 to 32 resulted in good performance §  Subfiling factor of 64 resulted in poor performance consistently; but 128 to

1024 was above average

Configuration §  /scratch3 Lustre file

system on Edison

§  36 OSTs, 32MB stripe size

§  72 GB/s peak BW

§  4K cores

§  1TB data

§  Varied the number of subfiles

Page 23: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning subfiling factor – cscratch from Edison

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

1 4 8 16 32 64 128 256 512 1024 2048 4096

I/Ora

te(G

B/s)

Subfilingfactor(NumberofMPIprocesseswriJngtoafile)

23

§  Subfiling factors of 8 to 32 resulted in good performance §  Subfiling factor of 64 resulted in poor performance consistently; but 128 to

1024 was above average

Configuration §  cscratch Lustre file

system on Cori

§  128 OSTs, 32MB stripe size

§  >700 GB/s peak BW

§  4K cores

§  1TB data

§  Varied the number of subfiles

Page 24: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning subfiling factor – cscratch from Cori

0.00

20.00

40.00

60.00

80.00

100.00

120.00

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Writerate(G

B/s)

Subfilingfactor(NumberofMPIprocesseswriIngtoafile)

24

§  Subfiling factors of 4 and 8 resulted in good performance §  Subfiling factors between 16 to 512 showed above average I/O rates

Configuration §  cscratch Lustre file

system on Cori

§  128 OSTs, 32MB stripe size

§  >700 GB/s peak BW

§  4K cores

§  1TB data

§  Varied the number of subfiles

Page 25: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning subfiling factor – Burst buffer on Cori

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Writerate(G

B/s)

Subfilingfactor(NumberofMPIprocesseswriIngtoafile)

25

§  Subfiling factors of 1 to 32 resulted in good performance §  Performance degraded with subfiling factors beyond 32 and beyond

Configuration §  Burst buffer on Cori

§  1.8 TB/s peak BW

§  4K cores

§  1TB data

§  Varied the number of subfiles

Page 26: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning Lustre striping – cscratch from Edison

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

Default 1 2 4 8 16 32 64 128 248

Writerate(G

B/s)

Lustrestripecount

26

§  Using 8 to 64 stripes resulted in more than average performance §  70% better performance than default stripe settings

Configuration §  cscratch Lustre file

system on Cori

§  >700 GB/s peak BW

§  4K cores

§  1TB data

§  Subfiling factor of 128

§  Varied the number of OSTs

§  Stripe size: 32MB

§  Default: 1 OST, 1 MB stripe size

Page 27: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning Lustre striping – cscratch from Cori – 4K

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

Default 1 2 4 8 16 32 64 128 248

Writerate(G

B/s)

Lustrestripecount

I/Orate(total)

Average

27

§  Using 8 to 248 stripes resulted in good I/O performance §  3.5X faster performance than default stripe settings @ 16 OSTs

Configuration §  cscratch Lustre file

system on Cori

§  >700 GB/s peak BW

§  4K cores

§  1TB data

§  Subfiling factor of 128

§  Varied the number of OSTs

§  Stripe size: 32MB

§  Default: 1 OST, 1 MB stripe size

Page 28: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Tuning Lustre striping – cscratch from Cori – 16K

28

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Default 4 8 16 32 64 128 248

Writerate(G

B/s)

Lustrestripecount

I/Orate(total)

Average

§  Using 64 stripes resulted in the best performance §  4.8X faster performance than default stripe settings @ 64 OSTs

Configuration §  cscratch Lustre file

system on Cori

§  >700 GB/s peak BW

§  16K cores

§  1TB data

§  Subfiling factor of 64

§  Varied the number of OSTs

§  Stripe size: 32MB

§  Default: 1 OST, 1 MB stripe size

Page 29: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Conclusions

29

•  Recommendations for obtaining good I/O rate §  Subfiling factor of 8 to 64 is reasonable §  Striping 16 at smaller scales, and 64 at larger scales

•  Limitations

§  Using subfiling at 32K MPI processes failed §  Failure observed for region sizes > 2GB (probably an MPI limitation)

§  Number of readers have to be equal to the number of writers

•  Subfiling is showing better performance than writing to a single shared file §  Up to 6.5X performance advantage

•  Reading with an arbitrary number of MPI processes, without matching the number of readers will be useful

Page 30: Tuning HDF5 Subfiling Performance on Parallel File …...Tuning Lustre striping – cscratch from Edison 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Default 1 2 4 8 16 32

Thanks!

Advanced Scientific Computing Research (ASCR) for funding the ExaHDF5 project Program Manager: Dr. Lucy Nowell

30