Download - March 9, 200910th International LCI Conference - HDF5 Tutorial1 Tutorial II: HDF5 and NetCDF-4 10 th International LCI Conference Albert Cheng, Neil Fortner

March 9, 2009 10th International LCI Conference - HDF5 Tutorial 1

Tutorial II: HDF5 and NetCDF-4

10th International LCI Conference

Albert Cheng, Neil Fortner

The HDF Group

Ed Hartnett

Unidata/UCAR


Outline

8:30 – 9:30

Introduction to HDF5 data, programming models and tools

9:30 – 10:00

Advanced features of the HDF5 library

10:30 – 11:30

Advanced features of the HDF5 library (continued)

11:30 – 12:00

Introduction to Parallel HDF5

1:00 – 2:30

Introduction to Parallel HDF5 (continued) and Parallel I/O

Performance Study

3:00 – 4:30

NetCDF-4


Introduction to HDF5 Data, Programming Models

and Tools


What is HDF?


HDF is…

• HDF stands for Hierarchical Data Format• A file format for managing any kind of data• Software system to manage data in the format

• Designed for high volume or complex data• Designed for every size and type of system• Open format and software library, tools

• There are two HDF’s: HDF4 and HDF5• Today we focus on HDF5


Brief History of HDF

1987 At NCSA (University of Illinois), a task force formed to create an architecture-independent format and library:

AEHOO (All Encompassing Hierarchical Object Oriented format) Became HDF

Early NASA adopted HDF for Earth Observing System project 1990’s

1996 DOE’s ASC (Advanced Simulation and Computing) Project began collaborating with the HDF group (NCSA) to create “Big HDF” (Increase in computing power of DOE systems at LLNL, LANL and Sandia National labs, required bigger, more complex data files).

“Big HDF” became HDF5. 1998 HDF5 was released with support from National Labs, NASA,

NCSA

2006 The HDF Group spun off from University of Illinois as non-profit corporation

March 9, 2009 10th International LCI Conference - HDF5 Tutorial 7• 7

Why HDF5?

In one sentence ...


• Matter and the universe

• Weather and climate

• August 24, 2001 • August 24, 2002

• Total Column Ozone (Dobson)

• 60 385 610

• Life and nature

Answering big questions …


… involves big data …


• LCI Tutorial

• 10

… varied data …

• Thanks to Mark Miller, LLNL


• Contig Summaries

• Discrepancies

• Contig Qualities

• Coverage Depth

• Read quality

• Aligned bases

• Contig

• Reads

• Percent match

• Trace

• SNP Score

… and complex relationships …


… on big computers …

• … and small computers …


How do we…

• Describe our data? • Read it? Store it? Find it? Share it? Mine it? • Move it into, out of, and between computers and

repositories?• Achieve storage and I/O efficiency?• Give applications and tools easy access our data?


Solution: HDF5!

• Can store all kinds of data in a variety of ways

• Runs on most systems

• Lots of tools to access data

• Emphasis on standards (HDF-EOS, CGNS)

• Library and format emphasis on I/O efficiency and storage


A single platform with multiple uses• One general format• One library, with

• Options to adapt I/O and storage to data needs• Layers on top and below

• Ability to interact well with other technologies

• Attention to past, present, future compatibility

HDF5 Philosophy


Who uses HDF5?


Who uses HDF5?

• Applications that deal with big or complex data• Over 200 different types of apps• 2+million product users world-wide• Academia, government agencies, industry


NASA EOS remote sense data

• HDF format is the standard file format for storing data from NASA's Earth Observing System (EOS) mission.

• Petabytes of data stored in HDF and HDF5 to support the Global Climate Change Research Program.


• File or other “storage”

• Virtual file I/O

• Library internals

Structure of HDF5 Library

• Object API (C, F90, C++, Java)

• Applications


HDF Tools

- HDFView and Java Products

- Command-line utilities (h5dump, h5ls, h5cc, h5diff, h5repack)


HDF5 Applications & Domains

• Simulation, visualization, • remote sensing…

• Examples: Thermonuclear simulations• Product modeling• Data mining tools• Visualization tools

• Climate models

• HDF-EOS CGNS ASC

• Storage

• File on parallel• file system

• File• Split metadata

• and raw data files

• User-defined• device

•?• HDF5 • format

• HDF5 Data Model & API• Stdio • Custom• Split Files • MPI

I/O

• Communities

• Virtual File Layer• (I/O Drivers)


HDF5The Format


An HDF5 “file” is a container…

• lat | lon | temp• ----|-----|-----• 12 | 23 | 3.1• 15 | 24 | 4.2• 17 | 21 | 3.6

• palette

• …into which you can put your data objects


Structures to organize objects

• palette

• Raster image

• 3-D array

• 2-D array

• Raster image

• lat | lon | temp• ----|-----|-----• 12 | 23 | 3.1• 15 | 24 | 4.2• 17 | 21 | 3.6

• Table

• “/” (root)

• “/foo”

• “Groups”

• “Datasets”


HDF5 model

• Groups – provide structure among objects• Datasets – where the primary data goes

• Data arrays• Rich set of datatype options• Flexible, efficient storage and I/O

• Attributes, for metadata

•Everything else is built essentially from these parts.


HDF5The Software


• HDF5 I/O Library

• Tools, Applications, Libraries

• HDF5 File

HDF5 Software


• “Virtual file layer” (VFL)

• HDF5 Application • Programming Interface

• Tools & Applications

• “HDF5 File”

• File system, MPI-IO, SAN, other layers

• Modules to adapt I/O to specific features of system, or do I/O in some special way.

• “File” could be on parallel system, in memory, collection of files, etc.

• Applications, tools use this API to create, read, write, query, etc.

• Power users (consumers)

• Most data consumers are here. Scientific/engineering applications.

• Domain-specific libraries/API, tools.

Users of HDF5 Software


HDF5 Data Model


HDF5 model (recap)

• Groups – provide structure among objects• Datasets – where the primary data goes

• Data arrays• Rich set of datatype options• Flexible, efficient storage and I/O

• Attributes, for metadata• Other objects

• Links (point to data in a file or in another HDF5 file)• Datatypes (can be stored for complex structures

and reused by multiple datatsets)


HDF5 Dataset

• Data• Metadata• Dataspace

• 3

• Rank

• Dim_2 = 5

• Dim_1 = 4

• Dimensions

• Time = 32.4

• Pressure = 987

• Temp = 56

• Attributes

• Chunked

• Compressed

• Dim_3 = 7

• Storage info

• IEEE 32-bit float

• Datatype


HDF5 Dataspace

• Two roles• Dataspace contains spatial info about a dataset

stored in a file• Rank and dimensions• Permanent part of dataset

definition

• Dataspace describes application’s data buffer and data elements participating in I/O

• Rank = 2• Dimensions = 4x6

• Rank = 1• Dimensions = 12


HDF5 Datatype

• Datatype – how to interpret a data element• Permanent part of the dataset definition• Two classes: atomic and compound• Can be stored in a file as an HDF5 object (HDF5

committed datatype)• Can be shared among different datasets


HDF5 Datatype

• HDF5 atomic types include• normal integer & float• user-definable (e.g., 13-bit integer)• variable length types (e.g., strings)• references to objects/dataset regions• enumeration - names mapped to integers• array

• HDF5 compound types• Comparable to C structs (“records”)• Members can be atomic or compound types


• Record

• int8 • int4 • int16 • 2x3x2 array of float32• Datatype:

HDF5 dataset: array of records

• Dimensionality: 5 x 3

•3

•5


Special storage options for dataset

• Better subsetting access time; compressible; extendable;

• chunked

• Improves storage efficiency, transmission speed

• compressed

• Arrays can be extended in any direction

• extendable

• Metadata for Fred

• Dataset “Fred”• File A

• File B

• Data for Fred

• Metadata in HDF5 file, raw data in a binary file

• external


HDF5 Attribute

• Attribute – data of the form “name = value”, attached to an object by application

• Operations similar to dataset operations, but … • Not extendible • No compression or partial I/O

• Can be overwritten, deleted, added during the “life” of a dataset


HDF5 Group

• A mechanism for organizing collections of related objects

• Every file starts with a root group

• Similar to UNIXdirectories

• Can have attributes

• “/”


• “/” •X

• temp

• temp

• / (root)• /X• /Y• /Y/temp• /Y/bar/temp

Path to HDF5 object in a file

• Y

• bar


Shared HDF5 objects

• /A/P• /B/R• /C/R

• “/”• A • B• C

• P• R • R


HDF5 Data ModelExample

ENSIGHT

Automotive crash simulation


March 9, 2009 4210th International LCI Conference - HDF5 Tutorial

Solid modeling


HDF5mesh


• April 28, 2008 • LCI Tutorial • 48

Mesh Example, in HDFView



HDF5 Software


• Tools & Applications

• HDF File

• HDF I/O Library

HDF5 software stack


• Virtual file I/O (C only)· Perform byte-stream I/O operations (open/close, read/write, seek)

· User-implementable I/O (stdio, network, memory, etc.)

• Virtual file I/O (C only)· Perform byte-stream I/O operations (open/close, read/write, seek)

· User-implementable I/O (stdio, network, memory, etc.)

• Library internals• Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.)


Structure of HDF5 Library

• Object API (C, Fortran 90, Java, C++)· Specify objects and transformation properties· Invoke data movement operations and data transformations



Write – from memory to disk

• memory

• disk


Partial I/O

• (b) Regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array

• memory

• disk• (a) Hyperslab from a 2D array to the corner of a smaller 2D array

• memory

• disk

• Move just part of a dataset


• (c) A sequence of points from a 2D array to a sequence of points in a 3D array.

• memory

• disk

• (d) Union of hyperslabs in file to union of hyperslabs in memory.

Partial I/O

• memory

• disk

• Move just part of a dataset


Layers – parallel example

• Application

• Parallel computing system (Linux cluster)• Comp

utenode

• I/O library (HDF5)

• Parallel I/O library (MPI-I/O)

• Parallel file system (GPFS)

• Switch network/I/O servers

• Compute

node

• Compute

node

• Compute

node

• Disk architecture & layout of data on disk

• I/O flows through many layers from application to disk.


• Virtual file I/O (C only)• Virtual file I/O (C only)

• Library internals• Library internals

Virtual I/O layer

• Object API (C, Fortran 90, Java, C++)• Object API (C, Fortran 90, Java, C++)


Virtual file I/O layer

• A public API for writing I/O drivers• Allows HDF5 to interface to disk, memory, or a

user-defined device

• ???

• Custom

• File Family • MPI I/O • Core

• Virtual file I/O drivers

• Memory

• Stdio

• File Family

• File

• “Storage”

Applications & Domains


• Storage

• File on parallel• file system

• File• Split metadata

• and raw data files

• User-defined• device

•?• Syste

m memo

ry

• HDF5 format

• HDF5 data model & API

• Simulation, visualization, • remote sensing…

• Examples: Thermonuclear simulations• Product modeling• Data mining tools• Visualization tools

• Climate models

• Common domain-specific data models

• HDF5 virtual file layer (I/O drivers)

• MPI I/O• Multi• Stdio • Custom • Core

• HDF5 serial &

parallel I/O

• UDM • SAF • H5Part • HDF-EOS• IDL• Domain-

specific APIs • LANL • LLNL, SNL • Grids COTS • NASA


Portability & Robustness

• Runs almost anywhere• Linux and UNIX workstations• Windows, Mac OS X• Big ASC machines, Crays, VMS systems• TeraGrid and other clusters• Source and binaries available from http://www.hdfgroup.org/HDF5/release/index.html

• QA• Daily regression tests on key platforms• Meets NASA’s highest technology readiness level


Other Software

• The HDF Group• HDFView• Java tools• Command-line utilities• Web browser plug-in• Regression and performance testing software• Parallel h5diff

• 3rd Party (IDL, MATLAB, Mathematica, PyTables, HDF Explorer, LabView)

• Communities (EOS, ASC, CGNS)• Integration with other software (iRODS, OPeNDAP)


Creating an HDF5 File with HDFView


• A • B

• “/” (root)

Example: Create this HDF5 File

• 4x6 array of integers

• Storm


Demo

• Demonstrate the use of HDFView to create the HDF5 file

• Use h5dump to see the contents of the HDF5 file• Use h5import to add data to the HDF5 file• Use h5repack to change properties of the stored

objects• Use h5diff to compare two files


Introduction to HDF5 Programming Model

and APIs


• Virtual file I/O API (C only)· Perform byte-stream I/O operations (open/close, read/write, seek)

· User-implementable I/O (stdio, mpi-io, memory, etc.)

• Virtual file I/O API (C only)· Perform byte-stream I/O operations (open/close, read/write, seek)

· User-implementable I/O (stdio, mpi-io, memory, etc.)





Structure of HDF5 Library (recap)


Goals of HDF5 Library

• Provide flexible API to support a wide range of operations on data.

• Support high performance access in serial and parallel computing environments.

• Be compatible with common data models and programming languages.

• Because of these goals,• the HDF5 API is rich and large


Operations Supported by the API

• Create groups, datasets, attributes, linkages• Create complex data types• Assign storage and I/O properties to objects• Perform complex subsetting during read/write• Use variety of I/O “devices” (parallel, remote, etc.)• Transform data during I/O• Query about file and structure and properties• Query about object structure, content, properties


Characteristics of the HDF5 API

• For flexibility, the API is extensive • 300+ functions

• This can be daunting… but there is hope• A few functions can do a lot• Start simple • Build up knowledge as more features are needed

• Library functions are categorized by object type• “H5Lite” API supports basic capabilities

• Victronix Swiss Army Cybertool 34


The General HDF5 API

• Currently C, Fortran 90, Java, and C++ bindings. • C routines begin with prefix H5?

? is a character corresponding to the type of object the function acts on

• Example APIs:

• H5D : Dataset interface e.g., H5Dread • H5F : File interface e.g., H5Fopen

• H5S : dataSpace interface e.g., H5Sclose


Compiling HDF5 Applications

• h5cc – HDF5 C compiler command• Similar to mpicc

• h5fc – HDF5 F90 compiler command• Similar to mpif90

• h5c++ – HDF5 C++ compiler command

To compile:% h5cc h5prog.c% h5fc h5prog.f90


Compile option: -show

-show: displays the compiler commands and options without executing them

• % h5cc –show Sample_c.c• gcc -I/home/packages/hdf5_1.6.6/Linux_2.6/include -UH5_DEBUG_API • -DNDEBUG -I/home/packages/szip/static/encoder/Linux2.6-gcc/include • -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -

D_FILE_OFFSET_BITS=64 • -D_POSIX_SOURCE -D_BSD_SOURCE -std=c99 -Wno-long-long -O • -fomit-frame-pointer -finline-functions -c Sample_c.c

• gcc -std=c99 -Wno-long-long -O -fomit-frame-pointer -finline-functions • -L/home/packages/szip/static/encoder/Linux2.6-gcc/lib Sample_c.o • -L/home/packages/hdf5_1.6.6/Linux_2.6/lib

/home/packages/hdf5_1.6.6/Linux_2.6/lib/libhdf5_hl.a /home/packages/hdf5_1.6.6/Linux_2.6/lib/libhdf5.a

• -lsz -lz -lm -Wl,-rpath -Wl,/home/packages/hdf5_1.6.6/Linux_2.6/lib


General Programming Paradigm

• Properties of object are optionally defined • Creation properties• Access property lists• Default values used if none are defined

• Object is opened or created• Object is accessed, possibly many times• Object is closed


Order of Operations

• An order is imposed on operations by argument dependencies

For Example:

A file must be opened before a dataset -because-

the dataset open call requires a file handle as an argument.

• Objects can be closed in any order.


HDF5 Defined Types

• For portability, the HDF5 library has its own defined types:

• hid_t: object identifiers (native integer)• hsize_t: size used for dimensions• (unsigned long or unsigned long long)• hssize_t: for specifying coordinates and sometimes for• dimensions (signed long or signed long long)• herr_t: function return value

• hvl_t: variable length datatype

• For C, include hdf5.h in your HDF5 application.

Example: Create this HDF5 File


• A • B

• “/” (root)


Example: Step by Step


• A


• B

• “/” (root)


• “/” (root)

Example: Create a File


Steps to Create a File

1. Decide any special properties the file should have • Creation properties, like size of user block• Access properties, such as metadata cache size

2. Create property lists, if necessary

3. Create the file

4. Close the file and the property lists, as needed


Code: Create a File

• • hid_t file_id; • • file_id = H5Fcreate("file.h5",H5F_ACC_TRUNC, • H5P_DEFAULT,H5P_DEFAULT);

•

• H5F_ACC_TRUNC flag – removes existing file• H5P_DEFAULT flags – create regular UNIX file and access• it with HDF5 SEC2 I/O file driver

Example: Add a Dataset



• A• “/” (root)


Dataset Components

• Data• Metadata• Dataspace

• 3

• Rank

• Dim_2 = 5

• Dim_1 = 4

• Dimensions

• Time = 32.4

• Pressure = 987

• Temp = 56

• Attributes

• Chunked

• Compressed

• Dim_3 = 7

• Storage info


• Datatype


Dataset Creation Property List

• Dataset creation property list: • information on how to store

data in a file

• Chunked

• Chunked & compressed


Steps to Create a Dataset

1. Define dataset characteristics• Dataspace – 4x6 Datatype – integer• Properties (if needed)

2. Decide where to put it – “root group”• Obtain location identifier

3. Decide link or path – “A”4. Create link and dataset in file5. (Eventually) Close everything • A

• “/” (root)


• 1 hid_t file_id, dataset_id, dataspace_id; • 2 hsize_t dims[2];• 3 herr_t status; • • 4 file_id = H5Fcreate (”file.h5", H5F_ACC_TRUNC, • H5P_DEFAULT, H5P_DEFAULT); • • 5 dims[0] = 4;• 6 dims[1] = 6;• 7 dataspace_id = H5Screate_simple (2, dims, NULL);

•

• 8 dataset_id = H5Dcreate(file_id,”A",H5T_STD_I32BE,

• dataspace_id, H5P_DEFAULT);

•

•

• 9 status = H5Dclose (dataset_id); • 10 status = H5Sclose (dataspace_id); • 11 status = H5Fclose (file_id);

Code: Create a Dataset

• Terminate access to dataset, dataspace, file

• Create a dataspace • ra

nk

• current dims

• Create a dataset

• dataspace

• datatype

• property list (default)

• pathname


• A • B

• “/” (root)

Example: Create a Group


• file.h5


Steps to Create a Group

1. Decide where to put it – “root group”• Obtain location identifier

2. Decide link or path – “B”3. Create link and group in file

• Specify number of bytes to store names of objects to be added to group (as a hint) – or use default.

4. (Eventually) Close the group.


Code: Create a Group

• hid_t file_id, group_id; • ...• /* Open “file.h5” */ • file_id = H5Fopen(“file.h5”, H5F_ACC_RDWR,• H5P_DEFAULT);

• /* Create group "/B" in file. */ • group_id = H5Gcreate(file_id, "/B", H5P_DEFAULT,

H5P_DEFAULT);

• /* Close group and file. */ • status = H5Gclose(group_id); • status = H5Fclose(file_id);


HDF5 Information

HDF Information Centerhttp://www.hdfgroup.org

HDF Help email [email protected]

HDF users mailing [email protected]

[email protected]

http://www.hdfgroup.org/

mailto:[email protected]




Questions?


Introduction to HDF5 Command-line Tools


HDF5 Command-line Tools

• Readers • h5dump, h5diff, h5ls• h5stat, h5check (new in release 1.8)

• Writers• h5import, h5repack, h5repart, h5jam/h5unjam• h5copy, h5mkgrp (new in release 1.8)

• Converters• h4toh5, h5toh4, gif2h5, h52gif


h5dump

h5dump: exports (dumps) the contents of an HDF5 file Multiple output types

ASCII binary XML

Complete or selected file content Object header information (the structure) Attributes (the metadata) Datasets (the data)

All dataset values Subsets of dataset values

Properties (filters, storage layout, fill value) Specific objects: groups/ datasets/ attributes / named datatypes /

soft links h5dump –help

Lists all option flags

Example: h5dump

HDF5 "Sample.h5" {GROUP "/" { GROUP "Floats" { DATASET "FloatArray" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 4, 3 ) / ( 4, 3 ) } DATA { (0,0): 0.01, 0.02, 0.03, (1,0): 0.1, 0.2, 0.3, (2,0): 1, 2, 3, (3,0): 10, 20, 30 } } } DATASET "IntArray" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 5, 6 ) / ( 5, 6 ) } DATA { (0,0): 0, 1, 2, 3, 4, 5, (1,0): 10, 11, 12, 13, 14, 15, (2,0): 20, 21, 22, 23, 24, 25, (3,0): 30, 31, 32, 33, 34, 35, (4,0): 40, 41, 42, 43, 44, 45 } }}}

No options: “All” contents to standard out


• % h5dump Sample.h5

h5dump - object header information

HDF5 "Sample.h5" {

GROUP "/" {

GROUP "Floats" {

DATASET "FloatArray" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 4, 3 ) / ( 4, 3 ) }

}

}

DATASET "IntArray" {

DATATYPE H5T_STD_I32LE


}

}

}

-H option: Object header information


• % h5dump –H Sample.h5

h5dump – specific dataset

HDF5 "Sample.h5" {

DATASET "/Floats/FloatArray" {

DATATYPE H5T_IEEE_F32LE


DATA {

(0,0): 0.01, 0.02, 0.03,

(1,0): 0.1, 0.2, 0.3,

(2,0): 1, 2, 3,

(3,0): 10, 20, 30

}

}

-d dataset option: Specific dataset


• % h5dump –d /Floats/FloatArray Sample.h5

h5dump – dataset values to file

HDF5 "Sample.h5" {

DATASET "/IntArray" {

DATATYPE H5T_STD_I32LE


DATA {

}

}

}

-o file option: Dataset values output to file


• % h5dump –o Ofile –d /IntArray Sample.h5

• (0,0): 0, 1, 2, 3, 4, 5,• (1,0): 10, 11, 12, 13, 14, 15,• (2,0): 20, 21, 22, 23, 24, 25,• (3,0): 30, 31, 32, 33, 34, 35,• (4,0): 40, 41, 42, 43, 44, 45

• % cat Ofile

• -y option: Do not output array indices with data values


h5dump – binary output

-b FORMAT option: Binary output, FORMAT can be: MEMORY

Data exported with datatypes matching memory on system where h5dump is run.

FILE Data exported with datatypes matching those in HDF5 file

being dumped.LE

Data exported with pre-defined little endian datatype.BE

Data exported with pre-defined big endian datatype.

• Typically used with –d dataset -o outputFile options Allows data values to be exported for use with other applications. When –b and –d used together, array indices are not output.

h5dump – binary output

0000000 000 000 000 000 000 000 000 001 000 000 000 002 000 000 000 003

0000020 000 000 000 004 000 000 000 005 000 000 000 012 000 000 000 013


• % h5dump –b BE –d /IntArray -o OBE Sample.h5

• % od –b OBE | head -2

• % h5dump –b LE –d /IntArray -o OLE Sample.h5

• % od –b OLE | head -2• 0000000 000 000 000 000 001 000 000 000 002 000 000 000 003 000 000 000• 0000020 004 000 000 000 005 000 000 000 012 000 000 000 013 000 000 000

• % h5dump –b MEMORY –d /IntArray -o OME Sample.h5

• % od –b OME | head -2• 0000000 000 000 000 000 001 000 000 000 002 000 000 000 003 000 000 000• 0000020 004 000 000 000 005 000 000 000 012 000 000 000 013 000 000 000

h5dump – properties information

HDF5 "Sample.h5" {GROUP "/" { GROUP "Floats" { DATASET "FloatArray" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 4, 3 ) / ( 4, 3 ) } STORAGE_LAYOUT { CONTIGUOUS SIZE 48 OFFSET 3696 } FILTERS { NONE } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_LATE } …

-p option: Print dataset filters, storage layout, fill value


• % h5dump –p –H Sample.h5


h5import

h5import: loads data into an existing or new HDF5 file• Data loaded from ASCII or binary files• Each file corresponds to data values for one dataset• Integer (signed or unsigned) and float data can be loaded• Per-dataset settable properties include:

• datatype (int or float; size; architecture; byte order)• storage (compression, chunking, external file, maximum

dimensions) • Properties set via

• command line

% h5import in in_opts [in2 in2_opts] –o out• configuration file

% h5import in –c conf1 [in2 –c conf2] –o out

Example: h5import

PATH /Floats/FloatArrayINPUT-CLASS TEXTFPRANK 2DIMENSION-SIZES 4 3

Create Sample2.h5 based on Sample.h5


• % cat config.FloatArray• 0.01 0.02 0.03• 0.1 0.2 0.3• 1 2 3• 10 20 30

• % cat in.FloatArray

• HDF5 "Sample.h5" {• DATASET “/Float/FloatArray" {• DATATYPE H5T_IEEE_F32LE• DATASPACE SIMPLE { ( 4, 3 ) / ( 4, 3 ) }• DATA {• 0.01, 0.02, 0.03,• 0.1, 0.2, 0.3,

• 1, 2, 3,• 10, 20, 30• }• }• }

• % h5dump –d Floats/FloatArray –y Sample.h5

Example: h5import

PATH /IntArrayINPUT-CLASS TEXTINRANK 2DIMENSION-SIZES 5 6


• % cat config.IntArray

• 0 1 2 3 4 5• 10 11 12 13 14 15• 20 21 22 23 24 25• 30 31 32 38 34 35• 40 41 42 43 44 45

• % cat in.IntArray

Input and configuration files ready; issue command

• % h5import in.FloatArray -c config.FloatArray \• in.IntArray -c config.IntArray -o Sample2.h5


h5mkgrp

h5mkgrp: makes groups in an HDF5 file.

Usage: h5mkgrp [OPTIONS] FILE GROUP... OPTIONS

-h, --help Print a usage message and exit

-l, --latest Use latest version of file format to create groups

-p, --parents No error if existing, make parent groups as needed

-v, --verbose Print information about OBJECTS and OPTIONS

-V, --version Print version number and exit

Example:

% h5mkgrp Sample2.h5 /EmptyGroup

Introduced in HDF5 release 1.8.0.


h5diff

h5diff: compares HDF5 files and reports differences• compare two HDF5 files

% h5diff file1 file2• compare same object in two files

% h5diff file1 file2 object• compare different objects in two files

% h5diff file1 file2 object1 object2

Option flags:none: report number of differences found in objects and

where they occurred

-r: in addition, report the differences

-v: in addition, print list of object(s) and warnings; typically used when comparing two files without specifying object(s)

Example: h5diff

file1 file2--------------------------------------- x x / x /EmptyGroup x x /Floats x x /Floats/FloatArray x x /IntArray

group : </> and </>0 differences foundgroup : </Floats> and </Floats>0 differences founddataset: </Floats/FloatArray> and </Floats/FloatArray>0 differences founddataset: </IntArray> and </IntArray>size: [5x6] [5x6]position IntArray IntArray difference -------------------------------------------------------------------[ 3 3 ] 33 38 5


• % h5diff –v Sample.h5 Sample2.h5


h5repack

h5repack: copies an HDF5 file to a new file with specified filter and storage layout

• Removes unused space introduced when… Objects were deleted Compressed datasets were updated and no longer fit in

original space Full space allocated for variable-length data not used

• Optionally applies filter to datasets gzip, szip, shuffle, checksum

• Optionally applies storage layout to datasets Continuous, chunking, compact


h5repack: filters

Compression will not be performed if data is smaller than 1K unless –m flag is used.

• -f FILTER option: Apply filter, FILTER can be:

• GZIP to apply GZIP compression

• SZIP to apply SZIP compression

• SHUF to apply the HDF5 shuffle filter

• FLET to apply the HDF5 checksum filter

• NBIT to apply NBIT compression

• SOFF to apply the HDF5 Scale/Offset filter

• NONE to remove all filters


h5repack: storage layout

• -f LAYOUT option: Apply layout, LAYOUT can

be:

• CHUNK to apply chunking layout

• COMPA to apply compact layout

• CONTI to apply continuous layout

33% reduction in file size

Example: h5repack (filter)

75608 TES-Aura.he5 56808 TES-rp.he5


• % h5repack –f SHUF –f GZIP=1 TES-Aura.he5 \• TES-rp.he5

• % ls –sk TES-Aura.he5 TES-rp.he5

• Tropspheric Emission Spectrometer on Aura, the third of NASA's Earth Observing System's spacecrafts.

• Makes global 3-d measurements of ozone and other chemical species involved in its formation and destruction.

Example: h5repack (layout)


• % h5repack –m 1 –l Floats/FloatArray:CHUNK=4x1 \

• Sample.h5 Sample-rp.h5HDF5 "Sample-rp.h5" {GROUP "/" { GROUP "Floats" { DATASET "FloatArray" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 4, 3 ) / ( 4, 3 ) } STORAGE_LAYOUT { CHUNKED ( 4, 1 ) SIZE 48

} FILTERS { NONE } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR } …

• % h5dump –p –H Sample-rp.h5


Performance Tuning & Troubleshooting

• HDF5 tools can assist with performance tuning and troubleshootingDiscover objects and their properties in HDF5 files

h5dump -p Get file size overhead information

h5statFind locations of objects in a file

h5lsDiscover differences

h5diff, h5lsLocation of raw data

h5ls –varDoes file conform to HDF5 File Format Specification?

h5check


h5stat

h5stat: Prints statistics about HDF5 files

• Reports two types of statistics: High-level information about objects:

Number of different objects (groups, datasets, datatypes) Number of unique datatypes Size of raw data

Information about object’s structural metadata Size of structural metadata (total/free)

• Object header, local and global heaps• Size of B-trees

Object header fragmentation


h5stat

• Helps… troubleshoot size overhead in HDF5 files choose appropriate properties and storage strategies

• Usage:% h5stat –help % h5stat file.h5

• Full specification at : http://www.hdfgroup.uiuc.edu/RFC/HDF5/h5stat/

Introduced in HDF5 release 1.8.0.

http://www.hdfgroup.uiuc.edu/RFC/HDF5/h5stat/

h5check

• Verifies that a file is encoded according to the HDF5 File Format Specificationhttp://www.hdfgroup.org/HDF5/doc/H5.format.html

• Does not use the HDF5 library• Used to confirm that the files written by the HDF5

library are compliant with the specification• Tool is not part of the HDF5 source code

distributionftp://ftp.hdfgroup.org/HDF5/special_tools/h5check/


http://www.hdfgroup.org/HDF5/doc/H5.format.html

ftp://ftp.hdfgroup.org/HDF5/special_tools/h5check/


Questions?


HDF5 Advanced Topics


Outline

• Part I• Overview of HDF5 datatypes

• Part II• Partial I/O in HDF5

• Hyperslab selection• Dataset region references

• Chunking and compression• Part III

• Performance issues (how to do it right)• Part IV

• Performance benefits of HDF5 version 1.8


Part IHDF5 Datatypes

Quick overview of the most difficult topics


HDF5 Datatypes

• HDF5 has a rich set of pre-defined datatypes and supports the creation of an unlimited variety of complex user-defined datatypes.

• Datatype definitions are stored in the HDF5 file with the data.

• Datatype definitions include information such as byte order (endianess), size, and floating point representation to fully describe how the data is stored and to insure portability across platforms.

• Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data.


Example

• Array of integers on IA32 platform• Native integer is little-endian, 4 bytes

•

• H5T_SDT_I32LE

• H5Dwrite

• Array of integers on SPARC64 platform• Native integer is big-endian, 8 bytes•

• H5T_NATIVE_INT • H5T_NATIVE_INT

• H5Dread

• Little-endian 4 bytes integer

• VAX G-floating

• H5Dwrite


Storing Variable Length Data in HDF5


• Data

• Time

• Data

• Data

• Data

• Data

• Data

• Data

• Data

• Data

• Time

HDF5 Fixed and Variable Length Array Storage


Storing Strings in HDF5

• Array of characters (Array datatype or extra dimension in dataset)• Quick access to each character• Extra work to access and interpret each string

• Fixed lengthstring_id = H5Tcopy(H5T_C_S1);H5Tset_size(string_id, size);

• Wasted space in shorter strings• Can be compressed

• Variable lengthstring_id = H5Tcopy(H5T_C_S1);H5Tset_size(string_id, H5T_VARIABLE);

• Overhead as for all VL datatypes• Compression will not be applied to actual data


Storing Variable Length Data in HDF5

• Each element is represented by C structure typedef struct { size_t length; void *p;} hvl_t;

• Base type can be any HDF5 typeH5Tvlen_create(base_type)


• Data

• Data

• Data

• Data

• Data

Example• hvl_t data[LENGTH];

• for(i=0; i<LENGTH; i++) { data[i].p=malloc((i+1)*sizeof(unsigned int)); data[i].len=i+1;

• }

• tvl = H5Tvlen_create (H5T_NATIVE_UINT);

• data[0].p

• data[4].len


Reading HDF5 Variable Length Array

• hvl_t rdata[LENGTH];• /* Create the memory vlen type */• tvl = H5Tvlen_create (H5T_NATIVE_UINT);• ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL,• H5P_DEFAULT, rdata); • /* Reclaim the read VL data */• H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata);

• On read HDF5 Library allocates memory to read data in, • application only needs to allocate array of hvl_t elements • (pointers and lengths).


Storing Tables in HDF5 file


Example

a_name (integer)

b_name (float)

c_name (double)

0 0. 1.0000

1 1. 0.5000

2 4. 0.3333

3 9. 0.2500

4 16. 0.2000

5 25. 0.1667

6 36. 0.1429

7 49. 0.1250

8 64. 0.1111

9 81. 0.1000

Multiple ways to store a table Dataset for each field Dataset with compound datatype If all fields have the same type: 2-dim array 1-dim array of array datatype continued…..Choose to achieve your goal!How much overhead each type of storage will create?Do I always read all fields?Do I need to read some fields more often?Do I want to use compression?Do I want to access some records?


HDF5 Compound Datatypes

• Compound types• Comparable to C structs • Members can be atomic or compound

types • Members can be multidimensional• Can be written/read by a field or set of

fields• Not all data filters can be applied (shuffling,

SZIP)


HDF5 Compound Datatypes

• Which APIs to use?• H5TB APIs

• Create, read, get info and merge tables• Add, delete, and append records• Insert and delete fields• Limited control over table’s properties (i.e. only GZIP

compression, level 6, default allocation time for table, extendible, etc.)

• PyTables http://www.pytables.org• Based on H5TB• Python interface• Indexing capabilities

• HDF5 APIs • H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a

compound datatype• H5Dcreate, etc.• See H5Tget_member* functions for discovering properties of the

HDF5 compound datatype

http://www.pytables.org/


Creating and Writing Compound Dataset

• h5_compound.c example

• typedef struct s1_t { • int a; • float b; • double c;• } s1_t;

• s1_t s1[LENGTH];



• /* Create datatype in memory. */

• s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t)); • H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a),• H5T_NATIVE_INT); • H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c),• H5T_NATIVE_DOUBLE); • H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b),• H5T_NATIVE_FLOAT);

• Note: • Use HOFFSET macro instead of calculating offset by hand.• Order of H5Tinsert calls is not important if HOFFSET is used.



• /* Create dataset and write data */

• dataset = H5Dcreate(file, DATASETNAME, s1_tid, space,

• H5P_DEFAULT, H5P_DEFAULT);• status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL,

• H5P_DEFAULT, s1); • Note: • In this example memory and file datatypes are the same.• Type is not packed. • Use H5Tpack to save space in the file.

• status = H5Tpack(s1_tid);• status = H5Dcreate(file, DATASETNAME, s1_tid, space,• H5P_DEFAULT, H5P_DEFAULT);


File Content with h5dump

• HDF5 "SDScompound.h5" {• GROUP "/" { • DATASET "ArrayOfStructures" {• DATATYPE { • H5T_STD_I32BE "a_name"; • H5T_IEEE_F32BE "b_name"; • H5T_IEEE_F64BE "c_name"; } • DATASPACE { SIMPLE ( 10 ) / ( 10 ) }

• DATA { • {• [ 0 ],• [ 0 ],• [ 1 ] • }, • { • [ 1 ],• …


Reading Compound Dataset

• /* Create datatype in memory and read data. */

• dataset = H5Dopen(file, DATASETNAME, H5P_DEFAULT);

• s2_tid = H5Dget_type(dataset);• mem_tid = H5Tget_native_type (s2_tid);• s1 = malloc(H5Tget_size(mem_tid)*number_of_elements);

• status = H5Dread(dataset, mem_tid, H5S_ALL,• H5S_ALL, H5P_DEFAULT, s1);• Note:

• We could construct memory type as we did in writing example.

• For general applications we need to discover the type in the file, find out corresponding memory type, allocate space and do read.


Reading Compound Dataset by Fields

• typedef struct s2_t { • double c; • int a;• } s2_t; • s2_t s2[LENGTH];• …• s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t)); • H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c),• H5T_NATIVE_DOUBLE); • H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a),• H5T_NATIVE_INT);• …• status = H5Dread(dataset, s2_tid, H5S_ALL,• H5S_ALL, H5P_DEFAULT, s2);


New Way of Creating Datatypes

• Another way to create a compound datatype

• #include H5LTpublic.h• …..

• s2_tid = H5LTtext_to_dtype(• "H5T_COMPOUND • {H5T_NATIVE_DOUBLE \"c_name\"; • H5T_NATIVE_INT \"a_name\";• }", • H5LT_DDL);


Need Help with Datatypes?

• Check our support web pages

• http://www.hdfgroup.uiuc.edu/UserSupport/examples-by-api/api18-c.html

• http://www.hdfgroup.uiuc.edu/UserSupport/examples-by-api/api16-c.html

http://www.hdfgroup.uiuc.edu/UserSupport/examples-by-api/api18-c.html





Part IIWorking with subsets

Collect data one way ….

• Array of images (3D)


• Stitched image (2D array)

Display data another way …


Data is too big to read….


• Need to select and access the same • elements of a dataset

Refer to a region…



HDF5 Library Features

• HDF5 Library provides capabilities to• Describe subsets of data and perform write/read

operations on subsets• Hyperslab selections and partial I/O

• Store descriptions of the data subsets in a file• Object references• Region references

• Use efficient storage mechanism to achieve good performance while writing/reading subsets of data• Chunking, compression


Partial I/O in HDF5


How to Describe a Subset in HDF5?

• Before writing and reading a subset of data one has to describe it to the HDF5 Library.

• HDF5 APIs and documentation refer to a subset as a “selection” or “hyperslab selection”.

• If specified, HDF5 Library will perform I/O on a selection only and not on all elements of a dataset.


Types of Selections in HDF5

• Two types of selections• Hyperslab selection

• Regular hyperslab• Simple hyperslab• Result of set operations on hyperslabs (union,

difference, …) • Point selection

• Hyperslab selection is especially important for doing parallel I/O in HDF5 (See Parallel HDF5 Tutorial)


•Regular Hyperslab

•

• • • • •

• • •

• • •

• • • • •

• Collection of regularly spaced equal size blocks


•Simple Hyperslab

•

• Contiguous subset or sub-array


•Hyperslab Selection

• Result of union operation on three simple hyperslabs


Hyperslab Description

• Start - starting location of a hyperslab (1,1)• Stride - number of elements that separate each

block (3,2)• Count - number of blocks (2,6)• Block - block size (2,1)• Everything is “measured” in number of elements


Simple Hyperslab Description

• Two ways to describe a simple hyperslab• As several blocks

• Stride – (1,1)• Count – (2,6)• Block – (2,1)

• As one block• Stride – (1,1)• Count – (1,1)• Block – (4,6)

• No performance penalty for • one way or another


H5Sselect_hyperslab Function

• space_id Identifier of dataspace • op Selection operator

• H5S_SELECT_SET or H5S_SELECT_OR • start Array with starting coordinates of hyperslab • stride Array specifying which positions along a dimension• to select• count Array specifying how many blocks to select from

the • dataspace, in each dimension

• block Array specifying size of element block • (NULL indicates a block size of a single element

in • a dimension)


Reading/Writing Selections

Programming model for reading from a dataset in

a file1. Open a dataset.

2. Get file dataspace handle of the dataset and specify subset to read from.a. H5Dget_space returns file dataspace handle

a. File dataspace describes array stored in a file (number of dimensions and their sizes).

b. H5Sselect_hyperslab selects elements of the array that participate in I/O operation.

3. Allocate data buffer of an appropriate shape and size


Reading/Writing Selections

Programming model (continued)4. Create a memory dataspace and specify subset to write

to.1. Memory dataspace describes data buffer (its rank and

dimension sizes).

2. Use H5Screate_simple function to create memory dataspace.

3. Use H5Sselect_hyperslab to select elements of the data buffer that participate in I/O operation.

5. Issue H5Dread or H5Dwrite to move the data between file and memory buffer.

6. Close file dataspace and memory dataspace when done.


Example : Reading Two Rows

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

• Data in a file• 4x6 matrix

• Buffer in memory• 1-dim array of length 14


Example: Reading Two Rows

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

• start = {1,0}• count = {2,6}• block = {1,1}• stride = {1,1}

• filespace = H5Dget_space (dataset);• H5Sselect_hyperslab (filespace, H5S_SELECT_SET,• start, NULL, count, NULL)



-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

• start[1] = {1}• count[1] = {12}• dim[1] = {14}

• memspace = H5Screate_simple(1, dim, NULL);• H5Sselect_hyperslab (memspace, H5S_SELECT_SET,• start, NULL, count, NULL)



1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

-1 7 8 9 10 11 12 13 14 15 16 17 18 -1

• H5Dread (…, …, memspace, filespace, …, …);


Things to Remember

• Number of elements selected in a file and in a memory buffer must be the same • H5Sget_select_npoints returns number of

selected elements in a hyperslab selection• HDF5 partial I/O is tuned to move data between

selections that have the same dimensionality; avoid choosing subsets that have different ranks (as in example above)

• Allocate a buffer of an appropriate size when reading data; use H5Tget_native_type and H5Tget_size to get the correct size of the data element in memory.


HDF5 Region References and Selections

• Need to select and access the same • elements of a dataset

Saving Selected Region in a File



Reference Datatype

• Reference to an HDF5 object• Pointer to a group or a dataset in a file

• Predefined datatype H5T_STD_REG_OBJ describe object references

• Reference to a dataset region (or to selection)• Pointer to the dataspace selection

• Predefined datatype H5T_STD_REF_DSETREG to describe regions


Reference to Dataset Region

• REF_REG.h5

• Root

• Region References• Matrix

• 1 1 2 3 3 4 5 5 6

• 1 2 2 3 4 4 5 6 6


Reference to Dataset Region

Example

dsetr_id = H5Dcreate(file_id, “REGION REFERENCES”, H5T_STD_REF_DSETREG, …);

H5Sselect_hyperslab(space_id, H5S_SELECT_SET, start, NULL, …);H5Rcreate(&ref[0], file_id, “MATRIX”,H5R_DATASET_REGION, space_id);

H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG, H5S_ALL, H5S_ALL, H5P_DEFAULT,ref);


Reference to Dataset RegionHDF5 "REF_REG.h5" {GROUP "/" { DATASET "MATRIX" { …… } DATASET "REGION_REFERENCES" { DATATYPE H5T_REFERENCE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): DATASET /MATRIX {(0,3)-(1,5)}, (1): DATASET /MATRIX {(0,0), (1,6), (0,8)} } }}}


Chunking in HDF5


HDF5 Chunking

• Dataset data is divided into equally sized blocks (chunks).• Each chunk is stored separately as a contiguous block in

HDF5 file.

• Application memory

• Metadata cache• Dataset header

• ………….• Datatype

• Dataspace• ………….• Attributes

• …

• File

• Dataset data

• A • D• C • B• header• Chunk

index

• Chunkindex

• A • B • C • D


HDF5 Chunking

• Chunking is needed for• Enabling compression and other filters• Extendible datasets


HDF5 Chunking

• If used appropriately chunking improves partial I/O for big datasets

• Only two chunks are involved in I/O


HDF5 Chunking

• Chunk has the same rank as a dataset• Chunk’s dimensions do not need to be factors of

dataset’s dimensions

•


Creating Chunked Dataset

1. Create a dataset creation property list.2. Set property list to use chunked storage layout.3. Create dataset with the above property list.

dcpl_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(dcpl_id, rank, ch_dims); dset_id = H5Dcreate (…, dcpl_id); H5Pclose(dcpl_id);


Writing or Reading Chunked Dataset

1. Chunking mechanism is transparent to application.

2. Use the same set of operation as for contiguous dataset, for example,

H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…);

3. Selections do not need to coincide precisely with the chunks boundaries.


HDF5 Filters

• HDF5 filters modify data during I/O operations• Available filters:

1. Checksum (H5Pset_fletcher32)2. Shuffling filter (H5Pset_shuffle)3. Data transformation (in 1.8.*)4. Compression

• Scale + offset (in 1.8.*)• N-bit (in 1.8.*)• GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip)• User-defined filters (BZIP2)

• Example of a user-defined compression filter can be found http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/

http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/


Creating Compressed Dataset

1. Create a dataset creation property list2. Set property list to use chunked storage layout3. Set property list to use filters4. Create dataset with the above property list

crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id);


Writing Compressed Dataset

• C • B• A

• …………..

• Default chunk cache size is 1MB. • Filters including compression are applied when chunk is evicted from

cache.• Chunks in the file may have different sizes

• A• B • C

• C

• File

• Chunk cache (per dataset)• Chunked dataset

• Filter pipeline


Chunking Basics to Remember

• Chunking creates storage overhead in the file.• Performance is affected by

• Chunking and compression parameters • Chunking cache size (H5Pset_cache call)

• Some hints for getting better performance• Use chunk size not smaller than block size (4k) on

a file system.• Use compression method appropriate for your

data.• Avoid using selections that do not coincide with

the chunking boundaries.


Example

Creates a compressed 1000x20 integer dataset in a file

%h5dump –p –H zip.h5

HDF5 "zip.h5" {GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 20, 20 ) SIZE 5316 }


Example (continued)

FILTERS { COMPRESSION DEFLATE { LEVEL 6 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR } } }}}


Example (bigger chunk)

Creates a compressed integer dataset 1000x20 in afile; better compression ratio is achieved.

h5dump –p –H zip.h5

HDF5 "zip.h5" {GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 200, 20 ) SIZE 2936 }


Part IIIPerformance Issues(How to Do it Right)


Performance of Serial I/O Operations

• Next slides show the performance effects of using different access patterns and storage layouts.

• We use three test cases which consist of writing a selection to an array of characters.

• Data is stored in a row-major order.• Tests were executed on THG Linux x86_64 box

using h5perf_serial and HDF5 version 1.8.0


Serial Benchmarking Tool

• Benchmarking tool, h5perf_serial, publicly released with HDF5 1.8.1

• Features inlcude:• Support for POSIX and HDF5 I/O calls.• Support for datasets and buffers with multiple

dimensions.• Entire dataset access using a single or several I/O

operations.• Selection of contiguous and chunked storage for HDF5

operations.


Contiguous Storage (Case 1)

• Rectangular dataset of size 48K x 48K, with write selections of 512 x 48K.

• HDF5 storage layout is contiguous.• Good I/O pattern for POSIX and

HDF5 because each selection is contiguous.

• POSIX: 5.19 MB/s• HDF5: 5.36 MB/s

• 1

• 2

• 3

• 4

• 1 • 2 • 3 • 4


Contiguous Storage (Case 2)

• Rectangular dataset of 48K x 48K, with write selections of 48K x 512.

• HDF5 storage layout is contiguous.

• Bad I/O pattern for POSIX and HDF5 because each selection is noncontiguous.


• 1• 2• 3• 4

• 1• 2• 3• 4• 1• 2• 3• 4 • …….


Chunked Storage

• Rectangular dataset of 48K x 48K, with write selections of 48K x 512.

• HDF5 storage layout is chunked. Chunks and selections sizes are equal.

• Bad I/O case for POSIX because selections are noncontiguous.

• Good I/O case for HDF5 since selections are contiguous due to chunking layout settings.


• 1• 2• 3• 4

• 1 • 2 • 3 • 4

• 1• 2• 3• 4• 1• 2• 3• 4 • …….

• POSIX

• HDF5


Conclusions

• Access patterns with small I/O operations incur high latency and overhead costs many times.

• Chunked storage may improve I/O performance by affecting the contiguity of the data selection.

Writing Chunked Dataset

• 1000x100x100 dataset• 4 byte integers• Random values 0-99

• 50x100x100 chunks (20 total)• Chunk size: 2 MB

• Write the entire dataset using 1x100x100 slices• Slices are written sequentially


Test Setup

• 20 Chunks

• 1000 slices• Chunk size is 2MB


Test Setup (continued)

• Tests performed with 1 MB and 5MB chunk cache size• Cache size set with H5Pset_cache function

H5Pget_cache (fapl, NULL, &rdcc_nelmts,

&rdcc_nbytes, &rdcc_w0); H5Pset_cache (fapl, 0, rdcc_nelmts,

5*1024*1024, rdcc_w0);

• Tests performed with no compression and with gzip (deflate) compression


Effect of Chunk Cache Size on Write

Cache size I/O operations Total data written

File size

1 MB (default) 1002 75.54 MB 38.15 MB

5 MB 22 38.16 MB 38.15 MB

• No compression

• Gzip compression

Cache size I/O operations Total data written

File size

1 MB (default) 1982 335.42 MB(322.34 MB read)

13.08 MB

5 MB 22 13.08 MB 13.08 MB



• With the 1 MB cache size, a chunk will not fit into the cache• All writes to the dataset must be immediately

written to disk• With compression, the entire chunk must be read

and rewritten every time a part of the chunk is written to• Data must also be decompressed and

recompressed each time• Non sequential writes could result in a larger file

• Without compression, the entire chunk must be written when it is first written to the file• If the selection were not contiguous on disk, it could

require as much as 1 I/O operation for each element



• With the 5 MB cache size, the chunk is written only after it is full• Drastically reduces the number of I/O operations• Reduces the amount of data that must be written

(and read)• Reduces processing time, especially with the

compression filter


Conclusion

• It is important to make sure that a chunk will fit into the raw data chunk cache

• If you will be writing to multiple chunks at once, you should increase the cache size even more• Try to design chunk dimensions to minimize the

number you will be writing to at once


Reading Chunked Dataset

• Read the same dataset, again by slices, but the slices cross through all the chunks

• 2 orientations for read plane• Plane includes fastest changing dimension• Plane does not include fastest changing dimension

• Measure total read operations, and total size read• Chunk sizes of 50x100x100, and 10x100x100• 1 MB cache


• Chunks

• Read slices• Vertical and horizontal

Test Setup


Results

• Read slice includes fastest changing dimension

Chunk size Compression I/O operations Total data read

50 Yes 2010 1307 MB

10 Yes 10012 1308 MB

50 No 100010 38 MB

10 No 10012 3814 MB


Results (continued)

• Read slice does not include fastest changing dimension

Chunk size Compression I/O operations Total data read

50 Yes 2010 1307 MB

10 Yes 10012 1308 MB

50 No 10000010 38 MB

10 No 10012 3814 MB


Effect of Cache Size on Read

• When compression is enabled, the library must always read each entire chunk once for each call to H5Dread.

• When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size.• If the chunk fits in cache, the library reads each

entire chunk once for each call to H5Dread• If the chunk does not fit in cache, the library reads

only the data that is selected• More read operations, especially if the read plane

does not include the fastest changing dimension• Less total data read


Conclusion

• In this case cache size does not matter when reading if compression is enabled.

• Without compression, a larger cache may not be beneficial, unless the cache is large enough to hold all of the chunks.• The optimum cache size depends on the exact

shape of the data, as well as the hardware.


Hints for Chunk Settings

• Chunk dimensions should align as closely as possible with hyperslab dimensions for read/write

• Chunk cache size (rdcc_nbytes) should be large enough to hold all the chunks in the selection• If this is not possible, it may be best to disable chunk

caching altogether (set rdcc_nbytes to 0)• rdcc_nelmts should be a prime number that is at

least 10 to 100 times the number of chunks that can fit into rdcc_nbytes

• rdcc_w0 should be set to 1 if chunks that have been fully read/written will never be read/written again



Part IVPerformance Benefits of

HDF5 version 1.8

What Did We Do in HDF5 1.8?

• Extended File Format Specification • Reviewed group implementations• Introduced new link object• Revamped metadata cache implementation• Improved handling of datasets and datatypes• Introduced shared object header message• Extended error handling• Enhanced backward/forward APIs and file format

compatibility


What Did We Do in HDF5 1.8?

And much more good stuff to make HDF5


•Better and Faster


HDF5 File Format Extension



• Why: • Address deficiencies of the original file format• Address space overhead in an HDF5 file• Enable new features

• What: • New routine that instructs the HDF5 library to

create all objects using the latest version of the HDF5 file format (cmp. with the earliest version when object became available, for example, array datatype)



Example

/* Use the latest version of a file format for each object created in a file */

fapl_id = H5Pcreate(H5P_FILE_ACCESS);H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);fid = H5Fcreate(…,…,…,fapl_id);orfid = H5Fopen(…,…,fapl_id);


Group Revisions


Better Large Group Storage

• Why: • Faster, more scalable storage and access for large

groups• What:

• New format and method for storing groups with many links


Informal Benchmark

• Create a file and a group in a file• Create up to 10^6 groups with one dataset in

each group• Compare files sizes and performance of HDF5

1.8.1 using the latest group format with the performance of HDF5 1.8.1 (default, old format) and 1.6.7

• Note: Default 1.8.1 and 1.6.7 became very slow after 700000 groups

Time to Open and Read a Dataset


10000 100000 10000000.1

1

10

100

1000

1.61.8 (old groups)1.8 (new groups)

Number of Groups

Tim

e (

mil

lis

ec

on

ds

)

File Size


0 200000 400000 600000 8000000

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

1.8 (old groups)1.8 (new groups)

Number of Groups

Siz

e (

kil

ob

yte

s)


Questions?


Data Storage and I/O in HDF5


Software stack

• Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5 file to application buffer?

• File or other “storage”

• Virtual file I/O

• Library internals

• Object API

• Application • Data buffer

• H5Dwrite

• ?

• Unbuffered I/O

• Data in a file


Goals

• Understanding of what is happening to data inside the HDF5 library will help to write efficient applications

• Goals of this talk:• Describe some basic operations and data

structures, and explain how they affect performance and storage sizes

• Give some “recipes” for how to improve performance


Topics

• Dataset metadata and array data storage layouts• Types of dataset storage layouts• Factors affecting I/O performance

• I/O with compact datasets• I/O with contiguous datasets• I/O with chunked datasets• Variable length data and I/O


HDF5 dataset metadata and array data storage

layouts


HDF5 Dataset

• Data array• Ordered collection of identically typed data items

distinguished by their indices

• Metadata• Dataspace: Rank, dimensions of dataset array• Datatype: Information on how to interpret data• Storage Properties: How array is organized on

disk• Attributes: User-defined metadata (optional)


HDF5 Dataset

• Dataset data• Metadata• Dataspace

• 3

• Rank

• Dim_2 = 5

• Dim_1 = 4

• Dimensions

• Time = 32.4

• Pressure = 987

• Temp = 56

• Attributes

• Chunked

• Compressed

• Dim_3 = 7

• Storage info


• Datatype


Metadata cache and dataset data

• Dataset data typically kept in application memory• Dataset header in separate space – metadata cache


• Metadata cache

• File •

• Dataset data

• Dataset header • Dataset data

• Dataset header• ………….• Datatype


• …

•


Metadata and metadata cache

• HDF5 metadata• Information about HDF5 objects used by the library• Examples: object headers, B-tree nodes for group,

B-Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs.

MB-GB)


Metadata and metadata cache

• Metadata cache• Space allocated to handle pieces of the HDF5

metadata • Allocated by the HDF5 library in application’s

memory space• Cache behavior affects overall performance• Metadata cache implementation prior to HDF5

1.6.5 could cause performance degradation for some applications


Types of data storage layouts


HDF5 datasets storage layouts

• Contiguous• Chunked• Compact


Contiguous storage layout

• Metadata header separate from dataset data• Data stored in one contiguous block in HDF5 file





• …

• File •

• Dataset data

• Dataset data


Chunked storage

• Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks

• Used for extendible datasets and datasets with filters applied (checksum, compression)

• HDF5 library treats each chunk as atomic object• Greatly affects performance and file sizes


Chunked storage layout

• Dataset data divided into equal sized blocks (chunks)• Each chunk stored separately as a contiguous block in

HDF5 file





• …

• File

• Dataset data

• A • D• C • B• header• Chunk

index

• Chunkindex

• A • B • C • D


Compact storage layout

• Dataset data and metadata stored together in the object header

• File




• …

• Metadata cache • Dataset data


Factors affecting I/O performance


What goes on inside the library?

• Operations on data inside the library• Copying to/from internal buffers• Datatype conversion• Scattering - gathering • Data transformation (filters, compression)

• Data structures used• B-trees (groups, dataset chunks)• Hash tables• Local and Global heaps (variable length data: link names, strings,

etc.)• Other concepts

• HDF5 metadata, metadata cache• Chunking, chunk cache


Operations on data inside the library

• Copying to/from internal buffers• Datatype conversion, such as

• float integer• Little-endian big-endian• 64-bit integer to 16-bit integer

• Scattering - gathering • Data is scattered/gathered from/to application buffers

into internal buffers for datatype conversion and partial I/O

• Data transformation (filters, compression)• Checksum on raw data and metadata (in 1.8.0)• Algebraic transform• GZIP and SZIP compressions• User-defined filters


I/O performance

• I/O performance depends on • Storage layouts• Dataset storage properties• Chunking strategy• Metadata cache performance• Datatype conversion performance• Other filters, such as compression• Access patterns


I/O with different storage layouts


Writing a compact dataset




• …

• File

• Metadata cache

• Dataset data

• One write to store header and dataset data

• Dataset data


Writing contiguous dataset – no conversion





• …

• File •

• Dataset data

• No sub-setting in memory or a file is performed


Writing a contiguous dataset with datatype conversion


• Dataspace• ………….• Attribute 1• Attribute 2• ………… • Application memory

• Metadata cache

• File

• Conversion buffer 1MB

• Dataset data

•

• No sub-setting in memory or a file is performed


Partial I/O with contiguous datasets


Writing whole dataset – contiguous rows

• File

• Application data in memory

• Data is contiguous in a file

• One I/O operation

• M rows

• M

• N


Sub-setting of contiguous datasetSeries of adjacent rows

• File

• N


• Subset – contiguous in a file

• One I/O operation

• M rows

• M

• Entire dataset – contiguous in a file


Sub-setting of contiguous datasetAdjacent, partial rows

• File

• N

• M

• …


• Data is scattered in a file in M contiguous blocks

• Several small I/O operation

• N elements


Sub-setting of contiguous datasetExtreme case: writing a column

• N

• M


• Subset data is scattered in a file in M different locations

• Several small I/O operation

• …

• 1 element


Sub-setting of contiguous datasetData sieve buffer

• File

• N

• M

• …


• Data is scattered in a file

• 1 element

• Data in a sieve buffer (64K) in memory

• memcopy


Performance tuning for contiguous dataset

• Datatype conversion• Avoid for better performance• Use H5Pset_buffer function to customize

conversion buffer size• Partial I/O

• Write/read in big contiguous blocks • Use H5Pset_sieve_buf_size to improve

performance for complex subsetting


I/O with Chunking


Chunked storage layout

• Raw data divided into equal sized blocks (chunks)• Each chunk stored separately as a contiguous block

in a file





• …

• File

• Dataset data

• A • D• C • B• header

• Chunkindex

• Chunkindex

• A • B • C • D


Information about chunking

• HDF5 library treats each chunk as atomic object• Compression and other filters are applied to each chunk• Datatype conversion is performed on each chunk

• Chunk size greatly affects performance• Chunk overhead adds to file size• Chunk processing involves many steps

• Chunk cache• Caches chunks for better performance• Size of chunk cache is set for file (default size 1MB)• Each chunked dataset has its own chunk cache• Chunk may be too big to fit into cache• Memory may grow if application keeps opening datasets


Chunk cache

• Dataset_1 header

• …………


• Metadata cache

• Chunking B-tree nodes• Chunk cache

• Default size is 1MB• Dataset_N header

• …………

• ………


Writing chunked dataset

• C • B• A

• …………..

• Filters including compression are applied when chunk is evicted from cache

• A• B • C

• C

• File

• Chunk cache• Chunked dataset

• Filter pipeline


Partial I/O with Chunking


Partial I/O for chunked dataset

Example: write the green subset from the dataset , converting the data

Dataset is stored as six chunks in the file. The subset spans four chunks, numbered 1-4 in the figure. Hence four chunks must be written to the file. But first, the four chunks must be read from the file, to preserve

those parts of each chunk that are not to be overwritten.

• 1 • 2

• 3 • 4



• For each of four chunks on writing:• Read chunk from file into chunk

cache, unless it’s already there• Determine which part of the chunk will

be replaced by the selection• Move those elements from application

buffer to conversion buffer • Perform conversion• Replace that part of the chunk in the

cache with the corresponding elements from the conversion buffer

• Apply filters (compression) when chunk is flushed from chunk cache

• For each element 3 (or more) memcopy operations are performed

• 1 • 2

• 3 • 4



• 3


• conversion buffer

• Application buffer

• Chunk

• Elements participating in I/O are gathered into corresponding chunk• after going through conversion buffer

• Chunk cache

• 3



• 3 • Conversion buffer


• Chunk cache

• File • Chunk

• Apply filters and write to file



Variable length data and I/O


Examples of variable length data

• String A[0] “the first string we want to write”

…………………………………

A[N-1] “the N-th string we want to write”• Each element is a record of variable-length

A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10]

A[1] (0,0,110,2005) [length = 4]

………………………..

A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M]


Variable length data in HDF5

• Variable length description in HDF5 application

typedef struct { size_t length; void *p;}hvl_t;

• Base type can be any HDF5 type

H5Tvlen_create(base_type)• ~ 20 bytes overhead for each element• Data cannot be compressed


Variable length data storage in HDF5

• Global hea

p

• Actual variable length data

• Dataset withvariable length elements

• Pointer intoglobal heap

• File

• Dataset header


Variable length datasets and I/O

• When writing variable length data, elements in application buffer always go through conversion and are copied to the global heaps in a metadata cache before ending in a file

• Global heap


• Metadata cache

• Raw VL data

• conversion buffer


There may be more than one global heap

• Global

heap

• Raw data

• Global

heap• Metadata cache


• Raw VL data

• Raw VL data

• Application buffer• Conversion buffer

• Raw VL data

• On a write request, VL data goes through conversion and is written to • a global heap; elements of the same dataset may be written to different heaps.


Variable length datasets and I/O

• File

• Global

heap

• Raw data

• Global

heap• Metadata cache


• Raw VL data

• Raw VL data

• Application buffer• Conversion buffer

• Raw VL data


VL chunked dataset in a file

• File

• Dataset header

• Chunk B-tree

• Dataset chunks• Heaps with VL data


Writing chunked VL datasets

• Dataset header

• …………

• Application memory• Metadata cache • B-tree nodes

• Chunk cache

• ………

• Conversion buffer• VL• data

• Raw data

• Global heap

• Chunk cache

• Data in applicati

on buffers

• File

• Filter pipeline

• hvl_t pointers • 1

• 2

• 3 • 4


Hints for variable length data I/O

• Avoid closing/opening a file while writing VL datasets • Global heap information is lost• Global heaps may have unused space

• Avoid alternately writing different VL datasets• Data from different datasets will go into to the

same heap• If maximum length of the record is known,

consider using fixed-length records and compression


Questions?


Parallel HDF5Tutorial

Albert Cheng

The HDF Group


Parallel HDF5Introductory Tutorial


Outline

• Overview of Parallel HDF5 design• Setting up parallel environment• Programming model for

• Creating and accessing a File• Creating and accessing a Dataset• Writing and reading Hyperslabs

• Parallel tutorial available at• http://www.hdfgroup.org/HDF5/Tutor/

http://www.hdfgroup.org/HDF5/Tutor/


Overview of Parallel HDF5 Design


PHDF5 Requirements

• Support MPI programming• PHDF5 files compatible with serial HDF5 files

• Shareable between different serial or parallel platforms

• Single file image to all processes• One file per process design is undesirable

• Expensive post processing• Not usable by different number of processes

• Standard parallel I/O interface• Must be portable to different platforms


PHDF5 Implementation Layers

Application

Parallel computing system (Linux cluster)Compute

node

I/O library (HDF5)

Parallel I/O library (MPI-I/O)

Parallel file system (GPFS)

Switch network/I/O servers

Computenode

Computenode

Computenode

Disk architecture & layout of data on disk

PHDF5 built on top of standard MPI-IO API


Parallel Environment Requirements

• MPI with MPI-IO. E.g.,• MPICH2 ROMIO• Vendor’s MPI-IO

• POSIX compliant parallel file system. E.g.,• GPFS• Lustre


MPI-IO vs. HDF5

• MPI-IO is an Input/Output API.• It treats the data file as a “linear byte stream”

and each MPI application needs to provide its own file view and data representations to interpret those bytes.

• All data stored are machine dependent except the “external32” representation.

• External32 is defined in Big Endianness• Little-endian machines have to do the data

conversion in both read or write operations.• 64bit sized data types may lose information.


MPI-IO vs. HDF5 Cont.

• HDF5 is a data management software.• It stores the data and metadata according to

the HDF5 data format definition.• HDF5 file is self-described.• Each machine can store the data in its own

native representation for efficient I/O without loss of data precision.

• Any necessary data representation conversion is done by the HDF5 library automatically.


How to Compile PHDF5 Applications

• h5pcc – HDF5 C compiler command• Similar to mpicc

• h5pfc – HDF5 F90 compiler command• Similar to mpif90

• To compile:• % h5pcc h5prog.c• % h5pfc h5prog.f90


h5pcc/h5pfc -show option

• -show displays the compiler commands and options without executing them, i.e., dry run

% h5pcc -show Sample_mpio.cmpicc -I/home/packages/phdf5/include \-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE \-D_FILE_OFFSET_BITS=64 -D_POSIX_SOURCE \-D_BSD_SOURCE -std=c99 -c Sample_mpio.c

mpicc -std=c99 Sample_mpio.o \-L/home/packages/phdf5/lib \home/packages/phdf5/lib/libhdf5_hl.a \ /home/packages/phdf5/lib/libhdf5.a -lz -lm -Wl,-rpath \-Wl,/home/packages/phdf5/lib


Collective vs. Independent Calls

• MPI definition of collective call• All processes of the communicator must

participate in the right order. E.g.,• Process1 Process2• call A(); call B(); call A(); call B(); **right**• call A(); call B(); call B(); call A(); **wrong**

• Independent means not collective• Collective is not necessarily synchronous


Programming Restrictions

• Most PHDF5 APIs are collective• PHDF5 opens a parallel file with a communicator

• Returns a file-handle• Future access to the file via the file-handle• All processes must participate in collective PHDF5

APIs• Different files can be opened via different

communicators


Examples of PHDF5 API

• Examples of PHDF5 collective API• File operations: H5Fcreate, H5Fopen, H5Fclose• Objects creation: H5Dcreate, H5Dopen, H5Dclose• Objects structure: H5Dextend (increase dimension

sizes)• Array data transfer can be collective or

independent• Dataset operations: H5Dwrite, H5Dread• Collectiveness is indicated by function parameters,

not by function names as in MPI API


What Does PHDF5 Support ?

• After a file is opened by the processes of a communicator• All parts of file are accessible by all processes• All objects in the file are accessible by all

processes• Multiple processes may write to the same data

array• Each process may write to individual data array


PHDF5 API Languages

• C and F90 language interfaces• Platforms supported:

• Most platforms with MPI-IO supported. E.g.,• IBM SP, Linux clusters, SGI Altrix, Cray XT3, …


Programming model for creating and accessing a file

• HDF5 uses access template object (property list) to control the file access mechanism

• General model to access HDF5 file in parallel:• Setup MPI-IO access template (access

property list)• Open File • Access Data• Close File


Setup MPI-IO access template

Each process of the MPI communicator creates anaccess template and sets it up with MPI parallel access informationC:

herr_t H5Pset_fapl_mpio(hid_t plist_id, MPI_Comm comm, MPI_Info info);

F90:

h5pset_fapl_mpio_f(plist_id, comm, info) integer(hid_t) :: plist_id integer :: comm, info

plist_id is a file access property list identifier


C Example Parallel File Create

23 comm = MPI_COMM_WORLD; 24 info = MPI_INFO_NULL; 26 /* 27 * Initialize MPI 28 */ 29 MPI_Init(&argc, &argv); 30 /* 34 * Set up file access property list for MPI-IO access 35 */ ->36 plist_id = H5Pcreate(H5P_FILE_ACCESS); ->37 H5Pset_fapl_mpio(plist_id, comm, info); 38 ->42 file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id); 49 /* 50 * Close the file. 51 */ 52 H5Fclose(file_id); 54 MPI_Finalize();


F90 Example Parallel File Create

23 comm = MPI_COMM_WORLD 24 info = MPI_INFO_NULL 26 CALL MPI_INIT(mpierror) 29 ! 30 !Initialize FORTRAN predefined datatypes 32 CALL h5open_f(error) 34 ! 35 !Setup file access property list for MPI-IO access. ->37 CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error) ->38 CALL h5pset_fapl_mpio_f(plist_id, comm, info, error) 40 ! 41 !Create the file collectively. ->43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 45 ! 46 !Close the file. 49 CALL h5fclose_f(file_id, error) 51 ! 52 !Close FORTRAN interface 54 CALL h5close_f(error) 56 CALL MPI_FINALIZE(mpierror)


Creating and Opening Dataset

• All processes of the communicator open/close a dataset by a collective callC: H5Dcreate or H5Dopen; H5DcloseF90: h5dcreate_f or h5dopen_f; h5dclose_f

• All processes of the communicator must extend an unlimited dimension dataset before writing to itC: H5DextendF90: h5dextend_f


C Example: Create Dataset

56 file_id = H5Fcreate(…); 57 /* 58 * Create the dataspace for the dataset. 59 */ 60 dimsf[0] = NX; 61 dimsf[1] = NY; 62 filespace = H5Screate_simple(RANK, dimsf, NULL); 63 64 /* 65 * Create the dataset with default properties collective. 66 */ ->67 dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT, 68 filespace, H5P_DEFAULT);

70 H5Dclose(dset_id); 71 /* 72 * Close the file. 73 */ 74 H5Fclose(file_id);


F90 Example: Create Dataset

43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 73 CALL h5screate_simple_f(rank, dimsf, filespace, error) 76 ! 77 ! Create the dataset with default properties. 78 ! ->79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER, filespace, dset_id, error) 90 ! 91 ! Close the dataset. 92 CALL h5dclose_f(dset_id, error) 93 ! 94 ! Close the file. 95 CALL h5fclose_f(file_id, error)


Accessing a Dataset

• All processes that have opened dataset may do collective I/O

• Each process may do independent and arbitrary number of data I/O access calls • C: H5Dwrite and H5Dread• F90: h5dwrite_f and h5dread_f


Programming model for dataset access

• Create and set dataset transfer property• C: H5Pset_dxpl_mpio

• H5FD_MPIO_COLLECTIVE• H5FD_MPIO_INDEPENDENT (default)

• F90: h5pset_dxpl_mpio_f• H5FD_MPIO_COLLECTIVE_F• H5FD_MPIO_INDEPENDENT_F (default)

• Access dataset with the defined transfer property


C Example: Collective write

95 /* 96 * Create property list for collective dataset write. 97 */ 98 plist_id = H5Pcreate(H5P_DATASET_XFER); ->99 H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); 100 101 status = H5Dwrite(dset_id, H5T_NATIVE_INT, 102 memspace, filespace, plist_id, data);


F90 Example: Collective write

88 ! Create property list for collective dataset write 89 ! 90 CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error) ->91 CALL h5pset_dxpl_mpio_f(plist_id, & H5FD_MPIO_COLLECTIVE_F, error) 92 93 ! 94 ! Write the dataset collectively. 95 ! 96 CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, & error, & file_space_id = filespace, & mem_space_id = memspace, & xfer_prp = plist_id)


Writing and Reading Hyperslabs

• Distributed memory model: data is split among processes

• PHDF5 uses HDF5 hyperslab model• Each process defines memory and file hyperslabs• Each process executes partial write/read call

• Collective calls• Independent calls


Set up the Hyperslab for Read/Write

H5Sselect_hyperslab(filespace,H5S_SELECT_SET,offset, stride, count, block

)


P0

P1File

Example 1: Writing dataset by rows

P2

P3


Writing by rows: Output of h5dump

HDF5 "SDS_row.h5" {GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 } } } }


Memory File


count[0] = dimsf[0]/mpi_sizecount[1] = dimsf[1];offset[0] = mpi_rank * count[0]; /* = 2 */offset[1] = 0;

count[0]

count[1]

offset[0]

offset[1]

Process 1



71 /* 72 * Each process defines dataset in memory and * writes it to the hyperslab 73 * in the file. 74 */ 75 count[0] = dimsf[0]/mpi_size; 76 count[1] = dimsf[1]; 77 offset[0] = mpi_rank * count[0]; 78 offset[1] = 0; 79 memspace = H5Screate_simple(RANK,count,NULL); 80 81 /* 82 * Select hyperslab in the file. 83 */ 84 filespace = H5Dget_space(dset_id); 85 H5Sselect_hyperslab(filespace, H5S_SELECT_SET,offset,NULL,count,NULL);


P0

P1

File

Example 2: Writing dataset by columns


Writing by columns: Output of h5dump

HDF5 "SDS_col.h5" {GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200 } } } }


Example 2: Writing dataset by column

Process 1

Process 0

FileMemory

block[1]

block[0]

P0 offset[1]

P1 offset[1]stride[1]

dimsm[0]dimsm[1]


Example 2: Writing dataset by column

85 /*86 * Each process defines a hyperslab in * the file88 */89 count[0] = 1;90 count[1] = dimsm[1];91 offset[0] = 0;92 offset[1] = mpi_rank;93 stride[0] = 1;94 stride[1] = 2;95 block[0] = dimsf[0];96 block[1] = 1;9798 /*99 * Each process selects a hyperslab.100 */101 filespace = H5Dget_space(dset_id);

102 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block);


Example 3: Writing dataset by pattern

Process 0

Process 2

File

Process 3

Process 1

Memory


Writing by Pattern: Output of h5dump

HDF5 "SDS_pat.h5" {GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4 } } } }


Process 2

File

Example 3: Writing dataset by pattern

offset[0] = 0;offset[1] = 1;count[0] = 4;count[1] = 2;stride[0] = 2;stride[1] = 2;

Memory

stride[0]

stride[1]

offset[1]

count[1]


Example 3: Writing by pattern

90 /* Each process defines dataset in memory and 91 * writes it to the hyperslab in the file. 92 */ 93 count[0] = 4; 94 count[1] = 2; 95 stride[0] = 2; 96 stride[1] = 2; 97 if(mpi_rank == 0) { 98 offset[0] = 0; 99 offset[1] = 0; 100 } 101 if(mpi_rank == 1) { 102 offset[0] = 1; 103 offset[1] = 0; 104 } 105 if(mpi_rank == 2) { 106 offset[0] = 0; 107 offset[1] = 1; 108 } 109 if(mpi_rank == 3) { 110 offset[0] = 1; 111 offset[1] = 1; 112 }


P0 P2 File

Example 4: Writing dataset by chunks

P1 P3


Writing by Chunks: Output of h5dump

HDF5 "SDS_chnk.h5" {GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4 } } } }


Example 4: Writing dataset by chunks

FileProcess 2: Memory

block[0] = chunk_dims[0];block[1] = chunk_dims[1];offset[0] = chunk_dims[0];offset[1] = 0;

chunk_dims[0]

chunk_dims[1]

block[0]

block[1]

offset[0]

offset[1]


Example 4: Writing by chunks

97 count[0] = 1; 98 count[1] = 1 ; 99 stride[0] = 1; 100 stride[1] = 1; 101 block[0] = chunk_dims[0]; 102 block[1] = chunk_dims[1]; 103 if(mpi_rank == 0) { 104 offset[0] = 0; 105 offset[1] = 0; 106 } 107 if(mpi_rank == 1) { 108 offset[0] = 0; 109 offset[1] = chunk_dims[1]; 110 } 111 if(mpi_rank == 2) { 112 offset[0] = chunk_dims[0]; 113 offset[1] = 0; 114 } 115 if(mpi_rank == 3) { 116 offset[0] = chunk_dims[0]; 117 offset[1] = chunk_dims[1]; 118 }


Parallel HDF5Intermediate Tutorial


Outline

• Performance• Parallel tools


My PHDF5 Application I/O is slow

• If my application I/O performance is slow, what can I do?• Use larger I/O data sizes• Independent vs. Collective I/O• Specific I/O system hints• Increase Parallel File System capacity


Write Speed vs. Block Size

TFLOPS: HDF5 Write vs MPIO Write(File size 3200MB, Nodes: 8)

020406080

100120

1 2 4 8 16 32

Block Size (MB)

MB

/Se

c

HDF5 WriteMPIO Write


Independent vs. Collective Access

• User reported Independent data transfer mode was much slower than the Collective data transfer mode

• Data array was tall and thin: 230,000 rows by 6 columns

:::

230,000 rows:::

Debug Slow Parallel I/O Speed(1)

• Writing to one dataset• Using 4 processes == 4 columns• data type is 8 bytes doubles• 4 processes, 1000 rows == 4x1000x8 = 32,000

bytes• % mpirun -np 4 ./a.out i t 1000

• Execution time: 1.783798 s.• % mpirun -np 4 ./a.out i t 2000

• Execution time: 3.838858 s.• # Difference of 2 seconds for 1000 more rows =

32,000 Bytes.• # A speed of 16KB/Sec!!! Way too slow.



• Build a version of PHDF5 with • ./configure --enable-debug --enable-parallel …• This allows the tracing of MPIO I/O calls in the

HDF5 library.• E.g., to trace

• MPI_File_read_xx and MPI_File_write_xx calls• % setenv H5FD_mpio_Debug “rw”



• % setenv H5FD_mpio_Debug ’rw’• % mpirun -np 4 ./a.out i t 1000 # Indep.; contiguous.• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=2056 size_i=8• in H5FD_mpio_write mpi_off=2048 size_i=8• in H5FD_mpio_write mpi_off=2072 size_i=8• in H5FD_mpio_write mpi_off=2064 size_i=8• in H5FD_mpio_write mpi_off=2088 size_i=8• in H5FD_mpio_write mpi_off=2080 size_i=8• …• # total of 4000 of this little 8 bytes writes == 32,000 bytes.



Independent calls are many and small

• Each process writes one element of one row, skips to next row, write one element, so on.

• Each process issues 230,000 writes of 8 bytes each.

• Not good==just like many independent cars driving to work, waste gas, time, total traffic jam.

:::

230,000 rows:::

Debug Slow Parallel I/O Speed (4)

• % setenv H5FD_mpio_Debug ’rw’• % mpirun -np 4 ./a.out i h 1000 # Indep., Chunked.• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=0 size_i=96• in H5FD_mpio_write mpi_off=3688 size_i=8000• in H5FD_mpio_write mpi_off=11688 size_i=8000• in H5FD_mpio_write mpi_off=27688 size_i=8000• in H5FD_mpio_write mpi_off=19688 size_i=8000• in H5FD_mpio_write mpi_off=96 size_i=40• in H5FD_mpio_write mpi_off=136 size_i=544• in H5FD_mpio_write mpi_off=680 size_i=120• in H5FD_mpio_write mpi_off=800 size_i=272• …• Execution time: 0.011599 s.



Use Collective Mode or Chunked Storage

• Collective mode will combine many small independent calls into few but bigger calls==like people going to work by trains collectively.

• Chunks of columns speeds up too==like people live and work in suburbs to reduce overlapping traffics.

:::

230,000 rows:::


# of Rows Data Size(MB)

Independent (Sec.)

Collective (Sec.)

16384 0.25 8.26 1.72

32768 0.50 65.12 1.80

65536 1.00 108.20 2.68

122918 1.88 276.57 3.11

150000 2.29 528.15 3.63

180300 2.75 881.39 4.12

Independent vs. Collective write

6 processes, IBM p-690, AIX, GPFS


Independent vs. Collective write (cont.)

Performance (non-contiguous)

0

100

200

300

400

500

600

700

800

900

1000

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Data space size (MB)

Tim

e (

s)

Independent

Collective


Effects of I/O Hints: IBM_largeblock_io

• GPFS at LLNL Blue• 4 nodes, 16 tasks• Total data size 1024MB• I/O buffer size 1MB

IBM_largeblock_io=false IBM_largeblock_io=trueTasks MPI-IO PHDF5 MPI-IO PHDF516 write (MB/S) 60 48 354 29416 read (MB/S) 44 39 256 248


• GPFS at LLNL ASCI Blue machine• 4 nodes, 16 tasks• Total data size 1024MB• I/O buffer size 1MB

0

50

100

150

200

250

300

350

400

MPI-IO PHDF5 MPI-IO PHDF5

IBM_largeblock_io=false IBM_largeblock_io=true

16 write

16 read

Effects of I/O Hints: IBM_largeblock_io


Parallel Tools

• ph5diff• Parallel version of the h5diff tool

• h5perf• Performance measuring tools showing

I/O performance for different I/O API


ph5diff

• An parallel version of the h5diff tool• Supports all features of h5diff• An MPI parallel tool• Manager process (proc 0)

• coordinates each the remaining processes (workers) to “diff” one dataset at a time;

• collects any output from each worker and prints them out.

• Works best if there are many datasets in the two files with few differences.

• Available in v1.8.


h5perf

• An I/O performance measurement tool• Test 3 File I/O API

• POSIX I/O (open/write/read/close…)• MPIO (MPI_File_{open,write,read,close})• PHDF5

• H5Pset_fapl_mpio (using MPI-IO)• H5Pset_fapl_mpiposix (using POSIX I/O)

• An indication of I/O speed upper limits


h5perf: Some features

• Check (-c) verify data correctness• Added 2-D chunk patterns in v1.8• -h shows the help page.


h5perf: example output 1/3 %mpirun -np 4 h5perf # Ran in a Linux systemNumber of processors = 4 Transfer Buffer Size: 131072 bytes, File size: 1.00 MBs # of files: 1, # of datasets: 1, dataset size: 1.00 MBs IO API = POSIX Write (1 iteration(s)): Maximum Throughput: 18.75 MB/s Average Throughput: 18.75 MB/s Minimum Throughput: 18.75 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 10.79 MB/s Average Throughput: 10.79 MB/s Minimum Throughput: 10.79 MB/s Read (1 iteration(s)): Maximum Throughput: 2241.74 MB/s Average Throughput: 2241.74 MB/s Minimum Throughput: 2241.74 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 756.41 MB/s Average Throughput: 756.41 MB/s Minimum Throughput: 756.41 MB/s


h5perf: example output 2/3 %mpirun -np 4 h5perf… IO API = MPIO Write (1 iteration(s)): Maximum Throughput: 611.95 MB/s Average Throughput: 611.95 MB/s Minimum Throughput: 611.95 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 16.89 MB/s Average Throughput: 16.89 MB/s Minimum Throughput: 16.89 MB/s Read (1 iteration(s)): Maximum Throughput: 421.75 MB/s Average Throughput: 421.75 MB/s Minimum Throughput: 421.75 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 109.22 MB/s Average Throughput: 109.22 MB/s Minimum Throughput: 109.22 MB/s


h5perf: example output 3/3 %mpirun -np 4 h5perf… IO API = PHDF5 (w/MPI-I/O driver) Write (1 iteration(s)): Maximum Throughput: 304.40 MB/s Average Throughput: 304.40 MB/s Minimum Throughput: 304.40 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 15.14 MB/s Average Throughput: 15.14 MB/s Minimum Throughput: 15.14 MB/s Read (1 iteration(s)): Maximum Throughput: 1718.27 MB/s Average Throughput: 1718.27 MB/s Minimum Throughput: 1718.27 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 78.06 MB/s Average Throughput: 78.06 MB/s Minimum Throughput: 78.06 MB/s Transfer Buffer Size: 262144 bytes, File size: 1.00 MBs # of files: 1, # of datasets: 1, dataset size: 1.00 MBs


Useful Parallel HDF Links

• Parallel HDF information sitehttp://www.hdfgroup.org/HDF5/PHDF5/

• Parallel HDF5 tutorial available athttp://www.hdfgroup.org/HDF5/Tutor/

• HDF Help email [email protected]

http://www.hdfgroup.org/HDF5/PHDF5/

http://www.hdfgroup.org/HDF5/Tutor/


Questions?

Parallel I/O Performance Study

(preliminary results)Albert Cheng

The HDF Group


Introduction

• Parallel performance affected by the I/O access pattern, file system, and MPI communication modes.

• Determination of interaction of these elements provides hints for improving performance.

• Study presents four test cases using h5perf and h5perf_serial.• h5perf has been extended to support parallel testing of

2D datasets.

• h5perf_serial, based on h5perf, allows serial testing of n-dimensional datasets and various file drivers.

• Testing includes various combinations of MPI communication modes and HDF5 storage layouts.

• Finally, we make recommendations that can improve the I/O performance for specific patterns.


Testing Systems and Configuration

System Architecture File System MPI Implementation

abe Linux Cluster with Intel 64

Lustre MVAPICH2 1.0.2p1 Message Passing with Intel compiler

cobalt ccNUMA with Itanium 2

CXFS SGI Message Passing Toolkit 1.16

mercury Linux Cluster with Itanium 2

GPFS MPICH Myrinet 1.2.5..10, GM 2.0.8, Intel 8.0

Processors 4

Dataset Size 64K×64K (4GB)

I/O Selection 64MB per processor (shape depends on test case)

API HDF5 v181 (default building options)

Iterations 3

MPI/IO Type Collective / Independent

Storage Layout Contiguous / Chunked (chunk size depend on test case)


HDF5 Storage Layouts

• Contiguous• HDF5 assigns a static contiguous region of storage

for raw data.

Dataset File storage


HDF5 Storage Layouts

• Chunked• HDF5 define separate regions of storage for raw data

named chunks, which are pre-allocated in row-major order when a file is created in parallel.

• This layout is only valid when a file is created and the chunks are pre-allocated. Further modification of the file may cause the chunks to be arranged differently.


C0 C1

C2 C3C0 C1 C2 C3

Test Cases

• Case A• The transfer selections extend over the entire columns

with a size of 64K×1K. If the storage is chunked, the size of the chunks is 1K×1K. The selections are interleaved horizontally with respect to the processors.


P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3

64K

64K

1K

…

Test Cases

• Case B• The transfer selection only spans half the columns with a size of

32K×2K. If the storage is chunked, the size of the chunks is

2K×2K. The selections are interleaved horizontally with respect

to the processors.


P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P332K …

2K

P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3…

64K

64K

Test Cases• Case C

• The transfer selections only span half the rows with a size of

2K×32K. If the storage is chunked, the size of the chunks is

2K×2K. The lower dimension (column) is evenly divided among

the processors.


P0…P0P1…P1P2…P2P3…P3

P0…P0P1…P1P2…P2P3…P3

64K

64K

32K

2K

Test Cases• Case D

• The transfer selection extends over the entire rows with a size of 1K×64K. If the storage is chunked, the size of the chunks is 1K×1K. The lower dimension (column) is evenly divided among the processors.


P0…P0P1…P1P2…P2P3…

64K

1K

64K

P3

Access Patterns

• Contiguous• Each processor retrieves a separate region of

contiguous storage. An example of this pattern is case D using contiguous storage.

• Non-contiguous• Separate regions are still assigned to each processor

but such regions contain gaps. Examples of this pattern include case C using contiguous storage, and collective cases C-D using chunked storage.

P0 P1 P2 P3

P0 … P1 P1 … P2 P2 … P3 P3 ...P0


Access Patterns

• Interleaved (or overlapped)• Each processor writes into many portions that are

interleaved with respect to the other processors. For example, using contiguous storage along with cases A-B generates

• Another instance results from using chunked storage with collective cases A-B

P0 P1 P2 P3 P0 P1 P2 P3 …

P0 P1 P2 P3 P0 P1 P2 P3 …


Performance Results and Analysis

• The results correspond to maximum throughput values of Write Open-Close operations during 3 iterations.

• Serial throughput is the performance baseline since our objective is to determine how parallel access can improve performance.

• Unlike GPFS and CXFS, Lustre does not stripe files by default. To enable parallel access, the directory / file must be striped using the command lfs.


I/O Performance in Lustre


NON-STRIPED STRIPED

COLLECTIVE Case A Case B Case C Case D Case A Case B Case C Case D

Contiguous 11.66 23.68 46.12 36.67 25.35 50.26 42.67 119.26

Chunked 179.85 117.31 124.88 106.95 180.33 224.28 86.88 93.45

INDEPENDENT Case A Case B Case C Case D Case A Case B Case C Case D

Contiguous 5.92 8.17 20.98 304.06 6.7 10.81 73.45 298.09

Chunked 219.15 328.04 12.15 8.16 158.9 133.27 12.94 10.51

I/O Performance in Lustre

• Striping partitions the file space into stripes and assigns them to several Object Storage Targets (OSTs) in round-robin fashion.

• Since each OST stores portions of the file that are different from the other OSTs, they all can access the file in parallel.

• The default configuration on abe uses a stripe size of 4MB and a stripe count of 16.

• Striping improves performance when the I/O request of each processor spans several stripes (and OSTs) after MPI aggregations, if any.

• When the processors make small independent I/O requests that are practically contiguous as cases A-B using chunked storage, a single OST can provide better performance due to asynchronous operations.


I/O Performance

Case A Case B Case C Case D1

10

100

1000

abe

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


I/O Performance

Case A Case B Case C Case D1

10

100

1000

cobalt

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


I/O Performance

Case A Case B Case C Case D0.1

1

10

100

1000

mercury

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


Performance of Serial I/O

• Access using contiguous storage has the steepest performance trend as the cases change from A to D.

• When using chunked storage, the throughput remains almost constant at the upper bound.

• The allocation of chunks at the time they are written causes the access pattern to be virtually contiguous regardless of the test cases.


Performance of Independent I/O

• Processors perform their I/O requests independently from each other.

• For contiguous storage, performance improves as the tests move from A to D.

• For chunked storage, throughput is high for interleaved cases A-B since writing blocks (chunks) become larger and caching is exploited. For cases C-D, the many writing requests (one per chunk) multiply the overhead due to unnecessary locking and caching in Lustre and CXFS.

• Unlike these file systems, GPFS has shown better scalability [1,2].


Performance of Collective I/O

• The participating processors coordinate and combine their many requests into fewer I/O operations reducing latency.

• Since the file space is evenly divided among the processors, no need for locking which reduces overhead [3].

• For contiguous storage, performance is overall high but there is still an increasing trend as the cases change from A to D.

• For chunked storage, the performance is even higher with minor variations among the tests cases because several chunks can be written with a single I/O operation.


Conclusion

• Important to determine the access pattern by analyzing the I/O requirements of the application and the storage implementation.

• For contiguous access patterns, independent access is preferable because it omits unnecessary overhead of collective calls.

• For non-contiguous patterns, there is little difference between independent and collective access. However, writing many chunks in independent mode may be expensive in Lustre and CXFS if caching is not exploited.

• For interleaved access pattern, collective mode is usually faster.• For all the access patterns, collective mode and chunk storage

provide the combination that yields the highest average performance.


References

1. J. Borrill, L. Oliker, J. Shalf, and H. Shan. Investigation of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark. In Proceedings of SC’07: High Performance Networking and Computing, Reno, NV, November 2007.

2. W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International Volume, Issue 26-30, pages 1-10, March 2007.

3. R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proceedings of the 7th Symposium of the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, February 1999.



Questions?

Download - March 9, 200910th International LCI Conference - HDF5 Tutorial1 Tutorial II: HDF5 and NetCDF-4 10 th International LCI Conference Albert Cheng, Neil Fortner

Top Related