1
Parallel and Grid I/O Infrastructure
Rob Ross, Argonne National LabParallel Disk Access and Grid I/O
(P4)
SDM All Hands MeetingMarch 26, 2002
2
Participants
Argonne National Laboratory
- Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan
Northwestern University
- Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li
Collaborators
- Lawrence Livermore National Laboratory- Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow
- Application groups
3
Focus Areas in Project
Parallel I/O on clusters
- Parallel Virtual File System (PVFS) MPI-IO hints
- ROMIO MPI-IO implementation Grid I/O
- Linking PVFS and ROMIO with Grid I/O components Application interfaces
- NetCDF and HDF5 Everything is interconnected! Wei-keng Liao will drill down into specific tasks
4
Parallel Virtual File System
Lead developer R. Ross (ANL)- R. Latham (ANL), developer
- A. Ching, K. Coloma (NWU), collaborators Open source, scalable parallel file system
- Project began in mid 90’s at Clemson University
- Now a collaborative between Clemson and ANL Successes
- In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …)
- 100+ unique downloads/month
- 160+ users on mailing list, 90+ on developers list
- Multiple Gigabyte/second performance shown
5
Keeping PVFS Relevant: PVFS2
Scaling to thousands of clients and hundreds of servers requires some design changes
- Distributed metadata
- New storage formats
- Improved fault tolerance New technology, new features
- High-performance networking (e.g. Infiniband, VIA)
- Application metadata New design and implementation warranted
(PVFS2)
6
PVFS1, PVFS2, and SDM
Maintaining PVFS1 as a resource to community
- Providing support, bug fixes
- Encouraging use by application groups
- Adding functionality to improve performance (e.g. tiled display)
Implementing next-generation parallel file system
- Basic infrastructure for future PFS work
- New physical distributions (e.g. chunking)
- Application metadata storage Ensuring that a working parallel file system will
continue to be available on clusters as they scale
7
Data Staging for Tiled Display
Contact: Joe Insley (ANL)- Commodity components
- projectors, PCs
- Provide very high resolutionvisualization
Staging application preprocesses “frames” into a tile stream for each “visualization node”- Uses MPI-IO to access data from PVFS file system
- Streams of tiles are merged into movie files on visualization nodes
- End goal is to display frames directly from PVFS
- Enhancing PVFS and ROMIO to improve performance
8
Example Tile Layout
3x2 display, 6 readers Frame size is 2532x1408 pixels Tile size is 1024x768 pixels (overlapped) Movies broken into frames with each frame
stored in its own file in PVFS Readers pull data from PVFS and send to display
9
Tested access patterns
Subtile- Each reader grabs a piece of
a tile
- Small noncontiguous accesses
- Lots of accesses for a frame Tile
- Each reader grabs a whole tile
- Larger noncontiguous accesses
- Six accesses for a frame Reading individual pieces is
simply too slow
10
Noncontiguous Access in ROMIO
ROMIO performs “data sieving” to cut down number of I/O operations
Uses large reads which grab multiple noncontiguous pieces
Example, reading tile 1:
11
Noncontiguous Access in PVFS
ROMIO data sieving
- Works for all file systems (just uses contiguous read)
- Reads extra data (three times desired amount)
Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU)
Support in ROMIO allowstransparent use of new optimization (K. Coloma,NWU)
PVFS and ROMIO supportimplemented
Normalized Read Performance
0
0.2
0.4
0.6
0.8
1
Subtile TileList I/ OData Sieving
12
Metadata in File Systems
Associative arrays of information related to a file Seen in other file systems (MacOS, BeOS, ReiserFS) Some potential uses:
- Ancillary data (from applications)- Derived values
- Thumbnail images
- Execution parameters
- I/O library metadata- Block layout information
- Attributes on variables
- Attributes of dataset as a whole
- Headers- Keeps header out of data stream
- Eliminates need for alignment in libraries
13
Metadata and PVFS2 Status
Prototype metadata storage for PVFS2 implemented
- R. Ross (ANL)
- Uses Berkeley DB for storage of keyword/value pairs
- Need to investigate how to interface to MPI-IO Other components of PVFS2 coming along
- Networking in testing (P. Carns, Clemson)
- Client side API under development (Clemson) PVFS2 beta early fourth quarter?
14
ROMIO MPI-IO Implementation
Written by R. Thakur (ANL)
- R. Ross and R. Latham (ANL), developers
- K. Coloma (NWU), collaborator
Implementation of MPI-2 I/O specification
- Operates on wide variety of platforms
- Abstract Device Interface for I/O (ADIO) aids in porting to new file systems
Successes
- Adopted by industry(e.g. Compaq, HP, SGI)
- Used at ASCI sites(e.g. LANL Blue Mountain)
15
ROMIO Current Directions
Support for PVFS noncontiguous requests
- K. Coloma (NWU) Hints - key to efficient use of HW & SW
components
- Collective I/O- Aggregation (synergy)
- Performance portability- Controlling ROMIO Optimizations
- Access patterns
- Grid I/O Scalability
- Parallel I/O benchmarking
16
ROMIO Aggregation Hints
Part of ASCI Software Pathforward project
- Contact: Gary Grider (LANL) Implementation by R. Ross, R. Latham (ANL) Hints control what processes do I/O in collectives Examples:
- All processes on same node as attached storage
- One process per host Additionally limit number of processes who open file
- Good for systems w/out shared FS (e.g. O2K clusters)
- More scalable
17
Aggregation Example
Cluster of SMPs Only one SMP box has connection to disks Data is aggregated to processes on single box Processes on that box perform I/O on behalf of the others
18
Optimization Hints
MPI-IO calls should be chosen to best describe the I/O taking place
- Use of file views
- Collective calls for inherently collective operations
Unfortunately sometimes choosing the “right” calls can result on lower performance
Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls
Avoid the misapplication of optimizations (aggregation, data sieving)
19
Optimization Problems
ROMIO checks for applicability of two-phase optimization when collective I/O is used
With tiled display application using subtile access, this optimization is never used
Checking for applicability requires communication between processes
Results in 33% drop in throughput (on test system)
A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application
20
Access Pattern Hints
Collaboration between ANL and LLNL (and growing)
Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system
Used as input to optimizations in MPI-IO layer Used as input to optimizations in FS layer as well
- Prefetching
- Caching
- Writeback
21
Status of Hints
Aggregation control finished Optimization hints
- Collectives, data sieving read finished
- Data sieving write control in progress
- PVFS noncontiguous I/O control in progress Access pattern hints
- Exchanging log files, formats
- Getting up to speed on respective tools
22
Parallel I/O Benchmarking
No common parallel I/O benchmarks New effort (consortium) to:
- Define some terminology
- Define test methodology
- Collect tests
Goal: provide a meaningful test suite with consistent measurement techniques
Interested parties at numerous sites (and growing)
- LLNL, Sandia, UIUC, ANL, UCAR, Clemson
In infancy…
23
Grid I/O
Looking at ways to connect our I/O work with components and APIs used in the Grid
- New ways of getting data in and out of PVFS
- Using MPI-IO to access data in the Grid
- Alternative mechanisms for transporting data across the Grid (synergy)
Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications)
Facilitate moving between Grid and Cluster worlds
24
Local Access to GridFTP Data
Grid I/O Contact: B. Allcock (ANL) GridFTP striped server provides high-throughput
mechanism for moving data across Grid Relies on proprietary storage format on striped
servers
- Must manage metadata on stripe location
- Data stored on servers must be read back from servers
- No alternative/more direct way to access local data
- Next version assumes shared file system underneath
25
GridFTP Striped Servers
Remote applications connect to multiple striped servers to quickly transfer data over Grid
Multiple TCP streams better utilize WAN network
Local processes would need to use same mechanism to get to data on striped servers
26
PVFS under GridFTP
With PVFS underneath, GridFTP servers would store data on PVFS I/O servers
Stripe information stored on PVFS metadata server
27
Local Data Access
Application tasks that are part of a local parallel job could access data directly off PVFS file system
Output from application could be retrieved remotely via GridFTP
28
MPI-IO Access to GridFTP
Applications such as tiled display reader desire remote access to GridFTP data
Access through MPI-IO would allow this with no code changes
ROMIO ADIO interface provides the infrastructure necessary to do this
MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.
29
WAN File Transfer Mechanism
B. Gropp (ANL), P. Dickens (IIT) Applications
- PPM and COMMAS (Paul Woodward, UMN)
Alternative mechanism for moving data across Grid using UDP
Focuses on requirements for file movement
- All data must arrive at destination
- Ordering doesn’t matter
- Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer
30
WAN File Transfer Performance
Performance of User-Level Protocol on Short and Long Haul Networks
0%10%20%30%40%50%60%70%80%90%
100%
50 100
200
300
400
500
600
700
800
900
1000
1500
2500
Number of Packets Received Before Sending an Acknowledgement Packet
Per
cen
tag
e o
f M
axim
um
B
and
wid
th O
bta
ined
Long Haul Network
Short Haul Network
TCP over Short and Long Haul
0%10%20%30%40%50%60%70%80%90%
100%
5K 10K
15K
20K
25K
30K
40K
50K
100K
1Meg
4Meg
40M
eg
Chunk Size (Bytes)
Per
cen
tag
e o
f M
axim
um
A
vaila
ble
Ban
dw
idth
Comparing TCP utilization to WAN FT technique
See 10-12% utilization with single TCP stream (8 streams to approach max. utilization)
With WAN FT obtain near 90% utilization, more uniform performance
31
Grid I/O Status
Planning with Grid I/O group
- Matching up components
- Identifying useful hints Globus FTP client library is available 2nd generation striped server being
implemented XIO interface prototyped
- Hooks for alternative local file systems
- Obvious match for PVFS under GridFTP
32
NetCDF
Applications in climate and fusion
- PCM- John Drake (ORNL)
- Weather Research and Forecast Model (WRF)- John Michalakes (NCAR)
- Center for Extended Magnetohydrodynamic Modeling- Steve Jardin (PPPL)
- Plasma Microturbulence Project- Bill Nevins (LLNL)
Maintained by Unidata Program Center API and file format for storing multidimensional
datasets and associated metadata (in a single file)
33
NetCDF Interface
Strong points:
- It’s a standard!
- I/O routines allow for subarray and strided access with single calls
- Access is clearly split into two modes- Defining the datasets (define mode)
- Accessing and/or modifying the datasets (data mode)
Weakness: no parallel writes, limited parallel read capability
This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications
34
Parallel NetCDF
Rich I/O routines and explicit define/data modes provide a good foundation
- Existing applications are already describing noncontiguous regions
- Modes allow for a synchronization point when file layout changes
Missing:
- Semantics for parallel access
- Collective routines
- Option for using MPI datatypes Implement in terms of MPI-IO operations Retain file format for interoperability
35
Parallel NetCDF Status
Design document created
- B. Gropp, R. Ross, and R. Thakur (ANL)
Prototype in progress
- J. Li (NWU)
Focus is on write functions first
- Biggest bottleneck for checkpointing applications
Read functions follow Investigate alternative file formats in future
- Address differences in access modes between writing and reading
36
FLASH Astrophysics Code
Developed at ASCI Center at University of Chicago
- Contact: Mike Zingale
Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes
Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data
Scales to thousands of processors, runs for weeks, needs to checkpoint
At the time, I/O was a bottleneck (½ of runtime on 1024 processors)
37
HDF5 Overhead Analysis
Instrumented FLASH I/O to log calls to H5Dwrite
MPI_File_write_atH5Dwrite
38
HDF5 Hyperslab Operations
White region is hyperslab “gather” (from memory) Cyan is “scatter” (to file)
39
Hand-Coded Packing
Packing time is in black regions between bars Nearly order of magnitude improvement
40
Wrap Up
Progress being made on multiple fronts
- ANL/NWU collaboration is strong
- Collaborations with other groups maturing Balance of immediate payoff and medium
term infrastructure improvements
- Providing expertise to application groups
- Adding functionality targeted at specific applications
- Building core infrastructure to scale, ensure availability
Synergy with other projects On to Wei-keng!