status of cdi-pio development
DESCRIPTION
Status of CDI-pio development. Thomas Jahns < [email protected] > IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28. Outline. History Recent work Issues (big time) Outlook Summary. What is CDI-pio?. - PowerPoint PPT PresentationTRANSCRIPT
Status of CDI-pio development
Thomas Jahns <[email protected]>IS-ENES workshop on Scalable I/O in
climate models, Hamburg 2013-10-28
• History• Recent work• Issues (big time)• Outlook• Summary
Outline
2 IS-ENES workshop on Scalable I/O in climate models, 2013
• CDI (climate data interface) is the I/O backend of CDO (climate data operators) both by Uwe Schulzweida– CDI abstracts away the differences of several file formats
relevant in climate science (GRIB1, GRIB2, netCDF, SVC, EXT, IEG)
• CDI-pio is the MPI-parallelization and I/O client/server infrastructure initially built by Deike Kleberg (supported GRIB1/GRIB2, regular decomposition)
What is CDI-pio?
3 IS-ENES workshop on Scalable I/O in climate models, 2013
• Extensions of CDI-pio to– address netCDF 4.x collective output– improve robustness– make decomposition flexible
were performed by Thomas Jahns for IS-ENES during 2012/early 2013
• Irina Fast ports/ported MPI-ESM1 to new API– She provided the performance results presented later
What is CDI-pio? (part 2)
4 IS-ENES workshop on Scalable I/O in climate models, 2013
Thebig picture
5 IS-ENES workshop on Scalable I/O in climate models, 2013
• Add netCDF 4.x collective output and– fallback to file per single process mapping if netCDF library doesn’t
support collective I/O.• Fix edge cases in reliably writing records/files of any size.• Allow flexible decomposition specification (more on that
soon) and make use of it in ECHAM.• Track deadlock in MPI_File_iwrite_shared in combination with
MPI RMA on IBM PE.
Focal points of work 2012/2013
6 IS-ENES workshop on Scalable I/O in climate models, 2013
• Previous version assumed regular 1D-decoCollective activity of model processes to matchCostly
• Need to describe of data on client sideYAXT library provides just such a descriptor
On flexible decompositions
7 IS-ENES workshop on Scalable I/O in climate models, 2013
• Library on top of MPI• Inspired by Fortran Prototype Unitrans by Mathias Pütz in
ScalES-project, therefore Yet Another Exchange Tool• Built and maintained by Moritz Hanke, Jörg Behrens,
Thomas Jahns• Implemented in C type agnostic code (Fortran is supposed ⇒
to have this whenever Fortran 2015 gets implemented)• Fully-featured Fortran interface (requires C-interop)• Supported by DKRZ
What‘s YAXT?
8 IS-ENES workshop on Scalable I/O in climate models, 2013
Central motivation for YAXT
9 IS-ENES workshop on Scalable I/O in climate models, 2013
Recurring problem: data must be rearranged across process-boundaries (halo exchange, transposition, gather/scatter, load-balancing)
Consequence:Replace error-prone hand-codedMPI calls with simpler interface.
• Easily understandable descriptors of data as relating to global context.• Programmatic discovery and persistence of
all necessary communication partners1. Sender side and2. Receiver side
• Persistence for the above and the corresponding memory access pattern1. Thus moving what was previously code to data2. As flexible as MPI datatypes
• Hiding of message passing mechanism– Can be replaced with e.g. 1-sided communication– Can easily aggregate multiple communications
What does YAXT provide?
10 IS-ENES workshop on Scalable I/O in climate models, 2013
• Several classes of index lists• Computation of exchange map (Xt_xmap), i.e. what to send/receive to
/from which other processes, given1. source index list and2. target index list
• Compute data access pattern (Xt_redist) given1. exchange map,2. MPI datatype of element or data access pattern(s)
• Perform data transfer via MPI given– Data access pattern and– Arrays holding sources/targets
What does YAXT provide? (in other words)
11 IS-ENES workshop on Scalable I/O in climate models, 2013
• For every data element:– assign integer “name” and– describe via MPI datatype
• Declare element “have”- and “want”-lists• Provide array addresses holding “have” elements and to hold
“want” elements• YAXT computes all MPI action needed to make the
redistribution happen
How is YAXT used?
12 IS-ENES workshop on Scalable I/O in climate models, 2013
• Where user typically defines both sides of YAXT redistribution, CDI defines variable targets to useindices 0..n-1 consecutively.
• YAXT descriptor is packed together with data, this allows for changing descriptors upon re-balancing.
Decomposition description via YAXT
13 IS-ENES workshop on Scalable I/O in climate models, 2013
• GRIB data format assumes no ordering of records (but tools only support this within same time step)
• No call in MPI can take advantage of that• Performance of MPI_File_(i)write_shared never surpasses
serial writing (various methods tried)
Issues with actual writing
14 IS-ENES workshop on Scalable I/O in climate models, 2013
• RMA working well with RDMA characteristics on MVAPICH2,but needs MV2_DEFAULT_PUT_GET_LIST_SIZE=ntasks for more than 192 non-I/O server tasks
• RMA functionally working with IBM PE,but performance is lower than rank 0 I/O with gather.
• OpenMPI improves just as well as MVAPICH2,but baseline performance is so much lower that even withCDI-pio OpenMPI is slower than MVAPICH2 without
Results for RMA
15 IS-ENES workshop on Scalable I/O in climate models, 2013
• Better implement scalable message passing scheme from the outset and invest in RMA where beneficial
Portable RMA = hurt
16 IS-ENES workshop on Scalable I/O in climate models, 2013
ResultswithMVAPICH2- scaling
17 IS-ENES workshop on Scalable I/O in climate models, 2013
Results with MVAPICH2 - profile
18 IS-ENES workshop on Scalable I/O in climate models, 2013
• For scientists: cloud- and precipitation-resolving simulations
• For DKRZ: huge (I/O) challenges to the power of 2• Works with ICON: icosahedral non-hydrostatic GCM
http://www.mpimet.mpg.de/en/science/models/icon.html
• Goals of ICON (MPI-Met and DWD):– Scalable dynamical core and better physics
HD(CP)2
19 IS-ENES workshop on Scalable I/O in climate models, 2013
ICON grid
20 IS-ENES workshop on Scalable I/O in climate models, 2013
The ICON gridcan be scaledreal well:
IS-ENES workshop on Scalable I/O in climate models, 201321
After scaling the grid,even finerrefinementscan benested:
IS-ENES workshop on Scalable I/O in climate models, 201322
Better stop illustrationshere, while there’s enoughpixels on aHD screen
• Grid size 100,000,000 cells horizontal• Reading on single machine no longer works: memory
on 64GB nodes is exhausted early on• But no prior knowledge of which process needs
which data because of irregular grid
ICON HD(CP)2 input
23 IS-ENES workshop on Scalable I/O in climate models, 2013
• Sub-second resolution of time• 3D data also means: already prohibitive 2D grid data
volume is dwarfed by volume data• Current meta-data approach of CDI-pio is
unsustainable.
ICON HD(CP)2 output
24 IS-ENES workshop on Scalable I/O in climate models, 2013
• Make collectors act collectively Tighter coupling of I/O servers, but that is inevitable anyway. More even distribution of data. Will accommodate 2-sided as wells as 1-sided communications.
• Decompose meta-data, particularly for HD(CP)2
• Further split of I/O servers into collectors and distinct writers to conserve per-process memory
• Introduce OpenMP parallelization.• Input
Future work
25 IS-ENES workshop on Scalable I/O in climate models, 2013
• Working concept, but many unexpected problems in practice.
• Invested work not recouped in production. • Very useful tool: YAXT• Problems only growing, invalidates previous
assumptions.
Summary
26 IS-ENES workshop on Scalable I/O in climate models, 2013
• Jörg Behrens (DKRZ)• Irina Fast (DKRZ)• Moritz Hanke (DKRZ)• Deike Kleberg (MPI-M)• Luis Kornblueh (MPI-M)• Uwe Schulzweida (MPI-M)• Günther Zängl (DWD)
Thanks to…
27 IS-ENES workshop on Scalable I/O in climate models, 2013
• Joachim Biercamp (DKRZ)• Panagiotis Adamidis (DKRZ)