status of cdi-pio development

27
Status of CDI-pio development Thomas Jahns <[email protected] > IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28

Upload: aadi

Post on 24-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Status of CDI-pio development. Thomas Jahns < [email protected] > IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28. Outline. History Recent work Issues (big time) Outlook Summary. What is CDI-pio?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Status of CDI-pio development

Status of CDI-pio development

Thomas Jahns <[email protected]>IS-ENES workshop on Scalable I/O in

climate models, Hamburg 2013-10-28

Page 2: Status of CDI-pio development

• History• Recent work• Issues (big time)• Outlook• Summary

Outline

2 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 3: Status of CDI-pio development

• CDI (climate data interface) is the I/O backend of CDO (climate data operators) both by Uwe Schulzweida– CDI abstracts away the differences of several file formats

relevant in climate science (GRIB1, GRIB2, netCDF, SVC, EXT, IEG)

• CDI-pio is the MPI-parallelization and I/O client/server infrastructure initially built by Deike Kleberg (supported GRIB1/GRIB2, regular decomposition)

What is CDI-pio?

3 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 4: Status of CDI-pio development

• Extensions of CDI-pio to– address netCDF 4.x collective output– improve robustness– make decomposition flexible

were performed by Thomas Jahns for IS-ENES during 2012/early 2013

• Irina Fast ports/ported MPI-ESM1 to new API– She provided the performance results presented later

What is CDI-pio? (part 2)

4 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 5: Status of CDI-pio development

Thebig picture

5 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 6: Status of CDI-pio development

• Add netCDF 4.x collective output and– fallback to file per single process mapping if netCDF library doesn’t

support collective I/O.• Fix edge cases in reliably writing records/files of any size.• Allow flexible decomposition specification (more on that

soon) and make use of it in ECHAM.• Track deadlock in MPI_File_iwrite_shared in combination with

MPI RMA on IBM PE.

Focal points of work 2012/2013

6 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 7: Status of CDI-pio development

• Previous version assumed regular 1D-decoCollective activity of model processes to matchCostly

• Need to describe of data on client sideYAXT library provides just such a descriptor

On flexible decompositions

7 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 8: Status of CDI-pio development

• Library on top of MPI• Inspired by Fortran Prototype Unitrans by Mathias Pütz in

ScalES-project, therefore Yet Another Exchange Tool• Built and maintained by Moritz Hanke, Jörg Behrens,

Thomas Jahns• Implemented in C type agnostic code (Fortran is supposed ⇒

to have this whenever Fortran 2015 gets implemented)• Fully-featured Fortran interface (requires C-interop)• Supported by DKRZ

What‘s YAXT?

8 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 9: Status of CDI-pio development

Central motivation for YAXT

9 IS-ENES workshop on Scalable I/O in climate models, 2013

Recurring problem: data must be rearranged across process-boundaries (halo exchange, transposition, gather/scatter, load-balancing)

Consequence:Replace error-prone hand-codedMPI calls with simpler interface.

Page 10: Status of CDI-pio development

• Easily understandable descriptors of data as relating to global context.• Programmatic discovery and persistence of

all necessary communication partners1. Sender side and2. Receiver side

• Persistence for the above and the corresponding memory access pattern1. Thus moving what was previously code to data2. As flexible as MPI datatypes

• Hiding of message passing mechanism– Can be replaced with e.g. 1-sided communication– Can easily aggregate multiple communications

What does YAXT provide?

10 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 11: Status of CDI-pio development

• Several classes of index lists• Computation of exchange map (Xt_xmap), i.e. what to send/receive to

/from which other processes, given1. source index list and2. target index list

• Compute data access pattern (Xt_redist) given1. exchange map,2. MPI datatype of element or data access pattern(s)

• Perform data transfer via MPI given– Data access pattern and– Arrays holding sources/targets

What does YAXT provide? (in other words)

11 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 12: Status of CDI-pio development

• For every data element:– assign integer “name” and– describe via MPI datatype

• Declare element “have”- and “want”-lists• Provide array addresses holding “have” elements and to hold

“want” elements• YAXT computes all MPI action needed to make the

redistribution happen

How is YAXT used?

12 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 13: Status of CDI-pio development

• Where user typically defines both sides of YAXT redistribution, CDI defines variable targets to useindices 0..n-1 consecutively.

• YAXT descriptor is packed together with data, this allows for changing descriptors upon re-balancing.

Decomposition description via YAXT

13 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 14: Status of CDI-pio development

• GRIB data format assumes no ordering of records (but tools only support this within same time step)

• No call in MPI can take advantage of that• Performance of MPI_File_(i)write_shared never surpasses

serial writing (various methods tried)

Issues with actual writing

14 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 15: Status of CDI-pio development

• RMA working well with RDMA characteristics on MVAPICH2,but needs MV2_DEFAULT_PUT_GET_LIST_SIZE=ntasks for more than 192 non-I/O server tasks

• RMA functionally working with IBM PE,but performance is lower than rank 0 I/O with gather.

• OpenMPI improves just as well as MVAPICH2,but baseline performance is so much lower that even withCDI-pio OpenMPI is slower than MVAPICH2 without

Results for RMA

15 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 16: Status of CDI-pio development

• Better implement scalable message passing scheme from the outset and invest in RMA where beneficial

Portable RMA = hurt

16 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 17: Status of CDI-pio development

ResultswithMVAPICH2- scaling

17 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 18: Status of CDI-pio development

Results with MVAPICH2 - profile

18 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 19: Status of CDI-pio development

• For scientists: cloud- and precipitation-resolving simulations

• For DKRZ: huge (I/O) challenges to the power of 2• Works with ICON: icosahedral non-hydrostatic GCM

http://www.mpimet.mpg.de/en/science/models/icon.html

• Goals of ICON (MPI-Met and DWD):– Scalable dynamical core and better physics

HD(CP)2

19 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 20: Status of CDI-pio development

ICON grid

20 IS-ENES workshop on Scalable I/O in climate models, 2013

The ICON gridcan be scaledreal well:

Page 21: Status of CDI-pio development

IS-ENES workshop on Scalable I/O in climate models, 201321

After scaling the grid,even finerrefinementscan benested:

Page 22: Status of CDI-pio development

IS-ENES workshop on Scalable I/O in climate models, 201322

Better stop illustrationshere, while there’s enoughpixels on aHD screen

Page 23: Status of CDI-pio development

• Grid size 100,000,000 cells horizontal• Reading on single machine no longer works: memory

on 64GB nodes is exhausted early on• But no prior knowledge of which process needs

which data because of irregular grid

ICON HD(CP)2 input

23 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 24: Status of CDI-pio development

• Sub-second resolution of time• 3D data also means: already prohibitive 2D grid data

volume is dwarfed by volume data• Current meta-data approach of CDI-pio is

unsustainable.

ICON HD(CP)2 output

24 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 25: Status of CDI-pio development

• Make collectors act collectively Tighter coupling of I/O servers, but that is inevitable anyway. More even distribution of data. Will accommodate 2-sided as wells as 1-sided communications.

• Decompose meta-data, particularly for HD(CP)2

• Further split of I/O servers into collectors and distinct writers to conserve per-process memory

• Introduce OpenMP parallelization.• Input

Future work

25 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 26: Status of CDI-pio development

• Working concept, but many unexpected problems in practice.

• Invested work not recouped in production. • Very useful tool: YAXT• Problems only growing, invalidates previous

assumptions.

Summary

26 IS-ENES workshop on Scalable I/O in climate models, 2013

Page 27: Status of CDI-pio development

• Jörg Behrens (DKRZ)• Irina Fast (DKRZ)• Moritz Hanke (DKRZ)• Deike Kleberg (MPI-M)• Luis Kornblueh (MPI-M)• Uwe Schulzweida (MPI-M)• Günther Zängl (DWD)

Thanks to…

27 IS-ENES workshop on Scalable I/O in climate models, 2013

• Joachim Biercamp (DKRZ)• Panagiotis Adamidis (DKRZ)