i/o strategies for the t3e
DESCRIPTION
I/O Strategies for the T3E. Jonathan Carter NERSC User Services. T3E Overview. T3E is a set of Processing Elements (PE) connected by a fast 3D torus. PEs do not have local disk All PEs access all filesystems equivalently Path for I/O generally looks like: user buffer space - PowerPoint PPT PresentationTRANSCRIPT
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
1
I/O Strategies for the T3E
Jonathan CarterNERSC User Services
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
2
T3E Overview
• T3E is a set of Processing Elements (PE) connected by a fast 3D torus.
• PEs do not have local disk• All PEs access all filesystems equivalently• Path for I/O generally looks like:
– user buffer space– system buffer space– I/O device buffer space
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
3
Filesystems• /usr/tmp
– fast– subject to 14 day purge, not backed up– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)
• $TMPDIR– fast– purged at end of job or session– shares quota with /usr/tmp
• $HOME– slower– permanent, backed up– check quota with quota (usually 2Gb and 3500 inodes)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
4
Types of I/O
• Language I/O: Fortran or C (ANSI or POSIX)• Cray FFIO library (can be used from Fortran or C)• MPI I/O• Cray extensions to Fortran and C I/O (mostly for
compatibility with PVP systems)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
5
I/O Strategies - Exclusive access files
• Each PE reads and writes to a separate file– Language I/O– MPI I/O– Increase language I/O performance with FFIO library (C must use
POSIX style calls)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
6
I/O Strategies - Communication and I/O PE
• One PE coordinates reading and writing and communicates data back and forth between other PEs via message passing– Language I/O– MPI I/O– Increase language I/O performance with FFIO library
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
7
I/O Strategies - Shared files
• All PEs read and write the same file simultaneously– Language I/O with FFIO library global layer– MPI I/O– Language I/O with FFIO library global layer and Cray extensions for
additional flexibility
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
8
Cray FFIO library
• FFIO is a set of I/O layers tuned for different I/O characteristics
• Buffering of data (configurable size)• Caching of data (configurable size)• Available to regular Fortran I/O without reprogramming• Available for C through POSIX-like calls, e.g. ffopen, ffwrite
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
9
The assign command
• the assign command controls– controls which FFIO layer is active– striping across multiple partitions– lots more
• scope of assign– File name– Fortran unit number– File type (e.g. all sequential unformatted files)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
10
assign Examples
• read and write to file restart.file from all PEs by using the FFIO library global layerassign -F global:128:2 f:restart.file
• use the FFIO library bufa layer to improve performance for file opened on Fortran unit 10assign -F bufa:128:2 u:10
• use the FFIO library bufa layer to improve performance for all unformatted sequential Fortran filesassign -F bufa:128:2 g:su
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
11
assign Examples
• To see all active assignsassign -V
• To remove all active assignsassign -R
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
12
bufa FFIO layer
• bufa is an asynchronous buffering layer• performs read-ahead, write-behind• specify buffer size with -F bufa:bs:nbufs where bs
is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers
• buffer space increases your applications memory requirements
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
13
global FFIO layer
• global is a caching and buffering layer which enables multiple PEs to read and write to the same file
• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy
• file open is a synchronizing event• By default, all PEs must open a global file, this can be
changed by calling GLIO_GROUP_MPI(comm)• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
14
File positioning with the global FFIO layer
• Positioning of a read or write is your responsibility• File pointers are private• Fortran
– Use a direct access file, and read/write(rec=num)– Use Cray extensions setpos and getpos to position file pointer
(not portable)
• C– Use ffseek
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
15
FFIO considerations
• Examples above use an unblocked file structure, normal Fortran files are blocked. To read the file without the global or bufa layers you must use
assign -s unblocked f:filename• bufa and global do not allow backspace, or skipping over a
partially read record. You can allow this behavior by using the cos layer in addition to bufa or global, but then setpos doesn’t work. assign -s cos:128,bufa:128:2 f:filename
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
16
More on FFIO
• There are many other FFIO layers, some pretty obscure– cache and cachea layers, good for random access files
• man intro_ffio for a terse description• Cray Publication - Application Programmer’s I/O Guide
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
17
More on assign
• Many text processing options• Switch between Fortran 77 and Fortran 90 namelist• File pre-allocation• File striping
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
18
Further Information
• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials
• Cray Publication - Application Programmer’s I/O Guide• Cray Publication - Cray T3E Fortran Optimization Guide• man assign
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
19
MPI I/O
• Part of MPI-2• Interface for High Performance Parallel I/O
– data partitioning– collective I/O– asynchronous I/O– portability and interoperability
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
20
MPI I/O Definitions
• An MPI file is an ordered collection of MPI types.• A file may be opened individually or collectively by a group
of processes• The fileview defines a template for accessing the file and is
used to partition the file amongst processes
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
21
Fileviews
• A fileview is composed of three pieces:– a displacement (in bytes) form
the beginning of the file– an elementary datatype (etype),
which is the unit of data access and positioning within the file
– an filetype, which defines a template for accessing the file. A filetype can contain etypes or holes of the same extent as etypes.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
22
Fileviews (cont.)
• The filetype pattern is repeated, “tiling” the file• Only the non-empty slots are available to read or write
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
23
Fileview (cont.)
• Each process can have a different filetype
Process 0
Process 1
Process 2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
24
MPI_File_set_view
• Called after MPI_File_open to set fileview• MPI_File_set_view(fh, disp, etype, filetype, datarep, info)
– fh is a file handle– disp, etype, and filetype define the fileview– datarep is one of “native”, “internal”, or “external32”– info is a set of hints to optimize performance
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
25
MPI Info object
• An info object bundles up a set of parametersinteger finfocall MPI_Info_create(finfo, ierr)call MPI_Info_set(finfo, ‘access_style’, ‘write_mostly’, ierr)
• MPI I/O defines a set of parameters used to help optimize I/O performance
• MPI_Info_null can be used instead of an info object
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
26
Open and Close
• MPI_File_open(comm, filename, amode, info, fh)– comm, open is collective over this communicator– filename, string or character variable– file access mode: MPI_MODE_RDONLY, MPI_MODE_RDWR
etc.– info object, used to pass hints to open– file handle
• MPI_File_close(fh)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
27
Utility routines
• MPI_File_delete• MPI_File_set_size• MPI_File_preallocate• MPI_File_set_info
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
28
Query routines
• MPI_File_get_size• MPI_File_get_group• MPI_File_get_amode• MPI_File_get_info• MPI_File_get_view
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
29
Data access routines
• Positioning– Explicit, each call has an offset– Individual, each PE maintains an individual file pointer– Shared, the file pointer is maintained globally
• Synchronism– Blocking, routine returns when complete– Non-blocking, must call a termination routine to ensure completion
• Coordination– Non-collective– Collective
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
30
Summary of access routines
Positioning Synchronism CoordinationNon-collective Collective
Explicit BlockingNon-blocking
READ_AT READ_AT_ALL
IREAD_AT READ_AT_ALL BEGINWAIT READ_AT_ALL_END
Individual BlockingNon-blocking
READ READ_ALL
IREAD READ_ALL_BEGINWAIT READ_ALL_END
Shared BlockingNon-Blocking
READ_SHARED READ_ORDERED
IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
31
Summery of access routines (cont.)
• MPI_File_seek• MPI_File_get_position• MPI_File_get_byte_offset• MPI_File_seek_shared (collective)• MPI_File_get_position_shared
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
32
T3E Implementation
• No shared file pointers• No non-blocking collective (split collective)• SPR filed on non-blocking read• Work in progress
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
33
Examples• All the program fragments are available as working
programs on the T3E• Do “module load training”, then look in
$EXAMPLES/mpi_io• All examples are of a distributed dot product
– initialize data with random numbers– compute dot product of whole vector– write out data into a shared file– read back in and check dot product
PE 0 PE 1 PE 2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
34
Naming convention
• First letter is positioning: explicit, individual, or shared• Second letter is synchronism: blocking or non-blocking• Third letter is coordination: non-collective or collective• ebn.f90 is the explicit, blocking non-collective example• There are several “ibn” examples dealing with different
fileviews
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
35
Filetype Example
• Process 0
• Process 1
• Process 2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
36
Filetype Example
filemode = MPI_MODE_RDWR + MPI_MODE_CREATE
call MPI_INFO_CREATE(finfo, ierr)call MPI_INFO_SET(finfo, 'access_style','write_mostly',ierr)
call MPI_FILE_OPEN(MPI_COMM_WORLD, 'vector', filemode,& finfo, fhv, ierr)
call MPI_TYPE_CREATE_SUBARRAY(1, m*nprocs, m, m*me,& MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice, ierr)
disp=0call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL, mpi_fileslice,& 'native', MPI_INFO_NULL, ierr)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
37
Individual, blocking, non-collective
call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status, ierr)
lresult=sdot(m, b, 1, b, 1)call MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM, 0,& MPI_COMM_WORLD, ierr)
if (me.eq.0) then write(6,*) 'dot product: ', resultend if
! zero vector and read it back in
b=0.0
disp=0call MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr)call MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
38
Further Information on MPI I/O
• MPI-The Complete Reference– Volume 1, The MPI Core– Volume 2, The MPI Extensions