i/o strategies for the t3e

38
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services

Upload: bryant

Post on 21-Mar-2016

30 views

Category:

Documents


1 download

DESCRIPTION

I/O Strategies for the T3E. Jonathan Carter NERSC User Services. T3E Overview. T3E is a set of Processing Elements (PE) connected by a fast 3D torus. PEs do not have local disk All PEs access all filesystems equivalently Path for I/O generally looks like: user buffer space - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

I/O Strategies for the T3E

Jonathan CarterNERSC User Services

Page 2: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

T3E Overview

• T3E is a set of Processing Elements (PE) connected by a fast 3D torus.

• PEs do not have local disk• All PEs access all filesystems equivalently• Path for I/O generally looks like:

– user buffer space– system buffer space– I/O device buffer space

Page 3: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Filesystems• /usr/tmp

– fast– subject to 14 day purge, not backed up– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)

• $TMPDIR– fast– purged at end of job or session– shares quota with /usr/tmp

• $HOME– slower– permanent, backed up– check quota with quota (usually 2Gb and 3500 inodes)

Page 4: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Types of I/O

• Language I/O: Fortran or C (ANSI or POSIX)• Cray FFIO library (can be used from Fortran or C)• MPI I/O• Cray extensions to Fortran and C I/O (mostly for

compatibility with PVP systems)

Page 5: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

I/O Strategies - Exclusive access files

• Each PE reads and writes to a separate file– Language I/O– MPI I/O– Increase language I/O performance with FFIO library (C must use

POSIX style calls)

Page 6: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

I/O Strategies - Communication and I/O PE

• One PE coordinates reading and writing and communicates data back and forth between other PEs via message passing– Language I/O– MPI I/O– Increase language I/O performance with FFIO library

Page 7: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

I/O Strategies - Shared files

• All PEs read and write the same file simultaneously– Language I/O with FFIO library global layer– MPI I/O– Language I/O with FFIO library global layer and Cray extensions for

additional flexibility

Page 8: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Cray FFIO library

• FFIO is a set of I/O layers tuned for different I/O characteristics

• Buffering of data (configurable size)• Caching of data (configurable size)• Available to regular Fortran I/O without reprogramming• Available for C through POSIX-like calls, e.g. ffopen, ffwrite

Page 9: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

The assign command

• the assign command controls– controls which FFIO layer is active– striping across multiple partitions– lots more

• scope of assign– File name– Fortran unit number– File type (e.g. all sequential unformatted files)

Page 10: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

assign Examples

• read and write to file restart.file from all PEs by using the FFIO library global layerassign -F global:128:2 f:restart.file

• use the FFIO library bufa layer to improve performance for file opened on Fortran unit 10assign -F bufa:128:2 u:10

• use the FFIO library bufa layer to improve performance for all unformatted sequential Fortran filesassign -F bufa:128:2 g:su

Page 11: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

assign Examples

• To see all active assignsassign -V

• To remove all active assignsassign -R

Page 12: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

bufa FFIO layer

• bufa is an asynchronous buffering layer• performs read-ahead, write-behind• specify buffer size with -F bufa:bs:nbufs where bs

is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers

• buffer space increases your applications memory requirements

Page 13: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

global FFIO layer

• global is a caching and buffering layer which enables multiple PEs to read and write to the same file

• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy

• file open is a synchronizing event• By default, all PEs must open a global file, this can be

changed by calling GLIO_GROUP_MPI(comm)• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

Page 14: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

File positioning with the global FFIO layer

• Positioning of a read or write is your responsibility• File pointers are private• Fortran

– Use a direct access file, and read/write(rec=num)– Use Cray extensions setpos and getpos to position file pointer

(not portable)

• C– Use ffseek

Page 15: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

FFIO considerations

• Examples above use an unblocked file structure, normal Fortran files are blocked. To read the file without the global or bufa layers you must use

assign -s unblocked f:filename• bufa and global do not allow backspace, or skipping over a

partially read record. You can allow this behavior by using the cos layer in addition to bufa or global, but then setpos doesn’t work. assign -s cos:128,bufa:128:2 f:filename

Page 16: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

More on FFIO

• There are many other FFIO layers, some pretty obscure– cache and cachea layers, good for random access files

• man intro_ffio for a terse description• Cray Publication - Application Programmer’s I/O Guide

Page 17: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

More on assign

• Many text processing options• Switch between Fortran 77 and Fortran 90 namelist• File pre-allocation• File striping

Page 18: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Further Information

• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials

• Cray Publication - Application Programmer’s I/O Guide• Cray Publication - Cray T3E Fortran Optimization Guide• man assign

Page 19: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

MPI I/O

• Part of MPI-2• Interface for High Performance Parallel I/O

– data partitioning– collective I/O– asynchronous I/O– portability and interoperability

Page 20: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

MPI I/O Definitions

• An MPI file is an ordered collection of MPI types.• A file may be opened individually or collectively by a group

of processes• The fileview defines a template for accessing the file and is

used to partition the file amongst processes

Page 21: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Fileviews

• A fileview is composed of three pieces:– a displacement (in bytes) form

the beginning of the file– an elementary datatype (etype),

which is the unit of data access and positioning within the file

– an filetype, which defines a template for accessing the file. A filetype can contain etypes or holes of the same extent as etypes.

Page 22: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

Fileviews (cont.)

• The filetype pattern is repeated, “tiling” the file• Only the non-empty slots are available to read or write

Page 23: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Fileview (cont.)

• Each process can have a different filetype

Process 0

Process 1

Process 2

Page 24: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

MPI_File_set_view

• Called after MPI_File_open to set fileview• MPI_File_set_view(fh, disp, etype, filetype, datarep, info)

– fh is a file handle– disp, etype, and filetype define the fileview– datarep is one of “native”, “internal”, or “external32”– info is a set of hints to optimize performance

Page 25: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

MPI Info object

• An info object bundles up a set of parametersinteger finfocall MPI_Info_create(finfo, ierr)call MPI_Info_set(finfo, ‘access_style’, ‘write_mostly’, ierr)

• MPI I/O defines a set of parameters used to help optimize I/O performance

• MPI_Info_null can be used instead of an info object

Page 26: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

Open and Close

• MPI_File_open(comm, filename, amode, info, fh)– comm, open is collective over this communicator– filename, string or character variable– file access mode: MPI_MODE_RDONLY, MPI_MODE_RDWR

etc.– info object, used to pass hints to open– file handle

• MPI_File_close(fh)

Page 27: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

Utility routines

• MPI_File_delete• MPI_File_set_size• MPI_File_preallocate• MPI_File_set_info

Page 28: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

Query routines

• MPI_File_get_size• MPI_File_get_group• MPI_File_get_amode• MPI_File_get_info• MPI_File_get_view

Page 29: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

Data access routines

• Positioning– Explicit, each call has an offset– Individual, each PE maintains an individual file pointer– Shared, the file pointer is maintained globally

• Synchronism– Blocking, routine returns when complete– Non-blocking, must call a termination routine to ensure completion

• Coordination– Non-collective– Collective

Page 30: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

Summary of access routines

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END

Page 31: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

Summery of access routines (cont.)

• MPI_File_seek• MPI_File_get_position• MPI_File_get_byte_offset• MPI_File_seek_shared (collective)• MPI_File_get_position_shared

Page 32: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

T3E Implementation

• No shared file pointers• No non-blocking collective (split collective)• SPR filed on non-blocking read• Work in progress

Page 33: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

Examples• All the program fragments are available as working

programs on the T3E• Do “module load training”, then look in

$EXAMPLES/mpi_io• All examples are of a distributed dot product

– initialize data with random numbers– compute dot product of whole vector– write out data into a shared file– read back in and check dot product

PE 0 PE 1 PE 2

Page 34: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Naming convention

• First letter is positioning: explicit, individual, or shared• Second letter is synchronism: blocking or non-blocking• Third letter is coordination: non-collective or collective• ebn.f90 is the explicit, blocking non-collective example• There are several “ibn” examples dealing with different

fileviews

Page 35: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

35

Filetype Example

• Process 0

• Process 1

• Process 2

Page 36: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

36

Filetype Example

filemode = MPI_MODE_RDWR + MPI_MODE_CREATE

call MPI_INFO_CREATE(finfo, ierr)call MPI_INFO_SET(finfo, 'access_style','write_mostly',ierr)

call MPI_FILE_OPEN(MPI_COMM_WORLD, 'vector', filemode,& finfo, fhv, ierr)

call MPI_TYPE_CREATE_SUBARRAY(1, m*nprocs, m, m*me,& MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice, ierr)

disp=0call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL, mpi_fileslice,& 'native', MPI_INFO_NULL, ierr)

Page 37: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

37

Individual, blocking, non-collective

call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status, ierr)

lresult=sdot(m, b, 1, b, 1)call MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM, 0,& MPI_COMM_WORLD, ierr)

if (me.eq.0) then write(6,*) 'dot product: ', resultend if

! zero vector and read it back in

b=0.0

disp=0call MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr)call MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)

Page 38: I/O Strategies for the T3E

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

38

Further Information on MPI I/O

• MPI-The Complete Reference– Volume 1, The MPI Core– Volume 2, The MPI Extensions