cosc 6374 parallel computation parallel i/o (ii) – access...

1

Edgar Gabriel

COSC 6374

Parallel Computation

Parallel I/O (II) –

Access patterns

Edgar Gabriel

Spring 2008

COSC 6374 – Parallel Computation

Edgar Gabriel

Summary of the last lecture (I)

• In a clustered environment accessing files is a problem

– processes don’t share disks

• Distributed filesystems (e.g. NFS)

– client-server concept

– server is the bottleneck if queried by many processes

– consistency model:

• strong consistency through file or block locking

• NFS: session semantics -> modifications of a file might

initially only be visible at the modifying process

2


Edgar Gabriel

Summary of the last lecture (II)

• Parallel filesystems

– goal: many processes access the same file

– basic idea: disk striping

– example for a parallel filesystem: xFS

– Problem of simple disk striping:

• A single disk-failure can destroy the data-file

• Currently wide-spread parallel filesystems:

– Parallel Virtual Filesystem (PVFS), PVFS2

– Lustre

– GPFS


Edgar Gabriel

Redundant arrays of independent disks

(RAID)• Central idea:

replicate data over several disks such that no data is lost if a disk fails

• Several RAID levels defined

• RAID 0: disk striping without redundant storage

(“JBOD”= just a bunch of disks)

– No fault tolerance

– Good for high transfer rates

– Good for high request rates

• RAID 1: mirroring

– All data is replicated on two or more disks

– Does not improve write performance and just moderately the read performance

3


Edgar Gabriel

RAID level 2

• RAID 2: Hamming codes

– Each group of data bits has several check bits appended to it

forming Hamming code words

– Each bit of a Hamming code word is stored on a separate disk

– Very high additional costs: e.g. up to 50% additional capacity

required

• Hardly used today since parity based codes faster and easier


Edgar Gabriel

RAID level 3• Parity based protection:

– Based on exclusive OR (XOR)

– Reversible

– Example

01101010 (data byte 1)

XOR 11001001 (data byte 2)

--------------------------------------

10100011 (parity byte)

– Recovery

11001001 (data byte 2)

XOR 10100011 (parity byte)---------------------------------------

01101010 (recovered data byte 1)

4


Edgar Gabriel

RAID level 3 (cont.)

• Data divided evenly into N subblocks

(N = number of disks, typically 4 or 5)

• Computing parity bytes generates an additional subblock

• Subblocks written in parallel on N+1 disks

• For best performance data should be of size (N * sector size)

• Problems with RAID level 3:

– All disks are always participating in every operation => contention for applications with high access rates

– If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes

• RAID level 3 good for high transfer rates


Edgar Gabriel

RAID level 4

• Parity bytes for N disks calculated and stored

• parity bytes are stored on a separate disk

• Files are not necessarily distributed over N disks

• For read operations:

– Determine disks for the requested blocks

– Read data from these disks

• For write operations

– Retrieve the old data from the sector being overwritten

– Retrieve parity block from the parity disk

– Extract old data from the parity block using XOR operations

– Add the new data to the parity block using XOR

– Store new data

– Store new parity block

• Bottleneck: parity disk is involved in every operation

5


Edgar Gabriel

RAID level 5

• Same as RAID 4, but parity blocks are distributed on

different disks

Block 1 Block 2 Block 3 Block 4 P(1,2,3,4)

Block 5 Block 6 Block 7 Block 8P(5,6,7,8)


Edgar Gabriel

RAID level 6

• Tolerates the loss of more than one disk

• Collection of several techniques

• E.g. P+Q parity: store parity bytes using two different algorithms

and store the two parity blocks on different disks

• E.g. Two dimensional parity

Parity disks

6


Edgar Gabriel

RAID level 10

• Is RAID level 1 + RAID level 0

RAID 1 mirroring

RAID 0 striping

• Also available: RAID 53 (RAID 0 + RAID 3)


Edgar Gabriel

Comparing RAID levels

RAID

level

Protection Space usage Good at.. Poor at..

0 None N Performance Data protect.

1 Mirroring 2N Data protect. Space effic.

2 Hamming codes ~1.5N Transfer rate Request rate

3 Parity N+1 Transfer rate Request rate

4 Parity N+1 Read req. rate Write perf.

5 Parity N+1 Request rate Transfer rate

6 P+Q or 2-D (N+2) or

(MN+M+N)

Data protect. Write perf.

10 Mirroring 2N Performance Space effic.

53 parity N+striping

factor

Performance Space effic.

7


Edgar Gabriel

Sequential file access in parallel

applications

• E.g. one process does all file access operations and

distributes/collects data from other processes

– Easy to manage

– Produces contiguous access patterns => good for

performance

– Might cause memory problems

– Load imbalance in parallel applications


Edgar Gabriel

Sequential I/O in a parallel application

8


Edgar Gabriel

Multiple file access in parallel

applications

• E.g. every process reads/writes its own file

– Bandwidth scales, if done on local file system

– Result is not a single file

• problems handling the output in subsequent sequential

or parallel applications

• Post-processing step might be necessary


Edgar Gabriel

I/O in scientific applications

• Three basic classes of I/O operations

– Required I/O: reading input data and writing final results

– Checkpointing: data written periodically as insurance

against hardware failures

– Data staging: support for applications, whose data does

not fit in memory (out-of-core computations)

9


Edgar Gabriel

Vector I/O pattern

• Long access to large files

– Applications tend to access contiguous chunks

• Cyclic and bursty access of files

– Periods of little I/O activity alternate with periods of

intense I/O activities

• Caching and buffering are ineffective


Edgar Gabriel

Discontiguous access1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

8

16

24

32

40

48

56

64

fseek(fh, offset=21, SEEK_SET);

read(fh, length=2)


read(fh, length=2)


read(fh, length=2)


read(fh, length=2)

e.g. reading a subblock of a two-dimensional matrix

• produces a series of discontiguous requests of

small amounts of data

10


Edgar Gabriel

Handling discontiguous access

• Merge small requests into a single operation

– Single data transfer operation

– Enables prefetching of blocks

• Possible I/O interfaces

– Algorithmic description: compact interface for regular

access patterns with constant strides

– List I/O: detailed interface for irregular access patterns


Edgar Gabriel

Algorithmic description• Contiguous in memory, discontiguous on disk

read_strided (file, buffer, file_stride,

segment_size);

• Discontiguous in memory, discontiguous on diskread_strided2 (file, buffer, file_stride,

mem_stride, segment_size);

disc

disc

buffer

buffer

file_stride=3, segment_size=1

file_stride=3, mem_stride = 2,segment_size=1

11


Edgar Gabriel

Algorithmic description in MPI

• Derived Datatypes

MPI_Type_vector(…);

MPI_Type_contiguous(…)

MPI_Type_subarray (…)


Edgar Gabriel

List I/O interfaces

• Contiguous in memory, discontiguous on diskread_list ( file, buffer, count, offsets[],

length[] );

• Discontiguous in memory, contiguous on diskreadv ( file, const struct iovec *vec, int

count);

struct iovec {

void* iov_base; /*starting address */

size_t iov_len; /*length in bytes */

}

• Discontiguous in memory, discontiguous on diskread_list2 (…)

– Is in fact a gather/scatter interface: gathers data

from disc and scatters it in memory

12


Edgar Gabriel

List I/O Interfaces in MPI

• Derived Datatypes

MPI_Type_indexed(…);

MPI_Type_struct (…);


Edgar Gabriel

Data sieving

• Ignore the gaps when reading from disk

– One large contiguous access instead of many small

requests

• Works well if gaps are small

• Overhead can be dominating for large gaps

disc

user buffer

temporary buffer

read();

13


Edgar Gabriel

Collective I/O

• Merges separate I/O requests across multiple processes

• Collective read: retrieve large chunks from disk and distribute to multiple processes

• Collective write: gather data from multiple processes before writing to disk

– Eliminates false sharing!

• Two classes of collective I/O techniques

– Client-based collective I/O

– Server-based collective I/O


Edgar Gabriel

Client-based collective I/O

• Uses the message-passing network to rearrange data

(shuffle) before sending contiguous chunks to the

I/O node

Data layout onI/O nodes

Data layoutper process

Logical Datalayout on inter-mediary processes

Process 0 Process 1 Process 2 Process 3

I/O node 1I/O node 0

14


Edgar Gabriel

Client-based collective I/O continued• Consists of two steps (=> often called two-phase I/O)

– Shuffle

– I/O operation

• Problems to worry about

– Number of intermediary processes: either number of application processes or number of I/O nodes

– Additional buffer space: segmenting of data might be required

– Schedule for accessing I/O nodes: avoid that all intermediary processes send first to I/O node 0, than to I/O node 1 etc.

• Client-based I/O introduces additional copy and data transfer operations. Client-based I/O improves performance only if these costs are smaller than the gain through the improved I/O performance.


Edgar Gabriel

Server-based I/O• Collect and merge requests on the server

Process 0 Process 1 Process 2 Process 3

I/O node 1I/O node 0

Data layoutper process

I/O nodes gatherdata to fill blocks

I/O nodes write previousblocks to disk whilecontinuing to gatherdata

15


Edgar Gabriel

Server-based I/O continued

• Steps for a write operation

– Compute processes send a description of the planned

data transfer (without data)

– Each I/O node determines which file blocks are under

its control

– Each I/O node determines which processes hold data

for each block

– For each block, I/O nodes request the data from the

compute nodes


Edgar Gabriel

Server-based I/O cont. again

• Eliminates the need for extra buffer space on compute

nodes

• Data travels only once over the network

• Many server-based I/O techniques are designed to

handle only a few blocks at a time

– minimizes buffer space requirements on the I/O nodes

– might require multiple messages between compute

process and I/O node for large read/write operations

16


Edgar Gabriel

Hints

• Performance of any I/O technique depends on

– Machine parameters

– Application parameters

– Implementation of the I/O library

⇒ I/O library can not determine the best/fastest method to handle I/O operations for a wide range of application scenarios

⇒Application have to give hints to the I/O library about their I/O characteristics


Edgar Gabriel

Hints and optimization possibilities

Hint Possible optimization

Read-only Aggressive prefetching

Write-only Turn-off prefetching

Consecutive

access

Prefetch blocks in sequence for read-access

files

Strided access Prefetch according to strided pattern; delay

writing if other process will fill in data

Random access Turn off prefetching; use largest possible cache

and buffer; delay writing as long as possible

Large consecutive

access

Turn off caching and buffering

No overlapping

access

Turn off concurrency control

cosc 6374 parallel computation parallel i/o (ii) – access...

Documents