cosc 6374 parallel computation parallel i/o (ii) – access...
TRANSCRIPT
1
Edgar Gabriel
COSC 6374
Parallel Computation
Parallel I/O (II) –
Access patterns
Edgar Gabriel
Spring 2008
COSC 6374 – Parallel Computation
Edgar Gabriel
Summary of the last lecture (I)
• In a clustered environment accessing files is a problem
– processes don’t share disks
• Distributed filesystems (e.g. NFS)
– client-server concept
– server is the bottleneck if queried by many processes
– consistency model:
• strong consistency through file or block locking
• NFS: session semantics -> modifications of a file might
initially only be visible at the modifying process
2
COSC 6374 – Parallel Computation
Edgar Gabriel
Summary of the last lecture (II)
• Parallel filesystems
– goal: many processes access the same file
– basic idea: disk striping
– example for a parallel filesystem: xFS
– Problem of simple disk striping:
• A single disk-failure can destroy the data-file
• Currently wide-spread parallel filesystems:
– Parallel Virtual Filesystem (PVFS), PVFS2
– Lustre
– GPFS
COSC 6374 – Parallel Computation
Edgar Gabriel
Redundant arrays of independent disks
(RAID)• Central idea:
replicate data over several disks such that no data is lost if a disk fails
• Several RAID levels defined
• RAID 0: disk striping without redundant storage
(“JBOD”= just a bunch of disks)
– No fault tolerance
– Good for high transfer rates
– Good for high request rates
• RAID 1: mirroring
– All data is replicated on two or more disks
– Does not improve write performance and just moderately the read performance
3
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 2
• RAID 2: Hamming codes
– Each group of data bits has several check bits appended to it
forming Hamming code words
– Each bit of a Hamming code word is stored on a separate disk
– Very high additional costs: e.g. up to 50% additional capacity
required
• Hardly used today since parity based codes faster and easier
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 3• Parity based protection:
– Based on exclusive OR (XOR)
– Reversible
– Example
01101010 (data byte 1)
XOR 11001001 (data byte 2)
--------------------------------------
10100011 (parity byte)
– Recovery
11001001 (data byte 2)
XOR 10100011 (parity byte)---------------------------------------
01101010 (recovered data byte 1)
4
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 3 (cont.)
• Data divided evenly into N subblocks
(N = number of disks, typically 4 or 5)
• Computing parity bytes generates an additional subblock
• Subblocks written in parallel on N+1 disks
• For best performance data should be of size (N * sector size)
• Problems with RAID level 3:
– All disks are always participating in every operation => contention for applications with high access rates
– If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes
• RAID level 3 good for high transfer rates
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 4
• Parity bytes for N disks calculated and stored
• parity bytes are stored on a separate disk
• Files are not necessarily distributed over N disks
• For read operations:
– Determine disks for the requested blocks
– Read data from these disks
• For write operations
– Retrieve the old data from the sector being overwritten
– Retrieve parity block from the parity disk
– Extract old data from the parity block using XOR operations
– Add the new data to the parity block using XOR
– Store new data
– Store new parity block
• Bottleneck: parity disk is involved in every operation
5
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 5
• Same as RAID 4, but parity blocks are distributed on
different disks
Block 1 Block 2 Block 3 Block 4 P(1,2,3,4)
Block 5 Block 6 Block 7 Block 8P(5,6,7,8)
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 6
• Tolerates the loss of more than one disk
• Collection of several techniques
• E.g. P+Q parity: store parity bytes using two different algorithms
and store the two parity blocks on different disks
• E.g. Two dimensional parity
Parity disks
6
COSC 6374 – Parallel Computation
Edgar Gabriel
RAID level 10
• Is RAID level 1 + RAID level 0
RAID 1 mirroring
RAID 0 striping
• Also available: RAID 53 (RAID 0 + RAID 3)
COSC 6374 – Parallel Computation
Edgar Gabriel
Comparing RAID levels
RAID
level
Protection Space usage Good at.. Poor at..
0 None N Performance Data protect.
1 Mirroring 2N Data protect. Space effic.
2 Hamming codes ~1.5N Transfer rate Request rate
3 Parity N+1 Transfer rate Request rate
4 Parity N+1 Read req. rate Write perf.
5 Parity N+1 Request rate Transfer rate
6 P+Q or 2-D (N+2) or
(MN+M+N)
Data protect. Write perf.
10 Mirroring 2N Performance Space effic.
53 parity N+striping
factor
Performance Space effic.
7
COSC 6374 – Parallel Computation
Edgar Gabriel
Sequential file access in parallel
applications
• E.g. one process does all file access operations and
distributes/collects data from other processes
– Easy to manage
– Produces contiguous access patterns => good for
performance
– Might cause memory problems
– Load imbalance in parallel applications
COSC 6374 – Parallel Computation
Edgar Gabriel
Sequential I/O in a parallel application
8
COSC 6374 – Parallel Computation
Edgar Gabriel
Multiple file access in parallel
applications
• E.g. every process reads/writes its own file
– Bandwidth scales, if done on local file system
– Result is not a single file
• problems handling the output in subsequent sequential
or parallel applications
• Post-processing step might be necessary
COSC 6374 – Parallel Computation
Edgar Gabriel
I/O in scientific applications
• Three basic classes of I/O operations
– Required I/O: reading input data and writing final results
– Checkpointing: data written periodically as insurance
against hardware failures
– Data staging: support for applications, whose data does
not fit in memory (out-of-core computations)
9
COSC 6374 – Parallel Computation
Edgar Gabriel
Vector I/O pattern
• Long access to large files
– Applications tend to access contiguous chunks
• Cyclic and bursty access of files
– Periods of little I/O activity alternate with periods of
intense I/O activities
• Caching and buffering are ineffective
COSC 6374 – Parallel Computation
Edgar Gabriel
Discontiguous access1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
fseek(fh, offset=21, SEEK_SET);
read(fh, length=2)
fseek(fh, offset=29, SEEK_SET);
read(fh, length=2)
fseek(fh, offset=37, SEEK_SET);
read(fh, length=2)
fseek(fh, offset=45, SEEK_SET);
read(fh, length=2)
e.g. reading a subblock of a two-dimensional matrix
• produces a series of discontiguous requests of
small amounts of data
10
COSC 6374 – Parallel Computation
Edgar Gabriel
Handling discontiguous access
• Merge small requests into a single operation
– Single data transfer operation
– Enables prefetching of blocks
• Possible I/O interfaces
– Algorithmic description: compact interface for regular
access patterns with constant strides
– List I/O: detailed interface for irregular access patterns
COSC 6374 – Parallel Computation
Edgar Gabriel
Algorithmic description• Contiguous in memory, discontiguous on disk
read_strided (file, buffer, file_stride,
segment_size);
• Discontiguous in memory, discontiguous on diskread_strided2 (file, buffer, file_stride,
mem_stride, segment_size);
disc
disc
buffer
buffer
file_stride=3, segment_size=1
file_stride=3, mem_stride = 2,segment_size=1
11
COSC 6374 – Parallel Computation
Edgar Gabriel
Algorithmic description in MPI
• Derived Datatypes
MPI_Type_vector(…);
MPI_Type_contiguous(…)
MPI_Type_subarray (…)
COSC 6374 – Parallel Computation
Edgar Gabriel
List I/O interfaces
• Contiguous in memory, discontiguous on diskread_list ( file, buffer, count, offsets[],
length[] );
• Discontiguous in memory, contiguous on diskreadv ( file, const struct iovec *vec, int
count);
struct iovec {
void* iov_base; /*starting address */
size_t iov_len; /*length in bytes */
}
• Discontiguous in memory, discontiguous on diskread_list2 (…)
– Is in fact a gather/scatter interface: gathers data
from disc and scatters it in memory
12
COSC 6374 – Parallel Computation
Edgar Gabriel
List I/O Interfaces in MPI
• Derived Datatypes
MPI_Type_indexed(…);
MPI_Type_struct (…);
COSC 6374 – Parallel Computation
Edgar Gabriel
Data sieving
• Ignore the gaps when reading from disk
– One large contiguous access instead of many small
requests
• Works well if gaps are small
• Overhead can be dominating for large gaps
disc
user buffer
temporary buffer
read();
13
COSC 6374 – Parallel Computation
Edgar Gabriel
Collective I/O
• Merges separate I/O requests across multiple processes
• Collective read: retrieve large chunks from disk and distribute to multiple processes
• Collective write: gather data from multiple processes before writing to disk
– Eliminates false sharing!
• Two classes of collective I/O techniques
– Client-based collective I/O
– Server-based collective I/O
COSC 6374 – Parallel Computation
Edgar Gabriel
Client-based collective I/O
• Uses the message-passing network to rearrange data
(shuffle) before sending contiguous chunks to the
I/O node
Data layout onI/O nodes
Data layoutper process
Logical Datalayout on inter-mediary processes
Process 0 Process 1 Process 2 Process 3
I/O node 1I/O node 0
14
COSC 6374 – Parallel Computation
Edgar Gabriel
Client-based collective I/O continued• Consists of two steps (=> often called two-phase I/O)
– Shuffle
– I/O operation
• Problems to worry about
– Number of intermediary processes: either number of application processes or number of I/O nodes
– Additional buffer space: segmenting of data might be required
– Schedule for accessing I/O nodes: avoid that all intermediary processes send first to I/O node 0, than to I/O node 1 etc.
• Client-based I/O introduces additional copy and data transfer operations. Client-based I/O improves performance only if these costs are smaller than the gain through the improved I/O performance.
COSC 6374 – Parallel Computation
Edgar Gabriel
Server-based I/O• Collect and merge requests on the server
Process 0 Process 1 Process 2 Process 3
I/O node 1I/O node 0
Data layoutper process
I/O nodes gatherdata to fill blocks
I/O nodes write previousblocks to disk whilecontinuing to gatherdata
15
COSC 6374 – Parallel Computation
Edgar Gabriel
Server-based I/O continued
• Steps for a write operation
– Compute processes send a description of the planned
data transfer (without data)
– Each I/O node determines which file blocks are under
its control
– Each I/O node determines which processes hold data
for each block
– For each block, I/O nodes request the data from the
compute nodes
COSC 6374 – Parallel Computation
Edgar Gabriel
Server-based I/O cont. again
• Eliminates the need for extra buffer space on compute
nodes
• Data travels only once over the network
• Many server-based I/O techniques are designed to
handle only a few blocks at a time
– minimizes buffer space requirements on the I/O nodes
– might require multiple messages between compute
process and I/O node for large read/write operations
16
COSC 6374 – Parallel Computation
Edgar Gabriel
Hints
• Performance of any I/O technique depends on
– Machine parameters
– Application parameters
– Implementation of the I/O library
⇒ I/O library can not determine the best/fastest method to handle I/O operations for a wide range of application scenarios
⇒Application have to give hints to the I/O library about their I/O characteristics
COSC 6374 – Parallel Computation
Edgar Gabriel
Hints and optimization possibilities
Hint Possible optimization
Read-only Aggressive prefetching
Write-only Turn-off prefetching
Consecutive
access
Prefetch blocks in sequence for read-access
files
Strided access Prefetch according to strided pattern; delay
writing if other process will fill in data
Random access Turn off prefetching; use largest possible cache
and buffer; delay writing as long as possible
Large consecutive
access
Turn off caching and buffering
No overlapping
access
Turn off concurrency control