page 1 parallel programming with mpi self test. page 2 message passing fundamentals

83
Page 1 Parallel Programming Parallel Programming With MPI With MPI Self Test

Post on 20-Dec-2015

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 1

Parallel Programming With Parallel Programming With MPIMPI

Self Test

Page 2

Message Passing Message Passing FundamentalsFundamentals

Page 3

Self TestSelf Test

1. A shared memory computer has access to:a) the memory of other nodes via a proprietary high-

speed communications network

b) a directives-based data-parallel language

c) a global memory space

d) communication time

Page 4

Self TestSelf Test

2. A domain decomposition strategy turns out not to be the most efficient algorithm for a parallel program when:

a) data can be divided into pieces of approximately the same size.

b) the pieces of data assigned to the different processes require greatly different lengths of time to process.

c) one needs the advantage of maintaining a single flow of control.

d) one must parallelize a finite differencing scheme.

Page 5

Self TestSelf Test

3. In the message passing approach:a) serial code is made parallel by adding directives that

tell the compiler how to distribute data and work across the processors.

b) details of how data distribution, computation, and communications are to be done are left to the compiler.

c) is not very flexible.

d) it is left up to the programmer to explicitly divide data and work across the processors as well as manage the communications among them.

Page 6

Self TestSelf Test

4. Total execution time does not involve:a) computation time.

b) compiling time.

c) communications time.

d) idle time.

Page 7

Self TestSelf Test

5. One can minimize idle time by:a) occupying a process with one or more new tasks

while it waits for communication to finish so it can proceed on another task.

b) always using nonblocking communications.

c) never using nonblocking communications.

d) frequent use of barriers.

Page 8

Matching QuestionMatching Question

1. Message passing

2. Domain decomposition

3. Idle time

4. Load balancing

5. Directives-based data parallel language

6. Distributed memory

7. Shared memory

8. Computation time

9. Functional decomposition

10. Communication time

a) When each node has rapid access to its own local memory and access to the memory of other nodes via some sort of communications network.

b) When multiple processor units share access to a global memory space via a high speed memory bus.

c) Data are divided into pieces of approximately the same size, and then mapped to different processors.

d) The problem is decomposed into a large number of smaller tasks and then the tasks are assigned to the processors as they become available.

e) Serial code is made parallel by adding directives that tell the compiler how to distribute data and work across the processors.

f) The programmer explicitly divides data and work across the processors as well as managing the communications among them.

g) Dividing the work equally among the available processes.

h) The time spent performing computations on the data.

i) The time a process spends waiting for data from other processors.

j) The time for processes to send and receive messages

Page 9

Course ProblemCourse Problem

• The initial problem implements a parallel search of an extremely large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file.

• Using these concepts of parallel programming, write a description of a parallel approach to solving the problem described above. (No coding is required for this exercise.)

Page 10

Getting Started with MPIGetting Started with MPI

Page 11

Self TestSelf Test

• Is a blocking send necessarily also synchronous?– Yes– No

Page 12

Self TestSelf Test

• Consider the following fragment of MPI pseudo-code:

...x = fun(y)MPI_SOME_SEND(the value of x to some other processor)x = fun(z)...

• where MPI_SOME_SEND is a generic send routine. In this case, it would be best to use …– A blocking send – A nonblocking send

Page 13

Self TestSelf Test

• Which of the following is true for all send routines?– It is always safe to overwrite the sent variable(s) on th

e sending processor after the send returns. – Completion implies that the message has been receive

d at its destination. – It is always safe to overwrite the sent variable(s) on th

e sending processor after the send is complete. – All of the above. – None of the above.

Page 14

Matching QuestionMatching Question

1. Point-to-point communication

2. Collective communication

3. Communication mode

4. Blocking send

5. Synchronous send

6. Broadcast

7. Scatter

8. Gather

a) A send routine that does not return until it is complete

b) Communication involving one or more groups of processes

c) send routine that is not complete until receipt of the message at its destination has been acknowledged

d) An operation in which one process sends the same data to several others

e) Communication involving a single pair of processes

f) An operation in which one process distributes different elements of a local array to several others

g) An operation in which one process collects data from several others and assembles them in a local array

h) Specification of the method of operation and completion criteria for a communication routine

Page 15

Course ProblemCourse Problem

• The initial problem implements a parallel search of an extremely large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file.

• Before writing a parallel version of a program, first write a serial version (that is, a version that runs on one processor). That is the task for this chapter. You can use C/C++ and should confirm that the program works by using a test input array.

Page 16

MPI Program StructureMPI Program Structure

Page 17

Self TestSelf Test

1. How would you modify "Hello World" so that only even-numbered processors print the greeting message?

Page 18

Self TestSelf Test

• Consider the following MPI pseudo-code, which sends a piece of data from processor 1 to processor 2:

MPI_INIT()MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)if (myrank = 1)

MPI_SEND (some data to processor 2 in MPI_COMM_WORLD)

else {MPI_RECV (data from processor 1 in MPI_COMM_WORLD)print "Message received!"

}MPI_FINALIZE()

– where MPI_SEND and MPI_RECV are blocking send and receive routines. Thus, for example, a process encountering the MPI_RECV statement will block while waiting for a message from processor 1.

Page 19

Self TestSelf Test

2. If this code is run on a single processor, what do you expect to happen?

a) The code will print "Message received!" and then terminate.

b) The code will terminate normally with no output.

c) The code will hang with no output.

d) An error condition will result.

Page 20

Self TestSelf Test

3. If the code is run on three processors, what do you expect?

a) The code will terminate after printing "Message received!".

b) The code will hang with no output.

c) The code will hang after printing "Message received!".

d) The code will give an error message and exit (possibly leaving a core file).

Page 21

Self TestSelf Test

4. Consider an MPI code running on four processors, denoted A, B, C, and D. In the default communicator MPI_COMM_WORLD their ranks are 0-3, respectively. Assume that we have defined another communicator, called USER_COMM, consisting of processors B and D. Which one of the following statements about USER_COMM is always true? a) Processors B and D have ranks 1 and 3, respectively. b) Processors B and D have ranks 0 and 1, respectively. c) Processors B and D have ranks 1 and 3, but which has which is

in general undefined. d) Processors B and D have ranks 0 and 1, but which has which is

in general undefined.

Page 22

Course ProblemCourse Problem

• Description– The initial problem implements a parallel search of an extremely

large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file.

• Exercise – You now have enough knowledge to write pseudo-code for the

parallel search algorithm introduced in Chapter 1. In the pseudo-code, you should correctly initialize MPI, have each processor determine and use its rank, and terminate MPI. By tradition, the Master processor has rank 0. Assume in your pseudo-code that the real code will be run on 4 processors.

Page 23

Point-to-Point Point-to-Point CommunicationCommunication

Page 24

Self TestSelf Test

1. MPI_SEND is used to send an array of 10 4-byte integers. At the time MPI_SEND is called, MPI has over 50 Kbytes of internal message buffer free on the sending process. Choose the best answer.

a) This is a blocking send. Most MPI implementations will copy the message into MPI internal message buffer and return.

b) This is a blocking send. Most MPI implementations will block the sending process until the destination process has received the message.

c) This is a non-blocking send. Most MPI implementations will copy the message into MPI internal message buffer and return.

Page 25

Self TestSelf Test

2. MPI_SEND is used to send an array of 100,000 8-byte reals. At the time MPI_SEND is called, MPI has less than 50 Kbytes of internal message buffer free on the sending process. Choose the best answer.

a) This is a blocking send. Most MPI implementations will block the calling process until enough message buffer becomes available.

b) This is a blocking send. Most MPI implementations will block the sending process until the destination process has received the message.

Page 26

Self TestSelf Test

3. MPI_SEND is used to send a large array. When MPI_SEND returns, the programmer may safely assume

a) The destination process has received the message.

b) The array has been copied into MPI internal message buffer.

c) Either the destination process has received the message or the array has been copied into MPI internal message buffer.

d) None of the above.

Page 27

Self TestSelf Test

4. MPI_ISEND is used to send an array of 10 4-byte integers. At the time MPI_ISEND is called, MPI has over 50 Kbytes of internal message buffer free on the sending process. Choose the best answer.

a) This is a non-blocking send. MPI will generate a request id and then return.

b) This a non-blocking send. Most MPI implementations will copy the message into MPI internal message buffer and return.

Page 28

Self TestSelf Test

5. MPI_ISEND is used to send an array of 10 4-byte integers. At the time MPI_ISEND is called, MPI has over 50 Kbytes of internal message buffer free on the sending process. After calling MPI_ISEND, the sending process calls MPI_WAIT to wait for completion of the send operation. Choose the best answer.

a) MPI_Wait will not return until the destination process has received the message.

b) MPI_WAIT may return before the destination process has received the message.

Page 29

Self TestSelf Test

6. MPI_ISEND is used to send an array of 100,000 8-byte reals. At the time MPI_ISEND is called, MPI has less than 50 Kbytes of internal message buffer free on the sending process. Choose the best answer.

a) This is a non-blocking send. MPI will generate a request id and return.

b) This is a blocking send. Most MPI implementations will block the sending process until the destination process has received the message.

Page 30

Self TestSelf Test

7. MPI_ISEND is used to send an array of 100,000 8-byte reals. At the time MPI_ISEND is called, MPI has less than 50 Kbytes of internal message buffer free on the sending process. After calling MPI_ISEND, the sending process calls MPI_WAIT to wait for completion of the send operation. Choose the best answer.

a) This is a blocking send. In most implementations, MPI_WAIT will not return until the destination process has received the message.

b) This is a non-blocking send. In most implementations, MPI_Wait will not return until the destination process has received the message.

c) This is a non-blocking send. In most implementations, MPI_WAIT will return before the destination process has received the message.

Page 31

Self TestSelf Test

8. Assume the only communicator used in this problem is MPI_COMM_WORLD. After calling MPI_INIT, process 1 immediately sends two messages to process 0. The first message sent has tag 100, and the second message sent has tag 200. After calling MPI_INIT and verifying there are at least 2 processes in MPI_COMM_WORLD, process 0 calls MPI_RECV with the source argument set to 1 and the tag argument set to MPI_TAG_ANY. Choose the best answer.

a) The tag of the first message received by process 0 is 100. b) The tag of the first message received by process 0 is 200. c) The tag of the first message received by process 0 is either 100

or 200. d) Based on the information given, one cannot safely assume any

of the above.

Page 32

Self TestSelf Test

9. Assume the only communicator used in this problem is MPI_COMM_WORLD. After calling MPI_INIT, process 1 immediately sends two messages to process 0. The first message sent has tag 100, and the second message sent has tag 200. After calling MPI_INIT and verifying there are at least 2 processes in MPI_COMM_WORLD, process 0 calls MPI_RECV with the source argument set to 1 and the tag argument set to 200. Choose the best answer.

a) Process 0 is deadlocked, since it attempted to receive the second message before receiving the first.

b) Process 0 receives the second message sent by process 1, even though the first message has not yet been received.

c) None of the above.

Page 33

Course ProblemCourse Problem

• Description– The initial problem implements a parallel search of an extremely

large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file.

• Exercise– Go ahead and write the real parallel code for the search problem!

Using the pseudo-code from the previous chapter as a guide, fill in all the sends and receives with calls to the actual MPI send and receive routines. For this task, use only the blocking routines. If you have access to a parallel computer with the MPI library installed, run your parallel code using 4 processors. See if you get the same results as those obtained with the serial version of Chapter 2. Of course, you should.

Page 34

Derived Datatypes and RelDerived Datatypes and Related Featuresated Features

Page 35

Self TestSelf Test

1. You are writing a parallel program to be run on 100 processors. Each processor is working with only one section of a skeleton outline of a 3-D model of a house. In the course of constructing the model house each processor often has to send the three cartesian coordinates (x,y,z) of nodes that are to be used to make the boundaries between the house sections. Each coordinate will be a real value.

Why would it be advantageous for you to define a new data type called Point which contained the three coordinates?

Page 36

Self TestSelf Test

a) My program will be more readable and self commenting.

b) Since many x,y,z values will be used in MPI communication routines they can be sent as a single Point type entity instead of packing and unpacking three reals each time.

c) Since all three values are real, there is no purpose in making a derived data type.

d) It would be impossible to use MPI_Pack and MPI_Unpack to send the three real values.

Page 37

Self TestSelf Test

2. What is the simplest MPI derived datatype creation function to make the Point datatype described in problem 1? (Three of the answers given can actually be used to construct the type, but one is the simplest). a) MPI_TYPE_CONTIGUOUS

b) MPI_TYPE_VECTOR

c) MPI_TYPE_STRUCT

d) MPI_TYPE_COMMIT

Page 38

Self TestSelf Test

3. The C syntax for the MPI_TYPE_CONTIGUOUS subroutine is

MPI_Type_contiguous (count, oldtype, newtype)

The argument names should be fairly self-explanatory, but if you want thier exact definition you can look them up at the MPI home page.For the derived datatype we have been discussing in the previous problems, what would be the values for the count, oldtype, and newtype arguments respectively?

a) 2, MPI_REAL, Pointb) 3, REAL, Pointc) 3, MPI_INTEGER, Coord:d) 3, MPI_REAL, Point

Page 39

Self TestSelf Test

4. In Section “Using MPI Derived Types for User-Defined Types”, the code for creating the derived data type MPI_SparseElt is shown. Using the MPI_Extent function, determine the size (in bytes) of a variable of type MPI_SparseElt. You should probably modify the code found in that section, compile and run it to get the answer.

Page 40

Course ProblemCourse Problem

• Description– The new problem still implements a parallel search of

an integer array. – The program should find all occurrences of a certain

integer which will be called the target. – It should then calculate the average of the target value

and its index. – Both the target location and the average should be

written to an output file. – In addition, the program should read both the target

value and all the array elements from an input file.

Page 41

Course ProblemCourse Problem

• Exercise– Modify your code from Chapter 4 to create a program t

hat solves the new Course Problem. – Use the techniques/routines of this chapter to make a

new derived type called MPI_PAIR that will contain both the target location and the average.

– All of the slave sends and the master receives must use the MPI_Pair type.

Page 42

Collective Collective CommunicationsCommunications

Page 43

Self Test Self Test

1. We want to do a simple broadcast of variable abc[7] in processor 0 to the same location in all other processors of the communicator. What is the correct syntax of the call to do this broadcast?a) MPI_Bcast( &abc[7], 1, MPI_REAL, 0, comm)

b) MPI_Bcast( &abc, 7, MPI_REAL, 0, comm)

c) MPI_Broadcast(&abc[7],1,MPI_REAL,0,comm)

Page 44

Self Test Self Test

2. Each processor has a local array a with 50 elements. Each local array is a slice of a larger global array. We wish to compute the average (mean) of all elements in the global array. Our preferred approach is to add all of the data element-by-element onto the root processor, sum elements of the resulting array, divide, and broadcast the result to all processes. Which sequence of calls will accomplish this? Assume variables are typed and initialized appropriately.

Page 45

Self TestSelf Test

a) start = 0; final = 49; count = final - start + 1; mysum = 0; for ( i=start; i<=final; ++i ) mysum +=a[i]; MPI_Reduce ( &mysum, &sum, 1, MPI_REAL, MPI_SUM, root, comm ); total_count=nprocs * count; if ( my_rank==root ) average=sum / total_count; MPI_Bcast ( &average, 1, MPI_REAL, root, comm );

Page 46

Self TestSelf Test

b) start = 0; final = 49; count = final - start + 1; MPI_Reduce ( a, sum_array, count, MPI_REAL, MPI_SUM, root, comm ); sum = 0; for ( i=start, i<=final; ++i ) sum +=sum_array[i]; total_count=nprocs * count; if ( my_rank==root ) average=sum / total_count; MPI_Bcast ( &average, 1, MPI_REAL, root, comm );

Page 47

Self TestSelf Test

c) start = 0; final = 49; count = final - start + 1; mysum = 0; for ( i=start; i<=final; ++i ) mysum +=a[i]; my_average=mysum / count; MPI_Reduce ( &my_average, &sum, 1, MPI_REAL, MPI_SUM, root, comm ); if ( my_rank==root ) average=sum / nprocs; MPI_Bcast ( &average, 1, MPI_REAL, root, comm );

Page 48

Self Test Self Test

3. Consider a communicator with 4 processes. How many total MPI_Send()'s and MPI_Recv()'s would be required to accomplish the following: MPI_Allreduce ( &a, &x, 1, MPI_REAL, MPI_SUM, comm );

a) 3

b) 4

c) 12

d) 16

Page 49

Course ProblemCourse Problem

• Description– The new problem still implements a parallel search of

an integer array. The program should find all occurrences of a certain integer which will be called the target. It should then calculate the average of the target value and its index. Both the target location and the average should be written to an output file. In addition, the program should read both the target value and all the array elements from an input file.

Page 50

Course ProblemCourse Problem

• Exercise – Modify your code from Chapter 5, to change how the master first

sends out the target and subarray data to the slaves. – Use the MPI broadcast routines to give each slave the target. – Use the MPI scatter routine to give all processors a section of the

array b it will search. – When you use the standard MPI scatter routine you will see that t

he global array b is now split up into four parts and the master process now has the first fourth of the array to search.

– So you should add a search loop (similar to the slaves') in the master section of code to search for the target and calculate the average and then write the result to the output file.

– This is actually an improvement in performance since all the processors perform part of the search in parallel.

Page 51

CommunicatorsCommunicators

Page 52

Self TestSelf Test

1. MPI_Comm_group may be used to:a) create a new group.

b) determine group handle of a communicator.

c) create a new communicator.

Page 53

Self TestSelf Test

2. MPI_Group_incl is used to select processes of old_group to form new_group. If the selection include process(es) not in old_group, it would cause the MPI job:

a) to print error messages and abort the job.

b) to print error messages but execution continues.

c) to continue with warning messages.

Page 54

Self TestSelf Test

3. Assuming that a calling process belongs to the new_group of MPI_Group_excl(old_group, count, nonmembers, new_group), if nonmembers' order were altered,

a) the corresponding rank of the calling process in the new_group would change.

b) the corresponding rank of the calling process in the new_group remain unchanged.

c) the corresponding rank of the calling process in the new_group might or might not change.

Page 55

Self TestSelf Test

4. In MPI_Group_excl(old_group, count, nonmembers, new_group), if count = 0, then

a) new_group is identical to old_group.

b) new_group has no members.

c) error results.

Page 56

Self TestSelf Test

5. In MPI_Group_excl(old_group, count, nonmembers, new_group) if the nonmembers array is not unique ( i.e., one or more entries of nonmembers point to the same rank in old_group), then

a) MPI ignores the repetition.

b) error results.

c) it returns MPI_GROUP_EMPTY.

Page 57

Self TestSelf Test

6. MPI_Group_rank is used to query the calling process' rank number in group. If the calling process does not belong to the group, then

a) error results.

b) the returned group rank has a value of -1, indicating that the calling process is not a member of the group.

c) the returned group rank is MPI_UNDEFINED.

Page 58

Self TestSelf Test

7. In MPI_Comm_split, if two processes of the same color are assigned the same key, then

a) error results.

b) their rank numbers in the new communicator are ordered according to their relative rank order in the old communicator.

c) they both share the same rank in the new communicator.

Page 59

Self TestSelf Test

8. MPI_Comm_split(old_comm, color, key, new_comm) is equivalent to MPI_Comm_create(old_comm, group, new_comm) when

a) color=Iam, key=0; calling process Iam belongs to group; ELSE color=MPI_UNDEFINED for all other processes in old_comm.

b) color=0, key=Iam; calling process Iam belongs to group; ELSE color=MPI_UNDEFINED for all other processes in old_comm.

c) color=0, key=0

Page 60

Course ProblemCourse Problem

• For this chapter, a new version of the Course Problem is presented in which the average value each processor calculates when a target location is found, is calculated in a different manner.

• Specifically, the average will be calculated from the "neighbor" values of the target.

• This is a classic style of programming (called calculations with a stencil) used at important array locations. Stencil calculations are used in many applications including numerical solutions of differential equations and image processesing to name two.

• This new Course Problem will also entail more message passing between the searching processors because in order to calculate the average they will have to get values of the global array they do not have in their subarray.

Page 61

Description Description

• Our new problem still implements a parallel search of an integer array. The program should find all occurences of a certain integer which will be called the target. When a processor of a certain rank finds a target location, it should then calculate the average of – The target value – An element from the processor with rank one higher (the "right"

processor). The right processor should send the first element from its local array.

– An element from the processor with rank one less (the "left" processor). The left processor should send the first element from its local array.

Page 62

DescriptionDescription

• For example, if processor 1 finds the target at index 33 in its local array, it should get from processors 0 (left) and 2 (right) the first element of their local arrays. These three numbers should then be averaged.

• In terms of right and left neighbors, you should visualize the four processors connected in a ring. That is, the left neighbor for P0 should be P3, and the right neighbor for P3 should be P0.

• Both the target location and the average should be written to an output file. As usual, the program should read both the target value and all the array elements from an input file.

Page 63

Exercise Exercise

• Solve this new version of the Course Problem by modifying your code from Chapter 6. Specifically, change the code to perform the new method of calculating the average value at each target location.

Page 64

Virtual TopologiesVirtual Topologies

Page 65

Self TestSelf Test

1. When using MPI_Cart_create, if the cartesian grid size is smaller than processes available in old_comm, then:

a) error results.

b) new_comm returns MPI_COMM_NULL for calling processes not used for grid.

c) new_comm returns MPI_UNDEFINED for calling processes not used for grid.

Page 66

Self TestSelf Test

2. When using MPI_Cart_create, if the cartesian grid size is larger than processes available in old_comm, then:

a) error results.

b) the cartesian grid is automatically reduced to match processes available in old_comm.

c) more processes are added to match the requested cartesian grid size if possible; otherwise error results.

Page 67

Self TestSelf Test

3. After using MPI_Cart_create to generate a cartesian grid with grid size smaller than processes available in old_comm, a call to MPI_Cart_coords or MPI_Cart_rank unconditionally(i.e., without regard to whether it is appropriate to call) ends in error because:

a) calling processes not belonging to group have been assigned the communicator MPI_UNDEFINED, which is not a valid communicator for MPI_Cart_coords or MPI_Cart_rank.

b) calling processes not belonging to group have been assigned the communicator MPI_COMM_NULL, which is not a valid communicator for MPI_Cart_coords or MPI_Cart_rank.

c) grid size does not match what is in old_comm.

Page 68

Self TestSelf Test

4. When using MPI_Cart_rank to translate cartesian coordinates into equivalent rank, if some or all of the indices of the coordinates are outside of the defined range, then

a) error results.

b) error results unless periodicity is imposed in all dimensions.

c) error results unless each of the out-of-range indices is periodic.

Page 69

Self TestSelf Test

5. With MPI_Cart_shift(comm, direction, displ, source, dest), if the calling process is the first or the last entry along the shift direction and that displ is greater than 0, then

a) error results.

b) MPI_Cart_shift returns source and dest if periodicity is imposed along the shift direction. Otherwise, source and/or dest return MPI_UNDEFINED.

c) error results unless periodicity is imposed along the shift direction.

Page 70

Self TestSelf Test

6. MPI_Cart_sub can be used to subdivide a cartesian grid into subgrids of lower dimensions. These subgrids

a) have dimensions one lower than the original grid.

b) attributes such as periodicity must be reimposed.

c) possess appropriate attributes of the original cartesian grid.

Page 71

Course ProblemCourse Problem

• Description– The new problem still implements a parallel search of

an integer array. The program should find all occurrences of a certain integer which will be called the target. When a processor of a certain rank finds a target location, it should then calculate the average of

• The target value • An element from the processor with rank one higher (the

"right" processor). The right processor should send the first element from its local array.

• An element from the processor with rank one less (the "left" processor). The left processor should send the first element from its local array.

Page 72

Course ProblemCourse Problem

• For example, if processor 1 finds the target at index 33 in its local array, it should get from processors 0 (left) and 2 (right) the first element of their local arrays. These three numbers should then be averaged.

• In terms of right and left neighbors, you should visualize the four processors connected in a ring. That is, the left neighbor for P0 should be P3, and the right neighbor for P3 should be P0.

• Both the target location and the average should be written to an output file. As usual, the program should read both the target value and all the array elements from an input file.

Page 73

Course ProblemCourse Problem

• Exercise– Modify your code from Chapter 7 to solve this latest

version of the Course Problem using a virtual topology. First, create the topology (which should be called MPI_RING) in which the four processors are connected in a ring. Then, use the utility routines to determine which neighbors a given processor has.

Page 74

MPI Program PerformanceMPI Program Performance

Page 75

MatchingMatching

1. Amdahl's Law

2. Profiles

3. Relative efficiency

4. Load imbalances

5. Timers

6. Asymptotic analysis

7. Execution time

8. Cache effects

9. Event traces

10. Absolute speedup

a) The time elapsed from when the first processor starts executing a problem to when the last processor completes execution.

b) T1/(P*Tp), where T1 is the execution time on one processor and Tp is the execution time on P processors.

c) The execution time on one processor of the fastest sequential program divided by the execution time on P processors.

d) When the sequential component of an algorithm accounts for 1/s of the program's execution time, then the maximum possible speedup that can be achieved on a parallel computer is s.

e) Characterizing performance in a large limit.

f) When an algorithm suffers from computation or communication imbalances among processors.

g) When the fast memory on a processor gets used more often in a parallel implementation, causing an unexpected decrease in the computation time.

h) A performance tool that shows the amount of time a program spends on different program components.

i) A performance tool that determines the length of time spent executing particular piece of code.

j) The most detailed performance tool that generates a file which records the significant events in the running of a program.

Page 76

Self Test Self Test

2. The following is not a performance metric: a) speedup

b) efficiency

c) problem size

Page 77

Self Test Self Test

3. A good question to ask in scalability analysis is:a) How can one overlap computation and

communications tasks in an efficient manner?

b) How can a single performance measure give an accurate picture of an algorithm's overall performance?

c) How does efficiency vary with increasing problem size?

d) In what parameter regime can I apply Amdahl's law?

Page 78

Self Test Self Test

4. If an implementation has unaccounted-for overhead, a possible reason is:

a) an algorithm may suffer from computation or communication imbalances among processors.

b) the cache, or fast memory, on a processor may get used more often in a parallel implementation causing an unexpected decrease in the computation time.

c) you failed to employ a domain decomposition.

d) there is not enough communication between processors.

Page 79

Self Test Self Test

5. Which one of the following is not a data collection technique used to gather performance data:

a) counters

b) profiles

c) abstraction

d) event traces

Page 80

Course Problem Course Problem

• In this chapter, the broad subject of parallel code performance is discussed both in terms of theoretical concepts and some specific tools for measuring performance metrics that work on certain parallel machines. Put in its simplest terms, improving code performance boils down to speeding up your parallel code and/or improving how your code uses memory.

Page 81

Course ProblemCourse Problem

• As you have learned new features of MPI in this course, you have also improved the performance of the code. Here is a list of performance improvements so far: – Using Derived Datatypes instead of sending and receiving the se

parate pieces of data – Using Collective Communication routines instead of repeating/loo

ping individual sends and receives – Using a Virtual Topology and its utility routines to avoid extraneo

us calculations – Changing the original master-slave algorithm so that the master a

lso searches part of the global array (The slave rebellion: Spartacus!)

– Using "true" parallel I/O so that all processors write to the output file simultaneously instead of just one (the master)

Page 82

Course ProblemCourse Problem

• But more remains to be done - especially in terms of how the program affects memory. And that is the last exercise for this course. The problem description is the same as the one given in Chapter 9 but you will modify the code you wrote using what you learned in this chapter.

Page 83

Course ProblemCourse Problem

• Description – The initial problem implements a parallel search of an extremely l

arge (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file.

• Exercise – Modify your code from Chapter 9 so that it uses dynamic memory

allocation to use only the amount of memory it needs and only for as long as it needs it. Make both the arrays a and b ALLOCATED DYNAMICALLY and connect them to memory properly. You may also assume that the input data file "b.data" now has on its first line the number of elements in the global array b. The second line now has the target value. The remaining lines are the contents of the global array b.