07_parallelprogramming
DESCRIPTION
this is good document for startTRANSCRIPT
![Page 1: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/1.jpg)
Type to enter text
Challenge the future
Delft University of Technology
Parallelizationprogramming
Jan Thorbecke
![Page 2: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/2.jpg)
2
Contents
• Introduction • Programming
• OpenMP • MPI
• Issues with parallel computing • Running programs • Examples
![Page 3: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/3.jpg)
• Describes the relation between the parallel portion of your code and the expected speedup !!!!!!!
• P = parallel portion • N = number of processors used in parallel part !
• P/N is the ideal parallel speed-up, it will always be less
Amdahl’s Law
3
speedup =1
(1� P ) + PN
![Page 4: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/4.jpg)
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Spee
dup
Number of Processors
1.00.990.98
Amdahl’s Law
4
![Page 5: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/5.jpg)
0
20
40
60
80
100
1 2 3 4 5 7 10 20 30 40 50 70 90
Max
imum
Spe
edup
Sequential Portion in %
maxspeedup(x)
Amdahl’s Law
5
![Page 6: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/6.jpg)
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Spee
dup
Number of Processors
1.00.990.96
super linear
Super Linear speed up
6
![Page 7: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/7.jpg)
Concurrency and Parallelism
• Concurrency and parallelism are often used synonymously. !
Concurrency: The independence of parts of an algorithm (= independent of each other). !Parallelism(also parallel execution): Two or more parts of a program are executed at the same moment in time. !
Concurrency is a necessary prerequisite for parallel execution for parallel execution but Parallel execution is only one possible consequence of concurrency.
7
![Page 8: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/8.jpg)
Concurrency vs. Parallelism
• Concurrency: two or more threads are in progress at the same time: !!!!!
• Parallelism: two or more threads are executing at the same time !!!!Multiple cores needed
Thread 1Thread 2
Thread 1Thread 2
![Page 9: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/9.jpg)
Learning
Classical CPU’s are sequential !
There is an enormous sequential programming knowledge build into compilers and know by most programmers. !Parallel Programming is requiring new skills and new tools.
!!Start to parallelise simple problems and keep on learning along the way to complex real-world problems. !
9
![Page 10: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/10.jpg)
Recognizing Sequential Processes
• Time is inherently sequential • Dynamics and real-time, event driven applications are often difficult to
parallelize effectively
• Many games fall into this category
!• Iterative processes
• The results of an iteration depend on the preceding iteration
• Audio encoders fall into this category
![Page 11: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/11.jpg)
Parallel Programming Models
• Parallel programming models exist as an abstraction above hardware and memory architectures. !
• Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better implementations of some models over others.
![Page 12: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/12.jpg)
Parallel Programming Models
• Shared Memory • tasks share a common address space, which they read and write
asynchronously. !
• Threads (functional) • a single process can have multiple, concurrent execution paths.
Example implementations: POSIX threads & OpenMP !
• Message Passing • tasks exchange data through communications by sending and
receiving messages. Example: MPI-2 specification. !
• Data Parallel languages • tasks perform the same operation on their partition of work.
Example: Co-array Fortran (CAF), Unified Parallel C (UPC), Chapel !
• Hybrid
![Page 13: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/13.jpg)
Programming Models
13
Hardware Layer
Memory
Interconnect
Interconnect
System Software
Programming Model
UserNew to parallel programming Experienced programmer
Shared Memory Message passing Hybrid
Operating system compilers
Distributed Memory Shared Memory
![Page 14: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/14.jpg)
Parallel Programming Concepts
• Work distribution • Which parallel task is doing what? • Which data is used by which task? !
• Synchronization • Do the different parallel tasks meet? !
• Communication • Is communication between parallel parts needed? !
• Load Balance • Does every task has the same amount of work? • Are all processors of the same speed?
14
![Page 15: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/15.jpg)
• Work decomposition • based on loop counter !!
• Data decomposition • all work for a local portion
of the data is done by the local processor !!
• Domain decomposition • decomposition of work and
data
Distributing Work and/or Data
15
do i=1,100!1: i=1,25!2: i=26,50!3: i=51,75!4: i=76,100
A(1:10,1:25)!A(1:10,26:50)!A(11:20,1:25)!A(11:20,25:50)
![Page 16: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/16.jpg)
Synchronization
!!!!!!!!
• Synchronization • causes overhead • idle time, when not all tasks are finished at the same time
16
2. — Parallel Hardware Architectures and Parallel Programming Models — 2.2-9
Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 17 / 54
Distributing Work & Data
do i=1,100 i=1,25
i=26,50i=51,75i=76,100
Work decomposition• based on loop decomposition
Domain decomposition• decomposition of work and
data is done in a higher model,e.g. in the reality
A( 1:20, 1: 50)A( 1:20, 51:100)A(21:40, 1: 50)A(21:40, 51:100)
Data decomposition• all work for a local portion
of the data is done by thelocal processor
Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 18 / 54
Synchronization
• Synchronization– is necessary– may cause
• idle time on some processors• overhead to execute the synchronization primitive
i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors
i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors
BARRIER synchronization
Do i=1,100a(i) = b(i)+c(i)
EnddoDo i=1,100
d(i) = 2*a(101-i)Enddo
![Page 17: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/17.jpg)
Communication
• communication is necessary on the boundaries !!!!!!!
• domain decomposition
17
do i=2,99! b(i) = b(i) + h*(a(i-1)-2*a(i)+a(i+1))!end do!!e.g. b(26) = b(26) + h*(a(25)-2*a(26)+a(27))!
A(1:25)!A(26:50)!A(51:75)!A(76:100)
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-6
Slide 11Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Replication versus Communication (II)
• Normally replicate the values
– Consider how many calculations you can execute while onlysending 1 Bit from one process to another(6 µs, 1.0 Gflop/s 6000 operations)
– Sending 16 kByte (20x20x5) doubles(with 300 MB/s bandwidth 53.3 µs 53 300 operations)
– very often blocks have to wait for their neighbours
– but extra work limits parallel efficiency
• Communication should only be used if one is quite sure that this isthe best solution
Slide 12Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
2- Dimensional DD with two Halo Cells
Mesh Partitioning
Subdomain for each Process
![Page 18: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/18.jpg)
38a. — Parallelization of Explicit and Implicit Solvers — 38a.38a-10
Rolf RabenseifnerParallelization and Iterative SolversSlide 19 of 51 Höchstleistungsrechenzentrum Stuttgart
Unstructured Grids
• Mesh partitioning with special load balancing libraries– Metis (George Karypis, University of Minnesota)– ParMetis (internally parallel version of Metis)
• http://www.cs.umn.edu/~karypis/metis/metis.html• http://www.hlrs.de/organization/par/services/tools/loadbalancer/metis.html
– Jostle (Chris Walshaw, University of Greenwich)• http://www.gre.ac.uk/jostle• http://www.hlrs.de/organization/par/services/tools/loadbalancer/jostle.html
– Goals:• Same work load in
each sub-domain• Minimizing the
maximal number of neighbor-connectionsbetween sub-domains
0 2
31
45
6 10
98
7 11
12 15
1613
17 14
21 20
2322
19
18
Parallelization
of Explicit and Im
plicit Solvers [38a]
Rolf RabenseifnerParallelization and Iterative SolversSlide 20 of 51 Höchstleistungsrechenzentrum Stuttgart
Halo
• Stencil:– To calculate a new grid point ( ),
old data from the stencil grid points ( ) are needed• E.g., 9 point stencil
• Halo– To calculate the new grid points of a sub-domain,
additional grid points from other sub-domains are needed.– They are stored in halos (ghost cells, shadows)– Halo depends on form of stencil
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to:
• less parallelism than processors • unequal sized tasks together with too little parallelism • unequal processors
18
![Page 19: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/19.jpg)
Examples of work distribution
• Domain Decomposition • Master Slave • Task Decomposition
19
![Page 20: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/20.jpg)
Domain Decomposition
!• First, decide how data elements should be divided among processors !!!
• Second, decide which tasks each processor should be doing !Example: ‣ Vector addition, summation, maximum, .. ‣ Large data sets whose elements can be computed independently
- Divide data and associated computation among threads
![Page 21: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/21.jpg)
Domain Decomposition
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Find the largest element of an array
![Page 22: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/22.jpg)
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Domain DecompositionFind the largest element of an array
Core 0 Core 1 Core 2 Core 3
![Page 23: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/23.jpg)
Domain DecompositionFind the largest element of an array
5 6 68 83
Core 0 Core 1 Core 2 Core 3
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 645 6 68 83
![Page 24: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/24.jpg)
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Domain DecompositionFind the largest element of an array
Core 0 Core 1 Core 2 Core 3
13 49 12 51
13 49 68 83
![Page 25: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/25.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
1 34 98 94
13 49 98 94
![Page 26: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/26.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
9 50 16 27
13 50 98 94
![Page 27: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/27.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
26 22 78 74
26 50 98 94
![Page 28: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/28.jpg)
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Domain DecompositionFind the largest element of an array
Core 0 Core 1 Core 2 Core 3
26 50 98 94
13 12 31 64
![Page 29: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/29.jpg)
Domain DecompositionFind the largest element of an array
26
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
26 50 98 94
![Page 30: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/30.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
26 50 98 94
50
![Page 31: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/31.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
26 50 98 94
98
![Page 32: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/32.jpg)
Domain DecompositionFind the largest element of an array
5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Core 0 Core 1 Core 2 Core 3
26 50 98 94
98
![Page 33: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/33.jpg)
Core 2
Core 1
Core 3
T
T
T
Master Slave
33
Get a heap of work done
Core 0 (master)
T
![Page 34: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/34.jpg)
Task Decompositionf()
s()
r()q()h()
g()
![Page 35: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/35.jpg)
Task Decompositionf()
s()
r()q()h()
g()
Core 0
Core 2
Core 1
![Page 36: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/36.jpg)
Task Decompositionf()
s()
r()q()h()
g()
Core 0
Core 2
Core 1
![Page 37: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/37.jpg)
Task Decompositionf()
s()
r()q()h()
g()
Core 0
Core 2
Core 1
![Page 38: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/38.jpg)
Task Decompositionf()
s()
r()q()h()
g()
Core 0
Core 2
Core 1
![Page 39: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/39.jpg)
Task Decompositionf()
s()
r()q()h()
g()
Core 0
Core 2
Core 1
![Page 40: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/40.jpg)
Evaluating Parallel Models
• How do processes share information? !
• How do processes synchronize? !
• In shared-memory model, both accomplished through shared variables ‣ Communication: buffer ‣ Synchronization: semaphore
![Page 41: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/41.jpg)
Shared versus Private Variables
Shared Variables
Private Variables
Private Variables
Thread !!!!!!Thread
![Page 42: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/42.jpg)
Parallel threads can “race” against each other to update resources !Race conditions occur when execution order is assumed but not guaranteed !Example: un-synchronised access to bank account
Race Conditions
Deposits $100 into account
Withdraws $100 from account
Initial balance = $1000 Final balance = ?
![Page 43: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/43.jpg)
Race Conditions
Deposits $100 into account
Withdraws $100 from account
Initial balance = $1000 Final balance = ?
Time Withdrawal Deposit
T Load (balance = $1000)
T Subtract $100 Load (balance = $1000)
T Store (balance = $900) Add $100
T Store (balance = $1100)
![Page 44: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/44.jpg)
Code Example in OpenMP
for (i=0; i<NMAX; i++) {! a[i] = 1;! b[i] = 2;!}!#pragma omp parallel for shared(a,b)!for (i=0; i<12; i++) {! a[i+1] = a[i]+b[i];!}!!1: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0!4: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 3.0, 5.0, 7.0, 9.0, 11.0!4: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0!4: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0
44
![Page 45: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/45.jpg)
Code Example in OpenMP
thread computation 0 a[1] = a[0] + b[0]!0 a[2] = a[1] + b[1]!0 a[3] = a[2] + b[2] <--| Problem !1 a[4] = a[3] + b[3] <--| Problem!1 a[5] = a[4] + b[4]!1 a[6] = a[5] + b[5] <--| Problem !2 a[7] = a[6] + b[6] <--| Problem!2 a[8] = a[7] + b[7]!2 a[9] = a[8] + b[8] <--| Problem !3 a[10] = a[9] + b[9] <--| Problem!3 a[11] = a[10] + b[10]!
45
![Page 46: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/46.jpg)
How to Avoid Data Races
• Scope variables to be local to threads • Variables declared within threaded functions • Allocate on thread’s stack • TLS (Thread Local Storage)
• Control shared access with critical regions • Mutual exclusion and synchronization • Lock, semaphore, event, critical section, mutex…
![Page 47: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/47.jpg)
Examples variables
47
![Page 48: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/48.jpg)
Domain Decomposition
Sequential Code: !int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i);
![Page 49: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/49.jpg)
Domain Decomposition
Sequential Code: !int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i); !Thread 0: for (i = 0; i < 500; i++) a[i] = foo(i); !Thread 1: for (i = 500; i < 1000; i++) a[i] = foo(i);
![Page 50: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/50.jpg)
Domain Decomposition
Sequential Code: !int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i); !Thread 0: for (i = 0; i < 500; i++) a[i] = foo(i); !Thread 1: for (i = 500; i < 1000; i++) a[i] = foo(i);
SharedPrivate
![Page 51: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/51.jpg)
Task Decompositionint e; !main () { int x[10], j, k, m; j = f(k); m = g(k); ... } !int f(int k) { int a; a = e * x[k] * x[k]; return a; } !int g(int k) { int a; k = k-1; a = e / x[k]; return a; }
![Page 52: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/52.jpg)
Task Decompositionint e; !main () { int x[10], j, k, m; j = f(k); m = g(k); } !int f(int k) { int a; a = e * x[k] * x[k]; return a; } !int g(int k) { int a; k = k-1; a = e / x[k]; return a; }
Thread 0
Thread 1
![Page 53: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/53.jpg)
Task Decompositionint e; !main () { int x[10], j, k, m; j = f(k); m = g(k); } !int f(int k) { int a; a = e * x[k] * x[k]; return a; } !int g(int k) { int a; k = k-1; a = e / x[k]; return a; }
Thread 0
Thread 1
Static variable: Shared
![Page 54: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/54.jpg)
Task Decompositionint e; !main () { int x[10], j, k, m; j = f(x, k); m = g(x, k); } !int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; } !int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }
Thread 0
Thread 1
Heap variable: Shared
![Page 55: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/55.jpg)
Task Decompositionint e; !main () { int x[10], j, k, m; j = f(k); m = g(k); } !int f(int k) { int a; a = e * x[k] * x[k]; return a; } !int g(int k) { int a; k = k-1; a = e / x[k]; return a; }
Thread 0
Thread 1
Function’s local variables: Private
![Page 56: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/56.jpg)
Shared and Private Variables
• Shared variables • Static variables • Heap variables • Contents of run-time stack at time of call
!• Private variables
• Loop index variables • Run-time stack of functions invoked by thread
![Page 57: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/57.jpg)
57
![Page 58: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/58.jpg)
3
What Is OpenMP?
•Compiler directives for multithreaded programming !
•Easy to create threaded Fortran and C/C++ codes !
•Supports data parallelism model !
•Portable and Standard !
•Incremental parallelism ‣Combines serial and parallel code in single source
![Page 59: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/59.jpg)
OpenMP is not ...!
Not Automatic parallelization !
– User explicitly specifies parallel execution !
– Compiler does not ignore user directives even if wrong !!!Not just loop level parallelism !
– Functionality to enable general parallel parallelism !!!Not a new language !
– Structured as extensions to the base !
– Minimal functionality with opportunities for extension !!
59
![Page 60: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/60.jpg)
Directive based
• Directives are special comments in the language !
– Fortran fixed form: !$OMP, C$OMP, *$OMP !
– Fortran free form: !$OMP !!Special comments are interpreted by OpenMP !
compilers ! w = 1.0/n !
sum = 0.0 !
!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum) !
do I=1,n !
x = w*(I-0.5) !
sum = sum + f(x) !
end do !
pi = w*sum !
print *,pi !
end !
60
Comment in Fortran but interpreted by OpenMP compilers
![Page 61: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/61.jpg)
C example
#pragma omp directives in C !– Ignored by non-OpenMP compilers! ! w = 1.0/n; !
sum = 0.0; !
#pragma omp parallel for private(x) reduction(+:sum) !
for(i=0, i<n, i++) { !
x = w*((double)i+0.5); !
sum += f(x); !
} !
pi = w*sum; !
printf(“pi=%g\n”, pi); !
}
61
![Page 62: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/62.jpg)
• Control runtime • schedule type • max threads • nested parallelism • throughput mode
Architecture of OpenMP
• Control structures • Work sharing • Synchronization • Data scope attributes
• private • shared • reduction
• Orphaning
62
• Control & query routines • number of threads • throughput mode • nested parallism
• Lock API
Directives, Pragmas
Runtime library routines
Environment variables
![Page 63: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/63.jpg)
Programming Model
• Fork-join parallelism: ‣ Master thread spawns a team of threads as needed
‣ Parallelism is added incrementally: the sequential program evolves into a parallel program
Parallel Regions
Master Thread
63
![Page 64: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/64.jpg)
Threads are assigned an independent set of iterations !Threads must wait at the end of work-sharing construct
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
Work-sharing Construct
T#pragma omp parallel
T
T
#pragma omp for
Implicit barrier
#pragma omp parallel #pragma omp for for(i = 0; i < 12; i++) c[i] = a[i] + b[i]
64
![Page 65: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/65.jpg)
Combining pragmas
These two code segments are equivalent
#pragma omp parallel { #pragma omp for for (i=0; i< MAX; i++) { res[i] = huge(); } }
#pragma omp parallel for for (i=0; i< MAX; i++) { res[i] = huge(); }
65
![Page 66: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/66.jpg)
18
IWOMP 2009TU Dresden
June 3-5, 2009
An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009
TID = 0for (i=0,1,2,3,4)
TID = 1for (i=5,6,7,8,9)
Example - Matrix times vector
i = 0 i = 5
a[0] = sum a[5] = sum
sum = Σ b[i=0][j]*c[j] sum = Σ b[i=5][j]*c[j]
i = 1 i = 6
a[1] = sum a[6] = sum
sum = Σ b[i=1][j]*c[j] sum = Σ b[i=6][j]*c[j]
... etc ...
��������������� ������������������������������� �������������������������������������
���
�������������������������� �����!����� �"����������������#�!���$�$��� ��%��� ��$�$�$�$�
= *
j
i
Matrix-vector example
66
![Page 67: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/67.jpg)
19
IWOMP 2009TU Dresden
June 3-5, 2009
An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009
� �� ��� ���� ����� ������ �������
�
���
����
����
����
����
������
�������
������
OpenMP Performance Example
Memory Footprint (KByte)
Perfo
rman
ce (M
flop/
s)
Matrix too small *
*) With the IF-clause in OpenMP this performance degradation can be avoided
scales
Performance is matrix size dependent
67
![Page 68: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/68.jpg)
OpenMP parallelization
• OpenMP Team := Master + Workers • A Parallel Region is a block of code executed by all
threads simultaneously • The master thread always has thread ID 0 • Thread adjustment (if enabled) is only done before entering a parallel
region • Parallel regions can be nested, but support for this is implementation
dependent • An "if" clause can be used to guard the parallel region; in case the
condition evaluates to "false", the code is executed serially • A work-sharing construct divides the execution of the enclosed
code region among the members of the team; in other words: they split the work
68
![Page 69: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/69.jpg)
Data Environment
• OpenMP uses a shared-memory programming model
• Most variables are shared by default.
• Global variables are shared among threads C/C++: File scope variables, static !
• Not everything is shared, there is often a need for “local” data as well
69
![Page 70: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/70.jpg)
... not everything is shared... !• Stack variables in functions called from parallel regions are PRIVATE
• Automatic variables within a statement block are PRIVATE
• Loop index variables are private (with exceptions) • C/C+: The first loop index variable in nested loops following a
#pragma omp for
Data Environment
70
![Page 71: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/71.jpg)
About Variables in SMP!
• Shared variables Can be accessed by every thread thread. Independent read/write operations can take place. !
• Private variables Every thread has it’s own copy of the variables that are created/destroyed upon entering/leaving the procedure. They are not visible to other threads.
71
serial code global auto local static dynamic
parallel code shared local use with care use with care
![Page 72: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/72.jpg)
attribute clauses
!!
•default(shared) !!
•shared(varname,…) !!!private(varname,…)
Data Scope clauses
72
![Page 73: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/73.jpg)
The Private Clause
Reproduces the variable for each thread • Variables are un-initialised; C++ object is default constructed • Any value external to the parallel region is undefined
void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } }
73
![Page 74: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/74.jpg)
Synchronization
• Barriers !!
• Critical sections !!
• Lock library routines
74
#pragma omp barrier
#pragma omp critical()
omp_set_lock(omp_lock_t *lock)!!omp_unset_lock(omp_lock_t *lock)!!....
![Page 75: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/75.jpg)
#pragma omp critical [(lock_name)] !Defines a critical region on a structured block
OpenMP Critical Construct
float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical consum (B, &R1); A = bigger_job(i); #pragma omp critical consum (A, &R2); } }
All threads execute the code, but only one at a time. Only one calls consum() thereby protecting R1 and R2 from race conditions.
Naming the critical constructs is optional, but may increase performance.
(R1_lock) !!(R2_lock)
75
![Page 76: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/76.jpg)
OpenMP Critical
76
Day 3: OpenMP 2010 – Course MT1
OpenMP critical directive: Explicit Synchronization
• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables
• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.
• Mutual exclusion synchronization is provided by the critical directive of OpenMP
• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.
• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.
29
fork
join
- critical region
int x x=0; #pragma omp parallel shared(x) { #pragma omp critical x = 2*x + 1; } /* omp end parallel */
All threads execute the code, but only one at a time. Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.
![Page 77: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/77.jpg)
Day 3: OpenMP 2010 – Course MT1
Simple Example: critical
30
cnt = 0; f = 7; #pragma omp parallel { #pragma omp for for (i=0;i<20;i++){ if(b[i] == 0){ #pragma omp critical cnt ++; } /* end if */ a[i]=b[i]+f*(i+1); } /* end for */ } /* omp end parallel */
cnt=0 f=7
i =0,4 i=5,9 i= 20,24 i= 10,14
if … if …
if … i= 20,24
cnt++
cnt++
cnt++ cnt++ a[i]=b[
i]+…
a[i]=b[i]+…
a[i]=b[i]+…
a[i]=b[i]+…
Critical Example
77
![Page 78: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/78.jpg)
OpenMP Single Construct
• Only one thread in the team executes the enclosed code !
• The Format is: !!!!!
• The supported clauses on the single directive are:
78
#pragma omp single [nowait][clause, ..]{! “block”!}
private (list)!firstprivate (list)
NOWAIT: the other threads will not wait at the end single directive
![Page 79: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/79.jpg)
OpenMP Master directive
!!!!
• All threads but the master, skip the enclosed section of code and continue !
• There is no implicit barrier on entry or exit ! !!!
• Each thread waits until all others in the team have reached this point.
79
#pragma omp master {! “code”!}
#pragma omp barrier!
![Page 80: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/80.jpg)
#pragma omp parallel!{! ....!#pragma omp single [nowait]!{! ....!}!#pragma omp master!{! ....!}! ....!#pragma omp barrier!}
Work Sharing: Single Master
80
![Page 81: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/81.jpg)
Work Sharing: Orphaning
• Worksharing constructs may be outside lexical scope of the parallel region
81
#pragma omp parallel!{! ....! dowork( )! ....!}! ....
void dowork( )!{! #pragma omp for! for (i=0; i<n; i++) {! ....! }!}!
![Page 82: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/82.jpg)
Scheduling the work
• schedule ( static | dynamic | guided | auto [, chunk] ) schedule (runtime)
static [, chunk]!• Distribute iterations in blocks of size "chunk" over the threads in a
round-robin fashion • In absence of "chunk", each thread executes approx. N/P chunks for
a loop of length N and P threads
82
66
IWOMP 2009TU Dresden
June 3-5, 2009
An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009
The schedule clause/2
Thread 0 1 2 3������ *-K L-M N-*7 *F-*O
����!�� *-7 F-K L-O P-MN-*� **-*7 *F-*K *L-*O
Example static scheduleLoop of length 16, 4 threads:
*) The precise distribution is implementation defined
![Page 83: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/83.jpg)
83
• dynamic [, chunk]!• Fixed portions of work; size is controlled by the value of chunk • When a thread finishes, it starts on the next portion of work !
• guided [, chunk]!• Same dynamic behavior as "dynamic", but size of the portion of work
decreases exponentially !
runtime!• Iteration scheduling scheme is set at runtime through environment
variable OMP_SCHEDULE
![Page 84: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/84.jpg)
68
IWOMP 2009TU Dresden
June 3-5, 2009
An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009
The experiment
0 100 200 300 400 500 600
3210
3210
3210
static
dynamic, 5
guided, 5
Iteration Number
Thre
ad ID
500 iterations on 4 threadsExample scheduling
84
![Page 85: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/85.jpg)
Exercise: OpenMP scheduling
• How OpenMP can help parallelize a loop, but as a user you can help in making it better ;-) !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/OMP_schedule.tar !
• Unpack tar file and check the README for instructions. !
• This exercise requires already some knowledge about OpenMP. The OpenMP F-Summary.pdf from the website can be helpful.
85
![Page 86: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/86.jpg)
13
OpenMP Reduction Clause
reduction (op : list) !The variables in “list” must be shared in the enclosing parallel region !Inside parallel or work-sharing construct: ‣ A PRIVATE copy of each list variable is created and initialized depending on
the “op”
‣ These copies are updated locally by threads
‣ At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable
![Page 87: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/87.jpg)
Day 3: OpenMP 2010 – Course MT1
OpenMP: Reduction
• performs reduction on shared variables in list based on the operator provided. • for C/C++ operator can be any one of :
– +, *, -, ^, |, ||, & or && – At the end of a reduction, the shared variable contains the result obtained upon
combination of the list of variables processed using the operator specified.
32
sum = 0.0 #pragma omp parallel for reduction(+:sum) for (i=0; i < 20; i++) sum = sum + (a[i] * b[i]);
sum=0
i=0,4 i=5,9 i=10,14 i=15,19
sum=.. sum=.. sum=.. sum=..
∑sum
sum=0
Local copy of sum for each thread All local copies of sum added together and stored in “global” variable
Reduction Example
#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }
87
![Page 88: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/88.jpg)
A range of associative and commutative operators can be used with reduction Initial values are the ones that make sense
C/C++ Reduction Operations
Operator Initial Value
+ 0
* 1
- 0
^ 0
Operator Initial Value
& ~0
| 0
&& 1
|| 0
FORTRAN: intrinsic is one of MAX, MIN, IAND, IOR, IEOR operator is one of +, *, -, .AND., .OR., .EQV., .NEQV.
88
![Page 89: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/89.jpg)
Numerical Integration Example
TTTT
4.0
2.0
1.00.0 X
� 1
0
41 + x2
dx = �static long num_steps=100000; double step, pi; !void main() { int i; double x, sum = 0.0; ! step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }
89
![Page 90: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/90.jpg)
static long num_steps=100000; double step, pi; !void main() { int i; double x, sum = 0.0; ! step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }
Numerical Integration to Compute Pi
Parallelize the numerical integration code using OpenMP !What variables can be shared? !What variables need to be private? !What variables should be set up for reductions?
step, num_steps
x, i
sum
90
![Page 91: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/91.jpg)
Solution to Computing Pi
static long num_steps=100000; double step, pi; !void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(x) reduction(+:sum) for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }
91
![Page 92: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/92.jpg)
Let’s try it out
• Go to example MPI_pi.tar and work with openmp_pi2.c
92
![Page 93: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/93.jpg)
Exercise: PI with MPI and OpenMP
93
cores OpenMP
1 9.617728
2 4.874539
4 2.455036
6 1.627149
8 1.214713
12 0.820746
16 0.616482
![Page 94: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/94.jpg)
0
2
4
6
8
10
12
14
16
1 2 4 6 12 16
spee
d up
number of CPUs
Pi Scaling
Amdahl 1.0OpenMP Barcelona 2.2 GHz
Exercise: PI with MPI and OpenMP
94
![Page 95: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/95.jpg)
Race condition code example
• MPI_pi.tar example and replace !
• #pragma omp parallel for private(x) reduction(+:sum) !!
• with !
• #pragma omp parallel for private(x) shared(sum)!!
• ‘solve’ with!!
• #pragma omp critical before sum
95
![Page 96: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/96.jpg)
© 2007 IBM Corporation
Multi-PF Solutions
CUDA: computing PICuda Computing PI
96
![Page 97: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/97.jpg)
Environment Variables
• The names of the OpenMP environment variables must be UPPERCASE
• The values assigned to them are case insensitive
97
OMP_NUM_THREADS!!OMP_SCHEDULE “schedule [chunk]”!!OMP_NESTED { TRUE | FALSE }!!
![Page 98: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/98.jpg)
Exercise: Shared Cache Trashing
• Let’s do the exercise.
98
![Page 99: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/99.jpg)
About local and shared data
• Consider the following example: !!!!!!
• Let’s assume we run this on 2 processors: !
• processor 1 for i=0,2,4,6,8 • processor 2 for i=1,3,5,7,9
99
for (i=0; i<10; i++){! a[i] = b[i] + c[i];!}
![Page 100: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/100.jpg)
i1
About local and shared data
100
for (i1=0,2,4,6,8){! a[i1] = b[i1] + c[i1];!}
for (i2=1,3,5,7,9){! a[i2] = b[i2] + c[i2];!}
i2A B C
P1 P2
private area private areashared area
Processor 1 Processor 2
![Page 101: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/101.jpg)
About local and shared data
processor 1 for i=0,2,4,6,8 processor 2 for i=1,3,5,7,9
!• This is not an efficient way to do this! !!!!
Why?
101
![Page 102: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/102.jpg)
Doing it the bad way
• Because of cache line usage !!!!!!!
• b[] and c[]: we use half of the data !
• a[]: false sharing
102
for (i=0; i<10; i++){! a[i] = b[i] + c[i];!}
![Page 103: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/103.jpg)
False sharing and scalability
• The Cause: Updates on independent data elements that happen to be part of the same cache line. !
• The Impact: Non-scalable parallel applications !
• The Remedy: False sharing is often quite simple to solve
103
![Page 104: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/104.jpg)
Poor cache line utilization
104
Processor 1 Processor 2
B(0)
B(1)
B(2)
B(3)
B(4)
B(5)
B(6)
B(7)
B(8)
B(9)
B(0)
B(1)
B(2)
B(3)
B(4)
B(5)
B(6)
B(7)
B(8)
B(9) The same holds for array C
cache line
Both processors read the same cache lines
used data not used data
![Page 105: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/105.jpg)
False Sharing
105
Processor 1 Processor 2
a[1] = b[1] + c[0];a[0] = b[0] + c[0];Write into the line containing a[0] This marks the cache line containing a[0] as ‘dirty’
Detects the line with a[0] is ‘dirty’
Get a fresh copy (from processor 1) Write into the line containing a[1] This marks the cache line containing a[1] as ‘dirty’
a[2] = b[2] + c[2];
time
a[3] = b[3] + c[3];Detects the line with a[3] is ‘dirty’
Detects the line with a[2] is ‘dirty’
Get a fresh copy (from processor 2) Write into the line containing a[2] This marks the cache line containing a[2] as ‘dirty’
![Page 106: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/106.jpg)
Summary
• First tune single-processor performance !
• Tuning parallel programs • Has the program been properly parallelized?
• Is enough of the program parallelized (Amdahl’s law)? • Is the load well-balanced?
• location of memory • Cache friendly programs: no special placement needed • Non-cache friendly programs
• False sharing?
• Use of OpenMP • try to avoid synchronization (barrier, critical, single, ordered)
106
![Page 107: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/107.jpg)
19
Plenty of Other OpenMP Stuff
Scheduling clauses !Atomic !Barrier !Master & Single !Sections !Tasks (OpenMP 3.0) !API routines
![Page 108: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/108.jpg)
Compiling and running OpenMP
• Compile with -openmp flag (intel compiler) or -fopenmp (GNU) !
• Run program with variable: !export OMP_NUM_THREADS=4!
108
![Page 109: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/109.jpg)
OpenACC
• New set of directives to support accelerators • GPU’s • Intel’s MIC • AMD Fusions processors
109
![Page 110: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/110.jpg)
OpenACC examplevoid convolution_SM_N(typeToUse A[M][N], typeToUse B[M][N]) {
int i, j, k; int m=M, n=N; // OpenACC kernel region // Define a region of the program to be compiled into a sequence of kernels // for execution on the accelerator device #pragma acc kernels pcopyin(A[0:m]) pcopy(B[0:m]) { typeToUse c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2.0f; c21 = +5.0f; c31 = -‐8.0f; c12 = -‐3.0f; c22 = +6.0f; c32 = -‐9.0f; c13 = +4.0f; c23 = +7.0f; c33 = +10.0f; // The OpenACC loop gang clause tells the compiler that the iterations of the loops // are to be executed in parallel across the gangs. // The argument specifies how many gangs to use to execute the iterations of this loop. #pragma acc loop gang(64) for (int i = 1; i < M -‐ 1; ++i) { // The OpenACC loop worker clause specifies that the iteration of the associated loop are to be // executed in parallel across the workers within the gangs created. // The argument specifies how many workers to use to execute the iterations of this loop. #pragma acc loop worker(128) for (int j = 1; j < N -‐ 1; ++j) { B[i][j] = c11 * A[i -‐ 1][j -‐ 1] + c12 * A[i + 0][j -‐ 1] + c13 * A[i + 1][j -‐ 1] + c21 * A[i -‐ 1][j + 0] + c22 * A[i + 0][j + 0] + c23 * A[i + 1][j + 0] + c31 * A[i -‐ 1][j + 1] + c32 * A[i + 0][j + 1] + c33 * A[i + 1][j + 1]; } } }//kernels region }
110
![Page 111: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/111.jpg)
MPI
111
![Page 112: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/112.jpg)
Message Passing
• Point-to-Point !
• Requires explicit commands in program • Send, Receive !
• Must be synchronized among different processors • Sends and Receives must match • Avoid Deadlock -- all processors waiting, none able to communicate !
• Multi-processor communications • e.g. broadcast, reduce
![Page 113: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/113.jpg)
113
MPI advantages
• Mature and well understood • Backed by widely-supported formal standard (1992) • Porting is “easy” !
• Efficiently matches the hardware • Vendor and public implementations available !
• User interface: • Efficient and simple • Buffer handling • Allow high-level abstractions !
• Performance
![Page 114: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/114.jpg)
114
MPI disadvantages
• MPI 2.0 includes many features beyond message passing !!!!!!!!!• Execution control environment depends on implementation
Learning curve
![Page 115: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/115.jpg)
T
5
Programming Model
• Explicit parallelism: ‣ All processes starts at the same time at the same point in the code
‣ Full parallelism: there is no sequential part in the program
Parallel Region
processes
![Page 116: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/116.jpg)
Work Distribution
• All processors run the same executable. !
• Parallel work distribution must be explicitly done by the programmer:
• domain decomposition • master slave
116
![Page 117: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/117.jpg)
117
A Minimal MPI Program (C)
#include "mpi.h"!#include <stdio.h>!!int main( int argc, char *argv[] )!{! MPI_Init( &argc, &argv );! printf( "Hello, world!\n" );! MPI_Finalize();! return 0;!}
![Page 118: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/118.jpg)
118
A Minimal MPI Program (Fortran 90)
program main!use MPI!integer ierr!!call MPI_INIT( ierr )!print *, 'Hello, world!'!call MPI_FINALIZE( ierr )!end
![Page 119: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/119.jpg)
Starting the MPI Environment
• MPI_INIT ( ) Initializes MPI environment. This function must be called and must be the first MPI function called in a program (exception: MPI_INITIALIZED) Syntax!
int MPI_Init ( int *argc, char ***argv )!
!MPI_INIT ( IERROR )!
INTEGER IERROR
NOTE: Both C and Fortran return error codes for all calls.
119
![Page 120: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/120.jpg)
Exiting the MPI Environment
• MPI_FINALIZE ( ) Cleans up all MPI state. Once this routine has been called, no MPI routine ( even MPI_INIT ) may be called
!Syntax!
int MPI_Finalize ( );!
!MPI_FINALIZE ( IERROR )!
INTEGER IERROR!
!MUST call MPI_FINALIZE when you exit from an MPI program
120
![Page 121: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/121.jpg)
C and Fortran Language Considerations
• Bindings
– C
• All MPI names have an MPI_ prefix
• Defined constants are in all capital letters
• Defined types and functions have one capital letter after the prefix; the remaining letters are lowercase
– Fortran
• All MPI names have an MPI_ prefix
• No capitalization rules apply
• last argument is an returned error value
121
![Page 122: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/122.jpg)
122
Finding Out About the Environment
• Two important questions that arise early in a parallel program are: • How many processes are participating in this computation? • Which one am I? !
• MPI provides functions to answer these questions: – MPI_Comm_size reports the number of processes. – MPI_Comm_rank reports the rank, a number between 0 and size-1,
identifying the calling process
![Page 123: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/123.jpg)
123
Better Hello (C)
#include "mpi.h"!#include <stdio.h>!!int main( int argc, char *argv[] )!{! int rank, size;! MPI_Init( &argc, &argv );! MPI_Comm_rank( MPI_COMM_WORLD, &rank );! MPI_Comm_size( MPI_COMM_WORLD, &size );! printf( "I am %d of %d\n", rank, size );! MPI_Finalize();! return 0;!}
![Page 124: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/124.jpg)
124
Better Hello (Fortran)
program main!use MPI!integer ierr, rank, size!!call MPI_INIT( ierr )!call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )!call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )!print *, 'I am ', rank, ' of ', size!call MPI_FINALIZE( ierr )!end
![Page 125: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/125.jpg)
125
Some Basic Concepts
• Processes can be collected into groups. • Each message is sent in a context, and must be received in the
same context. • A group and context together form a communicator. • A process is identified by its rank in the group associated with a
communicator. • There is a default communicator whose group contains all initial
processes, called MPI_COMM_WORLD.
Day 2: MPI 2010 – Course MT1
MPI Communicators
• Communicator is an internal object • MPI Programs are made up of communicating
processes • Each process has its own address space containing its
own attributes such as rank, size (and argc, argv, etc.) • MPI provides functions to interact with it • Default communicator is MPI_COMM_WORLD
– All processes are its members – It has a size (the number of processes) – Each process has a rank within it – One can think of it as an ordered list of processes
• Additional communicator(s) can co-exist • A process can belong to more than one communicator • Within a communicator, each process has a unique
rank
MPI_COMM_WORLD
0
1 2
5
3
4
6
7
14
![Page 126: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/126.jpg)
Communicator
• Communication in MPI takes place with respect to communicators
• MPI_COMM_WORLD is one such predefined communicator (something of type “MPI_COMM”) and contains group and context information
• MPI_COMM_RANK and MPI_COMM_SIZE return information based on the communicator passed in as the first argument
• Processes may belong to many different communicators
0 1 2 3 4 5 6 7
MPI_COMM_WORLD
Rank-->
126
![Page 127: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/127.jpg)
MPI Basic Send/Receive• Basic message passing process. Send data from one process
to another
• Questions – To whom is data sent? – Where is the data? – What type of data is sent? – How much of data is sent? – How does the receiver identify it?
A:
Send Receive
B:
Process 1Process 0
127
![Page 128: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/128.jpg)
128
MPI Basic Send/Receive
• Data transfer plus synchronization
• Requires co-operation of sender and receiver • Co-operation not always apparent in code • Communication and synchronization are combined
DataProcess 0
Process 1
May I Send?
Yes
DataData
DataData
DataData
DataData
Time
![Page 129: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/129.jpg)
Day 2: MPI 2010 – Course MT1
Message Envelope
• Communication across processes is performed using messages.
• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope : – Envelope comprises source,
destination, tag, communicator – Message comprises Envelope +
data • Communicator refers to the
namespace associated with the group of related processes
21
MPI_COMM_WORLD
0
1 2
5
3
4
6
7
Source : process0 Destination : process1 Tag : 1234 Communicator : MPI_COMM_WORLD
Message Organization in MPI
• Message is divided into data and envelope !
• data – buffer – count – datatype !
• envelope – process identifier (source/destination) – message tag – communicator
129
![Page 130: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/130.jpg)
MPI Basic Send/Receive• Thus the basic (blocking) send has become:
MPI_Send ( start, count, datatype, dest, tag, comm )!– Blocking means the function does not return until it is safe to reuse
the data in buffer. The message may not have been received by the target process. !
• And the receive has become:MPI_Recv( start, count, datatype, source, tag, comm, status )!
!- The source, tag, and the count of the message actually
received can be retrieved from status
130
![Page 131: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/131.jpg)
MPI C Datatypes
131
![Page 132: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/132.jpg)
MPI Fortran Datatypes
132
![Page 133: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/133.jpg)
133
MPI Tags
• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message. !
• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.
![Page 134: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/134.jpg)
Process Naming and Message Tags• Naming a process
– destination is specified by ( rank, group )!
– Processes are named according to their rank in the group
– Groups are defined by their distinct “communicator”
– MPI_ANY_SOURCE wildcard rank permitted in a receive Tags are integer variables or constants used to uniquely identify individual messages
• Tags allow programmers to deal with the arrival of messages in an orderly manner
• MPI tags are guaranteed to range from 0 to 32767 by MPI-1
– Vendors are free to increase the range in their implementations
• MPI_ANY_TAG can be used as a wildcard value
134
![Page 135: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/135.jpg)
135
Retrieving Further Information• Status is a data structure allocated in the user’s program. • In C:
int recvd_tag, recvd_from, recvd_count;
MPI_Status status;
MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )
recvd_tag = status.MPI_TAG;
recvd_from = status.MPI_SOURCE;
MPI_Get_count( &status, datatype, &recvd_count );
• In Fortran: integer recvd_tag, recvd_from, recvd_count
integer status(MPI_STATUS_SIZE)
call MPI_RECV(..., MPI_ANY_SOURCE, MPI_ANY_TAG, .. status, ierr)
tag_recvd = status(MPI_TAG)
recvd_from = status(MPI_SOURCE)
call MPI_GET_COUNT(status, datatype, recvd_count, ierr)
![Page 136: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/136.jpg)
Getting Information About a Message (Fortran)
• The following functions can be used to get information about a messageINTEGER status(MPI_STATUS_SIZE) call MPI_Recv( . . . , status, ierr ) tag_of_received_message = status(MPI_TAG) src_of_received_message = status(MPI_SOURCE) call MPI_Get_count(status, datatype, count, ierr)
• MPI_TAG and MPI_SOURCE are primarily of use when MPI_ANY_TAG and/or MPI_ANY_SOURCE is used in the receive
• The function MPI_GET_COUNT may be used to determine how much data of a particular type was received
136
![Page 137: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/137.jpg)
137
Simple Fortran Example - 1/3 program main! use MPI!! integer rank, size, to, from, tag, count, i, ierr! integer src, dest! integer st_source, st_tag, st_count! integer status(MPI_STATUS_SIZE)! double precision data(10) !! call MPI_INIT( ierr )! call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )! call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )! print *, 'Process ', rank, ' of ', size, ' is alive'! dest = size - 1! src = 0
![Page 138: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/138.jpg)
138
if (rank .eq. 0) then! do i=1,10! data(i) = i! done! call MPI_SEND( data, 10, MPI_DOUBLE_PRECISION,! + dest, 2001, MPI_COMM_WORLD, ierr)!! else if (rank .eq. dest) then! tag = MPI_ANY_TAG! source = MPI_ANY_SOURCE! call MPI_RECV( data, 10, MPI_DOUBLE_PRECISION,! + source, tag, MPI_COMM_WORLD,! + status, ierr)
Simple Fortran Example - 2/3
![Page 139: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/139.jpg)
139
call MPI_GET_COUNT( status, MPI_DOUBLE_PRECISION,! st_count, ierr )! st_source = status( MPI_SOURCE )! st_tag = status( MPI_TAG )! print *, 'status info: source = ', st_source,! + ' tag = ', st_tag, 'count = ', st_count! endif!! call MPI_FINALIZE( ierr )! end
Simple Fortran Example - 3/3
![Page 140: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/140.jpg)
Is MPI Large or Small?
• Is MPI large (128 functions) or small (6 functions)? – MPI’s extensive functionality requires many functions – Number of functions not necessarily a measure of complexity – Many programs can be written with just 6 basic functions
MPI_INIT! ! MPI_COMM_SIZE! MPI_SEND! MPI_FINALIZE! MPI_COMM_RANK! MPI_RECV!
!• MPI is just right
– A small number of concepts – Large number of functions provides flexibility, robustness,
efficiency, modularity, and convenience – One need not master all parts of MPI to use it
140
![Page 141: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/141.jpg)
#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){ int done = 0, n, myid, numprocs, i, rc; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x, a; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n",pi, fabs(pi - PI25DT)); } MPI_Finalize(); return 0; }
141
Example: PI in Fortran 90 and C
work distribution
![Page 142: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/142.jpg)
work distribution
program main use MPI double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr c function to integrate f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) 10 if ( myid .eq. 0 ) then write(6,98) 98 format('Enter the number of intervals: (0 quits)') read(5,’(i10)’) n endif call MPI_BCAST( n, 1, MPI_INTEGER, 0,MPI_COMM_WORLD, ierr) if ( n .le. 0 ) goto 30 h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum call MPI_REDUCE( mypi, pi, 1, MPI_DOUBLE_PRECISION, + MPI_SUM, 0, MPI_COMM_WORLD,ierr) if (myid .eq. 0) then write(6, 97) pi, abs(pi - PI25DT) 97 format(' pi is approximately: ', F18.16, + ' Error is: ', F18.16) endif goto 10 30 call MPI_FINALIZE(ierr) end!!! 142
![Page 143: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/143.jpg)
Exercise: PI with MPI and OpenMP
• Compile and run for computing PI parallel !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/MPI_pi.tar !
• Unpack tar file and check the README for instructions. !
• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know. !
• Use qsub to submit the mpi job (an example script is provided) to a queue.
143
![Page 144: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/144.jpg)
Exercise: PI with MPI and OpenMP
144
cores OpenMP marken linux
1 9.617728 14.10798 22.15252
2 4.874539 7.071287 9.661745
4 2.455036 3.532871 5.730912
6 1.627149 2.356928 3.547961
8 1.214713 1.832055 2.804715
12 0.820746 1.184123
16 0.616482 0.955704
![Page 145: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/145.jpg)
0
2
4
6
8
10
12
14
16
1 2 4 6 12 16
spee
d up
number of CPUs
Pi Scaling
Amdahl 1.0OpenMP Barcelona 2.2 GHz
MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz
Exercise: PI with MPI and OpenMP
145
![Page 146: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/146.jpg)
5
10
15
20
25
30
4 6 8 12 16 32
spee
d up
number of cores
Pi Scaling MPI
Amdahl 1.0MPI Core i7-950 3.06 GHz
2
4
6
8
10
12
14
16
1 2 4 6 8 12 16
spee
d up
number of cores
Pi Scaling OpenMP
Amdahl 1.0OpenMP Xeon E5504 2.0 GHz
Exercise: PI with MPI and OpenMP
146
![Page 147: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/147.jpg)
147
![Page 148: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/148.jpg)
Blocking Communication• So far we have discussed blocking communication
– MPI_SEND does not complete until buffer is empty (available for reuse) – MPI_RECV does not complete until buffer is full (available for use) !
• A process sending data will be blocked until data in the send buffer is emptied
• A process receiving data will be blocked until the receive buffer is filled • Completion of communication generally depends on the message size
and the system buffer size • Blocking communication is simple to use but can be prone to
deadlocks If (my_proc.eq.0) Then
Call mpi_send(..)
Call mpi_recv(…)
Usually deadlocks--> Else
Call mpi_send(…) <--- UNLESS you reverse send/recv
Call mpi_recv(….)
Endif
148
![Page 149: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/149.jpg)
149
• Send a large message from process 0 to process 1 • If there is insufficient storage at the destination, the send must
wait for the user to provide the memory space (through a receive) !
• What happens with
Sources of Deadlocks
Process 0 !Send(1) Recv(1)
Process 1 !Send(0) Recv(0)
• This is called “unsafe” because it depends on the availability of system buffers.
![Page 150: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/150.jpg)
150
Some Solutions to the “unsafe” Problem
• Order the operations more carefully:
Process 0 !Send(1) Recv(1)
Process 1 !Recv(0) Send(0)
• Use non-blocking operations:
Process 0 !Isend(1) Irecv(1) Waitall
Process 1 !Isend(0) Irecv(0) Waitall
![Page 151: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/151.jpg)
Blocking Send-Receive Diagram (Receive before Send)
send side receive side
T1:MPI_Send
T4
T2
Once receive is called @ T0, buffer unavailable to user
Receive returns @ T4, buffer filled
It is important to receive before sending, for highest performance.
T0: MPI_Recv
sender returns @ T2, buffer can be reused T3: Transfer Complete
Internal completion is soon followed by return of MPI_Recv
151
![Page 152: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/152.jpg)
T7: transfer !finishes
sender returns @ T3 buffer unavailable
Non-Blocking Send-Receive Diagram
send side receive side
T2: MPI_Isend
T8
T3
Once receive is called @ T0, buffer unavailable to user
MPI_Wait, returns @ T8 here, receive buffer filled
High Performance Implementations Offer Low Overhead for Non-blocking Calls
T0: MPI_Irecv
Internal completion is soon followed by return of MPI_Wait
sender completes @ T5 buffer available after MPI_Wait
T4: MPI_Wait called
T6: MPI_Wait
T1: Returns
T5
T9: Wait returns
T6
152
![Page 153: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/153.jpg)
Non-Blocking Communication• Non-blocking (asynchronous) operations return (immediately)
‘‘request handles” that can be waited on and queried MPI_ISEND( start, count, datatype, dest, tag, comm,
request )!
MPI_IRECV( start, count, datatype, src, tag, comm,
request )!
MPI_WAIT( request, status )!
• Non-blocking operations allow overlapping computation and communication.
• One can also test without waiting using MPI_TEST!MPI_TEST( request, flag, status )!
• Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait!
• Combinations of blocking and non-blocking sends/receives can be used to synchronize execution instead of barriers
153
![Page 154: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/154.jpg)
Multiple Completion’s
• It is often desirable to wait on multiple requests !
• An example is a worker/manager program, where the manager waits for one or more workers to send it a message MPI_WAITALL( count, array_of_requests, array_of_statuses )!
MPI_WAITANY( count, array_of_requests, index, status )!MPI_WAITSOME( incount, array_of_requests, outcount, array_of_indices, array_of_statuses )!
!• There are corresponding versions of test for each of these
154
![Page 155: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/155.jpg)
Probing the Network for Messages
• MPI_PROBE and MPI_IPROBE allow the user to check for incoming messages without actually receiving them !
• MPI_IPROBE returns “flag == TRUE” if there is a matching message available. MPI_PROBE will not return until there is a matching receive availableMPI_IPROBE (source, tag, communicator, flag, status) MPI_PROBE ( source, tag, communicator, status )
155
![Page 156: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/156.jpg)
Message Completion and Buffering• A send has completed when the user supplied buffer can be
reused
• The send mode used (standard, ready, synchronous, buffered) may provide additional information
• Just because the send completes does not mean that the receive has completed – Message may be buffered by the system – Message may still be in transit
*buf = 3; MPI_Send ( buf, 1, MPI_INT, ... ); *buf = 4; /* OK, receiver will always receive 3 */
*buf = 3; MPI_Isend(buf, 1, MPI_INT, ...); *buf = 4; /* Undefined whether the receiver will get 3 or 4 */ MPI_Wait ( ... );
156
![Page 157: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/157.jpg)
Collective Communications in MPI• Communication is co-ordinated among a group of processes,
as specified by communicator, not on all processes
• All collective operations are blocking and no message tags are used (in MPI-1)
• All processes in the communicator group must call the collective operation
• Collective and point-to-point messaging are separated by different “contexts”
• Three classes of collective operations
– Data movement
– Collective computation
– Synchronization
157
![Page 158: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/158.jpg)
Collective calls
158
• MPI has several collective communication calls, the most frequently used are: !• Synchronization
• Barrier !
• Communication • Broadcast • Gather Scatter • All Gather !
• Reduction • Reduce • All Reduce
![Page 159: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/159.jpg)
Day 2: MPI 2010 – Course MT1
40
MPI Collective Calls: Barrier
Function: MPI_Barrier()
int MPI_Barrier ( MPI_Comm comm )
Description: Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html
P0
P1
P2
P3
MPI
_Bar
rier(
)
P0
P1
P2
P3
MPI_Barrier()
159
Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.
![Page 160: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/160.jpg)
MPI Basic Collective Operations
• Two simple collective operations
MPI_BCAST( start, count, datatype, root, comm )!
MPI_REDUCE( start, result, count, datatype, operation,
root, comm )!
• The routine MPI_BCAST sends data from one process to all others
• The routine MPI_REDUCE combines data from all processes, using a specified operation, and returns the result to a single process
• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.
160
![Page 161: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/161.jpg)
Reduce (root=0, op)A0B1C2
D3
X0?1?2
?3
X=A op B op C op D
Process Ranks
Send buffer
Process Ranks
Receive buffer
Broadcast and Reduce
Bcast (root=0)A0?1?2
?3
Process Ranks
Send buffer
A0A1A2
A3
Process Ranks
Send buffer
161
![Page 162: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/162.jpg)
Scatter and Gather
Scatter (root=0)
Process Ranks
Send buffer
A0B1C2D3
Gather (root=0)
Process Ranks
Receive buffer
Process Ranks
Send buffer
Process Ranks
Receive buffer
ABCD0123
????????????
ABCD0123
????????????
A0B1C2D3
162
![Page 163: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/163.jpg)
MPI Collective Routines
• Several routines: MPI_ALLGATHER! MPI_ALLGATHERV! MPI_BCAST
MPI_ALLTOALL! MPI_ALLTOALLV! MPI_REDUCE!! MPI_GATHER! MPI_GATHERV! MPI_SCATTER! ! MPI_REDUCE_SCATTER! MPI_SCAN! ! ! MPI_SCATTERV! MPI_ALLREDUCE! !• All versions deliver results to all participating processes • “V” versions allow the chunks to have different sizes !
• MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built-in and user-defined combination functions
163
![Page 164: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/164.jpg)
Built-In Collective Computation Operations
164
![Page 165: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/165.jpg)
Timing MPI Programs
• MPI_WTIME returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past double MPI_Wtime( void ) DOUBLE PRECISION MPI_WTIME( )
• MPI_WTICK returns the resolution of MPI_WTIME in seconds. It returns, as a double precision value, the number of seconds between successive clock ticks.double MPI_Wtick( void ) DOUBLE PRECISION MPI_WTICK( )
165
![Page 166: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/166.jpg)
166
Extending the Message-Passing Interface
• Dynamic Process Management • Dynamic process startup • Dynamic establishment of connections !
• One-sided communication • Put/get • Other operations !
• Parallel I/O !
• Other MPI-2 features • Generalized requests • Bindings for C++/ Fortran-90; inter-language issues
![Page 167: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/167.jpg)
167
When to use MPI
• Portability and Performance !
• Irregular Data Structures !
• Building Tools for Others • Libraries !
• Need to Manage memory on a per processor basis
![Page 168: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/168.jpg)
168
When not to use MPI
• Number of cores is limited and OpenMP is doing well on that number of cores
• typically 16-32 cores in SMP !
![Page 169: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/169.jpg)
Domain Decomposition
!Break up the domain into blocks.
Assign blocks to MPI-processes one-to-one.
Provide a "map" of neighbors to each process.
Write or modify your code so it only updates a single block.
Insert communication subroutine calls where needed.
Adjust the boundary conditions code.
Use "guard cells".
169
for example: wave equation in a 2D domain.
![Page 170: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/170.jpg)
6. — A Heat-Transfer Example with MPI — 6.6-5
Rolf RabenseifnerA Heat-Transfer Example with MPISlide 9 Höchstleistungsrechenzentrum Stuttgart
Heat: How to parallelize with MPI
• Block data decomposition of arrays phi, phin– array sizes is variable, depends on processor number
• Halo (shadow, ghost cells, ...)– to store values of the neighbors– width of the halo := needs of the numerical formula (heat: 1)
• Communication:– fill the halo with values of the neighbors– after each iteration (dt)
• global execution of the abort criterion– collective operation
Rolf RabenseifnerA Heat-Transfer Example with MPISlide 10 Höchstleistungsrechenzentrum Stuttgart
Block data decomposition — in which dimensions?
Splitting in
• one dimension:communication
= n2*2*w *1
• two dimensions:communication
= n2*2*w *2 / p1/2
• three dimensions:communication
= n2*2*w *3 / p2/3
w=width of halon3 =size of matrixp = number of processorscyclic boundary—> two neighbors in each direction
optimal for p>11
-1- Break up domain in blocks
170
![Page 171: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/171.jpg)
-2- Assign blocks to MPI processes
171
[0, 0] [1, 0]
[0, 1] [1, 1]
nx
nz
[0, 0] MPI 0
[1, 0] MPI 1
[0, 1] MPI 2
[1, 1] MPI 3
nx_block=nx/2!nz_block=nz/2
nx_block
nz_block
![Page 172: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/172.jpg)
-3- Make a map of neighbors
172
lef=10 rig=40 bot=26 top=24
MPI 40
MPI 26 MPI 41
MPI 10
MPI 11
MPI 24
![Page 173: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/173.jpg)
-4- Write code for single MPI block
173
for (ix=0; ix<nx_block; ix++) {! for (iz=0; iz<nz_block; iz++) {! vx[ix][iz] -= rox[ix][iz]*(! c1*(p[ix][iz] - p[ix-1][iz]) +! c2*(p[ix+1][iz] - p[ix-2][iz]));! }! }!
![Page 174: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/174.jpg)
-5- Communication calls
• Examine the algorithm • Does it need data from its neighbors? • Yes? => Insert communication calls to get this data !
• Need p[-1][iz] and p[-2][iz] to update vx[0][iz] !
• p[-1][iz] corresponds to p[nx_block-1][iz] from left neighbor • p[-2][iz] corresponds to p[nx_block-2][iz] from left neighbor !
• Get this data from the neighbor with MPI.
174
![Page 175: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/175.jpg)
communication
175
for (ix=0; ix<nx_block; ix++) {! for (iz=0; iz<nz_block; iz++) {! vx[ix][iz] -= rox[ix][iz]*(! c1*(p[ix][iz] - p[ix-1][iz]) +! c2*(p[ix+1][iz] - p[ix-2][iz]));! }! }!
MPI 10 !
p[iz][nx_block-1] !
MPI 12 !!!
i=nx_block-1 i=1i=0
vx[0][iz] -= rox[0][iz]*(! c1*(p[0][iz] - p[i-1][iz]) +! c2*(p[1][iz] - p[i-2][iz])
![Page 176: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/176.jpg)
pseudocode
if (my_rank==12) {! call MPI_Receive(p[-2:-1][iz], .., 10, ..);!! vx[ix][iz] -= rox[ix][iz]*(! c1*(p[ix][iz] - p[ix-1][iz]) +! c2*(p[ix+1][iz] - p[ix-2][iz]));!}!!if (my_rank==10) {! call MPI_Send(p[-2:-1][iz], .., 12, ..);!}!
176
![Page 177: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/177.jpg)
Update of all blocks [i][j]
177
[i-1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)
[i+2][j] receive(-1) receive(-2) send(nx-1) send(nx-2)
[i+1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)
[i][j] receive(-1) receive(-2) send(nx-1) send(nx-2)
mpi_sendrecv(!p(nx-1:nx-2), 2 ,.., my_neighbor_right,... p(-2:-1), 2 ,.., my_neighbor_left,...)
![Page 178: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/178.jpg)
-6- Domain boundaries
• What if my block is at a physical boundary? !
• Periodic boundary conditions !
• Non-periodic boundary conditions
178
![Page 179: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/179.jpg)
Periodic Boundary Conditions
179
MPI 25 MPI 38MPI 47 left=38 right=0
MPI 10MPI 0 left=47 right=10
![Page 180: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/180.jpg)
Non-Periodic Boundary Conditions
180
MPI 25 MPI 38MPI 47 left=38
right=NUMPI 10
MPI 0 left=NUL right=10
1. Avoid sending or receiving data from a non-existent block. MPI_PROC_NULL !!!!!!!!!
2. Apply boundary conditions.
if (my_neighbor_right == MPI_PROC_NULL) &! [ APPLY APPROPRIATE BOUNDARY CONDITIONS ]
if (my_rank == 12) &! my_neighbor_right = MPI_PROC_NULL! .! call mpi_send(......my_neighbor_right...)! !no data is sent by proc. 12*
![Page 181: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/181.jpg)
6. — A Heat-Transfer Example with MPI — 6.6-6
Rolf RabenseifnerA Heat-Transfer Example with MPISlide 11 Höchstleistungsrechenzentrum Stuttgart
Heat: How to parallelize with MPI - Block Data Decomposit.
• Block data decomposition• Cartesian process topology
– two-dimensionalsize = idim * kdim
– process coordinatesicoord = 0 .. idim-1kccord = 0 .. kdim-1
– see heat-mpi-big.f, lines 18-42• Static array declaration
–> dynamic array allocation– phi(0:80,0:80) –> phi(is:ie,ks:ke)– additional subroutine “algorithm”– see heat-mpi-big.f, lines 186+192
• Computation of is, ie, ks, ke– see heat-mpi-big.f, lines 44-98
0 80
b1=1
is ievalues read in each iteration
values computed in each iteration
phi(is:ie,ks:ke)
inner area
outer area
halo
Rolf RabenseifnerA Heat-Transfer Example with MPISlide 12 Höchstleistungsrechenzentrum Stuttgart
Heat: Communication Method - Non-blocking
• Communicate halo in each iteration• Ranks via MPI_CART_SHIFT 114 call MPI_CART_SHIFT(comm,
0, 1, left, right, ierror) 115 call MPI_CART_SHIFT(comm,
1, 1, lower,upper, ierror)
• non-blocking point-to-point• Caution: corners must not be sent
simultaneously in two directions• vertical boundaries:
– non-contiguous data– MPI derived datatypes used
118 call MPI_TYPE_VECTOR( kinner,b1,iouter,MPI_DOUBLE_PRECISION, vertical_border, ierror)
120 call MPI_TYPE_COMMIT( vertical_border, ierror)
221 call MPI_IRECV(phi(is,ks+b1), 1,vertical_border, left,MPI_ANY_TAG,comm, req(1),ierror)
225 call MPI_ISEND(phi(ie-b1,ks+b1), 1,vertical_border, right,0,comm, req(3),ierror)
from left to rightfrom right to leftfrom upper to lowerfrom lower to upper
• Allocate local block including overlapping areas to store values received from neighbor communication. !
• This allows tidy coding of your loops
-7- guard cells
181
P[i][iz] !!!
i0-2 -1 1
guard cells
![Page 182: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/182.jpg)
Exercise: MPI_Cart
• Do I have to make a map of MPI processes myself? !
• You can use MPI_Cart to create a domain decomposition for you. !
• Exercise sets up 2D domain with one • periodic and • non-periodic boundary condition !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_Cart.tar !
• Unpack tar file and check the README for instructions
182
![Page 183: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/183.jpg)
!PE- 6: x=1, y=2: Sum = 32 neighb0rs: left=2 right=10 top=5 bottom=7!
PE- 4: x=1, y=0: Sum = 24 neighbors: left=0 right=8 top=-1 bottom=5!
PE- 2: x=0, y=2: Sum = 32 neighbors: left=14 right=6 top=1 bottom=3!
PE- 7: x=1, y=3: Sum = 36 neighbors: left=3 right=11 top=6 bottom=-1!
PE- 0: x=0, y=0: Sum = 24 neighbors: left=12 right=4 top=-1 bottom=1!
PE- 5: x=1, y=1: Sum = 28 neighbors: left=1 right=9 top=4 bottom=6!
PE- 1: x=0, y=1: Sum = 28 neighbors: left=13 right=5 top=0 bottom=2!
PE- 3: x=0, y=3: Sum = 36 neighbors: left=15 right=7 top=2 bottom=-1!
PE- 14: x=3, y=2: Sum = 32 neighbors: left=10 right=2 top=13 bottom=15!
PE- 9: x=2, y=1: Sum = 28 neighbors: left=5 right=13 top=8 bottom=10!
PE- 8: x=2, y=0: Sum = 24 neighbors: left=4 right=12 top=-1 bottom=9!
PE- 11: x=2, y=3: Sum = 36 neighbors: left=7 right=15 top=10 bottom=-1!
PE- 15: x=3, y=3: Sum = 36 neighbors: left=11 right=3 top=14 bottom=-1!
PE- 10: x=2, y=2: Sum = 32 neighbors: left=6 right=14 top=9 bottom=11!
PE- 12: x=3, y=0: Sum = 24 neighbors: left=8 right=0 top=-1 bottom=13!
PE- 13: x=3, y=1: Sum = 28 neighbors: left=9 right=1 top=12 bottom=14!
answer 16 MPI tasks
183
![Page 184: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/184.jpg)
184
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
![Page 185: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/185.jpg)
Example: 3D-Finite Difference
• 3D domain decomposition to distribute the work • MPI_Cart_create to set up cartesian domain !
• Hide communication of halo-areas with computational work !
• MPI-IO for writing snapshots to disk • use local and global derived types to exclude halo areas write in
global snapshot file !
• Tutorial about domain decomposition: http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/
185
![Page 186: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/186.jpg)
MPI_Comm_size(MPI_COMM_WORLD, &size);!! /* divide 3D cube among available processors */! dims[0]=0; dims[1]=0; dims[2]=0;! MPI_Dims_create(size, 3, dims);! ! npz = 2*halo+nz/dims[0];! npx = 2*halo+nx/dims[1];! npy = 2*halo+ny/dims[2];!! /* create cartesian domain decomposition */! MPI_Cart_create(MPI_COMM_WORLD, 3, dims, period, reorder, &COMM_CART);!! /* find out neighbors of the rank, MPI_PROC_NULL is a hard boundary of the model */! MPI_Cart_shift(COMM_CART, 1, 1, &left, &right);! MPI_Cart_shift(COMM_CART, 2, 1, &top, &bottom);! MPI_Cart_shift(COMM_CART, 0, 1, &front, &back);!! fprintf(stderr, "Rank %d has LR neighbours %d %d FB %d %d TB %d %d\n", my_rank, left, right, front, back, top, bottom);!
186
![Page 187: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/187.jpg)
187
nx px
ny py
lnx
lny
subarrays for IOlocal
global
![Page 188: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/188.jpg)
188
!! /* create subarrays(excluding halo areas) to write to file with MPI-IO */! /* data in the local array */! sizes[0]=npz; sizes[1]=npx; sizes[2]=npy;! subsizes[0]=sizes[0]-2*halo;! subsizes[1]=sizes[1]-2*halo;! subsizes[2]=sizes[2]-2*halo;! starts[0]=halo; starts[1]=halo; starts[2]=halo;! MPI_Type_create_subarray(3, sizes, subsizes, starts, MPI_ORDER_C,! MPI_FLOAT, &local_array);! MPI_Type_commit(&local_array);!!! /* data in the global array */! gsizes[0]=nz; gsizes[1]=nx; gsizes[2]=ny;! gstarts[0]=subsizes[0]*coord[0];! gstarts[1]=subsizes[1]*coord[1];! gstarts[2]=subsizes[2]*coord[2];! MPI_Type_create_subarray(3, gsizes, subsizes, gstarts, MPI_ORDER_C,! MPI_FLOAT, &global_array);! MPI_Type_commit(&global_array);!!!
![Page 189: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/189.jpg)
189
for (it=0; it<nt; it++) {!! /* copy of halo areas */! copyHalo(vx, vz, p, npx, npy, npz, halo, leftSend, rightSend, frontSend, backSend, topSend, bottomSend);!! /* start a-synchronous communication of halo areas to neighbours */! exchangeHalo(leftRecv, leftSend, rightRecv, rightSend, 3*halosizex, left, right, &reqRecv[0], &reqSend[0], 0);! exchangeHalo(frontRecv, frontSend, backRecv, backSend, 3*halosizey, front, back, &reqRecv[2], &reqSend[2], 4);! exchangeHalo(topRecv, topSend, bottomRecv, bottomSend, 3*halosizez, top, bottom, &reqRecv[4], &reqSend[4], 8);!! /* update on grid values excluding halo areas */! updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);! updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);!!!
![Page 190: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/190.jpg)
190
!!! /* wait for halo communications to be finished */! waitForHalo(&reqRecv[0], &reqSend[0]);! t1 = wallclock_time();! tcomm += t1-t2;!! newHalo(vx, vz, p, npx, npy, npz, halo, leftRecv, rightRecv, frontRecv, backRecv, topRecv, bottomRecv);! t2 = wallclock_time();! tcopy += t2-t1;!! /* compute field on the grid of the halo areas */! updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);! updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);!
![Page 191: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/191.jpg)
191
! /* write snapshots to file */! if (time == snapshottime) {!! MPI_Info_create(&fileinfo);! MPI_Info_set(fileinfo, "striping_factor", STRIPE_COUNT);! MPI_Info_set(fileinfo, "striping_unit", STRIPE_SIZE);! MPI_File_delete(filename, MPI_INFO_NULL);!! rc = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDWR|MPI_MODE_CREATE, fileinfo, &fh);!! disp = 0;! rc = MPI_File_set_view(fh, disp, MPI_FLOAT, global_array, "native", fileinfo);!! rc = MPI_File_write_all(fh, p, 1, local_array, status);!! MPI_File_close(&fh);! }!! } /* end of time loop */!
![Page 192: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/192.jpg)
Summary
192
![Page 193: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/193.jpg)
MPI Summary!
• MPI Standard widely accepted by vendors and programmers
• MPI implementations available on most modern platforms • Several MPI applications deployed • Several tools exist to trace and tune MPI applications !
• Simple applications use point-to-point and collective communication operations !
• Advanced applications use point-to-point, collective, communicators, datatypes, one-sided, and topology operations
193
![Page 194: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/194.jpg)
Hybrid systems programming hierarchy
Hybrid System
Pure MPIHybrid MPI/OpenMP OpenMP in SMP nodes MPI across the int. net.
Pure OpenMP Supported by software
distributed shared memory
No overlapping computation/communication
MPI only out of OpenMP code. Only master thread with
MPI communication.
Overlapping computation/communication
MPI inside OpenMP code
![Page 195: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/195.jpg)
Hybrid OpenMP/MPI
• Natural paradigm for clusters of SMP’s • May offer considerable advantages when application mapping and
load balancing is tough • Benefits with slower interconnection networks (overlapping
computation/communication) !
‣ Requires work and code analysis to change pure MPI codes ‣ Start with auto parallelization? ‣ Link shared memory libraries…check various thread/MPI
processes combinations ‣ Study carefully the underlying architecture !
• What is the future of this model? Could it be consolidated in new languages?
• Connection with many-core?
195
![Page 196: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/196.jpg)
196
P0 P1
P2 P3
Type to enter
Type to enter te
C. BEKAS
T T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T T
Suppose we wish to solve the PDE
Using the Jacobi method: the value of u at each discretization point is given by a certain average among its neighbors, until convergence. Distributing the mesh to SMP clusters by Domain Decomposition, it is clear that the GREEN nodes can proceed without any comm., while the Blue nodes have to communicate first and calculate later.
Overlapping computation/communication: Example
![Page 197: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/197.jpg)
MPI/OpenMPI: Overlapping computation/communication
197
Not only the master but other threads communicate. Call MPI primitives in OpenMP code regions.
!if (my_thread_id < # ){!! MPI_… (communicate needed data)!} else!! /* Perform computations that to not need
communication */!! .!! .!}!/* All threads execute code that requires
communication */!! .!!! .
![Page 198: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/198.jpg)
198
! for (k=0; k < MAXITER; k++){!! ! /* Start parallel region here */!! ! #pragma omp parallel private(){!! ! ! my_id = omp_get_thread_num();!! ! !! ! ! if (my_id is given “halo points”)!! ! ! ! MPI_SendRecv(“From neighboring MPI process”);!! ! ! else{!! ! ! ! for (i=0; i < # allocated points; i++)!! ! ! ! ! newval[i] = avg(oldval[i]);!! ! ! }!! !! ! ! if (there are still points I need to do) /* Thi!! ! ! ! for (i=0; i< # remaining points; i++)!! ! ! ! ! newval[i] = avg(oldval[i]);!! !! ! ! }!! ! ! for (i=0; i<(all_my_points); i++)!! ! ! ! ! oldval[i] = newval[i];!! ! }!! ! MPI_Barrier(); /* Synchronize all MPI processes here */!! }
![Page 199: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/199.jpg)
PGAS
• What is PGAS? !
• How to make use of PGAS as a programmer?
199
![Page 200: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/200.jpg)
PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 44
Partitioned Global Address SpacePartitioned Global Address Space
! Explicitly parallel, shared-memory like programming model
! Global addressable space
" Allows programmers to declare and “directly” access data distributed across the machine
! Partitioned address space
" Memory is logically partitioned between local and remote (a two-level hierarchy)" Forces the programmer to pay attention to data locality, by exposing the inherent
NUMA-ness of current architectures
! Single Processor Multiple Data (SPMD) execution model
" All threads of control execute the same program" Number of threads fixed at startup" Newer languages such as X10 escape this model, allowing fine-grain threading
! Different language implementations:
" UPC (C-based), CoArray Fortran (Fortran-based), Titanium and X10 (Java-based)
Partitioned Global Address Space
200
![Page 201: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/201.jpg)
PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 55
! Computation is performed in multiple places.
! A place contains data that can be operated on remotely.
! Data lives in the place it was created, for its lifetime.
! A datum in one place may point to a datum in another place.
! Data-structures (e.g. arrays) may be distributed across many places.
A place expresses locality.
Address Space
Shared Memory
OpenMP
PGAS
UPC, CAF, X10Message passing
MPI
Process/Thread
Partitioned Global Address SpacePartitioned Global Address SpacePartitioned Global Address Space
201
![Page 202: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/202.jpg)
Shared Memory (OpenMP)
• Multiple threads share global memory • Most common variant: OpenMP !
• Program loop iterations distributed to threads, more recent task features
• Each thread has a means to refer to private objects within a parallel context
• Terminology • Thread, thread team !
• Implementation • Threads map to user threads running on one SMP node • Extensions to multiple servers not so successful
202
![Page 203: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/203.jpg)
OpenMP
203
memory
threads
![Page 204: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/204.jpg)
OpenMP: work distribution
204
memory
threads
!$OMP PARALLEL do i=1,32 a(i)=a(i)*2 end do1-8 9-16 17-24 25-32
![Page 205: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/205.jpg)
OpenMP: implementation
205
memory
threads
cpus
process
![Page 206: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/206.jpg)
Message Passing (MPI)
• Participating processes communicate using a message-passing API !
• Remote data can only be communicated (sent or received) via the API. !
• MPI (the Message Passing Interface) is the standard !
• Implementation: • MPI processes map to processes within one SMP node or across
multiple networked nodes • API provides process numbering, point-to-point and collective
messaging operations
206
![Page 207: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/207.jpg)
MPI
207
memory
cpu
processes
memory
cpu
memory
cpu
![Page 208: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/208.jpg)
MPI
208
memory
cpu
process 0
memory
cpu
MPI_Send(a,...,1,…)
process 1
MPI_Recv(a,...,0,…)
![Page 209: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/209.jpg)
Partitioned Global Address Space
• Shortened to PGAS !
• Participating processes/threads have access to local memory via standard program mechanisms !
• Access to remote memory is directly supported by the PGAS language
209
![Page 210: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/210.jpg)
PGAS
210
memory
cpu
process
memory
cpu
memory
cpu
process process
![Page 211: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/211.jpg)
Partitioned Global Address Space (PGAS):
� Global address space – any process can address memory on any processor !
� Partitioned GAS – retain information about locality !
� Core idea– hardest part of writing parallel code is managing data distribution and communication; make that simple and explicit !
� PGAS Languages try to simplify parallel programming (increase programmer productivity).
5HPCC PTRANS and PGAS Languages
PGAS languages
![Page 212: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/212.jpg)
Data Parallel Languages
• Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines. http://upc.lbl.gov/ !
• Co-array Fortran (CAF) is part of Fortran 2008 standard. It is a simple, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. http://www.co-array.org !
• both need a global address space (which is not equal to SMP)
212
![Page 213: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/213.jpg)
� Unified Parallel C: � An extension of C99 � An evolution of AC, PCP, and Split-C !
� Features � SPMD parallelism via replication of threads � Shared and private address spaces � Multiple memory consistency models !
� Benefits � Global view of data � One-sided communication
HPCC PTRANS and PGAS Languages 7
UPC
![Page 214: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/214.jpg)
� Co-Array Fortran: � An extension of Fortran 95 and part of “Fortran 2008” � The language formerly known as F-- !
� Features � SPMD parallelism via replication of images � Co-‐arrays for distributed shared data !
� Benefits � Syntactically transparent communication � One-sided communication � Multi-dimensional arrays � Array operations
HPCC PTRANS and PGAS Languages 9
Co-Array Fortran
![Page 215: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/215.jpg)
Basic execution model co-array F--
• Program executes as if replicated to multiple copies with each copy executing asynchronously (SPMD) !
• Each copy (called an image) executes as a normal Fortran application !
• New object indexing with [] can be used to access objects on other images. !
• New features to inquire about image index, number of images and to synchronize
215
![Page 216: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/216.jpg)
CAF (F--)
REAL, DIMENSION(N)[*] :: X,Y! X(:) = Y(:)[Q]!!!Array indices in parentheses follow the normal Fortran rules within one memory image.
Array indices in square brackets provide an equally convenient notation for accessing objects across images and follow similar rules.
216
![Page 217: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/217.jpg)
Coarray execution model
217
memory
cpu
Image 1
memory
cpu
memory
cpu
Image 2 Image 3
Remote access with square bracket indexing: a(:)[2]
coarrays
![Page 218: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/218.jpg)
Basic coarray declaration and usage
218
integer :: b integer :: a(4)[*] !coarray
1 8 1 5a
image 1
b 1
1 7 9 9a
image 2
b 3
1 7 9 4a
image 3
b 6
b=a(2)
![Page 219: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/219.jpg)
Basic coarray declaration and usage
219
integer :: b integer :: a(4)[*] !coarray
b=a(2)
1 8 1 5a
image 1
b 8
1 7 9 9a
image 2
b 7
1 7 9 4a
image 3
b 7
![Page 220: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/220.jpg)
Text
Basic coarray declaration and usage
220
integer :: b integer :: a(4)[*] !coarray
b=a(4)[3]
1 8 1 5a
image 1
b 1
1 7 9 9a
image 2
b 3
1 7 9 4a
image 3
b 6
![Page 221: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/221.jpg)
Text
Basic coarray declaration and usage
221
integer :: b integer :: a(4)[*] !coarray
b=a(4)[3]
1 8 1 5a
image 1
b 4
1 7 9 9a
image 2
b 4
1 7 9 4a
image 3
b 4
![Page 222: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/222.jpg)
222
![Page 223: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/223.jpg)
Feb-5-08 56
Performance tuning
S
p
first attempt
second attempt
ideal
Performance tuning of parallel applications is an iterative process
Performance Tuning
223
![Page 224: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/224.jpg)
Compiling MPI Programs
224
![Page 225: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/225.jpg)
Compiling and Starting MPI Jobs• Compiling:
• Need to link with appropriate MPI and communication subsystem libraries and set path to MPI Include files
• Most vendors provide scripts or wrappers for this (mpxlf, mpif77, mpicc, etc)
• Starting jobs:
• Most implementations use a special loader named mpirun
– mpirun -np <no_of_processors> <prog_name> !
• In MPI-2 it is recommended to use
– mpiexec -n <no_of_processors> <prog_name>
225
![Page 226: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/226.jpg)
226
MPICH: a Portable MPI Environment
• MPICH is a high-performance portable implementation of MPI (both 1 and 2). !
• It runs on MPP's, clusters, and heterogeneous networks of workstations. !
• The CH comes from Chameleon, the portability layer used in the original MPICH to provide portability to the existing message-passing systems. !
• In a wide variety of environments, one can do: mpicc -mpitrace myprog.c! mpirun -np 10 myprog! upshot myprog.log!!
to build, compile, run, and analyze performance.
![Page 227: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/227.jpg)
MPICH2
MPICH2 is an all-new implementation of the MPI Standard, designed to implement all of the MPI-2 additions to MPI. !
‣ separation between process management and communications ‣ use daemons (mpd) on nodes ‣ dynamic process management, ‣ one-sided operations, ‣ parallel I/O, and others !
!!
• http://www.mcs.anl.gov/research/projects/mpich2/
227
![Page 228: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/228.jpg)
228
Compiling MPI programs
• From a command line: mpicc -o prog prog.c !!
• Use profiling options (specific to mpich) • -mpilog Generate log files of MPI calls • -mpitrace Trace execution of MPI calls • -mpianim Real-time animation of MPI (not available on all
systems) • --help Find list of available options !
• The environment variables MPICH_CC, MPICH_CXX, MPICH_F77, and MPICH_F90 may be used to specify alternate C, C++, Fortran 77, and Fortran 90 compilers, respectively. !
![Page 229: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/229.jpg)
229
example hello.c
#include "mpi.h"!#include <stdio.h>!int main(int argc ,char *argv[])!{!! int myrank;!! MPI_Init(&argc, &argv);!! MPI_Comm_rank(MPI_COMM_WORLD, &myrank);!! fprintf(stdout, "Hello World, I am process %d
\n", myrank);!! MPI_Finalize();!! return 0;!}
![Page 230: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/230.jpg)
230
Example: using MPICH1
-bash-3.1$ mpicc -o hello hello.c!-bash-3.1$ mpirun -np 4 hello!Hello World, I am process 0!Hello World, I am process 2!Hello World, I am process 3!Hello World, I am process 1!!
![Page 231: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/231.jpg)
231
Example: details
• If using frontend and compute nodes in machines file use mpirun -np 2 -machinefile machines hello !!• If using only compute nodes in machine file use mpirun -nolocal -np 2 -machinefile machines hello !
• -nolocal - don’t start job on frontend • -np 2 - start job on 2 nodes • -machinefile machines - nodes are specified in machines file • hello - start program hello
![Page 232: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/232.jpg)
232
Notes on clusters
•Make sure you have access to the compute node (ssh keys are generated ssh-keygen) and ask your system administrator.
!•Which mpicc are you using
•$ which mpicc !
•command line arguments are not always passed to mpirun/mpiexec (depending on your version). In that case make a script which calls your program with all its arguments
!
![Page 233: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/233.jpg)
MPICH2 daemons• mpdtrace: output a list of nodes on which you can run
MPI programs (runs mpd daemons). ‣ The -l option lists full hostnames and the port where the mpd is
listening. !
• mpd starts an mpd daemon. !
• mpdboot starts a set of mpd’s on a list of machines. !
• mpdlistjobs lists the jobs that the mpd’s are running. !
• mpdkilljob kills a job specified by the name returned by mpdlistjobs !mpdsigjob delivers a signal to the named job.
233
![Page 234: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/234.jpg)
MPI - Message Passing Interface
• MPI or MPI-1 is a library specification for message-passing. !
• MPI-2: Adds in Parallel I/O, Dynamic Process management, Remote Memory Operation, C++ & F90 extension … !
• MPI Standard: http://www-unix.mcs.anl.gov/mpi/standard.html
!• MPI Standard 1.1 Index:
http://www.mpi-forum.org/docs/mpi-11-html/node182.html !
• MPI-2 Standard Index: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/node306.htm !
• MPI Forum Home Page: http://www.mpi-forum.org/index.html
![Page 235: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/235.jpg)
MPI tutorials
• http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/mpi_II_tutorial.html !
• https://fs.hlrs.de/projects/par/par_prog_ws/
235
![Page 236: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/236.jpg)
236
MPI Sources
• The Standard (3.0) itself: • at http://www.mpi-forum.org • All MPI official releases, in both PDF and HTML !
• Books: – Using MPI: Portable Parallel Programming with the Message-
Passing Interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994. – Parallel Programming with MPI, by Peter Pacheco, Morgan-
Kaufmann, 1997. !
• Other information on Web: • at http://www.mcs.anl.gov/mpi • pointers to lots of stuff, including other talks and tutorials, a FAQ,
other MPI pages • http://mpi.deino.net/mpi_functions/index.htm
![Page 237: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/237.jpg)
Job Management and queuing
237
![Page 238: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/238.jpg)
Job Management and queuing
• On a large system many users are running simultaneously. !
• What to do when: !
• The system is full and you want to run your 512 CPU job? • You want to run 16 jobs, should others wait on that? • Have bigger jobs priority over smaller jobs? • Have longer jobs lower/higher priority? !
• The job manager and queue system takes care of it.
238
![Page 239: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/239.jpg)
MPICH1 in PBS#!/bin/bash!#PBS -N Flank!#PBS -V!#PBS -l nodes=11:ppn=2!#PBS -j eo !!cd $PBS_O_WORKDIR!export nb=`wc -w < $PBS_NODEFILE`!echo $nb!!mpirun -np $nb -machinefile $PBS_NODEFILE ~/bin/migr_mpi \! file_vel=$PBS_O_WORKDIR/grad_salt_rot.su \! file_shot=$PBS_O_WORKDIR/cfp2_sx.su
239
![Page 240: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/240.jpg)
Job Management and queuing
• PBS (maui): • http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php • Torque resource manager http://www.clusterresources.com/pages/products/
torque-resource-manager.php !
• Sun Grid Engine • http://gridengine.sunsource.net/ !
• SLURM • http://slurm.schedmd.com
!!!
• Luckily their interface is very similar (qsub, qstat, ...)
240
![Page 241: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/241.jpg)
queuing commands
• qsub !
• qstat !
• qdel !
• xpbsmon
![Page 242: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/242.jpg)
qsub
#PBS -l nodes=10:ppn=1!#PBS -l mem=20mb!#PBS -l walltime=1:00:00!#PBS -j eo!!!
submit: !qsub -q normal job.scr!!
output: !
jobname.ejobid!jobname.ojobid!
![Page 243: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/243.jpg)
qstat
• available queue and resources !qstat -q!!
• queued and running jobs !qstat (-a)!!
243
![Page 244: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/244.jpg)
qdel
• deletes job from queue and stops all running executables !qdel jobid
244
![Page 245: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/245.jpg)
Submitting many jobs in one script#!/bin/bash -f!#!export xsrc1=93!export xsrc2=1599!export dxsrc=3!xsrc=$xsrc1!!while (( xsrc <= xsrc2 ))!do!echo ' modeling shot at x=' $xsrc !!cat << EOF > jobs/pbs_${xsrc}.job!#!/bin/bash!#!#PBS -N fahud_${xsrc}!#PBS -q verylong!#PBS -l nodes=1:ppn=1!#PBS -V!!program with arguments!!EOF!!qsub jobs/pbs_${xsrc}.job!(( xsrc = $xsrc + $dxsrc))!done
245
Be careful !
![Page 246: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/246.jpg)
246
![Page 247: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/247.jpg)
Exercise: PI with MPI and OpenMP
• Compile and run for computing PI parallel !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar !
• Unpack tar file and check the README for instructions. !
• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know. !
• Use qsub to submit the mpi job (an example script is provided) to a queue.
247
![Page 248: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/248.jpg)
Exercise: PI with MPI and OpenMP
248
cores OpenMP marken (MPI) linux (MPI)
1 9.617728 14.10798 22.15252
2 4.874539 7.071287 9.661745
4 2.455036 3.532871 5.730912
6 1.627149 2.356928 3.547961
8 1.214713 1.832055 2.804715
12 0.820746 1.184123
16 0.616482 0.955704
![Page 249: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/249.jpg)
0
2
4
6
8
10
12
14
16
1 2 4 6 12 16
spee
d up
number of CPUs
Pi Scaling
Amdahl 1.0OpenMP Barcelona 2.2 GHz
MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz
Exercise: PI with MPI and OpenMP
249
![Page 250: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/250.jpg)
Exercise: OpenMP Max
• Compile and run for computing PI parallel !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar !
• Unpack tar file and check the README for instructions. !
• The exercise focus on using the available number of cores in an efficient way. !
• Also inspect the code and see how the reductions are done, is there another way of doing the reductions?
250
![Page 251: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/251.jpg)
Exercise: OpenMP details
• More details is using OpenMp and shared memory parallelisation !
• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/OpenMP.tar !
• Unpack tar file and check the README for instructions. !
• These are ‘old’ exercises from SGI and give insight in problems you can encounter using OpenMP. !
• It requires already some knowledge about OpenMP. The OpenMP F-Summary.pdf from the website can be helpful.
251
![Page 252: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/252.jpg)
252
Extend the Exercises
• Modify the pi program to use send/receive instead of bcast/reduce. !
• Write a program that sends a message around a ring. That is, process 0 reads a line from the terminal and sends it to process 1, who sends it to process 2, etc. The last process sends it back to process 0, who prints it. !
• Time programs with MPI_WTIME().
![Page 253: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/253.jpg)
END
253
![Page 254: 07_ParallelProgramming](https://reader036.vdocuments.site/reader036/viewer/2022062421/55cf8f77550346703b9ca454/html5/thumbnails/254.jpg)
254