3d parallel fem(iv) (openmp+ mpi) hybrid parallel...
TRANSCRIPT
![Page 1: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/1.jpg)
3D Parallel FEM (IV)(OpenMP + MPI) Hybrid Parallel
Programming Model
Kengo NakajimaInformation Technology Center
Technical & Scientific Computing II (4820-1028)Seminar on Computer Science II (4810-1205)
Hybrid Distributed Parallel Computing (3747-111)
![Page 2: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/2.jpg)
2
Hybrid Parallel Programming Model• Message Passing (e.g. MPI) + Multi Threading (e.g.
OpenMP, CUDA, OpenCL, OpenACC etc.)• In K computer and FX10, hybrid parallel
programming is recommended– MPI + Automatic Parallelization by Fujitsu’s Compiler
• Personally, I do not like to call this “hybrid” !!!
• Expectations for Hybrid– Number of MPI processes (and sub-domains) to be
reduced– O(108-109)-way MPI might not scale in Exascale Systems– Easily extended to Heterogeneous Architectures
• CPU+GPU, CPU+Manycores (e.g. Intel MIC/Xeon Phi)• MPI+X: OpenMP, OpenACC, CUDA, OpenCL
![Page 3: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/3.jpg)
Flat MPI vs. Hybrid
Hybrid:Hierarchal Structure
Flat-MPI:Each Core -> Independent
corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y
mem
ory
mem
ory
mem
ory
core
core
core
core
core
core
core
core
core
core
core
core
3
![Page 4: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/4.jpg)
OMP-1 4
Background• Multicore/Manycore Processors
– Low power consumption, Various types of programming models
• OpenMP– Directive based, (seems to be) easy– Many books
• Data Dependency (S1/S2 Semester)– Conflict of reading from/writing to memory– Appropriate reordering of data is needed for
“consistent” parallel computing– NO detailed information in OpenMP books: very
complicated• OpenMP/MPI Hybrid Parallel Programming Model
for Multicore/Manycore Clusters
![Page 5: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/5.jpg)
OMP-1 5
SMPMEMORY
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
• SMP– Symmetric Multi Processors– Multiple CPU’s (cores) share a single memory space
![Page 6: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/6.jpg)
OMP-1 6
What is OpenMP ?http://www.openmp.org
• An API for multi-platform shared-memory parallel programming in C/C++ and Fortran– Current version: 4.0
• Background– Merger of Cray and SGI in 1996– ASCI project (DOE) started
• C/C++ version and Fortran version have been separately developed until ver.2.5.
• Fork-Join Parallel Execution Model• Users have to specify everything by directives.
– Nothing happen, if there are no directives
![Page 7: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/7.jpg)
OMP-1 7
Fork-Join Parallel Execution Model
Master
thread
thread
thread
thread
thread
thread
thread
Master
thread
thread
thread
thread
thread
thread
thread
MasterMasterMaster
PARALLELfork
END PARALLELjoin
PARALLELfork
END PARALLELjoin
![Page 8: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/8.jpg)
OMP-1 8
Number of Threads• OMP_NUM_THREADS
– How to change ?• bash(.bashrc) export OMP_NUM_THREADS=8• csh(.cshrc) setenv OMP_NUM_THREADS 8
![Page 9: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/9.jpg)
OMP-1 9
Information about OpenMP• OpenMP Architecture Review Board (ARB)
– http://www.openmp.org• References
– Chandra, R. et al.「Parallel Programming in OpenMP」(Morgan Kaufmann)
– Quinn, M.J. 「Parallel Programming in C with MPI and OpenMP」(McGrawHill)
– Mattson, T.G. et al. 「Patterns for Parallel Programming」(Addison Wesley)
– 牛島「OpenMPによる並列プログラミングと数値計算法」(丸善)
– Chapman, B. et al. 「Using OpenMP」(MIT Press)• Japanese Version of OpenMP 3.0 Spec. (Fujitsu etc.)
– http://www.openmp.org/mp-documents/OpenMP30spec-ja.pdf
![Page 10: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/10.jpg)
OMP-1 10
Features of OpenMP• Directives
– Loops right after the directives are parallelized.– If the compiler does not support OpenMP, directives are
considered as just comments.
![Page 11: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/11.jpg)
OMP-1 11
OpenMP/DirectivesArray Operations
!$omp parallel dodo i= 1, NPW(i,1)= 0.d0W(i,2)= 0.d0
enddo!$omp end parallel do
!$omp parallel do private(iS,iE,i)!$omp& reduction(+:RHO)
do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iERHO= RHO + W(i,R)*W(i,Z)
enddoenddo
!$omp end parallel do
Simple Substitution Dot Products
!$omp parallel dodo i= 1, NPY(i)= ALPHA*X(i) + Y(i)
enddo!$omp end parallel do
DAXPY
![Page 12: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/12.jpg)
OMP-1 12
OpenMP/DireceivesMatrix/Vector Products!$omp parallel do private(ip,iS,iE,i,j)
do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iEW(i,Q)= D(i)*W(i,P)do j= 1, INL(i)W(i,Q)= W(i,Q) + W(IAL(j,i),P)
enddodo j= 1, INU(i)W(i,Q)= W(i,Q) + W(IAU(j,i),P)
enddoenddo
enddo!$omp end parallel do
![Page 13: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/13.jpg)
OMP-1 13
Features of OpenMP• Directives
– Loops right after the directives are parallelized.– If the compiler does not support OpenMP, directives are
considered as just comments.• Nothing happen without explicit directives
– Different from “automatic parallelization/vectorization”– Something wrong may happen by un-proper way of usage – Data configuration, ordering etc. are done under users’
responsibility• “Threads” are created according to the number of
cores on the node– Thread: “Process” in MPI– Generally, “# threads = # cores”: Xeon Phi supports 4
threads per core (Hyper Multithreading)
![Page 14: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/14.jpg)
OMP-1 14
Memory Contention: メモリ競合
MEMORY
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
• During a complicated process, multiple threads may simultaneously try to update the data in same address on the memory.– e.g.: Multiple cores update a single component of an array.– This situation is possible.– Answers may change compared to serial cases with a
single core (thread).
![Page 15: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/15.jpg)
OMP-1 15
Memory Contention (cont.)MEMORY
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
• In this lecture, no such case does not happen by reordering etc.– In OpenMP, users are responsible for such issues (e.g.
proper data configuration, reordering etc.)• Generally speaking, performance per core reduces
as number of used cores (thread number) increases.– Memory access performance: STREAM
![Page 16: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/16.jpg)
OMP-1 16
Features of OpenMP (cont.)
• “!omp parallel do”-”!omp end parallel do”• Global (Shared) Variables, Private Variables
– Default: Global (Shared)– Dot Products: reduction
W(:,:),R,Z,PEsmpTOTglobal (shared)
!$omp parallel do private(iS,iE,i)!$omp& reduction(+:RHO)
do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iERHO= RHO + W(i,R)*W(i,Z)
enddoenddo
!$omp end parallel do
![Page 17: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/17.jpg)
OMP-1 17
FORTRAN & C
#include <omp.h>...{
#pragma omp parallel for default(none) shared(n,x,y) private(i)
for (i=0; i<n; i++)x[i] += y[i];
}
use omp_lib...!$omp parallel do shared(n,x,y) private(i)
do i= 1, nx(i)= x(i) + y(i)
enddo!$ omp end parallel do
![Page 18: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/18.jpg)
OMP-1 18
In this class ...• There are many capabilities of OpenMP.• In this class, only several functions are shown for
parallelization of parallel FEM.
![Page 19: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/19.jpg)
OMP-1 19
First things to be done(after OpenMP 3.0)
• use omp_lib Fortran• #include <omp.h> C
![Page 20: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/20.jpg)
OMP-1 20
OpenMP Directives (Fortran)
• NO distinctions between upper and lower cases.• sentinel
– Fortran: !$OMP, C$OMP, *$OMP• !$OMP only for free format
– Continuation Lines (Same rule as that of Fortran compiler is applied)
• Example for !$OMP PARALLEL DO SHARED(A,B,C)
sentinel directive_name [clause[[,] clause]…]
!$OMP PARALLEL DO!$OMP+SHARED (A,B,C)
!$OMP PARALLEL DO &!$OMP SHARED (A,B,C)
![Page 21: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/21.jpg)
OMP-1 21
OpenMP Directives (C)
• “\” for continuation lines• Only lower case (except names of variables)
#pragma omp directive_name [clause[[,] clause]…]
#pragma omp parallel for shared (a,b,c)
![Page 22: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/22.jpg)
OMP-1 22
PARALLEL DO
• Parallerize DO/for Loops• Examples of “clause”
– PRIVATE(list)– SHARED(list)– DEFAULT(PRIVATE|SHARED|NONE)– REDUCTION({operation|intrinsic}: list)
!$OMP PARALLEL DO[clause[[,] clause] … ](do_loop)
!$OMP END PARALLEL DO
#pragma parallel for [clause[[,] clause] … ](for_loop)
![Page 23: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/23.jpg)
OMP-1 23
REDUCTION
• Similar to “MPI_Reduce”• Operator
– +,*,-, .AND., .OR., .EQV., .NEQV.• Intrinsic
– MAX, MIN, IAND, IOR, IEQR
REDUCTION ({operator|instinsic}: list)
reduction ({operator|instinsic}: list)
![Page 24: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/24.jpg)
OMP-1 24
Example-1: A Simple Loop!$OMP PARALLEL DO
do i= 1, NB(i)= (A(i) + B(i)) * 0.50
enddo !$OMP END PARALLEL DO
• Default status of loop variables (“i” in this case) is private. Therefore, explicit declaration is not needed.
• “END PARALLEL DO” is not required– In C, there are no definitions of “end parallel do”
![Page 25: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/25.jpg)
OMP-1 25
Example-1: REDUCTION!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+:A,B)
do i= 1, Ncall WORK (Alocal, Blocal)A= A + AlocalB= B + Blocal
enddo!$OMP END PARALLEL DO
• “END PARALLEL DO” is not required
![Page 26: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/26.jpg)
26
Functions which can be used with OpenMP
Name Functions
int omp_get_num_threads (void) Total Thread #int omp_get_thread_num (void) Thread IDdouble omp_get_wtime (void) = MPI_Wtime
void omp_set_num_threads (intnum_threads)call omp_set_num_threads (num_threads)
Setting Thread #
![Page 27: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/27.jpg)
OMP-1 27
OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)
enddo
![Page 28: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/28.jpg)
OMP-1 28
OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)
enddo
VAL= 0.d0!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)
do i= 1, NVAL= VAL + W(i,R) * W(i,Z)
enddo!$OMP END PARALLEL DO
Directives are just inserted.
![Page 29: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/29.jpg)
OMP-1 29
OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)
enddo
VAL= 0.d0!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)
do i= 1, NVAL= VAL + W(i,R) * W(i,Z)
enddo!$OMP END PARALLEL DO
VAL= 0.d0!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)
do ip= 1, PEsmpTOTdo i= index(ip-1)+1, index(ip)
VAL= VAL + W(i,R) * W(i,Z)enddo
enddo!$OMP END PARALLEL DO
Multiple LoopPEsmpTOT: Number of threads
Additional array INDEX(:) is needed.Efficiency is not necessarily good, but users can specify thread for each component of data.
Directives are just inserted.
![Page 30: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/30.jpg)
OMP-1 30
OpenMP for Dot ProductsVAL= 0.d0
!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)do ip= 1, PEsmpTOT
do i= index(ip-1)+1, index(ip)VAL= VAL + W(i,R) * W(i,Z)
enddoenddo
!$OMP END PARALLEL DO
e.g.: N=100, PEsmpTOT=4
INDEX(0)= 0INDEX(1)= 25INDEX(2)= 50INDEX(3)= 75INDEX(4)= 100
Multiple LoopPEsmpTOT: Number of threads
Additional array INDEX(:) is needed.Efficiency is not necessarily good, but users can specify thread for each component of data.
NOT good for GPU’s
![Page 31: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/31.jpg)
31
Matrix-Vector Multiply
do i = 1, NVAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)
enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)
enddoW(i,Q)= VAL
enddo
![Page 32: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/32.jpg)
32
Matrix-Vector Multiply!$omp parallel do private(ip,i,VAL,k)
do ip= 1, PEsmpTOTdo i = INDEX(ip-1)+1, INDEX(ip)VAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)
enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)
enddoW(i,Q)= VAL
enddoenddo
!$omp end parallel do
![Page 33: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/33.jpg)
33
Matrix-Vector Multiply: Other ApproachThis is rather better for GPU and (very) many-core
architectures: simpler structure of loops
!$omp parallel do private(i,VAL,k)do i = 1, NVAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)
enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)
enddoW(i,Q)= VAL
enddo!$omp end parallel do
![Page 34: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/34.jpg)
34
omp parallel (do)• Each “omp parallel-omp end parallel” pair starts &
stops threads: fork-join• If you have many loops, these operations on
threads could be overhead• omp parallel + omp do/omp for
#pragma omp parallel ...
#pragma omp for {
...#pragma omp for {
!$omp parallel ...
!$omp dodo i= 1, N
...!$omp do
do i= 1, N...!$omp end parallel 必須
![Page 35: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/35.jpg)
35
Exercise !!
• Apply multi-threading by OpenMP on parallel FEM code using MPI– CG Solver (solver_CG, solver_SR)– Matrix Assembling (mat_ass_main, mat_ass_bc)
• Hybrid parallel programming model
• Evaluate the effects of– Problem size, parallel programming model, thread #
![Page 36: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/36.jpg)
36
OpenMP(Only Solver)(F・C)>$ cd <$O-TOP>/pfem3d/src1>$ make>$ cd ../run>$ ls sol1
sol1
>$ cd ../pmesh
<Parallel Mesh Generation>
>$ cd ../run
<modify go1.sh>
>$ pjsub go1.sh
![Page 37: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/37.jpg)
37
Makefile(Fortran)F90 = mpiifortF90LINKER = $(F90)LIB_DIR =INC_DIR =OPTFLAGS = -O3 -xCORE-AVX2 -align array32byte -qopenmpFFLAGS = $(OPTFLAGS)FLIBS =F90LFLAGS=#TARGET = ../run/sol1default: $(TARGET)OBJS =¥pfem_util.o …
$(TARGET): $(OBJS)$(F90LINKER) $(OPTFLAGS) -o $(TARGET) $(OBJS) $(F90LFLAGS)
clean:/bin/rm -f *.o $(TARGET) *~ *.mod
.f.o:$(F90) $(FFLAGS) $(INC_DIR) -c $*.f
.f90.o:$(F90) $(FFLAGS) $(INC_DIR) -c $*.f90
.SUFFIXES: .f90 .f
![Page 38: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/38.jpg)
38
Makefile(C)CC = mpiiccLIB_DIR=INC_DIR=OPTFLAGS= -O3 -xCORE-AVX2 -align -qopenmpLIBS =LFLAGS=#TARGET = ../run/sol1default: $(TARGET)OBJS =¥
test1.o¥...
$(TARGET): $(OBJS)$(CC) $(OPTFLAGS) -o $@ $(OBJS) $(LFLAGS)
.c.o:$(CC) $(OPTFLAGS) -c $*.c
clean:/bin/rm -f *.o $(TARGET) *~ *.mod
![Page 39: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/39.jpg)
39
HB M x N
Number of OpenMP threads per a single MPI process
Number of MPI processper a single “socket”
Socket #0 Socket #1
![Page 40: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/40.jpg)
OMP-1 40
4-nodes/8-sockets: 128 MPI process’sFlat MPI, 32 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3
mesh.inp256 128 6416 8 1pcube
inp_kmetiscube.02128pcube
select=4:mpiprocs=32
I_MPI_PERHOST=32
inp_mg256 128 64
16/18 cores/socket
![Page 41: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/41.jpg)
41
Flat MPI: 16 MPI Processes/Socket#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=32 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=32 MPI Proc.#/Node
mpirun ./impimap.sh ./sol
![Page 42: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/42.jpg)
OMP-1 42
4-nodes: 16-threads x 8 MPI process’sHB 16x1, 2 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3 16/18
cores/socket
mesh.inp256 128 644 2 1
pcube
inp_kmetiscube.028pcube
select=4:mpiprocs=2
I_MPI_PERHOST=2OMP_NUM_THREADS=16
inp_mg256 128 64
![Page 43: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/43.jpg)
43
HB 16x1#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=2 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export OMP_NUM_THREADS=16 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2 MPI Proc.#/Node
mpirun ./impimap.sh ./sol1
![Page 44: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/44.jpg)
OMP-1 44
4-nodes: 8-threads x 16 MPI process’sHB 8x2, 4 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3
mesh.inp256 128 644 4 1
pcube
inp_kmetiscube.0216pcube
select=4:mpiprocs=4
I_MPI_PERHOST=4OMP_NUM_THREADS=8
inp_mg256 128 64
16/18 cores/socket
![Page 45: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/45.jpg)
45
HB 8x2#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=4 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export OMP_NUM_THREADS=8 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=4 MPI Proc.#/Node
mpirun ./impimap.sh ./sol1
![Page 46: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/46.jpg)
OMP-1 46
4-nodes: 4-threads x 32 MPI process’sHB 4x4, 8 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3
mesh.inp256 128 648 4 1
pcube
inp_kmetiscube.0232pcube
select=4:mpiprocs=8
I_MPI_PERHOST=8OMP_NUM_THREADS=4
inp_mg256 128 64
16/18 cores/socket
![Page 47: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/47.jpg)
47
HB 4x4#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=8 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export OMP_NUM_THREADS=4 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=8 MPI Proc.#/Node
mpirun ./impimap.sh ./sol1
![Page 48: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/48.jpg)
OMP-1 48
4-nodes: 2-threads x 64 MPI process’sHB 2x8, 16 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3
mesh.inp256 128 648 8 1
pcube
inp_kmetiscube.0264pcube
select=4:mpiprocs=16
I_MPI_PERHOST=16OMP_NUM_THREADS=2
inp_mg256 128 64
16/18 cores/socket
![Page 49: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/49.jpg)
49
HB 2x8#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=16 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export OMP_NUM_THREADS=2 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=16 MPI Proc.#/Node
mpirun ./impimap.sh ./sol1
![Page 50: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/50.jpg)
OMP-1 50
4-nodes: 18-threads x 8 MPI process’sHB 18x1, 2 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3 18/18
cores/socket
mesh.inp256 128 644 2 1
pcube
inp_kmetiscube.028pcube
select=4:mpiprocs=2
I_MPI_PERHOST=2OMP_NUM_THREADS=18
inp_mg256 128 64
![Page 51: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/51.jpg)
51
HB 18x1#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=2 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export OMP_NUM_THREADS=18 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2 MPI Proc.#/Node
mpirun ./impimap.sh ./sol1
![Page 52: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/52.jpg)
OMP-1 52
4-nodes/8-sockets: 144 MPI process’sFlat MPI, 36 MPI process’s/Node
Socket #0 Socket #1
Node#0
Node#1
Node#2
Node#3
inp_kmetiscube.02144pcube
select=4:mpiprocs=36
I_MPI_PERHOST=36
inp_mg256 128 64
18/18 cores/socket
![Page 53: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/53.jpg)
53
Flat MPI: 18 MPI Processes/Socket#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=36 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst
cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh
export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=36 MPI Proc.#/Node
mpirun ./impimap.sh ./sol
![Page 54: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/54.jpg)
54
How to apply multi-threading• CG Solver
– Just insert OpenMP directives– ILU/IC preconditioning is much more difficult
• MAT_ASS (mat_ass_main, mat_ass_bc)– Data Dependency– Avoid to accumulate contributions of multiple elements to
a single node simultaneously (in parallel)• results may be changed• deadlock may occur
– Coloring• Elements in a same color do not share a node• Parallel operations are possible for elements in each color• In this case, we need only 8 colors for 3D problems (4 colors for
2D problems)• Coloring part is very expensive: parallelization is difficult
![Page 55: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/55.jpg)
55
FORTRAN(solver_CG)!$omp parallel do private(i)
do i= 1, NX(i) = X (i) + ALPHA * WW(i,P)
WW(i,R)= WW(i,R) - ALPHA * WW(i,Q)enddo
DNRM20= 0.d0!$omp parallel do private(i) reduction (+:DNRM20)
do i= 1, NDNRM20= DNRM20 + WW(i,R)**2
enddo
!$omp parallel do private(j,k,i,WVAL)do j= 1, N
WVAL= D(j)*WW(j,P)do k= index(j-1)+1, index(j)
i= item(k)WVAL= WVAL + AMAT(k)*WW(i,P)
enddoWW(j,Q)= WVAL
enddo
![Page 56: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/56.jpg)
56
C(solver_CG)#pragma omp parallel for private (i)
for(i=0;i<N;i++){X [i] += ALPHA *WW[P][i];WW[R][i]+= -ALPHA *WW[Q][i];
}
DNRM20= 0.e0;#pragma omp parallel for private (i) reduction (+:DNRM20)
for(i=0;i<N;i++){DNRM20+=WW[R][i]*WW[R][i];
}
#pragma omp parallel for private (j,i,k,WVAL)for( j=0;j<N;j++){
WVAL= D[j] * WW[P][j];for(k=indexLU[j];k<indexLU[j+1];k++){
i=itemLU[k];WVAL+= AMAT[k] * WW[P][i];
}WW[Q][j]=WVAL;
![Page 57: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/57.jpg)
57
solver_SR (send)
for( neib=1;neib<=NEIBPETOT;neib++){istart=EXPORT_INDEX[neib-1];inum =EXPORT_INDEX[neib]-istart;
#pragma omp parallel for private (k,ii)for( k=istart;k<istart+inum;k++){
ii= EXPORT_ITEM[k];WS[k]= X[ii-1];
}MPI_Isend(&WS[istart],inum,MPI_DOUBLE,
NEIBPE[neib-1],0,MPI_COMM_WORLD,&req1[neib-1]);}
do neib= 1, NEIBPETOTistart= EXPORT_INDEX(neib-1)inum = EXPORT_INDEX(neib ) - istart
!$omp parallel do private(k,ii)do k= istart+1, istart+inum
ii = EXPORT_ITEM(k)WS(k)= X(ii)
enddo
call MPI_Isend (WS(istart+1), inum, MPI_DOUBLE_PRECISION, && NEIBPE(neib), 0, MPI_COMM_WORLD, req1(neib), && ierr)enddo
![Page 58: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/58.jpg)
pFEM3D-2
58
Example: Strong Scaling: Fortran• 256×128×128 nodes
– 4,194,304 nodes, 4,112,895 elements• 32~864 cores, HB 16x1, HB 18x1, Flat MPI• Linear Solver
256 128 1282 1 1
pcube
256 128 1282 1 2
pcube
256 128 1284 2 2
pcube
select=1:mpiprocs=2
select=2:mpiprocs=4
select=8:mpiprocs=16
Performance of Flat-pmesh/16 w/32 cores= 32.0
0
200
400
600
800
1000
0 200 400 600 800 1000
Spee
d-U
p
CORE#
HB-pmesh/16HB-pmesh/18HB-kmetis/16Flat-pmesh/16Flat-pmetis/16Ideal
![Page 59: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/59.jpg)
pFEM3D-2
59
Example: Strong Scaling: Fortran• 256×128×128 nodes
– 4,194,304 nodes, 4,112,895 elements• 32~864 cores, HB 16x1, HB 18x1, Flat MPI• Linear Solver
0
200
400
600
800
1000
0 200 400 600 800 1000
Spee
d-U
p
CORE#
HB-pmesh/16HB-pmesh/18HB-kmetis/16Flat-pmesh/16Flat-pmetis/16Ideal
400
500
600
700
800
900
1000
512 768
Spee
d-U
p
CORE#
HB-pmesh/16HB-kmetis/16Flat-pmesh/16Flat-pmetis/16
Performance of Flat-pmesh/16 w/32 cores= 32.0
![Page 60: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/60.jpg)
60
Computation Time using 16 nodes• kmetis• Linear Solver
0.00
0.10
0.20
0.30
0.40
FlatMPI/16
HB 2x8 HB 4x4 HB 8x2 HB 16x1 HB 18x1
sec.
Thread/#MPI Process
![Page 61: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/61.jpg)
61
Flat MPI vs. Hybrid• Depends on applications, problem size, HW etc.• Flat MPI is generally better for sparse linear solvers,
if number of computing nodes is not so large.– Memory contention
• Hybrid becomes better, if number of computing nodes is larger.– Fewer number of MPI processes.
• 1 MPI Process/Node is possible: NUMA (A1/A2)
Socket #0 Socket #1
![Page 62: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/62.jpg)
62
How to apply multi-threading• CG Solver
– Just insert OpenMP directives– ILU/IC preconditioning is much more difficult
• MAT_ASS (mat_ass_main, mat_ass_bc)– Data Dependency– Avoid to accumulate contributions of multiple elements to
a single node simultaneously (in parallel)• results may be changed• deadlock may occur
– Coloring• Elements in a same color do not share a node• Parallel operations are possible for elements in each color• In this case, we need only 8 colors for 3D problems (4 colors for
2D problems)• Coloring part is very expensive: parallelization is difficult
![Page 63: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/63.jpg)
63
Multi-Threading: Mat_AssParallel operations are possible for elements in same
color (they are independent)
![Page 64: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/64.jpg)
64
Coloring (1/2)allocate (ELMCOLORindex(0:NP)) Number of elements in each colorallocate (ELMCOLORitem (ICELTOT)) Element ID renumbered according to “color”if (allocated (IWKX)) deallocate (IWKX)allocate (IWKX(0:NP,3))
IWKX= 0icou= 0do icol= 1, NP
do i= 1, NPIWKX(i,1)= 0
enddodo icel= 1, ICELTOT
if (IWKX(icel,2).eq.0) thenin1= ICELNOD(icel,1)in2= ICELNOD(icel,2)in3= ICELNOD(icel,3)in4= ICELNOD(icel,4)in5= ICELNOD(icel,5)in6= ICELNOD(icel,6)in7= ICELNOD(icel,7)in8= ICELNOD(icel,8)
ip1= IWKX(in1,1)ip2= IWKX(in2,1)ip3= IWKX(in3,1)ip4= IWKX(in4,1)ip5= IWKX(in5,1)ip6= IWKX(in6,1)ip7= IWKX(in7,1)ip8= IWKX(in8,1)
![Page 65: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/65.jpg)
65
Coloring (2/2)isum= ip1 + ip2 + ip3 + ip4 + ip5 + ip6 + ip7 + ip8if (isum.eq.0) then None of the nodes is accessed in same color
icou= icou + 1IWKX(icol,3)= icou (Current) number of elements in each colorIWKX(icel,2)= icolELMCOLORitem(icou)= icel ID of icou-th element= icel
IWKX(in1,1)= 1 These nodes on the same elements can not beIWKX(in2,1)= 1 accessed in same colorIWKX(in3,1)= 1IWKX(in4,1)= 1IWKX(in5,1)= 1IWKX(in6,1)= 1IWKX(in7,1)= 1IWKX(in8,1)= 1if (icou.eq.ICELTOT) goto 100 until all elements are colored
endifendif
enddoenddo
100 continueELMCOLORtot= icol Number of ColorsIWKX(0 ,3)= 0IWKX(ELMCOLORtot,3)= ICELTOT
do icol= 0, ELMCOLORtotELMCOLORindex(icol)= IWKX(icol,3)
enddo
![Page 66: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima](https://reader030.vdocuments.site/reader030/viewer/2022040309/5f0c51367e708231d434ccc9/html5/thumbnails/66.jpg)
66
Multi-Threaded Matrix Assembling Procedure
do icol= 1, ELMCOLORtot!$omp parallel do private (icel0,icel) &!$omp& private (in1,in2,in3,in4,in5,in6,in7,in8) &!$omp& private (nodLOCAL,ie,je,ip,jp,kk,iiS,iiE,k) &!$omp& private (DETJ,PNX,PNY,PNZ,QVC,QV0,COEFij,coef,SHi) &!$omp& private (PNXi,PNYi,PNZi,PNXj,PNYj,PNZj,ipn,jpn,kpn) &!$omp& private (X1,X2,X3,X4,X5,X6,X7,X8) &!$omp& private (Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8) &!$omp& private (Z1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,COND0)
do icel0= ELMCOLORindex(icol-1)+1, ELMCOLORindex(icol)icel= ELMCOLORitem(icel0)in1= ICELNOD(icel,1)in2= ICELNOD(icel,2)in3= ICELNOD(icel,3)in4= ICELNOD(icel,4)in5= ICELNOD(icel,5)in6= ICELNOD(icel,6)in7= ICELNOD(icel,7)in8= ICELNOD(icel,8)
...