ppopen-hpc open source infrastructure for development and execution of large-scale scientific...
TRANSCRIPT
![Page 1: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/1.jpg)
ppOpen-HPCOpen Source Infrastructure for Development
and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT)
Kengo NakajimaInformation Technology Center
The University of Tokyo
![Page 2: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/2.jpg)
Lessons learned in the 20th Century
• Methods for scientific computing (e.g. FEM, FDM, BEM etc.) consists of typical data structures, and typical procedures.• Optimization of each procedure is possible and effective.
• Well-defined data structure can “hide” communication processes with MPI from code developer.• Code developers do not have to care about communications• Halo for parallel FEM
2
![Page 3: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/3.jpg)
3
1 2 3 4 5
21 22 23 24 25
1617 18
20
1112 13 14
15
67 8 9
10
19
PE#0PE#1
PE#2PE#3
77 11 22 33
1010 99 1111 1212
556688
44PE#2PE#2
11 22 33
44 55
66 77
88 99 1111
1010
1414 1313
1515
1212
11 22 33
44 55
66 77
88 99 1111
1010
1414 1313
1515
1212
PE#0PE#0
3344
88
6699
1010 1212
11 22
55
1111
77
3344
88
6699
1010 1212
11 22
55
1111
77
PE#3PE#3
77 11 22 33
1010 99 1111 1212
556688
44
PE#2PE#2
11 22 33
44 55
66 77
88 99 1111
1010
1414 1313
1515
1212
11 22 33
44 55
66 77
88 99 1111
1010
1414 1313
1515
1212
PE#0PE#0
3344
88
6699
1010 1212
11 22
55
1111
77
3344
88
6699
1010 1212
11 22
55
1111
77
PE#3PE#3
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3 111 2
PE#37 1 2 3
10 9 11 12
5 68 4
PE#2
3 4 8
6 9
10 12
1 2
5
11
7
PE#1
![Page 4: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/4.jpg)
ppOpen-HPC: Overview• Open Source Infrastructure for development and
execution of large-scale scientific applications on post-peta-scale supercomputers with automatic tuning (AT) • “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011) • P.I.: Kengo Nakajima (ITC, The University of Tokyo)• Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by JST/CREST (Japan Science and Technology Agency, Core Research for Evolutional Science and Technology)
• Team with 7 institutes, >30 people (5 PDs) from various fields: Co-Desigin• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo• Hokkaido U., Kyoto U., JAMSTEC
4
![Page 5: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/5.jpg)
ii
ii
ii
FT
FDMFEM BEM DEM
COMM
ppOpen-APPL
MGppOpen-MATH
ppOpen-SYS
GRAPH VIS MP
FVM
iippOpen-AT STATIC DYNAMIC
ppOpen-HPC
User’s Program
Optimized Application with Optimized ppOpen-APPL, ppOpen-MATH
5
FrameworkAppl. Dev.
MathLibraries
AutomaticTuning (AT)
SystemSoftware
![Page 6: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/6.jpg)
• Group Leaders– Masaki Satoh (AORI/U.Tokyo)– Takashi Furumura (ERI/U.Tokyo)– Hiroshi Okuda (GSFS/U.Tokyo)– Takeshi Iwashita (Kyoto U., ITC/Hokkaido U.)– Hide Sakaguchi (IFREE/JAMSTEC)
• Main Members – Takahiro Katagiri (ITC/U.Tokyo)– Masaharu Matsumoto (ITC/U.Tokyo)– Hideyuki Jitsumoto (ITC/U.Tokyo)– Satoshi Ohshima (ITC/U.Tokyo)– Hiroyasu Hasumi (AORI/U.Tokyo)– Takashi Arakawa (RIST)– Futoshi Mori (ERI/U.Tokyo)– Takeshi Kitayama (GSFS/U.Tokyo)– Akihiro Ida (ACCMS/Kyoto U.)– Miki Yamamoto (IFREE/JAMSTEC)– Daisuke Nishiura (IFREE/JAMSTEC)
6
![Page 7: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/7.jpg)
ppOpen-HPC: ppOpen-APPL
• ppOpen-HPC consists of various types of optimized libraries, which covers various types of procedures for scientific computations. • ppOpen-APPL/FEM, FDM, FVM, BEM, DEM• Linear Solvers, Mat. Assemble, AMR., Visualization etc.• written in Fortran 2003 (C interface is available soon)
• Source code developed on a PC with a single processor is linked with these libraries, and generated parallel code is optimized for post-peta scale system.
• Users don’t have to worry about optimization tuning, parallelization etc.• Part of MPI, OpenMP, (OpenACC)
7
![Page 8: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/8.jpg)
ppOpen-HPC covers …8
![Page 9: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/9.jpg)
FEM Code on ppOpen-HPCOptimization/parallelization could be hidden from
application developersProgram My_pFEMuse ppOpenFEM_utiluse ppOpenFEM_solver
call ppOpenFEM_initcall ppOpenFEM_cntlcall ppOpenFEM_meshcall ppOpenFEM_mat_init
do call Users_FEM_mat_ass call Users_FEM_mat_bc call ppOpenFEM_solve call ppOPenFEM_vis Time= Time + DTenddo
call ppOpenFEM_finalizestopend
9
![Page 10: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/10.jpg)
ppOpen-HPC: AT & Post T2K
• Automatic Tuning (AT) enables development of optimized codes and libraries on emerging architectures− Directive-based Special Language for AT− Optimization of Memory Access
• Target system is Post T2K system− 20-30 PFLOPS, FY.2015-2016
JCAHPC: U. Tsukuba & U. Tokyo
− Many-core based (e.g. Intel MIC/Xeon Phi)− ppOpen-HPC helps smooth transition of users (> 2,000) to
new system
10
![Page 11: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/11.jpg)
Supercomputers in U.Tokyo2 big systems, 6 yr. cycle
11
FY05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
Hitachi SR11000/J218.8TFLOPS, 16.4TB
Fat nodes with large memory
(Flat) MPI, good comm. performance
京 (=K)Peta
Turning point to Hybrid Parallel Prog. Model
Fujitsu PRIMEHPC FX10 based on SPARC64 IXfx
1.13 PFLOPS, 150 TB
Hitachi SR16000/M1based on IBM Power-7
54.9 TFLOPS, 11.2 TB
Our last SMP, to be switched to MPP
Hitachi HA8000 (T2K)140TFLOPS, 31.3TB
Post T2K20-30 PFLOPS
![Page 12: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/12.jpg)
ii
ii
ii
FT
FDMFEM BEM DEM
COMM
ppOpen-APPL
MGppOpen-MATH
ppOpen-SYS
GRAPH VIS MP
FVM
iippOpen-AT STATIC DYNAMIC
ppOpen-HPC
User’s Program
Optimized Application with Optimized ppOpen-APPL, ppOpen-MATH
12
![Page 13: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/13.jpg)
Weak-Coupled Simulation by the ppOpen-HPC Libraries
Seism3D+
ppOpen-APPL/FDM
FrontISTR++
Two kinds of applications (Seism3D+ based on FDM, and FrontISTR++ based on FEM) are connected by the ppOpen-MATH/MP coupler.
ppOpen-APPL/FEM
ppOpen-MATH/MP
Principal Functions
Make a mapping table Convert physical variables Choose a timing of data
transmission
Velocity Displacement
…
13
![Page 14: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/14.jpg)
update_stress update_vel update_stress_sponge
diff_* passing_* whole
558
200 17130 20 51
Speedup [%]!oat$ install LoopFusionSplit region start!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL!oat$ SplitPointCopyDef region start QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)!oat$ SplitPointCopyDef region end SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG!oat$ SplitPoint (K, J, I) STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1) STMP3 = STMP1 + STMP2 RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))!oat$ SplitPointCopyInsert SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG END DO; END DO; END DO!$omp end parallel do!oat$ install LoopFusionSplit region end
Example of directivesfor ppOpen-AT
Loop spilitting/fusion
ppOpen-AT on Seism3D(FDM) (Xeon Phi 8-nodes)
![Page 15: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/15.jpg)
Schedule of Public Release (with English Documents)
http://ppopenhpc.cc.u-tokyo.ac.jp/
• Released at SC-XY (or can be downloaded)• Multicore/manycore cluster version (Flat MPI,
OpenMP/MPI Hybrid) with documents in English• We are now focusing on MIC/Xeon Phi• Collaborations with scientists are welcome
History• SC12, Nov 2012 (Ver.0.1.0)• SC13, Nov 2013 (Ver.0.2.0)• SC14, Nov 2014 (Ver.0.3.0)
15
![Page 16: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/16.jpg)
New Features in Ver.0.3.0http://ppopenhpc.cc.u-tokyo.ac.jp/
• ppOpen-APPL/AMR-FDM: AMR framework with a dynamic load-balancing method for various FDM applications
• HACApK library for H-matrix comp. in ppOpen-APPL/BEM
16
Processor Assigned data
when
: A small submatrixof P0P1P2P3P4P5P6P7
Assignac to Pk if S(ac1j) in R(Pk)
• Utilities for pre-processing in ppOpen-APPL/DEM
• Booth #713
![Page 17: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/17.jpg)
17
Collaborations, Outreaching• Collaborations
– International Collaborations• Lawrence Berkeley National Lab.• National Taiwan University• IPCC ( Intel Parallel Computing
Center )
• Outreaching, Applications– Large-Scale Simulations
• Geologic CO2 Storage
• Astrophysics• Earthquake Simulations etc.• ppOpen-AT, ppOpen-MATH/VIS,
ppOpen-MATH/MP, Linear Solvers
– Intl. Workshops (2012, 2013)– Tutorials, Classes
![Page 18: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/18.jpg)
18
from Post-Peta to Exascale
• Currently, we are focusing on Post-T2K system by manycore architectures (Intel Xeon/Phi)
• Outline of the Exascale Systems is much clearer than which were in 2011 (when this project started).– Frameworks like ppOpen-HPC are really needed
• More complex, and huge system• More difficult to extract performance of applications
– Smooth transition from post-peta to exa will be possible through continuous development and improvement of ppOpen-HPC (We need funding for that !)
• Research Topics in Exascale Era– Power-Aware Algorithms/AT – Communication/Synchronization Reducing Algorithms
![Page 19: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/19.jpg)
• pK-Open-HPC• Ill-Conditioned Problems• SPPEXA Proposal
19
![Page 20: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/20.jpg)
SIAM PP1416th SIAM Conference on Parallel Processing for Scientific
Computing, Feb.18-21, 2014http://www.siam.org/meetings/pp14/
• Comm./Synch. Avoiding/Reducing Algorithms– Direct/Iterative Solvers, Preconditioning, s-step method– Coarse Grid Solvers on Parallel Multigrid
• Preconditioning Methods on Manycore Architectures– SPAI, Poly., ILU with Multicoloring/RCM, (no CM-RCM),
GMG– Asynchronous ILU on GPU (E.Chow (Ga.Tech.)) ?
• Preconditioning Methods for Ill-Conditioned Problems– Low-Rank Approximation
• Block-Structured AMR– SFC: Space-Filling Curve
20
![Page 21: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/21.jpg)
Communications are expensive ...Ph.D Thesis by Mark Hoemmen (SNL) at
UC Berkeley on CA-KSP Solvers (2010)
• Serial Communications– Data Transfer through Hierarchical Memory
• Parallel Communications– Message Passing through Network
• Efficiency for Memory Access -> Exa-Feasible Applications
21
![Page 22: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/22.jpg)
Future of CAE Applications
• FEM with fully-unstructured meshes is un-feasible for exa-scale supercomputer systems
• Block-structured AMR, Voxel-type FEM
• Serial Comm.– Fixed inner loops
• e.g. sliced ell
22
Level-0
Level-1
Level-2
• Parallel Comm.– Krylov iterative solvers– Preconditioning, Dot Products, HALO communications
![Page 23: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/23.jpg)
23
ELL: Fixed Loop-length, Nice for Pre-fetching
50001
04730
00314
00521
00031 1 3
1 2 5
4 1 3
3 7 4
1 5
1 3
1 2 5
4 1 3
3 7 4
1 5
0
0
(a) CRS (b) ELL
![Page 24: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/24.jpg)
Special Treatment for “Boundary” Meshes connected to “Halo”
• Distribution of Lower/Upper Non-Zero Off-Diagonal Components
• If we adopt RCM (or CM) reordering ...
• Pure Internal Meshes– L: ~3, U: ~3
• Boundary Meshes– L: ~3, U: ~6
24
External Meshes
Internal Meshes on Boundary
Pure Internal Meshes
x
yz
Pure Internal Meshes
Internal Meshes on Boundary
● Internal (lower)
● Internal (upper)
● External (upper)
![Page 25: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/25.jpg)
Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)
25
Pure Internal CellsAUnew(6,N)
Boundary CellsAUnew(6,N)
![Page 26: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/26.jpg)
Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6Sliced ELL [Monakov et al. 2010] (for SpMV/GPU)
26
Pure Internal CellsAUnew3(3,N)
Boundary CellsAUnew6(6,N)
![Page 27: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/27.jpg)
Analyses by Detailed Profiler of Fujitsu FX10, single node, Flat
MPI, RCM (Multigrid Part), 643cells/core, 1-node
27
Instruction L1Dmiss
L2 miss SIMD Op. Ratio
GFLOPS
CRS 1.53109 2.32107 1.67107 30.14% 6.05
OriginalELL 4.91108 1.67107 1.27107 93.88% 6.99
ImprovedELL 4.91108 1.67107 9.14106 93.88% 8.56
![Page 28: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/28.jpg)
Hierarchical CGA: Comm. Reducing MGReduced number of MPI processes[KN 2013]
28
Level=1
Level=2
Level=m-3
Level=m-3
Fine
Coarse
Level=m-2
Coarse grid solver at a single MPI process (multi-threaded, further multigrid)
![Page 29: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/29.jpg)
29
Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)
DOWN is GOOD
Effect of Sliced ELL
HB 8x2 on FX10Effect ofhCGA
Down is good
Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced)
CGA
C4 ELL (sliced)
hCGA
0.00
5.00
10.00
15.00
20.00
100 1000 10000 100000
sec.
CORE#
HB 8x2:C0HB 8x2:C1HB 8x2:C2HB 8x2:C3
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
Flat MPI:C3
Flat MPI:C4
HB 4x4:C4
HB 8x2:C3
HB 16x1:C3
![Page 30: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/30.jpg)
Framework for Exa-Feasible Applications: pK-Open-HPC
• Should be based on voxel-type meshes• Robust/Scalable Geometric/Algebraic Multigrid
Preconditioning– for various types of applications, PDE’s– Robust smoother– Low-rank approximations– Overhead for coarse grid solvers (hCGA)– Dot products– Limited or fixed number of non-zero off-diagonals of
coefficient matrices at every level• Generally speaking, AMG has many non-zero off-diagonals at
coarse level of meshes
• Preprocessors
30SDHPC-11
![Page 31: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/31.jpg)
31
AMROctree (8/27-children), Self-Similar Blocks
Block CellSelf-similar block• Cell size becomes half• The same number of cells
programtype(oct_Block) :: blockreal :: A(nx,ny,nz), B(nx,ny,nz), C(nx,ny,nz)
Initial parameter setup
do istep=1,Nstep (Time loop)Treatment of cell refinement
do index=1,Nall (Block loop)Main calculation
Data copy A(:,:,:) = block(index)%F(1,:,:,:)B(:,:,:) = block(index)%F(2,:,:,:)C(:,:,:) = block(index)%F(3,:,:,:)Insert uniform-cell program
call advance_fieldData copy to block arraysblock(index)%F(1,:,:,:) = A(:,:,:)block(index)%F(2,:,:,:) = B(:,:,:)block(index)%F(3,:,:,:) = C(:,:,:)
enddoOuter Boundary treatment
enddostopend program
subroutine advance_fieldreal :: A(nx,ny,nz), B(nx,ny,nz), C(nx,ny,nz)do iz=1,nzdo iy=1,nydo ix=1,nxA(ix,iy,iz) = ・・・
・・・・B(ix,iy,iz) = ・・・
・・・・C(ix,iy,iz) = ・・・
・・・・enddoenddoenddo returnend subroutine advance_field
Uniform-cell program
①
②
③
Framework
![Page 32: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/32.jpg)
• pK-Open-HPC• Ill-Conditioned Problems• SPPEXA Proposal
32
![Page 33: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/33.jpg)
Extension of Depth of Overlapping
●:Internal Nodes,●:External Nodes■:Overlapped Elements●:Internal Nodes,●:External Nodes■:Overlapped Elements
5
21 22 23 24 25
1617 18 19
20
1113 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
12
32 41 5
21 22 23 24 25
1617 18 19
20
1113 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
12
32 41
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7PE#3
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7PE#3
Cost for computation and communication may increase
33
![Page 34: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/34.jpg)
• Multilevel Domain Decomposition– Extension of Nested Dissection
• Non-overlapping at each level: Connectors, Separators• Suitable for Parallel Preconditioning Method
HID: Hierarchical Interface Decomposition [Henon & Saad 2007]
level-1 :●level-2 :●level-4 :●
0 0 0 1 1 1
0,2 0,2 0,2 1,3 1,3 1,3
2 2 2 3 3 3
2 2 2 2,3 3 3 3
2 2 2 2,3 3 3 3
0 0 0 0,1 1 1 1
0 0 0 0,1 1 1 1
0,12,3
0,12,3
0,12,3
34
![Page 35: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/35.jpg)
Parallel ILU in HID for each Connector at each LEVEL
• The unknowns are reordered according to their level numbers, from the lowest to highest.
• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.
0
1
2
3
0,1
0,2
2,3
1,30,1,2,3
Level-1
Level-2
Level-4
35
![Page 36: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/36.jpg)
Results: 64 coresContact ProblemsBILU(p)-(depth of overlapping)3,090,903 DOF
0
50
100
150
200
250
300
350
BILU(1) BILU(1+) BILU(2)
sec.
0
500
1000
1500
BILU(1) BILU(1+) BILU(2)
ITE
RA
TIO
NS
■BILU(p)-(0): Block Jacobi■BILU(p)-(1)■BILU(p)-(1+)■BILU(p)-HID GPBiCG
36
![Page 37: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/37.jpg)
Hetero 3D (1/2)• Parallel FEM Code (Flat MPI)
– 3D linear elasticity problems in cube geometries with heterogeneity
– SPD matrices– Young’s modulus: 10-6~10+6
• (Emin-Emax): controls condition number
• Preconditioned Iterative Solvers– GP-BiCG [Zhang 1997]– BILUT(p,d,t)
• Domain Decomposition– Localized Block-Jacobi with
Extended Overlapping (LBJ)– HID/Extended HID
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
37
![Page 38: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/38.jpg)
Hetero 3D (2/2)
• based on the parallel FEM procedure of GeoFEM– Benchmark developed in FP3C project under Japan-France
collaboration• Parallel Mesh Generation
– Fully parallel way• each process generates local mesh, and assembles local matrices.
– Total number of vertices in each direction (Nx, Ny, Nz)– Number of partitions in each direction (Px,Py,Pz)– Number of total MPI processes is equal to PxPyPz
– Each MPI process has (Nx/Px)( Ny/Py)( Nz/Pz) vertices.– Spatial distribution of Young’s modulus is given by an
external file, which includes information for heterogeneity for the field of 1283 cube geometry. • If Nx (or Ny or Nz) is larger than 128, distribution of these 1283 cubes
is repeated periodically in each direction.
FP3C
38
![Page 39: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/39.jpg)
BILUT(p,d,t)
• Incomplete LU factorization with threshold (ILUT)• ILUT(p,d,t) [KN 2010]
– p: Maximum fill-level specified before factorization– d, t: Criteria for dropping tolerance before/after
factorization
• The process (b) can be substituted by other factorization methods or more powerful direct linear solvers, such as MUMPS, SuperLU and etc.
A
Initial Matrix
Dropping Components- Aij< d- Location
A’
DroppedMatrix
ILU (p)Factorization
(ILU)’
ILUFactorization
(ILUT)’
ILUT(p,d,t)
Dropping Components- Aij< t- Location
(a) (b) (c)
39
![Page 40: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/40.jpg)
Preliminary Results• Hardware
– 16-240 nodes (256-3,840 cores) of Fujitsu PRIMEHPC FX10 (Oakleaf-FX), University of Tokyo
• Problem Setting– 420×320×240 vertices (3.194×107 elem’s, 9.677×107 DOF)– Strong scaling– Effect of thickness of overlapped zones
• BILUT(p,d,t)-LBJ-X (X=1,2,3)– RCM-Entire renumbering for LBJ
40
![Page 41: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/41.jpg)
Effect of t on PerformanceBILUT(2,0,t)-GPBi-CG with 240 nodes (3,840 cores)
Emax=10-6, Emax=10+6
Normalized by results of BILUT(2,0,0)-LBJ-2●: [NNZ], ▲: Iterations, ◆: Solver Time
41
0.00
0.25
0.50
0.75
1.00
1.25
1.50
0.00E+00 1.00E-02 2.00E-02 3.00E-02
Rat
io
t: BILUT(2,0,t)-HID
0.00
0.25
0.50
0.75
1.00
1.25
1.50
0.00E+00 1.00E-02 2.00E-02 3.00E-02
Rat
io
t: BILUT(2,0,t)-LJB-2
![Page 42: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/42.jpg)
BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Fill-in
PreconditionerNNZ of
[M]Set-up(sec.)
Solver(sec.)
Total(sec.)
Iterations
BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916
BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288
BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367
BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339
BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939
BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006
BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192
BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823
BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722
BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472
BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096
[NNZ] of [A]: 7.174109
42
![Page 43: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/43.jpg)
BILUT(p,0,0) at 3,840 coresNO dropping: Effect of OverlappingPreconditioner
NNZ of [M]
Set-up(sec.)
Solver(sec.)
Total(sec.)
Iterations
BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916
BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288
BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367
BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339
BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939
BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006
BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192
BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823
BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722
BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472
BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096
[NNZ] of [A]: 7.174109
43
![Page 44: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/44.jpg)
BILUT(p,0,t) at 3,840 coresOptimum Value of t
Preconditioner NNZ of [M]Set-up(sec.)
Solver(sec.)
Total(sec.)
Iterations
BILUT(1,0,2.7510-2)-LBJ-1 7.755109 1.36 45.0 46.3 1916
BILUT(1,0,2.7510-2)-LBJ-2 1.0191010 2.05 42.0 44.1 1383
BILUT(1,0,2.7510-2)-LBJ-3 1.2851010 2.81 54.2 57.0 1492
BILUT(2,0,1.0010-2)-LBJ-1 1.1181010 3.11 39.1 42.2 1422
BILUT(2,0,1.0010-2)-LBJ-2 1.4871010 4.41 37.1 41.5 1029
BILUT(2,0,1.0010-2)-LBJ-3 1.8931010 5.99 37.1 43.1 915
BILUT(3,0,2.5010-2)-LBJ-1 8.072109 9.35 38.4 47.7 1526
BILUT(3,0,2.5010-2)-LBJ-2 1.0631010 12.7 35.5 48.3 1149
BILUT(3,0,2.5010-2)-LBJ-3 1.3421010 17.3 40.9 58.2 1180
BILUT(1,0,2.5010-2)-HID 6.850109 2.25 38.5 40.7 1313
BILUT(2,0,1.0010-2)-HID 1.0301010 5.04 36.1 41.1 1064
44
[NNZ] of [A]: 7.174109
![Page 45: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/45.jpg)
Strong Scaling up to 3,840 coresaccording to elapsed computation time (set-
up+solver) for BILUT(1,0,2.510-2)-HID with 256 cores
0.00E+00
1.00E+03
2.00E+03
3.00E+03
4.00E+03
0 500 1000 1500 2000 2500 3000 3500 4000
Sp
eed
-Up
CORE#
BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2Ideal
70
80
90
100
110
120
130
100 1000 10000
Par
alle
l Per
form
ance
(%)
CORE#
BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2
45
MUMPS: Low-Rank-Approx.
![Page 46: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/46.jpg)
• pK-Open-HPC• Ill-Conditioned Problems• SPPEXA Proposal
46
![Page 47: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/47.jpg)
Developments for SPPEXA (1/2)• Geometric/Algebraic Multigrid Solvers• Robust Preconditioning Method with Special
Partitioning Method• pK-Open-HPC
– Robust and Scalable– Communication Reducing/Avoiding Algorithms– Extensions of Sliced ELL for Future Architectures
• Post K (Sparc), Xeon Phi, Nvidia Volta• U.Tokyo’s Systems
– FY.2016: KNL-based System (20-30 PF)– FY.2018-2019: Post FX10 System (Sparc ?, Nvidia ?) (50PF)
– OpenMP/MPI Hybrid– Automatic Tuning (AT)
• Parameter Selection, Stencil Construction
47
![Page 48: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/48.jpg)
Developments for SPPEXA (2/2)• Deployment of pK-Open-HPC will be support
JST/CREST (~2016.3), and RIKEN-U.Tokyo Collaboration (starts 2014.12~2020)– not much funding
• FY.2016, FY.2017: supported by JST/CREST if the proposal is accepted
• FY.2018 (~2019.3): RIKEN-U.Tokyo Collaboration, Other Funding needed
• Members– University of Tokyo: K. Nakajima, T. Katagiri– Hokkaido University: T. Iwashita (AMG)– Kyoto University: A. Ida (AMG, Low-Rank)– French Partners ?: Michel Dayde (Toulouse)– Users, Application Groups
48
![Page 49: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/49.jpg)
Multigrid Solvers (1/2)• Geometric, Algebraic
– Adaptive Mesh Refinement (AMR)
• Robust for various types of problems– Target Applications: Poisson, Solid Mechanics, ME– Smoothers: ILU/IC etc.
• Carefully designed strategy of stencil construction for feasibility on post-peta/exascale systems– Modified/Extended Sliced ELL– Generally speaking, matrices of AMG at coarse lelve
become dense. • Critical for load balancing on massively parallel systems• ELL is difficult, some strategy for construction of stencil
49
![Page 50: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/50.jpg)
Multigrid Solvers (2/2)• MPI/OpenMP Hybrid
– Reordering for extraction of parallelism– hCGA-type strategy for communication reducing
• Automatic Tuning– Stencil construction– Various parameters
50
![Page 51: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/51.jpg)
Robust Preconditioner
• ILU/BILU(p,d,t)– Strategy for domain decomposition– Flexible overlapping, local reordering– Extended HID– AT for selection of optimum parameters
• Low-Rank Approximation– H-Matrix solver– MUMPS
51
![Page 52: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/52.jpg)
(Rough) Schedule (1/2)• 1st yr.
– Multgrid• Development & Test
– ILU/BILU• Development & Test
– Low-Rank-Approximation• Development & Test
• 2nd yr.– Multgrid
• Development & Test (cont.), Optimization, AT – ILU/BILU
• Development & Test (cont.), Optimization, AT– Low-Rank-Approximation
• Development & Test (cont.), Optimization
52
![Page 53: PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information](https://reader035.vdocuments.site/reader035/viewer/2022062717/56649e2c5503460f94b1b7c5/html5/thumbnails/53.jpg)
(Rough) Schedule (2/2)• 3rd yr.
– Evaluations through real applications– Multgrid
• Development & Test, Optimization, AT (cont.) – ILU/BILU
• Development & Test, Optimization, AT (cont.)– Low-Rank-Approximation
• Development & Test, Optimization, AT (cont.)
• Proposal– Deadline: Jan.31, 2015– KN attends a conference in Barcelona Jan.29-30, and can
visit Germany on Jan.27 (T).
53