robust & efficient parallel preconditioning methods in “multi-core era”
DESCRIPTION
core. core. core. core. core. core. memory. memory. memory. core. core. core. core. core. core. Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”. Introduction - PowerPoint PPT PresentationTRANSCRIPT
October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster
CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest
Integrated Predictive Simulation System for Earthquake and Tsunami Disaster
Robust & Efficient Parallel Preconditioning Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”Methods in “Multi-Core Era”
IntroductionIn this work, parallel preconditioning methods based on “Hierarchical Interface Decomposition (HID)” and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.
Parallel Programming Models on Multi-core ClustersIn order to achieve minimal parallelization overheads on SMP (symmetric multiprocessors) and multi-core clusters, a multi-level hybrid parallel programming model is often employed. In this method, coarse-grained parallelism is achieved through domain decomposition by message passing among nodes, while fine-grained parallelism is obtained via loop-level parallelism inside each node using compiler-based thread parallelization techniques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Generally, the efficiency of each programming model depends on hardware performance, application features, and problem size.
Fig.1 Hybrid and Flat MPI Parallel Programming Models
core
core
core
core
me
mo
ry
me
mo
ry
me
mo
ry
me
mo
ry
core
core
core
core
me
mo
ry
core
core
core
core
me
mo
ry
Target ApplicationIn the present work, linear elasticity problems in simple cube geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Poisson’s ratio is set to 0.25 for all elements, while a heterogeneous distribution of Young’s modulus in each element is calculated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are 10-
3 and 103, respectively, with an average value of 1.0. The GPBi-CG (Generalized Product-type methods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based on the framework for parallel FEM procedures of GeoFEM, and the GeoFEM’s local data structure is applied.
Fig.2 Heterogeneous Distribution of Material
Property
HID (Hierarchical Interface Decomposition)Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative solvers . They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-conditioned problems with many processors, because it ignores the global effect of external nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC preconditioners. PHIDAL is based on defining “hierarchical interface decomposition (HID)”. The HID process starts with a partitioning of the graph, with one layer of overlap. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups of a lower level. (to be continued to the back page)
core
core
core
core
core
core
core
core
core
core
core
core
October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster
CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest
HID (Hierarchical Interface Decomposition) (cont.)If the unknowns are reordered according to their level numbers, from the lowest to highest, the block structure of the reordered matrix is as shown in Fig.3. This block structure leads to a natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied. .
(a) Domain Decomposition
0
1
2
3
0,1
0,2
2,3
1,30,1,2,3
Level-1
Level-2
Level-4
0 0 0 1 1 1
0,2 0,2 0,2 1,3 1,3 1,3
2 2 2 3 3 3
2 2 2 2,3 3 3 3
2 2 2 2,3 3 3 3
0 0 0 0,1 1 1 1
0 0 0 0,1 1 1 1
0,12,3
0,12,3
0,12,3
do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)
!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)
SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL
k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)
SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3
enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3
enddoenddo
!$omp end parallel doenddo
call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.
enddo
Fig.3 Domain/block decomposition of the matrix according to the HID reordering(b) Matrix Block (c) Forward Substitution Proc. of SGS
T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)” at the University of Tokyo. The “T2K/Tokyo” was developed by Hitachi under “T2K Open Supercomputer Alliance”. T2K/Tokyo is an AMD Quad-core Opteron-based combined cluster system with 952 nodes, 15,232 cores and 31TB memory.
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Total peak performance is 140.1 TFLOPS. Each node includes four “sockets” of AMD Quad-core Opteron processors (2.3GHz), as shown in Fig.4. Each node is connected via Myrinet-10G network. In the present work, 32 nodes of the system have been evaluated. Because T2K/Tokyo is based on CC/NUMA architecture, careful design of software and data structure is required for efficient access to local memory.
Fig.4 Overview of T2K/Tokyo (Entire System and Each Node)http://www.open-supercomputer.org/Results & Future Works
Preliminary tests of the developed code have been conducted on the T2K/Tokyo using up to 512 cores of T2K/Tokyo system. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.HID-based preconditioning with the hybrid parallel programming model is expected to be a good choice for excellent scalable performance and robustness for implementations on more than 104 cores on clusters of multi-core processors, if each MPI process is assigned to each socket, as is in Hybrid 4x4 in this work.
In the present work, no fill-in processes have been considered in HID procedures. The results of using HID in comparison with conventional localized block Jacobi preconditioning have therefore not been so significant. More robust preconditioning methods based on HID may be developed by considering fill-ins inside and between connectors for realistic applications with ill-conditioned matrices. Moreover, the developed methods may further be evaluated on various types of clusters with more cores.
0.80
1.00
1.20
1.40
1.60
1.80
32 64 128 192 256 384 512
core#
Re
lati
ve
Pe
rfo
rma
nc
e
Flat MPIHybrid 4x4Hybrid 8x2
Fig.5 Strong Scaling Test up to 512 cores of T2K/Tokyo
0.00
0.25
0.50
0.75
1.00
1.25
Flat MPI Hybrid 4x4 Hybrid 8x2
Re
lati
ve
Pe
rfo
rma
nc
e
128 cores 512 cores
Relative Performance of HID normalized by results of Localized Block Jacobi
Relative Performance of three parallel programming models (Flat MPI, Hybrid 4x4 and Hybrid 8x2), normalized by results of Flat MPI