robust & efficient parallel preconditioning methods in “multi-core era”

October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest

Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

Robust & Efficient Parallel Preconditioning Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”Methods in “Multi-Core Era”

IntroductionIn this work, parallel preconditioning methods based on “Hierarchical Interface Decomposition (HID)” and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.

Parallel Programming Models on Multi-core ClustersIn order to achieve minimal parallelization overheads on SMP (symmetric multiprocessors) and multi-core clusters, a multi-level hybrid parallel programming model is often employed. In this method, coarse-grained parallelism is achieved through domain decomposition by message passing among nodes, while fine-grained parallelism is obtained via loop-level parallelism inside each node using compiler-based thread parallelization techniques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Generally, the efficiency of each programming model depends on hardware performance, application features, and problem size.

Fig.1 Hybrid and Flat MPI Parallel Programming Models

core

core

core

core

me

mo

ry

me

mo

ry

me

mo

ry

me

mo

ry

core

core

core

core

me

mo

ry

core

core

core

core

me

mo

ry

Target ApplicationIn the present work, linear elasticity problems in simple cube geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Poisson’s ratio is set to 0.25 for all elements, while a heterogeneous distribution of Young’s modulus in each element is calculated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are 10-

3 and 103, respectively, with an average value of 1.0. The GPBi-CG (Generalized Product-type methods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based on the framework for parallel FEM procedures of GeoFEM, and the GeoFEM’s local data structure is applied.

Fig.2 Heterogeneous Distribution of Material

Property

HID (Hierarchical Interface Decomposition)Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative solvers . They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-conditioned problems with many processors, because it ignores the global effect of external nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC preconditioners. PHIDAL is based on defining “hierarchical interface decomposition (HID)”. The HID process starts with a partitioning of the graph, with one layer of overlap. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups of a lower level. (to be continued to the back page)

core

core

core

core

core

core

core

core

core

core

core

core

October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest

HID (Hierarchical Interface Decomposition) (cont.)If the unknowns are reordered according to their level numbers, from the lowest to highest, the block structure of the reordered matrix is as shown in Fig.3. This block structure leads to a natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied. .

(a) Domain Decomposition

0

1

2

3

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

0,12,3

do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)

!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)

SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL

k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)

SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3

enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3

enddoenddo

!$omp end parallel doenddo

call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.

enddo

Fig.3 Domain/block decomposition of the matrix according to the HID reordering(b) Matrix Block (c) Forward Substitution Proc. of SGS

T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)” at the University of Tokyo. The “T2K/Tokyo” was developed by Hitachi under “T2K Open Supercomputer Alliance”. T2K/Tokyo is an AMD Quad-core Opteron-based combined cluster system with 952 nodes, 15,232 cores and 31TB memory.

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Total peak performance is 140.1 TFLOPS. Each node includes four “sockets” of AMD Quad-core Opteron processors (2.3GHz), as shown in Fig.4. Each node is connected via Myrinet-10G network. In the present work, 32 nodes of the system have been evaluated. Because T2K/Tokyo is based on CC/NUMA architecture, careful design of software and data structure is required for efficient access to local memory.

Fig.4 Overview of T2K/Tokyo (Entire System and Each Node)http://www.open-supercomputer.org/Results & Future Works

Preliminary tests of the developed code have been conducted on the T2K/Tokyo using up to 512 cores of T2K/Tokyo system. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.HID-based preconditioning with the hybrid parallel programming model is expected to be a good choice for excellent scalable performance and robustness for implementations on more than 104 cores on clusters of multi-core processors, if each MPI process is assigned to each socket, as is in Hybrid 4x4 in this work.

In the present work, no fill-in processes have been considered in HID procedures. The results of using HID in comparison with conventional localized block Jacobi preconditioning have therefore not been so significant. More robust preconditioning methods based on HID may be developed by considering fill-ins inside and between connectors for realistic applications with ill-conditioned matrices. Moreover, the developed methods may further be evaluated on various types of clusters with more cores.

0.80

1.00

1.20

1.40

1.60

1.80

32 64 128 192 256 384 512

core#

Re

lati

ve

Pe

rfo

rma

nc

e

Flat MPIHybrid 4x4Hybrid 8x2

Fig.5 Strong Scaling Test up to 512 cores of T2K/Tokyo

0.00

0.25

0.50

0.75

1.00

1.25

Flat MPI Hybrid 4x4 Hybrid 8x2

Re

lati

ve

Pe

rfo

rma

nc

e

128 cores 512 cores

Relative Performance of HID normalized by results of Localized Block Jacobi

Relative Performance of three parallel programming models (Flat MPI, Hybrid 4x4 and Hybrid 8x2), normalized by results of Flat MPI

robust & efficient parallel preconditioning methods in “multi-core era”

Documents