ipdps 2012 (main track), pp. 474-483 -...

Takeshi Iwashita (Kyoto Univ.)Hiroshi Nakashima (Kyoto Univ )Hiroshi Nakashima (Kyoto Univ.)Yasuhito Takahashi (Doshisha Univ.)

IPDPS 2012 (Main track), pp. 474-483

Main issue Main issue◦ We propose algebraic block multi-color ordering

method for parallel ICCG method for unstructured h blmesh problems

Overview◦ Brief introduction of ICCG method and objective of◦ Brief introduction of ICCG method and objective of

the present research◦ Previous method: Explanation of block multi-color

ordering method for finite difference methodordering method for finite difference method◦ Proposed method: Explanation of algebraic block

multi-color ordering methodN i l l◦ Numerical results◦ Conclusions

Incomplete Cholesky Conjugate Gradient Incomplete Cholesky Conjugate Gradient Method

Most popular linear solver for a linear system Most popular linear solver for a linear system of equations with a sparse symmetric positive-definite coefficient matrixdefinite coefficient matrix◦ Utilized as a linear solver in small or middle scale

simulations◦ Utilized as a building block in large scale simulations

on supercomputers, for example, IC smoother of l d h d d bd l f dmultigrid method and a subdomain solver for domain

decomposition method

Main issue is in parallelization of forward and Main issue is in parallelization of forward and backward substitutions◦ CG solver (sparse matrix vector multiplication, inner

product, vector updates) can be straightforwardly p , p ) g yparallelized

◦ Same procedure is used in IC (ILU) smoother, Gauss-Seidel method(smoother), SOR method

f f l h d ll l Significance of multi-thread parallel processing in large-scale analysis◦ SMP cluster is major configuration of supercomputers◦ When the number of nodes is increased, hybrid (multi-

process / multi-thread) parallelization becomes more advantageous.Multi process parallelization is utilized by domain◦ Multi-process parallelization is utilized by domain decomposition method, which has an advantage in communication.

◦ Multi-thread parallelization for subdomain solvers andMulti thread parallelization for subdomain solvers and building blocks has an importance.

Additi S h t th d (L li d ICCG Additive Schwarz type method (Localized ICCG, Block ICCG)

Reordering (parallel ordering) Reordering (parallel ordering)◦ Unknowns or grid points are reordered to appropriate

order for parallel processingp p g◦ Parallel node ordering (Finite difference analysis) Multi-color ordering, Domain decomposition ordering,

Dissection orderingDissection ordering,… Block multi-color ordering, Block red-black ordering

T. Iwashita and M. Shimasaki, Int. J. Para. Pro., 31, (2003), pp. 55-75.T I hit Y N k i hi d M Shi ki SIAM J SC 26 (2005) 1234 1260 T. Iwashita, Y. Nakanishi and M. Shimasaki, SIAM J. SC, 26, (2005), pp.1234-1260.

◦ Parallel ordering for linear systems with general sparse coefficient matrices Extension of parallel node ordering

Extension of multi-color ordering Extension of multi color ordering◦ Jones and Plasmann’s method◦ Algebraic multi-color ordering method

f d d Extension of domain decomposition type ordering◦ HID method by Henon and Saad◦ HID method by Henon and Saad

Extension of block red-black ordering◦ ABRB method◦ Effective in small degrees of parallelism (Np<8)

We propose algebraic block multi-color ordering, which is considered a generalization of blockwhich is considered a generalization of block multi-color ordering

Blocking nodes

Multi-color ordering (4 colors) Block multi-color ordering(blocksize:4 3 colors)(blocksize:4, 3 colors)Nodes with an identical color

are independent of one another. Blocks with an identical color can be handled in parallel.

Substitutions are parallelizedin each color

p

Blocking improves cache hit ratio and convergence

Problem: Low cache hit ratio(when number of colorsare increased)

Extension of block multi-color ordering to general sparse g g pcoefficient matrix case (a black-box type solver)

Blocking and coloring of unknowns are performed by information only on coefficient matrixy

A n-dimensional linear system: Ax=b◦ Although we deal with the symmetric coefficient matrix in our

numerical tests, we here explain our method including non-symmetric coefficient matrix casesymmetric coefficient matrix case.

Blocking: n unknowns are divided into multiple blocks of size s. The number of blocks nb is given by n/s.

Coloring: n blocks are assigned to n colors Coloring: nb blocks are assigned to nc colors. ABMC method◦ Unknowns in any two blocks having the same color never have

data-relationshipdata relationship. Data-relationship between the unknowns◦ Unknowns i and j has a relationship⇔a(i,j)≠0 or a(j,i)≠0

Blocking◦ We assign more related unknowns to an identical block◦ We assign more related unknowns to an identical block. To improve cache hit ratio

◦ The sizes of all blocks are the same. Unnecessary for parallel computation, but it makes load-balancing easyy p p , g y

Coloring◦ We used two multi-color ordering methods applicable to a general

sparse matrix.

◦ The greedy algorithm Y. Saad, Iterative Methods for Sparse Linear Systems, Second ed., SIAM, 2003. M T Jones and P E Plassmann “The efficient parallel iterative solution of M.T. Jones and P.E. Plassmann, The efficient parallel iterative solution of

large sparse linear systems,” in Graph Theory and Sparse Matrix Computations, IMA 56, 1994, pp. 229–245.

Algebraic multi color ordering (AMC) method◦ Algebraic multi-color ordering (AMC) method T. Iwashita and M. Shimasaki, “Algebraic multi-color ordering for parallelized

ICCG solver in finite element analyses,” IEEE Trans. Magn., 38, (2002), pp. 429–432.

h b f l l ll Users can give the number of colors as an input parameter. It cyclically assign a color to the unknown satisfying the independence between the unknowns with the same color.

The number of blocks assigned to a color becomes about the same.

Using a non-directed graph G(V, E) derived from the coefficient matrix, the blocking procedure is written asthe blocking procedure is written as

,21 nbVVVV ,21 sVVV nb )'(' kkVV kk P d f k t k th bl k V i b h f GProcedure of unknowns to k-th block, Vk using a subgraph of G, G’(V’,E’)

Step 1(Initialization): C 0 V 0 (C: a set for candidates)Step 1(Initialization): C=0, Vk=0 (C: a set for candidates)

Step 2: If C=0, choose a node v from V’ according to the seed selection policy otherwise choose a node v from C based on the selection policypolicy, otherwise choose a node v from C based on the selection policy.

Step 3: If |Vk|=s finish the blocking process for Vk otherwise return to)(,/)','(')','(', vadjCCvEVGEVGvVV kk

Step 3: If |Vk|=s, finish the blocking process for Vk, otherwise return to Step 2.

By repeating the above procedure for k = 1 to k = nb with the initial setting G’(V’, E’) ← G(V, E) for k = 1, all the unknowns are assigned to blocks.

Assignment to k-th block．

Block size: 3．．

Vk-1

V

k 1

Vk

Vk+1

Vk+2

．．． : unknowns in the candidate set

Seed selection policy◦ Simple method: the node with the smallest order

number is selected as the seed.N d l i li f h did Node selection policy from the candidate set◦ Policy 1: FIFO policy

P li 2 S l t th d ith th i b◦ Policy 2: Select the node with the maximum number of edges to the nodes already assigned to Vk.◦ Policy 3: Select the node v with the highest score inPolicy 3: Select the node v with the highest score in

sv, calculated by .

kVv

vvvvv aas'

'' 2/)(

Our expectation:(1) The difference in the node selection policy may not significantly affect

the solver performance.(2) B ll i i d l li i li 2 d(2) But, we generally expect an increase in data locality using policy 2 and

an improvement in convergence with policy 3 compared to policy 1.

Genrate n by n matrix T which represents the Genrate nb by nb matrix T which represents the data-relationship between blocks.◦ I-th and J-th blocks have a relationship → T(I J)=1◦ I-th and J-th blocks have a relationship → T(I,J)=1◦ No relationship → T(I,J)=0

The matrix T is always symmetric The matrix T is always symmetric. A multi-color ordering technique is applied to

the unknowns whose relationships are given bythe unknowns whose relationships are given by T.◦ Any multi-color ordering techniques can be used.Any multi color ordering techniques can be used.◦ It guarantees the independence of the blocks with the

identical color.

After all unknowns are assigned into a block After all unknowns are assigned into a block and a color, they are reordered following the order of colororder of color.

The form of the coefficient matrix is given by follows:follows:

CPU1CPU1

CPU2Parallelsubstitutions

CPU3

Degree of parallelism =bl k l# blocks in one color

It does not depend on # cores.

# synchronizations =# colors

A linear system from static magnetic field l i ( 0 8 illi DOF )analysis( 0.8 million DOFs)

Four linear systems from the UF Matrix Collection◦ We mainly selected problems arising in unstructured mesh

l i h l i l l DOFanalyses with relatively large DOFs. Computer: T2K open supercomputer @ Kyoto Univ.,

Fujitsu HX600◦ 4 Quad-core Opteron processors

Convergence criteria of ICCG method is the relative residual norm less than 10-7

Fortran & OpenMP◦ Compile option: -Kfast, Opteron

Algebraic multi-color ordering with m colors: AMC(m) Algebraic multi color ordering with m colors: AMC(m) Greedy algorithm (multi-color ordering): MC ABMC with Policy X and s block size: ABMC-Px(s)

Th d l ith i d f th l i◦ The greedy algorithm is used for the coloring process.

T (sec) # Ite Ti (sec)Sequential ICCG

T: Total elapsed timeT (sec) # Ite. Ti (sec)232 577 0.403

AMC(30)

Ti: Computational time per iteration

# cores T (sec) # Ite. Speedup Ti (sec)2 501 1366 0.463 0.367

AMC(30)

2 501 1366 0.463 0.3678 140 1372 1.66 0.10216 92.7 1356 2.51 0.0683

# T ( ) # It S d Ti ( )AMC(500) Convergence improvedComp. time is not reduced

# cores T (sec) # Ite. Speedup Ti (sec)2 501 1044 0.463 0.4808 155 1051 1.49 0.148

Increased8 155 1051 1.49 0.14816 129 1046 1.80 0.123

ABMC-P1(64) with 38 colors (): AMC(30)

# cores T (sec) # ite. Speedup Ti (sec)2 340 (501) 1330 (1366) 0.683 (0.463) 0.256 (0.367)

ABMC P1(64) with 38 colors (): AMC(30)

( ) 330 ( ) ( ) ( )

8 112 (140) 1329 (1372) 2.07 (1.66) 0.0843 (0.102)

16 69.4 (92.7) 1326 (1356) 3.57 (2.51) 0.0547 (0.0683)

ABMC-P1(512) with 31 colors (): AMC(30)

# cores

T (sec) # ite. Speedup Ti (sec)

2 227 (501) 976 (1366) 1.08 (0.463) 0.233 (0.367)( )8 73.9 (140) 980 (1372) 3.14 (1.66) 0.0754 (0.102)

16 47.5 (92.7) 983 (1356) 4.89 (2.51) 0.0483 (0.0683)

Compare cache hit ratio by profiler

Method Secondary cache miss ratio (%）

Sequential computation

y (AMC(30) 2.59

AMC(500) 6.18ABMC-P1(64) 1.90

ABMC-P1(512) 1.66

Method Secondary cache miss ratio (%）Parallel computation by 16 cores

AMC(30) 1.54AMC(500) 1.67

ABMC P1(64) 1 16ABMC-P1(64) 1.16ABMC-P1(512) 0.92

T (sec) # ite. Speedup Ti (sec)MC 80 2 2486 2 89 0 054

Numerical results on 16 cores

MC 80.2 2486 2.89 0.054ABMC-P1(64) 69.4 1326 3.57 0.0547ABMC-P1(512) 47 5 983 4 89 0 0483ABMC P1(512) 47.5 983 4.89 0.0483ABMC-P2(64) 66.7 1252 3.48 0.0533ABMC-P2(512) 47.5 934 4.89 0.0508ABMC-P3(64) 74.8 1256 3.1 0.0596ABMC-P3(256) 59.9 985 3.88 0.0608

Effect of block size: larger block size usually results in better performance. (But, use of too large block size needs more colors and reduces the degree of parallelism )and reduces the degree of parallelism.)

Effect of selection policy is not in significant in convergence in this problem Consequently while policies 2 and 3 exhibitthis problem. Consequently, while policies 2 and 3 exhibit slightly faster convergence than policy 1, policy 2 achieves the best solver performance of the three policies.

Although the proposed method obtains a 70% improvement in parallel speedup ratio over MC, the speedup ratio remains about 5.

• Main kernels of the ICCG method are memory intensive.• In the HX600 node, the parallel speedup in STREAM(Triad) benchmark is only 7 2 for 16 cores because each quad-core Opteron processor hasis only 7.2 for 16 cores, because each quad-core Opteron processor has two memory channels.

The proposed ABMC method achieves better parallel performance in calculation time in one iteration than MC.

Algebraic multi-color ordering (AMC) method Algebraic multi color ordering (AMC) method◦ When the number of colors is increased for improving

convergence, the performance is not increased due to decline in granularity and cache hit ratiodecline in granularity and cache hit ratio.

Algebraic block multi-color ordering (ABMC)method (proposed method)( ) (p p )◦ Blocking unknowns leads to improvement in cache hit

ratio and reduction in computational time in one iterationiteration.

◦ Blocking improves the convergence rate without increasing the number of thread synchronizations.About 5 fold speedup by 16 cores◦ About 5-fold speedup by 16 cores 6 – 7 fold speedup by 16 cores in simple sparse matrix-

vector multiplication

We used 4 test problems derived from the collection We used 4 test problems derived from the collection. G3_cirucit is a structured mesh problem, while other three

problems arise from unstructured mesh problems.

Data set name Problem area DOF #nonzerosparabolic_fem CFD 525,825 3,674,625

thermal2 Static thermal problem 1,228,045 8,580,313audikw_1 Structural problem 943,605 77,651,847G3 C l 1 8 8 660 826G3_circuit Circuit simulation 1,585,478 7,660,826

Since three unstructured problems have the similar tendency in the l l h l d h l f d k lnumerical results, in this slide, the result of audikw_1 is only

introduced.

Sequential ICCG

T (sec) # Ite. Ti (sec)1776 1739 0.979

Sequential ICCG

• The ABMC method attains 2.6 times

Parallel ICCG (16 cores)

T # ite Ti (sec)

better performance than MC.

h d lT (sec)

# ite. Ti (sec)

MC 681 1728 0.394

• In this test model (other unstructured problems as well), th d iABMC-P1(64) 279 1773 0.157

ABMC-P1(128) 263 1753 0.150ABMC P2(64) 274 1790 0 153

the reordering process does not much affect the convergence TheABMC-P2(64) 274 1790 0.153

ABMC-P2(128) 322 2157 0.149ABMC-P3(64) 287 1788 0 160

convergence. The advantage of the ABMC is in reduced computational timeABMC P3(64) 287 1788 0.160

ABMC-P3(128) 281 1808 0.155computational time in one iteration.

Sequential ICCG • The ABMC methodT (sec) # Ite. Ti (sec)

156 757 0.206

Sequential ICCG • The ABMC method has also better performance than MC

Parallel ICCG (16 cores)

T # ite Ti (sec)

MC.

• In this test model, the superiority ofT

(sec)# ite. Ti (sec)

MC 62.2 1521 0.0409

the superiority of the ABMC over MC is mainly due to improvement in

ABMC-P1(64) 52.4 1138 0.0460ABMC-P1(128) 51.1 1174 0.0435ABMC P2(64) 47 0 1059 0 0444

pconvergence.

• Policy 3 exhibits ABMC-P2(64) 47.0 1059 0.0444ABMC-P2(128) 42.6 1037 0.0411ABMC-P3(64) 42 7 991 0 0431

better performance.

ABMC P3(64) 42.7 991 0.0431ABMC-P3(128) 41.4 980 0.0422

Ordering does not much affect convergence Ordering does not much affect convergence in three unstructured mesh problems.◦ Consequently the total elapsed time depends on◦ Consequently, the total elapsed time depends on

the computational time in one iteration. The ABMC shows better performance than the multi-color ordering because of an advantage in efficient cache utilization.

I G3 i it ( t t d bl ) th In G3_circuit (a structured problem), the advantage in convergence of the ABMC results in better performance than the MCresults in better performance than the MC.

It is not easy to give a general way for choosing It is not easy to give a general way for choosing the optimal blocking policy and block size for ABMC.

Eff t f i di id l bl h ti◦ Effect of an individual problem has sometimes a significant impact on convergence.

But, we can say that the ABMC method mostly hi b l f h hachieves better solver performance than the

conventional multi-color ordering even if the simplest blocking policy (policy 1) and a relatively p g p y (p y ) ysmall block size, for example, 8 or 16, are used.

Computational time (sec) on 16 cores

Method parabolic_fem audikw thermal2 G3_circuitMC 29.0 681 117 62.2ABMC-P1(8) 21.5 363 76.1 64.1ABMC-P1(16) 21.6 351 73.8 58.6

We propose algebraic block multi-color ordering which is a generalization of the block multi-color ordering methoda generalization of the block multi color ordering method for finite difference analysis.◦ The proposed method can be applied to a general sparse linear

system as a black-box type solver.◦ In this method, the blocks having an identical color are

independent of one another. Substitutions are parallelized in block-wise manner in each color.

◦ We present a blocking strategy to improve cache hit ratio inWe present a blocking strategy to improve cache hit ratio, in which more related unknowns are assigned to an identical block.

The proposed technique is examined by five numerical The proposed technique is examined by five numerical tests. It is shown that the method attains better solver performance than the conventional (greedy) multi-color ordering.

The introduced algebraic block multi-color ordering method can be directly used for parallelization of ILU preconditioning Gauss Seidel method SOR methodpreconditioning, Gauss-Seidel method, SOR method.

ipdps 2012 (main track), pp. 474-483 -...

Documents