ipdps 2012 (main track), pp. 474-483 -...
TRANSCRIPT
Takeshi Iwashita (Kyoto Univ.)Hiroshi Nakashima (Kyoto Univ )Hiroshi Nakashima (Kyoto Univ.)Yasuhito Takahashi (Doshisha Univ.)
IPDPS 2012 (Main track), pp. 474-483
Main issue Main issue◦ We propose algebraic block multi-color ordering
method for parallel ICCG method for unstructured h blmesh problems
Overview◦ Brief introduction of ICCG method and objective of◦ Brief introduction of ICCG method and objective of
the present research◦ Previous method: Explanation of block multi-color
ordering method for finite difference methodordering method for finite difference method◦ Proposed method: Explanation of algebraic block
multi-color ordering methodN i l l◦ Numerical results◦ Conclusions
Incomplete Cholesky Conjugate Gradient Incomplete Cholesky Conjugate Gradient Method
Most popular linear solver for a linear system Most popular linear solver for a linear system of equations with a sparse symmetric positive-definite coefficient matrixdefinite coefficient matrix◦ Utilized as a linear solver in small or middle scale
simulations◦ Utilized as a building block in large scale simulations
on supercomputers, for example, IC smoother of l d h d d bd l f dmultigrid method and a subdomain solver for domain
decomposition method
Main issue is in parallelization of forward and Main issue is in parallelization of forward and backward substitutions◦ CG solver (sparse matrix vector multiplication, inner
product, vector updates) can be straightforwardly p , p ) g yparallelized
◦ Same procedure is used in IC (ILU) smoother, Gauss-Seidel method(smoother), SOR method
f f l h d ll l Significance of multi-thread parallel processing in large-scale analysis◦ SMP cluster is major configuration of supercomputers◦ When the number of nodes is increased, hybrid (multi-
process / multi-thread) parallelization becomes more advantageous.Multi process parallelization is utilized by domain◦ Multi-process parallelization is utilized by domain decomposition method, which has an advantage in communication.
◦ Multi-thread parallelization for subdomain solvers andMulti thread parallelization for subdomain solvers and building blocks has an importance.
Additi S h t th d (L li d ICCG Additive Schwarz type method (Localized ICCG, Block ICCG)
Reordering (parallel ordering) Reordering (parallel ordering)◦ Unknowns or grid points are reordered to appropriate
order for parallel processingp p g◦ Parallel node ordering (Finite difference analysis) Multi-color ordering, Domain decomposition ordering,
Dissection orderingDissection ordering,… Block multi-color ordering, Block red-black ordering
T. Iwashita and M. Shimasaki, Int. J. Para. Pro., 31, (2003), pp. 55-75.T I hit Y N k i hi d M Shi ki SIAM J SC 26 (2005) 1234 1260 T. Iwashita, Y. Nakanishi and M. Shimasaki, SIAM J. SC, 26, (2005), pp.1234-1260.
◦ Parallel ordering for linear systems with general sparse coefficient matrices Extension of parallel node ordering
Extension of multi-color ordering Extension of multi color ordering◦ Jones and Plasmann’s method◦ Algebraic multi-color ordering method
f d d Extension of domain decomposition type ordering◦ HID method by Henon and Saad◦ HID method by Henon and Saad
Extension of block red-black ordering◦ ABRB method◦ Effective in small degrees of parallelism (Np<8)
We propose algebraic block multi-color ordering, which is considered a generalization of blockwhich is considered a generalization of block multi-color ordering
Blocking nodes
Multi-color ordering (4 colors) Block multi-color ordering(blocksize:4 3 colors)(blocksize:4, 3 colors)Nodes with an identical color
are independent of one another. Blocks with an identical color can be handled in parallel.
Substitutions are parallelizedin each color
p
Blocking improves cache hit ratio and convergence
Problem: Low cache hit ratio(when number of colorsare increased)
Extension of block multi-color ordering to general sparse g g pcoefficient matrix case (a black-box type solver)
Blocking and coloring of unknowns are performed by information only on coefficient matrixy
A n-dimensional linear system: Ax=b◦ Although we deal with the symmetric coefficient matrix in our
numerical tests, we here explain our method including non-symmetric coefficient matrix casesymmetric coefficient matrix case.
Blocking: n unknowns are divided into multiple blocks of size s. The number of blocks nb is given by n/s.
Coloring: n blocks are assigned to n colors Coloring: nb blocks are assigned to nc colors. ABMC method◦ Unknowns in any two blocks having the same color never have
data-relationshipdata relationship. Data-relationship between the unknowns◦ Unknowns i and j has a relationship⇔a(i,j)≠0 or a(j,i)≠0
Blocking◦ We assign more related unknowns to an identical block◦ We assign more related unknowns to an identical block. To improve cache hit ratio
◦ The sizes of all blocks are the same. Unnecessary for parallel computation, but it makes load-balancing easyy p p , g y
Coloring◦ We used two multi-color ordering methods applicable to a general
sparse matrix.
◦ The greedy algorithm Y. Saad, Iterative Methods for Sparse Linear Systems, Second ed., SIAM, 2003. M T Jones and P E Plassmann “The efficient parallel iterative solution of M.T. Jones and P.E. Plassmann, The efficient parallel iterative solution of
large sparse linear systems,” in Graph Theory and Sparse Matrix Computations, IMA 56, 1994, pp. 229–245.
Algebraic multi color ordering (AMC) method◦ Algebraic multi-color ordering (AMC) method T. Iwashita and M. Shimasaki, “Algebraic multi-color ordering for parallelized
ICCG solver in finite element analyses,” IEEE Trans. Magn., 38, (2002), pp. 429–432.
h b f l l ll Users can give the number of colors as an input parameter. It cyclically assign a color to the unknown satisfying the independence between the unknowns with the same color.
The number of blocks assigned to a color becomes about the same.
Using a non-directed graph G(V, E) derived from the coefficient matrix, the blocking procedure is written asthe blocking procedure is written as
,21 nbVVVV ,21 sVVV nb )'(' kkVV kk P d f k t k th bl k V i b h f GProcedure of unknowns to k-th block, Vk using a subgraph of G, G’(V’,E’)
Step 1(Initialization): C 0 V 0 (C: a set for candidates)Step 1(Initialization): C=0, Vk=0 (C: a set for candidates)
Step 2: If C=0, choose a node v from V’ according to the seed selection policy otherwise choose a node v from C based on the selection policypolicy, otherwise choose a node v from C based on the selection policy.
Step 3: If |Vk|=s finish the blocking process for Vk otherwise return to)(,/)','(')','(', vadjCCvEVGEVGvVV kk
Step 3: If |Vk|=s, finish the blocking process for Vk, otherwise return to Step 2.
By repeating the above procedure for k = 1 to k = nb with the initial setting G’(V’, E’) ← G(V, E) for k = 1, all the unknowns are assigned to blocks.
Assignment to k-th block.
Block size: 3..
Vk-1
V
k 1
Vk
Vk+1
Vk+2
... : unknowns in the candidate set
Seed selection policy◦ Simple method: the node with the smallest order
number is selected as the seed.N d l i li f h did Node selection policy from the candidate set◦ Policy 1: FIFO policy
P li 2 S l t th d ith th i b◦ Policy 2: Select the node with the maximum number of edges to the nodes already assigned to Vk.◦ Policy 3: Select the node v with the highest score inPolicy 3: Select the node v with the highest score in
sv, calculated by .
kVv
vvvvv aas'
'' 2/)(
Our expectation:(1) The difference in the node selection policy may not significantly affect
the solver performance.(2) B ll i i d l li i li 2 d(2) But, we generally expect an increase in data locality using policy 2 and
an improvement in convergence with policy 3 compared to policy 1.
Genrate n by n matrix T which represents the Genrate nb by nb matrix T which represents the data-relationship between blocks.◦ I-th and J-th blocks have a relationship → T(I J)=1◦ I-th and J-th blocks have a relationship → T(I,J)=1◦ No relationship → T(I,J)=0
The matrix T is always symmetric The matrix T is always symmetric. A multi-color ordering technique is applied to
the unknowns whose relationships are given bythe unknowns whose relationships are given by T.◦ Any multi-color ordering techniques can be used.Any multi color ordering techniques can be used.◦ It guarantees the independence of the blocks with the
identical color.
After all unknowns are assigned into a block After all unknowns are assigned into a block and a color, they are reordered following the order of colororder of color.
The form of the coefficient matrix is given by follows:follows:
CPU1CPU1
CPU2Parallelsubstitutions
CPU3
Degree of parallelism =bl k l# blocks in one color
It does not depend on # cores.
# synchronizations =# colors
A linear system from static magnetic field l i ( 0 8 illi DOF )analysis( 0.8 million DOFs)
Four linear systems from the UF Matrix Collection◦ We mainly selected problems arising in unstructured mesh
l i h l i l l DOFanalyses with relatively large DOFs. Computer: T2K open supercomputer @ Kyoto Univ.,
Fujitsu HX600◦ 4 Quad-core Opteron processors
Convergence criteria of ICCG method is the relative residual norm less than 10-7
Fortran & OpenMP◦ Compile option: -Kfast, Opteron
Algebraic multi-color ordering with m colors: AMC(m) Algebraic multi color ordering with m colors: AMC(m) Greedy algorithm (multi-color ordering): MC ABMC with Policy X and s block size: ABMC-Px(s)
Th d l ith i d f th l i◦ The greedy algorithm is used for the coloring process.
T (sec) # Ite Ti (sec)Sequential ICCG
T: Total elapsed timeT (sec) # Ite. Ti (sec)232 577 0.403
AMC(30)
Ti: Computational time per iteration
# cores T (sec) # Ite. Speedup Ti (sec)2 501 1366 0.463 0.367
AMC(30)
2 501 1366 0.463 0.3678 140 1372 1.66 0.10216 92.7 1356 2.51 0.0683
# T ( ) # It S d Ti ( )AMC(500) Convergence improvedComp. time is not reduced
# cores T (sec) # Ite. Speedup Ti (sec)2 501 1044 0.463 0.4808 155 1051 1.49 0.148
Increased8 155 1051 1.49 0.14816 129 1046 1.80 0.123
ABMC-P1(64) with 38 colors (): AMC(30)
# cores T (sec) # ite. Speedup Ti (sec)2 340 (501) 1330 (1366) 0.683 (0.463) 0.256 (0.367)
ABMC P1(64) with 38 colors (): AMC(30)
( ) 330 ( ) ( ) ( )
8 112 (140) 1329 (1372) 2.07 (1.66) 0.0843 (0.102)
16 69.4 (92.7) 1326 (1356) 3.57 (2.51) 0.0547 (0.0683)
ABMC-P1(512) with 31 colors (): AMC(30)
# cores
T (sec) # ite. Speedup Ti (sec)
2 227 (501) 976 (1366) 1.08 (0.463) 0.233 (0.367)( )8 73.9 (140) 980 (1372) 3.14 (1.66) 0.0754 (0.102)
16 47.5 (92.7) 983 (1356) 4.89 (2.51) 0.0483 (0.0683)
Compare cache hit ratio by profiler
Method Secondary cache miss ratio (%)
Sequential computation
y (AMC(30) 2.59
AMC(500) 6.18ABMC-P1(64) 1.90
ABMC-P1(512) 1.66
Method Secondary cache miss ratio (%)Parallel computation by 16 cores
AMC(30) 1.54AMC(500) 1.67
ABMC P1(64) 1 16ABMC-P1(64) 1.16ABMC-P1(512) 0.92
T (sec) # ite. Speedup Ti (sec)MC 80 2 2486 2 89 0 054
Numerical results on 16 cores
MC 80.2 2486 2.89 0.054ABMC-P1(64) 69.4 1326 3.57 0.0547ABMC-P1(512) 47 5 983 4 89 0 0483ABMC P1(512) 47.5 983 4.89 0.0483ABMC-P2(64) 66.7 1252 3.48 0.0533ABMC-P2(512) 47.5 934 4.89 0.0508ABMC-P3(64) 74.8 1256 3.1 0.0596ABMC-P3(256) 59.9 985 3.88 0.0608
Effect of block size: larger block size usually results in better performance. (But, use of too large block size needs more colors and reduces the degree of parallelism )and reduces the degree of parallelism.)
Effect of selection policy is not in significant in convergence in this problem Consequently while policies 2 and 3 exhibitthis problem. Consequently, while policies 2 and 3 exhibit slightly faster convergence than policy 1, policy 2 achieves the best solver performance of the three policies.
Although the proposed method obtains a 70% improvement in parallel speedup ratio over MC, the speedup ratio remains about 5.
• Main kernels of the ICCG method are memory intensive.• In the HX600 node, the parallel speedup in STREAM(Triad) benchmark is only 7 2 for 16 cores because each quad-core Opteron processor hasis only 7.2 for 16 cores, because each quad-core Opteron processor has two memory channels.
The proposed ABMC method achieves better parallel performance in calculation time in one iteration than MC.
Algebraic multi-color ordering (AMC) method Algebraic multi color ordering (AMC) method◦ When the number of colors is increased for improving
convergence, the performance is not increased due to decline in granularity and cache hit ratiodecline in granularity and cache hit ratio.
Algebraic block multi-color ordering (ABMC)method (proposed method)( ) (p p )◦ Blocking unknowns leads to improvement in cache hit
ratio and reduction in computational time in one iterationiteration.
◦ Blocking improves the convergence rate without increasing the number of thread synchronizations.About 5 fold speedup by 16 cores◦ About 5-fold speedup by 16 cores 6 – 7 fold speedup by 16 cores in simple sparse matrix-
vector multiplication
We used 4 test problems derived from the collection We used 4 test problems derived from the collection. G3_cirucit is a structured mesh problem, while other three
problems arise from unstructured mesh problems.
Data set name Problem area DOF #nonzerosparabolic_fem CFD 525,825 3,674,625
thermal2 Static thermal problem 1,228,045 8,580,313audikw_1 Structural problem 943,605 77,651,847G3 C l 1 8 8 660 826G3_circuit Circuit simulation 1,585,478 7,660,826
Since three unstructured problems have the similar tendency in the l l h l d h l f d k lnumerical results, in this slide, the result of audikw_1 is only
introduced.
Sequential ICCG
T (sec) # Ite. Ti (sec)1776 1739 0.979
Sequential ICCG
• The ABMC method attains 2.6 times
Parallel ICCG (16 cores)
T # ite Ti (sec)
better performance than MC.
h d lT (sec)
# ite. Ti (sec)
MC 681 1728 0.394
• In this test model (other unstructured problems as well), th d iABMC-P1(64) 279 1773 0.157
ABMC-P1(128) 263 1753 0.150ABMC P2(64) 274 1790 0 153
the reordering process does not much affect the convergence TheABMC-P2(64) 274 1790 0.153
ABMC-P2(128) 322 2157 0.149ABMC-P3(64) 287 1788 0 160
convergence. The advantage of the ABMC is in reduced computational timeABMC P3(64) 287 1788 0.160
ABMC-P3(128) 281 1808 0.155computational time in one iteration.
Sequential ICCG • The ABMC methodT (sec) # Ite. Ti (sec)
156 757 0.206
Sequential ICCG • The ABMC method has also better performance than MC
Parallel ICCG (16 cores)
T # ite Ti (sec)
MC.
• In this test model, the superiority ofT
(sec)# ite. Ti (sec)
MC 62.2 1521 0.0409
the superiority of the ABMC over MC is mainly due to improvement in
ABMC-P1(64) 52.4 1138 0.0460ABMC-P1(128) 51.1 1174 0.0435ABMC P2(64) 47 0 1059 0 0444
pconvergence.
• Policy 3 exhibits ABMC-P2(64) 47.0 1059 0.0444ABMC-P2(128) 42.6 1037 0.0411ABMC-P3(64) 42 7 991 0 0431
better performance.
ABMC P3(64) 42.7 991 0.0431ABMC-P3(128) 41.4 980 0.0422
Ordering does not much affect convergence Ordering does not much affect convergence in three unstructured mesh problems.◦ Consequently the total elapsed time depends on◦ Consequently, the total elapsed time depends on
the computational time in one iteration. The ABMC shows better performance than the multi-color ordering because of an advantage in efficient cache utilization.
I G3 i it ( t t d bl ) th In G3_circuit (a structured problem), the advantage in convergence of the ABMC results in better performance than the MCresults in better performance than the MC.
It is not easy to give a general way for choosing It is not easy to give a general way for choosing the optimal blocking policy and block size for ABMC.
Eff t f i di id l bl h ti◦ Effect of an individual problem has sometimes a significant impact on convergence.
But, we can say that the ABMC method mostly hi b l f h hachieves better solver performance than the
conventional multi-color ordering even if the simplest blocking policy (policy 1) and a relatively p g p y (p y ) ysmall block size, for example, 8 or 16, are used.
Computational time (sec) on 16 cores
Method parabolic_fem audikw thermal2 G3_circuitMC 29.0 681 117 62.2ABMC-P1(8) 21.5 363 76.1 64.1ABMC-P1(16) 21.6 351 73.8 58.6
We propose algebraic block multi-color ordering which is a generalization of the block multi-color ordering methoda generalization of the block multi color ordering method for finite difference analysis.◦ The proposed method can be applied to a general sparse linear
system as a black-box type solver.◦ In this method, the blocks having an identical color are
independent of one another. Substitutions are parallelized in block-wise manner in each color.
◦ We present a blocking strategy to improve cache hit ratio inWe present a blocking strategy to improve cache hit ratio, in which more related unknowns are assigned to an identical block.
The proposed technique is examined by five numerical The proposed technique is examined by five numerical tests. It is shown that the method attains better solver performance than the conventional (greedy) multi-color ordering.
The introduced algebraic block multi-color ordering method can be directly used for parallelization of ILU preconditioning Gauss Seidel method SOR methodpreconditioning, Gauss-Seidel method, SOR method.