flat mpi and hybrid parallel programming models for fem ... · mpi programming model (on es). •...

139
Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures Kengo Nakajima The 21st Century Earth Science COE Program Department of Earth & Planetary Science The University of Tokyo First International Workshop on OpenMP (IWOMP 05) June 1-4, 2005, Eugene, Oregon, USA.

Upload: others

Post on 24-Jun-2020

57 views

Category:

Documents


0 download

TRANSCRIPT

  • Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures

    Kengo NakajimaThe 21st Century Earth Science COE ProgramDepartment of Earth & Planetary ScienceThe University of Tokyo

    First International Workshop on OpenMP (IWOMP 05) June 1-4, 2005, Eugene, Oregon, USA.

  • IWOMP, JUN05

    2

    • Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers

    • Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications

    • Effect of reordering on various types of architectures

    – Contact Problems– PGA (Pin-Grid Array)– Multigrid

    • Summary & Future Works

    Overview

  • IWOMP, JUN05

    3

    • Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core

    convection, etc.• Part of national project by STA/MEXT for large-scale

    earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural

    science (solid earth) communities.• Current Activity

    – Research Consortium for Solid Earth Simulations on the Earth Simulator (9 sub-groups, >80 members)

    GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

  • IWOMP, JUN05

    4

    System Configuration ofGeoFEM

    Visualization dataGPPView

    One-domain mesh

    Utilities Pluggable Analysis Modules

    PEs

    Partitioner

    Equationsolvers

    VisualizerParallelI/O

    構造計算(Static linear)構造計算(Dynamic linear)構造計算(

    Contact)

    Partitioned mesh

    PlatformSolverI/F

    Comm.I/F

    Vis.I/F

    StructureFluid

    Wave

    Visualization dataGPPView

    One-domain mesh

    Utilities Pluggable Analysis Modules

    PEs

    Partitioner

    Equationsolvers

    VisualizerParallelI/O

    構造計算(Static linear)構造計算(Dynamic linear)構造計算(

    Contact)

    Partitioned mesh

    PlatformSolverI/F

    Comm.I/F

    Vis.I/F

    StructureFluid

    Wave

  • IWOMP, JUN05

    5

    Results on Solid Earth Simulation

    Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands

    Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan

    TSUNAMI !!TSUNAMI !!Transportation by Groundwater Flow Transportation by Groundwater Flow

    through Heterogeneous Porous Mediathrough Heterogeneous Porous Media

    Δh=5.00

    Δh=1.25

    T=100 T=200 T=300 T=400 T=500

  • IWOMP, JUN05

    6

    Results by GeoFEM

  • IWOMP, JUN05

    7

    • GeoFEM Project (FY.1998-2002)

    • Performance evaluation of FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers

    • Preconditioned Iterative Solver– Hybrid vs. Flat MPI Parallel Programming Model

    Motivations

  • IWOMP, JUN05

    8

    Earth Simulator (ES)http://www.es.jamstec.go.jp/

    • 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES

    • 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network

    – 12.3 GB/sec×2• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)

  • IWOMP, JUN05

    9

    Flat MPI vs. Hybrid

    PE

    PE

    PE

    PE

    MemoryPE

    PE

    PE

    PE

    Memory

    Hybrid:Hierarchal Structure

    PE

    PE

    PE

    PE

    MemoryPE

    PE

    PE

    PE

    Memory

    Flat-MPI:Each PE -> Independent

  • IWOMP, JUN05

    10

    • “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or

    processors).– “Hybrid” may provide better convergence rate (i.e. faster

    convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).

    Hybrid vs. Flat-MPI

  • IWOMP, JUN05

    11

    Local Data StructureNode-based Partitioning

    internal nodes - elements - external nodesPE#1

    1 2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

    1 2 3

    4 5

    6 7

    8 9 11

    10

    14 13

    15

    12

    PE#0

    7 8 9 10

    4 5 6 12

    3111

    2

    7 1 2 3

    10 9 11 12

    568

    4

    PE#2

    34

    8

    69

    10 12

    1 2

    5

    11

    7

    PE#3

    2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

    PE#0PE#1

    PE#2PE#3

    2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

  • IWOMP, JUN05

    12

    Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]

    • Ignore the effects by elements which belong to different partition / processor.

    • Not as strong as original ILU. – How bad in many PE cases ?

    • Global data dependency in ILU preconditioner is not suitable for parallel computing.

    )1,,1,(~

    ),,2(

    1

    1

    1

    ⋅⋅⋅−=⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛−=

    ⋅⋅⋅=−=

    +=

    =

    NNkyuydx

    Nkylby

    N

    kjjkjkkk

    k

    jjkjkk

  • IWOMP, JUN05

    13

    Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]

    Matrix components whose columun numbers are outside the processor are ignored (set equal to 0) at Back-Forward Substitution.

    1 2 3

    4 5 6

    Considered :

    Ignored :

    A1 2 3 4 5 6PE#1

    PE#2

    PE#3

    PE#4

    A1 2 3 4 5 6PE#1

    PE#2

    PE#3

    PE#4

    PE#1

    PE#2

    PE#3

    PE#4

  • IWOMP, JUN05

    14

    Overlapped Additive Schwartz Domain Decomposition Method

    Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi

    SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8

    PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21

    16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07

    NO Additive Schwartz WITH Additive Schwartz

  • IWOMP, JUN05

    15

    Overlapped Additive Schwartz Domain Decomposition Method

    Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi

    SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8

    PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21

    16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07

    NO Additive Schwartz WITH Additive Schwartz

  • IWOMP, JUN05

    16

    Overlapped Additive Schwartz Domain Decomposition Method

    for Stabilizing Localized Preconditioning

    Global OperationΩ

    Local OperationΩ1 Ω2

    Global Nesting CorrectionΩ1 Ω2

    Γ2→1 Γ1→2

    222111, ΩΩΩΩΩΩ == rzMrzM

    nn

    rMz =

    ( )11111212111111

    −ΓΓ

    −ΩΩΩ

    −Ω

    −ΩΩ →→

    −−+← nnnn zMzMrMzz

    ( )11112121222222

    −ΓΓ

    −ΩΩΩ

    −Ω

    −ΩΩ →→

    −−+← nnnn zMzMrMzz

  • IWOMP, JUN05

    17

    Overlapped Additive Schwartz Domain Decomposition Method

    Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi

    SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8

    PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21

    16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07

    NO Additive Schwartz WITH Additive Schwartz

  • IWOMP, JUN05

    18

    • “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or

    processors).– “Hybrid” may provide better convergence rate (i.e. faster

    convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).

    • Factors for Performance– Type of Applications, Problem Size– Balance of H/W Parameters

    Hybrid vs. Flat-MPI

    R. Rabenseifner: Communication Bandwidth of Parallel Programming Models on Hybrid Architectures, International Workshop on OpenMP: Experiences and Implementa-tions (WOMPEI 2002), Lecture Notes in Computer Science 2327, pp.437-448, 2002.

  • IWOMP, JUN05

    19

    • Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers

    • Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications

    • Effect of reordering on various types of architectures

    – PGA (Pin-Grid Array)– Contact Problems– Multigrid

    • Summary & Future Works

  • IWOMP, JUN05

    20

    Block IC(0)-CG Solver on the Earth Simulator

    • 3D Linear Elastic Problems (SPD)• Parallel Iterative Linear Solver

    – Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)– Additive Schwartz Domain Decomposition(ASDD)– http://geofem.tokyo.rist.or.jp/

    • Hybrid Parallel Programming Model– OpenMP + MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI

  • IWOMP, JUN05

    21

    Flat MPI vs. Hybrid

    PE

    PE

    PE

    PE

    MemoryPE

    PE

    PE

    PE

    Memory

    Hybrid:Hierarchal Structure

    PE

    PE

    PE

    PE

    MemoryPE

    PE

    PE

    PE

    Memory

    Flat-MPI:Each PE -> Independent

  • IWOMP, JUN05

    22

    Local Data StructureNode-based Partitioning

    internal nodes - elements - external nodesPE#1

    1 2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

    1 2 3

    4 5

    6 7

    8 9 11

    10

    14 13

    15

    12

    PE#0

    7 8 9 10

    4 5 6 12

    3111

    2

    7 1 2 3

    10 9 11 12

    568

    4

    PE#2

    34

    8

    69

    10 12

    1 2

    5

    11

    7

    PE#3

    2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

    PE#0PE#1

    PE#2PE#3

    2 3 4 5

    21 22 23 24 25

    1617 18 19

    20

    1112 13 14

    15

    67 8 9

    10

  • IWOMP, JUN05

    23

    1 SMP node => 1 domain for Hybrid Programming Model

    MPI communication among domains

    Node-0 Node-1

    Node-2 Node-3

    PEPEPEPEPEPEPEPE

    MemoryNode-0 Node-1

    Node-2 Node-3PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    Node-0 Node-1

    Node-2 Node-3

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    MemoryNode-0 Node-1

    Node-2 Node-3PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

    PEPEPEPEPEPEPEPE

    Memory

  • IWOMP, JUN05

    24

    Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

    – Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.

    M = (L+D)D-1(D+U)Mz= r

    D-1(D+U)= z1

    Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)

    Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz

    need to solve this equation

    M = (L+D)D-1(D+U)Mz= r

    D-1(D+U)= z1

    Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)

    Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz

    need to solve this equation

  • IWOMP, JUN05

    25

    Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

    – Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.

    do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

    enddoZ(i)= WVAL / D(i)

    enddo

    do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))

    enddoZ(i)= Z(i) – SW / D(i)

    enddo

    Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)

    Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz

    Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

    Reordering:Directly connectednodes do not appearin RHS.

    M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA

    do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

    enddoZ(i)= WVAL / D(i)

    enddo

    do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))

    enddoZ(i)= Z(i) – SW / D(i)

    enddo

    Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)

    Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz

    Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

    Reordering:Directly connectednodes do not appearin RHS.

    M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA

  • IWOMP, JUN05

    26

    Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

    – Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.

    • Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

  • IWOMP, JUN05

    27

    Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

    – Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.

    • Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

    • 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization

  • IWOMP, JUN05

    28

    SMP ParallelSMP Parallel

    VectorVector

    Re-Ordering Technique for Vector/Parallel Architectures

    Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

    VectorVectorSMP ParallelSMP Parallel

    1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)

    3. DJDS re-ordering 4. Cyclic DJDS for SMP unit

    These processes can be substituted by traditional multi-coloring (MC).

  • IWOMP, JUN05

    29

    Re-Ordering Technique for Vector/Parallel Architectures

    Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

    http://www.sc-conference.org/sc2003/paperpdfs/pap155.pdf

    K.Nakajima, “Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator“SC2003 Technical Session, Phoenix, Arizona, 2003.[

    http://www.sc-conference.org/sc2003/inter_cal/inter_cal_detail.php?eventid=10703#1

  • IWOMP, JUN05

    30

    Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from

    each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector

    length.: Trade-Off !!

    Red-Black (2 colors) 4 colors Multicolor

  • IWOMP, JUN05

    31

    Colors & IterationsIncompatible Nodes

    Doi, S. (NEC) et al.

    1 2 3 4 5

    6 7 8 9 10

    11 12 13 14 15

    16 17 18 19 20

    21 22 23 24 25

    “Incompatible nodes” are not affected by other nodes. System with fewer incompatible nodes provides faster convergence.

  • IWOMP, JUN05

    32

    RCM Ordering

    1 2 4 7 11

    3 5 8 12 16

    6 9 13 17 20

    10 14 18 21 23

    15 19 22 24 25

  • IWOMP, JUN05

    33

    Red-Black OrderingHighly Parallelized: Large Loop Length

    Many Incompatible Nodes: Slow Convergence

    1 14 2 15 3

    16 4 17 5 18

    6 19 7 20 8

    21 9 22 10 23

    11 24 12 25 13

  • IWOMP, JUN05

    34

    Four-Color OrderingFairly Parallelized: Shorter Loop Length

    Fewer Incompatible Nodes: Faster Convergence

    1 8 14 20 2

    9 15 21 3 10

    16 22 4 11 17

    23 5 12 18 24

    6 13 19 25 7

  • IWOMP, JUN05

    35

    Cyclic DJDS(RCM+CMC)for Forward/Backward Substitution in BILU Factorization

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

  • IWOMP, JUN05

    36

    Simple 3D Cubic Model

    x

    y

    z

    Uz=0 @ z=Zmin

    Ux=0 @ x=Xmin

    Uy=0 @ y=Ymin

    Uniform Distributed Force in z-dirrection @ z=Zmin

    Ny-1

    Nx-1

    Nz-1

  • IWOMP, JUN05

    37

    Effect of Ordering

  • IWOMP, JUN05

    38

    Effect of Re-Ordering

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    Long LoopsContinuous Access

    Short LoopsContinuous Access

    Short LoopsIrregular Access

    Proposed Method

  • IWOMP, JUN05

    39

    Effect of Re-OrderingResults on 1 SMP node

    Color #: 99 (fixed)Re-Ordering is REALLY required !!!

    1.E-02

    1.E-01

    1.E+00

    1.E+01

    1.E+02

    1.E+04 1.E+05 1.E+06 1.E+07

    DOF

    GFL

    OPS

    Effect of Vector LengthX10

    + Re-Ordering X100

    22 GFLOPS, 34% of the Peak

    Ideal Performance: 40%-45%for Single CPU

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

  • IWOMP, JUN05

    40

    Effect of Re-OrderingResults on 1 SMP node

    Color #: 99 (fixed)Re-Ordering is REALLY required !!!

    1.E-02

    1.E-01

    1.E+00

    1.E+01

    1.E+02

    1.E+04 1.E+05 1.E+06 1.E+07

    DOF

    GFL

    OPS

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    80x80x80 case (1.5M DOF)

    ● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.

  • IWOMP, JUN05

    41

    Ideal Performance: 48.6% of Peak

  • IWOMP, JUN05

    42

    SMP node # > 10up to 176 nodes (1408 PEs)

    Problem size for each SMP node is fixed.PDJDS-CMC/RCM, Color #: 99

  • IWOMP, JUN05

    43

    3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

    ●:Flat MPI, ○:Hybrid

    GFLOPS rate Parallel Work Ratio

    0

    1000

    2000

    3000

    4000

    0 16 32 48 64 80 96 112 128 144 160 176 192

    NODE#

    GFL

    OPS

    40

    50

    60

    70

    80

    90

    100

    0 16 32 48 64 80 96 112 128 144 160 176 192

    NODE#

    Para

    llel W

    ork

    Rat

    io: %

    3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)

    ●Flat MPI

    ○Hybrid

  • IWOMP, JUN05

    44

    3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

    ●:Flat MPI, ○:Hybrid

    0

    1000

    2000

    3000

    4000

    0 16 32 48 64 80 96 112 128 144 160 176 192

    NODE#

    GFL

    OPS

    40

    50

    60

    70

    80

    90

    100

    0 16 32 48 64 80 96 112 128 144 160 176 192

    NODE#

    Para

    llel W

    ork

    Rat

    io: %

    GFLOPS rate Parallel Work Ratio

    ●Flat MPI

    ○Hybrid

  • IWOMP, JUN05

    45

    3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator

    ー:Flat MPI, ---:Hybrid

    15

    20

    25

    30

    35

    40

    1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

    DOF#

    Peak

    Per

    form

    ance

    Rat

    io: %

    Prob Size/SMP node = 3x64^3

    3x80^33x100^3

    3x128^3

    3x256x128x128 SMPnodes

    176 SMPnodes

    1.E+02

    1.E+03

    1.E+04

    1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

    DOF#

    GFL

    OPS

    Prob Size/SMP node = 3x64^3

    3x80^3

    3x100^3

    3x128

    3x256x128x128

    Large SMP Node # Large SMP Node #

  • IWOMP, JUN05

    46

    3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator

    ー:Flat MPI, ---:Hybrid

    Large SMP Node # Large SMP Node #

    0

    500

    1000

    1500

    2000

    2500

    1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

    DOF#

    Itera

    tions

    80

    90

    100

    110

    120

    130

    140

    150

    160

    0 16 32 48 64 80 96 112 128 144 160 176 192

    NODE#

    Hyb

    rid/F

    lat M

    PI: %

    Prob Size/SMP node= 3x64^3

    3x100^3

    3x256x128x128

    Hybrid isfaster

    FlatMPIis faster

  • IWOMP, JUN05

    47

    Hybrid outperforms Flat-MPI• when ...

    – number of SMP node (PE) is large.– problem size/node is small.

    • because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio

    • Effect of communication becomes significant if number of SMP node (or PE) is large.

    • Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES

    N

    N

    HybridFlat MPIN/2

    N/2

    N

    N

    HybridFlat MPIN/2

    N/2

  • IWOMP, JUN05

    48

    Comm. latency is relatively large in ES

    D.J.Kerbyson et al. LA-UR-02-5222

    ES ASCI-QHP ES45

    500 MHz 1.25 GHzProc. Speed

    8 x 640= 5,120 4 x 3,072=12,288SMP nodes

    8 x 5,120= 40 TFLOPS

    2.5 x 12,288= 30 TFLOPSPeak Speed

    2 x 5,120= 10 TB 4 x 12,288= 48 TBMemory

    32 GB/s 2 GB/sL1/L2 20/30 GB/sPeak Memory BWTH

    Latency: 5.6 μsecBWTH : 11.8 GB/s

    Latency: 5 μsecBWTH : 300 MB/sInter-Node MPI Comm.

  • IWOMP, JUN05

    49

    Performance Estimation for Finite-Element Type Application on ES using performance model (not real computations)

    D.J.Kerbyson et al. LA-UR-02-5222 (LANL)

    # Processors

    Tim

    e (%

    )

    100

    01 5120

    computation/memorycomm. latencycomm. bandwidth

    # Processors

    Tim

    e (%

    )

    100

    01 5120

    computation/memorycomm. latencycomm. bandwidth

    computation/memorycomm. latencycomm. bandwidth

  • IWOMP, JUN05

    50

    Hybrid outperforms Flat-MPI (cont.)

    • Difference between Hybrid & Flat MPI of iteration number for convergence is not so significant … due to additive Schwartz domain decomposition.

    ー:Flat MPI---:Hybrid

    Large SMP Node #

    0

    500

    1000

    1500

    2000

    2500

    1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

    DOF#

    Itera

    tions

  • IWOMP, JUN05

    51

    • Earth Simulator (ES)– Earth Simulator Center, JAMSTEC.– Vector Processors

    • Hitachi SR8000– The University of Tokyo– Power-PC based Architecture– Pseudo-Vector Capability with Preload/Prefetch

    • IBM SP-3– NERSC/Lawrence Berkeley National Laboratory: Seaborg– Power3– 8MB L2-cache for each PE

    Various Platforms

  • IWOMP, JUN05

    52

    Platforms

    PE#/node

    Peak GF/PE

    Mem.BW GB/s/node

    ES SP3 SR8k

    8

    8.0

    256

    16

    1.5

    16

    8

    1.8

    32

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    MPI lat. μsec

    Network BW GB/s/Node

    5.6

    12.3

    16.3*

    1.00

    6-20**

    1.60

    PE#/node

    Peak GF/PE

    Mem.BW GB/s/node

    ES SP3 SR8k

    8

    8.0

    256

    16

    1.5

    16

    8

    1.8

    32

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    MPI lat. μsec

    Network BW GB/s/Node

    5.6

    12.3

    16.3*

    1.00

    6-20**

    1.60

    *) Oliker et al.**) HLRS

  • IWOMP, JUN05 53

    Reordering for the Earth Simulator Matrix Storage for ILU Factorization

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    Long LoopsContinuous Access

    Short LoopsContinuous Access

    Short LoopsIrregular Access

  • IWOMP, JUN05 54

    3D Elastic SimulationProblem Size~GFLOPS

    Earth Simulator1 SMP node (8 PE’s)

    Flat-MPI23.4 GFLOPS, 36.6 % of Peak

    1.0E-01

    1.0E+00

    1.0E+01

    1.0E+02

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    1.0E-01

    1.0E+00

    1.0E+01

    1.0E+02

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    OpenMP21.9 GFLOPS, 34.3 % of Peak

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

  • IWOMP, JUN05 55

    3D Elastic SimulationProblem Size~GFLOPS

    Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    Flat-MPI2.17 GFLOPS, 15.0 % of Peak

    OpenMP2.68 GFLOPS, 18.6 % of Peak

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

  • IWOMP, JUN05 56

    3D Elastic SimulationProblem Size~GFLOPS

    IBM-SP3 (NERSC) 1 SMP node (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    Flat-MPI OpenMP

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1.0E+04 1.0E+05 1.0E+06 1.0E+07

    DOF: Problem Size

    GFL

    OPS

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    Cache is wellCache is well--utilizedutilizedin Flatin Flat--MPI.MPI.

  • IWOMP, JUN05 57

    3D Elastic SimulationSingle PE Performance

    GFLOPS, 104 colors

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    EarthSimulator

    98,304 DOF(=3×323)

    331,776 DOF(=3×483)

    786,432 DOF(=3×643)

    DJDS● CRS■ DJDS● CRS■ DJDS● CRS■

    1.67 -Hybrid 2.06 - 3.01 -

    2.75 -FlatMPI 3.02 - 3.19 -

    HitachiSR8000

    0.404 0.127Hybrid 0.437 0.140 0.423 0.146

    0.366 0.128FlatMPI 0.343 0.138 0.322 0.146

    IBM SP30.145 0.167Hybrid 0.142 0.165 0.137 0.145

    0.138 0.165FlatMPI 0.135 0.159 0.126 0.141

  • IWOMP, JUN05 58

    Performance of Intra-Node MPI8 PE’s: Measurement using SEND/RECV

    subroutines in GeoFEM

    1.E-03

    1.E-02

    1.E-01

    1.E+00

    1.E+01

    1.E+02

    1.E+02 1.E+03 1.E+04 1.E+05

    message size

    sec.

    -GB

    /sec

    .

    1.E-03

    1.E-02

    1.E-01

    1.E+00

    1.E+01

    1.E+02

    1.E+02 1.E+03 1.E+04 1.E+05

    message size

    sec.

    -GB

    /sec

    .

    1.E-03

    1.E-02

    1.E-01

    1.E+00

    1.E+01

    1.E+02

    1.E+02 1.E+03 1.E+04 1.E+05

    message size

    sec.

    -GB

    /sec

    .

    ESES Hitachi SR8kHitachi SR8k IBM SP3IBM SP3

    Elapsed time for 1,000 iterations (sec.)Elapsed time for 1,000 iterations without estimated latency(sec.)Estimated communication bandwidth (GB/sec)

  • IWOMP, JUN05 59

    Performance of Intra-Node MPI8 PE’s: Measurement using inter-domain SEND/RECV

    subroutines in GeoFEM

    PE#/nodePeak GF/PEMem.BW GB/s/nodeMPI lat. μsecNetwork BW GB/s/NodeIntra-node MPI lat. μsecIntra-node MPI BWMB/s/PE

    ES8

    8.02565.6

    12.3

    5.29

    1018.

    SP3161.516

    16.3*

    1.00

    8.32

    52.5

    SR8k8

    1.832

    6-20**

    1.60

    9.17

    95.8

    Intra-node communication performance seems reasonable accordingto memory BW, inter-node communication performance.

  • IWOMP, JUN05

    60

    • Earth Simulator (ES)– Flat MPI is slightly better.

    • Hitachi SR8000– Similar feature with that of ES– Hybrid is better for PDJDS/CM-RCM ordering.– Difference of single PE performance between Hybrid and

    Flat MPI for PDJDS/CM-RCM is significant.• compiler ?

    • IBM SP-3– Effect of cache (8MB/PE) is significant.– Cache is well-utilized in Flat MPI– Performance in larger problems is similar

    Summary: Single Node PerformanceFlat MPI vs. Hybrid

  • IWOMP, JUN05

    61

    Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up

    extrapolated from 1-node performance)

    EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)

    0

    50

    100

    150

    0 32 64 96 128 160 192

    SMP NODE# (8 PEs/NODE)

    GFL

    OPS

    3.0M DOF/node

    0

    1000

    2000

    3000

    4000

    0 32 64 96 128 160 192

    SMP NODE# (8 PEs/NODE)

    GFL

    OPS

    12.6M DOF/node

    2.2G DOF3.80 TFLOPS 33.7% of peak(176 nodes)

    0.38G DOF110. GFLOPS 7.16% of peak(128 nodes)(8 PEs/node)

  • IWOMP, JUN05

    62

    Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up

    extrapolated from 1-node performance)

    0

    50

    100

    150

    0 32 64 96 128 160 192

    SMP NODE# (8 PEs/NODE)

    GFL

    OPS

    786.4K DOF/node

    0

    1000

    2000

    3000

    4000

    0 32 64 96 128 160 192

    SMP NODE# (8 PEs/NODE)

    GFL

    OPS

    786.4K DOF/node

    0.10G DOF115. GFLOPS 7.51% of peak(128 nodes)(8 PEs/node)

    EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)

  • IWOMP, JUN05

    63

    Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up

    extrapolated from 1-node performance)Hitachi SR8000Hitachi SR8000

    0

    100

    200

    300

    400

    0 16 32 48 64 80 96 112 128

    SMP NODE#

    GFL

    OPS

    0

    100

    200

    300

    400

    0 16 32 48 64 80 96 112 128

    SMP NODE#

    GFL

    OPS

    6.29M DOF/node786.4K DOF/node

    8.05G DOF, 128 nodes● 335 GFLOPS, 18.2% of peak● 257 GFLOPS, 14.7%

    0.10G DOF, 128 nodes● 271 GFLOPS, 14.7% of peak● 276 GFLOPS, 15.0%

  • IWOMP, JUN05

    64

    Platforms

    PE#/node

    Peak GF/PE

    Mem.BW GB/s/node

    ES SP3 SR8k

    8

    8.0

    256

    16

    1.5

    16

    8

    1.8

    32

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    MPI lat. μsec

    Network BW GB/s/Node

    5.6

    12.3

    16.3*

    1.00

    6-20**

    1.60

    PE#/node

    Peak GF/PE

    Mem.BW GB/s/node

    ES SP3 SR8k

    8

    8.0

    256

    16

    1.5

    16

    8

    1.8

    32

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

    NERSC/LBNLNERSC/LBNL

    Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

    Hitachi Ltd.Hitachi Ltd.

    MPI lat. μsec

    Network BW GB/s/Node

    5.6

    12.3

    16.3*

    1.00

    6-20**

    1.60 12 : 1 : 1.6

    2.9 : 1 : 1.2-2.7

  • IWOMP, JUN05

    65

    • Earth Simulator (ES)– Effect of MPI latency is significant for:

    • Flat MPI• Many node #• small problem size/node

    • Hitachi SR8000• IBM SP-3

    – Effect of MPI latency is hidden by their relatively low communication bandwidth.

    – Performance with 128 SMP nodes are identical with extrapolation of single node performance, especially if problem size/node is sufficiently large.

    Summary: Performance for Many Nodes, Flat MPI vs. Hybrid

  • IWOMP, JUN05

    66

    • Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers

    • Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications

    • Effect of reordering on various types of architectures

    – PGA (Pin-Grid Array)– Contact Problems– Multigrid

    • Summary & Future Works

  • IWOMP, JUN05 67

    Elastic Problemswith Cubic Model

  • IWOMP, JUN05 68

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDCRS/CM-RCM Re-OrderingIterations for Convergence

    Flat MPIHybrid

    200

    250

    300

    350

    400

    10 100 1000

    COLORS

    ITER

    ATI

    ON

    S

  • IWOMP, JUN05 69

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)

    Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0.0

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    10 100 1000

    COLORS

    GFL

    OPS

    0

    25

    50

    75

    100

    10 100 1000

    COLORS

    sec.

  • IWOMP, JUN05 70

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)

    Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0.0

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    10 100 1000

    COLORS

    GFL

    OPS

    0

    25

    50

    75

    100

    10 100 1000

    COLORS

    sec.

    Smaller Vector LengthSmaller Vector Length

  • IWOMP, JUN05 71

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)

    Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0.0

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    10 100 1000

    COLORS

    GFL

    OPS

    0

    25

    50

    75

    100

    10 100 1000

    COLORS

    sec.

    Smaller Vector LengthSmaller Vector Length++

    OpenMPOpenMP OverheadOverhead

  • IWOMP, JUN05 72

    Many Colors ... Trade-OffFast convergence according to iteration number

    S.Doi et al.Smaller vector length

    FLOPS rate decreases.

    Hybrid is much more sensitive to color numberEspecially on the Earth Simulator

  • IWOMP, JUN05 73

    Hybrid is much more sensitive to color numbers !

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

    y(i)= y(i) + A(k)*X(kk)

  • IWOMP, JUN05 74

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDJDS/CM-RCM Re-OrderingHitachi SR8k (single node)

    Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    10 100 1000

    COLORS

    GFL

    OPS

    0

    100

    200

    300

    400

    500

    10 100 1000

    COLORS

    sec.

    Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator. In FlatIn Flat--MPI, performance is rather improved.MPI, performance is rather improved.

  • IWOMP, JUN05 75

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    PDCRS/CM-RCM Re-OrderingIBM SP3 (single node, 8PE’s)

    Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0

    250

    500

    750

    1000

    10 100 1000

    COLORS

    sec.

    Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator.

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    COLORS

    GFL

    OPS

  • IWOMP, JUN05 76

    Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

    IBM SP3 (single node, 8PE’s)Flat MPIHybrid

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    COLORS

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    COLORS

    GFL

    OPS

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

  • IWOMP, JUN05 77

    South-West Japan ModelContact Problems

    Selective Blocking Preconditioning

  • IWOMP, JUN05 78

    South-West Japan

    Eurasia

    Philippine

    PacificEurasia

    Philippine

    Pacific

  • IWOMP, JUN05 79

    South-West Japan Model800K nodes, 2.4M DOF 1-SMP nodes (8 PE’s)

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 80

    South-West Japan Model800K nodes, 2.4M DOF ES, 1-SMP nodes (8 PE’s)

    ● Flat MPI○ Hybrid (OpenMP)

    20

    25

    30

    35

    40

    1 10 100 1000

    Colors

    sec.

    5

    10

    15

    20

    25

    1 10 100 1000

    Colors

    GFL

    OPS

    Smaller Vector LengthSmaller Vector Length++

    OpenMPOpenMP OverheadOverhead

    Smaller Vector LengthSmaller Vector Length

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 81

    South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1 10 100 1000

    Colors

    GFL

    OPS

    0

    100

    200

    300

    1 10 100 1000

    Colors

    sec.

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 82

    South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1 10 100 1000

    Colors

    GFL

    OPS

    0

    100

    200

    300

    1 10 100 1000

    Colors

    sec.

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 83

    South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1 10 100 1000

    Colors

    GFL

    OPS

    0

    100

    200

    300

    1 10 100 1000

    Colors

    sec.

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

    Smaller Vector LengthSmaller Vector Length++

    OpenMPOpenMP OverheadOverhead

  • IWOMP, JUN05 84

    South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    1 10 100 1000

    Colors

    GFL

    OPS

    0

    100

    200

    300

    1 10 100 1000

    Colors

    sec.

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

    WHY ?WHY ?

  • IWOMP, JUN05 85

    South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)

    0

    250

    500

    750

    1000

    1 10 100 1000

    COLOR#

    sec.

    0.00

    0.25

    0.50

    0.75

    1.00

    1 10 100 1000

    COLOR#

    GFL

    OPS

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 86

    South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)

    0

    250

    500

    750

    1000

    1 10 100 1000

    COLOR#

    sec.

    0.00

    0.25

    0.50

    0.75

    1.00

    1 10 100 1000

    COLOR#

    GFL

    OPS

    ● Flat MPI○ Hybrid (OpenMP)

    400

    450

    500

    550

    600

    650

    1 10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 87

    South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)

    ● Flat MPI○ Hybrid

    3000

    3100

    3200

    3300

    3400

    3500

    10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 88

    South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)

    ● Flat MPI○ Hybrid

    75

    100

    125

    150

    175

    200

    10 100 1000

    Colors

    GFL

    OPS

    100

    150

    200

    250

    10 100 1000

    Colors

    sec.

  • Complicated GeometriesPGA for Mobile Pentium III

  • IWOMP, JUN05 90

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF)

  • IWOMP, JUN05 91

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering

    ● Flat MPI○ OpenMP

    400

    425

    450

    475

    500

    10 100 1000

    Colors

    Itera

    tions

  • IWOMP, JUN05 92

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Earth Simulator

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● Flat MPI○ OpenMP

    0

    50

    100

    150

    10 100 1000

    Colors

    sec.

    0

    10

    20

    30

    10 100 1000

    Colors

    GFL

    OPS

    Smaller Vector LengthSmaller Vector Length++

    OpenMPOpenMP OverheadOverhead

    Smaller Vector LengthSmaller Vector Length

  • IWOMP, JUN05 93

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Hitachi SR8000

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    0

    200

    400

    600

    800

    10 100 1000

    Colors

    sec.

    0.00

    1.00

    2.00

    3.00

    10 100 1000

    Colors

    GFL

    OPS

    ● Flat MPI○ OpenMP

    WHY ?WHY ?

  • IWOMP, JUN05 94

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, IBM-SP3

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● Flat MPI○ OpenMP

    0

    1000

    2000

    3000

    10 100 1000

    Colors

    sec.

    0.00

    0.25

    0.50

    0.75

    1.00

    10 100 1000

    Colors

    GFL

    OPS

  • IWOMP, JUN05 95

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeHitachi SR8000

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ●■ Flat MPI○□ OpenMP

    0.00

    1.00

    2.00

    3.00

    10 100 1000

    Colors

    GFL

    OPS

    0.00

    1.00

    2.00

    3.00

    10 100 1000

    Colors

    GFL

    OPS

    PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

    WHY ?WHY ?

  • IWOMP, JUN05 96

    Single PE test for simple cubic model786,432 DOF= 3x643 nodesHitachi SR8000

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● Flat MPI(PDJDS)■ Flat MPI(PDCRS)○ OpenMP(PDJDS)□ OpenMP(PDCRS)

    0.00

    0.10

    0.20

    0.30

    0.40

    0.50

    10 100 1000

    Colors

    GFL

    OPS

    Behavior of Flat MPI withBehavior of Flat MPI withPDJDS seem like that of scalar PDJDS seem like that of scalar processors, while processors, while OpenMPOpenMP/PDJDS/PDJDSlooks like vector processors.looks like vector processors.

    Feature (or problem) of compiler ?Feature (or problem) of compiler ?PseudoPseudo--Vector option does notVector option does notwork well in Flat MPI.work well in Flat MPI.

  • IWOMP, JUN05 97

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ●■ Flat MPI○□ OpenMP

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

  • IWOMP, JUN05 98

    PDCRS is more suitable for scalar processors

    Reduction type loop of CRS is more suitable for cache-based scalar processor because of its localized operation.PDJDS provides data locality if the color number increases.

    On scalar processors, performance may be improved as color number increases.

    PDJDSPDJDS PDCRSPDCRS

    do i= 1, Nk1= index(i-1)+1k2= index(i)do k= k1, k2kk= item(k)Y(i)= Y(i)+A(k)*x(k)

    enddoenddo

    do j= 1, NCONdo i= 1, NN(j)k = index(j-1) + ikk= item (k)Y(i)= Y(i)+A(k)*x(k)

    enddoenddo

  • IWOMP, JUN05 99

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ●■ Flat MPI○□ OpenMP

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

    Performance is slightlyPerformance is slightlyimproved due to data locality.improved due to data locality.

  • IWOMP, JUN05 100

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ●■ Flat MPI○□ OpenMP

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

    Performance Performance is slightlyis slightlyimproved due improved due to data locality.to data locality.

  • IWOMP, JUN05 101

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ●■ Flat MPI○□ OpenMP

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    0.00

    0.50

    1.00

    1.50

    10 100 1000

    Colors

    GFL

    OPS

    PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

    OpenMPOpenMPoverheadoverhead

    OpenMPOpenMPoverheadoverhead

  • IWOMP, JUN05 102

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeItanium II (SGI/Altix) (8 PE’s)

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● Flat MPI(PDJDS)■ Flat MPI(PDCRS)

    0.00

    1.00

    2.00

    3.00

    10 100 1000

    Colors

    GFL

    OPS

    0

    250

    500

    750

    1000

    10 100 1000

    Colors

    sec.

  • IWOMP, JUN05 103

    PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeAMD Opteron (8 PE’s)

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● ■ ▲

    PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

    CRS no re-ordering

    ● Flat MPI(PDJDS)■ Flat MPI(PDCRS)

    0.00

    1.00

    2.00

    3.00

    10 100 1000

    Colors

    GFL

    OPS

    0

    250

    500

    750

    1000

    10 100 1000

    Colors

    sec.

  • IWOMP, JUN05 104

    Multigrids

  • IWOMP, JUN05

    105

    MultigridMultigrid is scalable !!!

    MGCG: scalable

    ICCG: unscalable

    J2CG: unscalable

    200

    150

    100

    50

    0100 101 102 103

    Number of Processors (Problem Size)

    Tim

    e to

    Sol

    utio

    n

    Based on LLNL

  • IWOMP, JUN05

    106

    • Typical Geometry in Earth Science Simulations

    – Mantle, Core etc.– Also applicable to external

    flow• Boundary Conditions

    – r= Rin• u=v=w=0, T=1

    – r= Rout• u=v=w=0, T=0

    – Heat Generation

    Target ApplicationThermal Convection between 2 Spherical Surfaces

    Rin=0.50

    Rout=1.00

  • IWOMP, JUN05

    107

    • Governing equations– Momentum (Navier-Stoke)– Continuity– Energy

    • Poisson equations for pressure correction derived from continuity constraint.

    • Parallel MGCG iterative method for Poisson eqn’s– V-cycle, Geometrical Multigrid– Gauss-Seidel Smoothing– Semi-coarsening– Multilevel communication table

    Semi-Implicit Pressure Correction Scheme

  • IWOMP, JUN05

    108

    Start from IcosahedronSurface Grid by Adaptive Mesh Refinement (AMR)

    Level 012 nodes20 tri's

    Level 142 nodes80 tri's

    Level 2162 nodes320 tri's

    Level 42,562 nodes5,120 tri's

    Level 3642 nodes

    1,280 tri's

  • IWOMP, JUN05

    109

    • generated from unstructured surface triangles

    • structured in normal-to-surface direction

    • flexible• suitable for near-wall

    boundary layer computation

    Semi-Unstructured Prismatic Grids

  • IWOMP, JUN05

    110

    Results on Hitachi SR2201320x900=288,000 cells/PEup to 37M DOF on 128 PEs

    0.00E+00

    1.00E+03

    2.00E+03

    3.00E+03

    4.00E+03

    5.00E+03

    0 16 32 48 64 80 96 112 128

    PE#

    sec.

    ICCG MGCG/FGS MGCG/ILUs

    60

    70

    80

    90

    100

    110

    0 16 32 48 64 80 96 112 128

    PE#

    Par

    alle

    l Per

    form

    ance

    (%)

  • IWOMP, JUN05

    111

    How about on ES ?

    • 1 CPU– 768,000 elements (1280×600)– ICCG : 29 sec.(Xeon 2.8GHz) ⇒169 sec.(ES)– MGCG: 58 sec.(Xeon 2.8GHz)⇒345 sec.(ES)

    • Why ?– CRS: loop length is small

    • Remedy– Reordering done in BILU(0)

    • Gauss-Seidel smoothing: dependency

    – Traditional Multicoloring

  • IWOMP, JUN05

    112

    Critical Issue of Multigridfor Vector Computer

    • Shorter Loop-Length for Coarse Grid Level.

  • IWOMP, JUN05

    113

    Vectorization:Re-OrderingEarth Simulator

    • RCM(Reverse Cuthil-Mckee), Multicolor

    768,000 elem'sES, 1 PE

    ○ MGCG(MC)△ MGCG(RCM)● ICCG(MC)▲ ICCG(RCM)

    MGCG(original): 345 sec.ICCG (original): 169 sec.

    0.00

    1.00

    2.00

    3.00

    4.00

    5.00

    6.00

    0 500 1000 1500

    COLOR #

    sec.

    5

    6

    7

    8

    9

    4

    5

    6

    7

    8

    3

    4

    5

    6

    7

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

    5

    6

    7

    8

    9

    4

    5

    6

    7

    8

    3

    4

    5

    6

    7

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

  • IWOMP, JUN05

    114

    GFLOPS,Iterations768,000 elem's, ES 1PE

    ○ MGCG(MC)△ MGCG(RCM)● ICCG(MC)▲ ICCG(RCM)

    10

    100

    1000

    0 500 1000 1500

    COLOR #

    Itera

    tions

    1.00

    1.50

    2.00

    2.50

    0 500 1000 1500

    COLOR #

    GFL

    OPS

    Many colors...• fewer iterations for convergence• lower performance due to shorter loops (especially in MG)

    5

    6

    7

    8

    9

    4

    5

    6

    7

    8

    3

    4

    5

    6

    7

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

    5

    6

    7

    8

    9

    4

    5

    6

    7

    8

    3

    4

    5

    6

    7

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

  • IWOMP, JUN05

    115

    Scalability: Flat MPI768,000 elem's/PE, up to 80 PE’s

    0

    10

    20

    30

    40

    50

    0.E+00 2.E+07 4.E+07 6.E+07

    DOF

    sec.

    ○ MGCG(RCM)● ICCG(MC=1500)

  • IWOMP, JUN05

    116

    Next step: Hybrid

    Many colors are required for reasonably fast convergence.Overhead of OpenMP.

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

    SMPparallel

    do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

    !CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

    enddoenddo

    enddoenddo

    Vectorized

  • IWOMP, JUN05

    117

    Hybrid vs. Flat MPI8PEs on ES (1 SMP node)

    MGCG (10 iterations) ICCG (100 iterations)

    ○ Hybrid● Flat MPI

    1

    10

    100

    10 100 1000 10000

    COLOR #

    sec.

    / 10

    itera

    tions

    1

    10

    100

    10 100 1000 10000

    COLOR #

    sec.

    / 100

    iter

    atio

    ns

  • IWOMP, JUN05

    118

    MGCG on ESExcellent performance by reordering

    More than 20% of peak even in MGCGNice for Poisson solvers75% of ICCG

    Excellent scalability for flat-MPI Hybrid

    Low performance for many colorsSignificant for MGCG

    But we need many colors for reasonably fast convergenceNot recommended

  • IWOMP, JUN05

    119

    Hybrid vs. Flat MPI8PE’s on ES (1 SMP node)

    MGCG (10 iterations) ICCG (100 iterations)

    ○ Hybrid● Flat MPI

    1

    10

    100

    10 100 1000 10000

    COLOR #

    sec.

    / 10

    itera

    tions

    1

    10

    100

    10 100 1000 10000

    COLOR #

    sec.

    / 100

    iter

    atio

    ns

    1280x600x8= 6,144,000 elements

  • IWOMP, JUN05

    120

    Hybrid vs. Flat MPI8PE’s on Hitachi SR8000(1 SMP node)

    MGCG (10 iterations) ICCG (100 iterations)

    ○ Hybrid● Flat MPI

    0

    20

    40

    60

    80

    10 100 1000 10000

    COLOR #

    sec.

    / 10

    itera

    tions

    0

    20

    40

    60

    80

    10 100 1000

    COLOR #

    sec.

    / 100

    iter

    atio

    ns

    10000

    1280x600x8= 6,144,000 elements

  • IWOMP, JUN05

    121

    Hybrid vs. Flat MPI8PE’s on IBM-SP3 (1 SMP node)

    MGCG (10 iterations) ICCG (100 iterations)

    ○ Hybrid● Flat MPI1280x600x8= 6,144,000 elements

    1.0E+00

    1.0E+01

    1.0E+02

    1.0E+03

    10 100 1000 10000

    COLOR #

    sec.

    / 100

    iter

    atio

    ns

    1.0E+00

    1.0E+01

    1.0E+02

    1.0E+03

    10 100 1000 10000

    COLOR #

    sec.

    / 10

    itera

    tions

  • IWOMP, JUN05

    122

    • Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers

    • Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications

    • Effect of reordering on various types of architectures

    – PGA (Pin-Grid Array)– Contact Problems– Multigrid

    • Summary & Future Works

  • IWOMP, JUN05 123

    Summary (Earth Simulator)Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

    Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%)Re-Ordering is really requiredSimple multicoloring is better for complicated geometries.

  • IWOMP, JUN05 124

    Summary (cont.) (Earth Simulator)Hybrid vs. Flat MPI

    Flat-MPI is better for small number of SMP nodes.Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.Flat MPI: Communication, Hybrid: Memorydepends on application, problem size etc.Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP, especially on the Earth Simulator.

    This is not so significant in Hitachi SR8k & IBM SP3

  • IWOMP, JUN05 125

    Summary (cont.) (General)Hybrid vs. Flat MPI

    In IBM SP-3, difference between flat-MPI and hybrid is not so significant, although flat-MPI is slightly better.In Hitachi SR8000, pseudo-vector works better for hybrid cases.

    Effect of ColorIn scalar processors (especially for flat-MPI), performance is getting better as color number increases, due to efficient utilization of cache.

  • IWOMP, JUN05 126

    Hybrid vs. Flat MPIGenerally speaking, they are competitive.

    Flat-MPI is better for multigrid type applications with multicolor ordering.In ill-conditioned problems, Flat-MPI may require more iterations.Depends on implementation of compiler

  • IWOMP, JUN05 127

    Hybrid vs. Flat MPI (cont.)Earth Simulator

    Flat MPI is betterHybrid is better for large PE number with small problem size.

    IBM SP-3Flat MPI is better for small problem size because cache is utilized more efficiently

    Hitachi SR8000Hybrid is better (mainly because of feature of compiler)

    Careful treatment on color number is required for Hybrid programming model.

  • IWOMP, JUN05 128

    My current feeling is …Flat MPI is slightly better on the Earth Simulator in FEM-type applications with multicoloring.

  • IWOMP, JUN05 129

    Can Hybrid survive ?SAI (Sparse Approximate Inverse) preconditioning

    Suitable for parallel & vector computation.especially for Hybrid Parallel Programming Modelonly Mat-Vec. product.reordering for avoiding dependency is not required.

    corresponds to single-color

    Preconditioning for contact problems.J.Zhang, K.Nakajima et al. (2003) for scalar processorsK.Nakajima (2005) for ES at WOMPEI05

  • IWOMP, JUN05 130

    Matrix-Vector Product ONLY

    !OMP PARALLEL DOdo ip= 1, PEsmpTOT

    iv0= STACKmc(ip)do j= 1, NONdiagTOT

    iS= indexS(ip, j)iE= indexE(ip, j)

    !CDIR NODEPdo i= iv0+1, iv0+iE-iS

    (computations)enddo

    enddoenddo

    !OMP END PARALLEL DO

  • IWOMP, JUN05 131

    Earth Simulator, Mat-Vec.1 SMP node (64GFLOPS peak), 500 Iter's.Hybrid is rather faster !!

    ●: Flat MPI○: Hybrid

    10.0

    15.0

    20.0

    25.0

    30.0

    1.E+04 1.E+05 1.E+06 1.E+07

    DOF

    GFL

    OPS

  • IWOMP, JUN05

    132

    Results: ES (1/2)

    0

    100

    200

    300

    1 2 4 6 8thread#

    sec.

    set-upsolver

    0.00

    2.00

    4.00

    6.00

    8.00

    0 1 2 3 4 5 6 7 8

    thread#

    Spee

    d-up

    ●Large (2.47M DOF) ○Small (0.35M DOF)

    Large Problem

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05

    133

    Results: ES (2/2)

    1

    10

    100

    1 10 100 1000

    COLOR#

    sec.

    0

    10

    20

    30

    1 10 100 1000

    COLOR#

    GFL

    OPS

    ●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05

    134

    Results: ES (2/2)

    1

    10

    100

    1 10 100 1000

    COLOR#

    sec.

    0

    10

    20

    30

    1 10 100 1000

    COLOR#

    GFL

    OPS

    ●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05

    135

    Results: Hitachi SR8000 (1/2)

    ●Large (2.47M DOF) ○Small (0.35M DOF)

    Large Problem

    0

    500

    1000

    1500

    2000

    1 2 4 6 8

    thread#

    sec.

    set-upsolver

    0.00

    2.00

    4.00

    6.00

    8.00

    0 1 2 3 4 5 6 7 8

    thread#

    Spee

    d-up

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05

    136

    Comparison: ES vs. Hitachi SR8k

    0

    500

    1000

    1500

    2000

    1 2 4 6 8

    thread#

    sec.

    set-upsolver

    0.00

    2.00

    4.00

    6.00

    8.00

    0 1 2 3 4 5 6 7 8

    thread#

    Spee

    d-up

    0

    100

    200

    300

    1 2 4 6 8thread#

    sec.

    set-upsolver

    0.00

    2.00

    4.00

    6.00

    8.00

    0 1 2 3 4 5 6 7 8

    thread#

    Spee

    d-up

    ES

    SR8k

  • IWOMP, JUN05

    137

    Results: Hitachi SR8000 (2/2)●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)

    1

    10

    100

    1000

    1 10 100 1000

    COLOR#

    sec.

    0.00

    1.00

    2.00

    3.00

    1 10 100 1000

    COLOR#

    GFL

    OPS

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05

    138

    Results: IBM SP-3 (1/2)

    ●Large (2.47M DOF) ○Small (0.35M DOF)

    Large Problem

    0

    2500

    5000

    7500

    1 2 4 8 12 16thread#

    sec.

    set-upsolver

    0.00

    4.00

    8.00

    12.00

    16.00

    0 4 8 12 16

    thread#

    Spee

    d-up

    Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

  • IWOMP, JUN05 139

    SponsorMEXT, Japan.

    Computational ResourcesEarth Simulator Center, Japan.NERSC/Lawrence Berkeley National Laboratory, USA.

    Dr. Horst D. SimonDr. Esmond G. Ng

    Information Technology Center, The University of Tokyo, Japan.

    Colleagues ...GeoFEM Project

    IWOMP Committee

    Acknowledgements

    Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures�OverviewGeoFEM: FY.1998-2002�http://geofem.tokyo.rist.or.jp/System Configuration of�GeoFEMResults on Solid Earth SimulationResults by GeoFEMMotivationsEarth Simulator (ES)�http://www.es.jamstec.go.jp/Flat MPI vs. HybridHybrid vs. Flat-MPILocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodesOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�for Stabilizing Localized PreconditioningOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Hybrid vs. Flat-MPIBlock IC(0)-CG Solver on the Earth SimulatorFlat MPI vs. HybridLocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodes1 SMP node => 1 domain for Hybrid Programming Model�MPI communication among domainsBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECReordering = ColoringColors & Iterations�Incompatible Nodes�Doi, S. (NEC) et al.RCM OrderingRed-Black Ordering�Highly Parallelized: Large Loop Length �Many Incompatible Nodes: Slow ConvergenceFour-Color Ordering�Fairly Parallelized: Shorter Loop Length �Fewer Incompatible Nodes: Faster ConvergenceCyclic DJDS(RCM+CMC) �for Forward/Backward Substitution in BILU FactorizationSimple 3D Cubic ModelEffect of Ordering�Effect of Re-Ordering Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Ideal Performance: 48.6% of PeakSMP node # > 10�up to 176 nodes (1408 PEs)��Problem size for each SMP node is fixed.�PDJDS-CMC/RCM, Color #: 993D Elastic Model (Large Case)�256x128x128/SMP node, up to 2,214,592,512 DOF3D Elastic Model (Small Case)�64x64x64/SMP node, up to 125,829,120 DOF3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth Simulator3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth SimulatorHybrid outperforms Flat-MPIComm. latency is relatively �large in ES�D.J.Kerbyson et al. LA-UR-02-5222Performance Estimation for Finite-Element Type Application on ES �using performance model (not real computations)�D.J.KerbysonHybrid outperforms Flat-MPI (cont.)Various PlatformsPlatformsReordering for the Earth Simulator Matrix Storage for ILU Factorization3D Elastic Simulation�Problem Size~GFLOPS�Earth Simulator�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�Hitachi-SR8000-MPP with�Pseudo Vectorization�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�IBM-SP3 (NERSC) �1 SMP node (8 PE’s)3D Elastic Simulation�Single PE Performance�GFLOPS, 104 colorsPerformance of Intra-Node MPI�8 PE’s: Measurement using SEND/RECV �subroutines in GeoFEMPerformance of Intra-Node MPI�8 PE’s: Measurement using inter-domain SEND/RECV �subroutines in GeoFEMSummary: Single Node Performance�Flat MPI vs. HybridSpeed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)PlatformsSummary: Performance for Many Nodes, Flat MPI vs. HybridElastic Problems�with Cubic Model Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�Iterations for ConvergenceEffect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Many Colors ... Trade-OffHybrid is much more sensitive to color numbers !Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Hitachi SR8k (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�IBM SP3 (single node, 8PE’s)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�IBM SP3 (single node, 8PE’s)South-West Japan Model�Contact Problems��Selective Blocking PreconditioningSouth-West JapanSouth-West Japan Model�800K nodes, 2.4M DOF �1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �ES, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)Complicated Geometries�PGA for Mobile Pentium IIIPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node�PDJDS/MC Re-OrderingPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Earth SimulatorPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Hitachi SR8000Single PE test for simple cubic model�786,432 DOF= 3x643 nodes� Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PDCRS is more suitable for scalar processorsPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Itanium II (SGI/Altix) (8 PE’s)PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� AMD Opteron (8 PE’s)Multigrids Multigrid�Multigrid is scalable !!!Target Application�Thermal Convection between 2 Spherical SurfacesSemi-Implicit Pressure Correction SchemeStart from Icosahedron�Surface Grid by Adaptive Mesh Refinement (AMR)Semi-Unstructured Prismatic GridsResults on Hitachi SR2201�320x900=288,000 cells/PE�up to 37M DOF on 128 PEsHow about on ES ?Critical Issue of Multigrid�for Vector ComputerVectorization:Re-Ordering�Earth SimulatorGFLOPS,Iterations�768,000 elem's, ES 1PEScalability: Flat MPI�768,000 elem's/PE, up to 80 PE’sNext step: HybridHybrid vs. Flat MPI�8PEs on ES (1 SMP node)MGCG on ESHybrid vs. Flat MPI�8PE’s on ES (1 SMP node)Hybrid vs. Flat MPI�8PE’s on Hitachi SR8000 (1 SMP node)Hybrid vs. Flat MPI�8PE’s on IBM-SP3 �(1 SMP node)Summary (Earth Simulator)Summary (cont.) (Earth Simulator)Summary (cont.) (General)Hybrid vs. Flat MPIHybrid vs. Flat MPI (cont.)My current feeling is …Can Hybrid survive ?�SAI (Sparse Approximate Inverse) preconditioningMatrix-Vector Product ONLYEarth Simulator, Mat-Vec.�1 SMP node (64GFLOPS peak), 500 Iter's.�Hybrid is rather faster !!Results: ES (1/2)Results: ES (2/2)Results: ES (2/2)Results: Hitachi SR8000 (1/2)Comparison: ES vs. Hitachi SR8kResults: Hitachi SR8000 (2/2)Results: IBM SP-3 (1/2)Acknowledgements