flat mpi and hybrid parallel programming models for fem ... · mpi programming model (on es). •...

Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures

Kengo NakajimaThe 21st Century Earth Science COE ProgramDepartment of Earth & Planetary ScienceThe University of Tokyo

First International Workshop on OpenMP (IWOMP 05) June 1-4, 2005, Eugene, Oregon, USA.

IWOMP, JUN05

2

• Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications

• Effect of reordering on various types of architectures

– Contact Problems– PGA (Pin-Grid Array)– Multigrid

• Summary & Future Works

Overview

IWOMP, JUN05

3

• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core

convection, etc.• Part of national project by STA/MEXT for large-scale

earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural

science (solid earth) communities.• Current Activity

– Research Consortium for Solid Earth Simulations on the Earth Simulator (9 sub-groups, >80 members)

GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

IWOMP, JUN05

4

System Configuration ofGeoFEM

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/F

Comm.I/F

Vis.I/F

StructureFluid

Wave

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/F

Comm.I/F

Vis.I/F

StructureFluid

Wave

IWOMP, JUN05

5

Results on Solid Earth Simulation

Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands

Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan

TSUNAMI !!TSUNAMI !!Transportation by Groundwater Flow Transportation by Groundwater Flow

through Heterogeneous Porous Mediathrough Heterogeneous Porous Media

Δh=5.00

Δh=1.25

T=100 T=200 T=300 T=400 T=500

IWOMP, JUN05

6

Results by GeoFEM

IWOMP, JUN05

7

• GeoFEM Project (FY.1998-2002)

• Performance evaluation of FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers

• Preconditioned Iterative Solver– Hybrid vs. Flat MPI Parallel Programming Model

Motivations

IWOMP, JUN05

8

Earth Simulator (ES)http://www.es.jamstec.go.jp/

• 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES

• 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network

– 12.3 GB/sec×2• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)

IWOMP, JUN05

9

Flat MPI vs. Hybrid

PE

PE

PE

PE

MemoryPE

PE

PE

PE

Memory

Hybrid：Hierarchal Structure

PE

PE

PE

PE

MemoryPE

PE

PE

PE

Memory

Flat-MPI：Each PE -> Independent

IWOMP, JUN05

10

• “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or

processors).– “Hybrid” may provide better convergence rate (i.e. faster

convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).

Hybrid vs. Flat-MPI

IWOMP, JUN05

11

Local Data StructureNode-based Partitioning

internal nodes - elements - external nodesPE#1

1 ２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7

PE#3

２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

2 3 4 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

IWOMP, JUN05

12

Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]

• Ignore the effects by elements which belong to different partition / processor.

• Not as strong as original ILU. – How bad in many PE cases ?

• Global data dependency in ILU preconditioner is not suitable for parallel computing.

)1,,1,(~

),,2(

1

1

1

⋅⋅⋅−=⎟⎟⎠

⎞⎜⎜⎝

⎛−=

⋅⋅⋅=−=

∑

∑

+=

−

=

NNkyuydx

Nkylby

N

kjjkjkkk

k

jjkjkk

IWOMP, JUN05

13

Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]

Matrix components whose columun numbers are outside the processor are ignored (set equal to 0) at Back-Forward Substitution.

1 2 3

4 5 6

Considered :

Ignored :

A1 2 3 4 5 6PE#1

PE#2

PE#3

PE#4

A1 2 3 4 5 6PE#1

PE#2

PE#3

PE#4

PE#1

PE#2

PE#3

PE#4

IWOMP, JUN05

14

Overlapped Additive Schwartz Domain Decomposition Method

Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi

SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8

PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21

16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07

NO Additive Schwartz WITH Additive Schwartz

IWOMP, JUN05

15





16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07


IWOMP, JUN05

16


for Stabilizing Localized Preconditioning

Global OperationΩ

Local OperationΩ1 Ω2

Global Nesting CorrectionΩ1 Ω2

Γ2→1 Γ１→２

222111, ΩΩΩΩΩΩ == rzMrzM

nn

rMz =

( )11111212111111

−ΓΓ

−ΩΩΩ

−Ω

−ΩΩ →→

−−+← nnnn zMzMrMzz

( )11112121222222

−ΓΓ

−ΩΩΩ

−Ω

−ΩΩ →→

−−+← nnnn zMzMrMzz

IWOMP, JUN05

17





16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07


IWOMP, JUN05

18

• “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or

processors).– “Hybrid” may provide better convergence rate (i.e. faster

convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).

• Factors for Performance– Type of Applications, Problem Size– Balance of H/W Parameters

Hybrid vs. Flat-MPI

R. Rabenseifner: Communication Bandwidth of Parallel Programming Models on Hybrid Architectures, International Workshop on OpenMP: Experiences and Implementa-tions (WOMPEI 2002), Lecture Notes in Computer Science 2327, pp.437-448, 2002.

IWOMP, JUN05

19




– PGA (Pin-Grid Array)– Contact Problems– Multigrid


IWOMP, JUN05

20

Block IC(0)-CG Solver on the Earth Simulator

• 3D Linear Elastic Problems (SPD)• Parallel Iterative Linear Solver

– Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)– Additive Schwartz Domain Decomposition（ASDD）– http://geofem.tokyo.rist.or.jp/

• Hybrid Parallel Programming Model– OpenMP + MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI

IWOMP, JUN05

21

Flat MPI vs. Hybrid

PE

PE

PE

PE

MemoryPE

PE

PE

PE

Memory

Hybrid：Hierarchal Structure

PE

PE

PE

PE

MemoryPE

PE

PE

PE

Memory

Flat-MPI：Each PE -> Independent

IWOMP, JUN05

22

Local Data StructureNode-based Partitioning

internal nodes - elements - external nodesPE#1

1 ２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7

PE#3

２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

2 3 4 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

IWOMP, JUN05

23

1 SMP node => 1 domain for Hybrid Programming Model

MPI communication among domains

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

IWOMP, JUN05

24

Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

– Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.

M = (L+D)D-1(D+U)Mz= r

D-1(D+U)= z1

Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz

need to solve this equation

M = (L+D)D-1(D+U)Mz= r

D-1(D+U)= z1

Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz

need to solve this equation

IWOMP, JUN05

25



do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddoZ(i)= WVAL / D(i)

enddo

do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))

enddoZ(i)= Z(i) – SW / D(i)

enddo

Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)

Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

Reordering:Directly connectednodes do not appearin RHS.

M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA

do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))

enddoZ(i)= WVAL / D(i)

enddo

do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))

enddoZ(i)= Z(i) – SW / D(i)

enddo

Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)

Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

Reordering:Directly connectednodes do not appearin RHS.

M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA

IWOMP, JUN05

26



• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

IWOMP, JUN05

27



• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization

IWOMP, JUN05

28

SMP ParallelSMP Parallel

VectorVector

Re-Ordering Technique for Vector/Parallel Architectures

Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

VectorVectorSMP ParallelSMP Parallel

1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)

3. DJDS re-ordering 4. Cyclic DJDS for SMP unit

These processes can be substituted by traditional multi-coloring (MC).

IWOMP, JUN05

29

Re-Ordering Technique for Vector/Parallel Architectures

Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

http://www.sc-conference.org/sc2003/paperpdfs/pap155.pdf

K.Nakajima, “Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator“SC2003 Technical Session, Phoenix, Arizona, 2003.[

http://www.sc-conference.org/sc2003/inter_cal/inter_cal_detail.php?eventid=10703#1

IWOMP, JUN05

30

Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from

each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector

length.: Trade-Off !!

Red-Black (2 colors) 4 colors Multicolor

IWOMP, JUN05

31

Colors & IterationsIncompatible Nodes

Doi, S. (NEC) et al.

１ 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

“Incompatible nodes” are not affected by other nodes. System with fewer incompatible nodes provides faster convergence.

IWOMP, JUN05

32

RCM Ordering

１ 2 4 7 11

3 5 8 12 16

6 9 13 17 20

10 14 18 21 23

15 19 22 24 25

IWOMP, JUN05

33

Red-Black OrderingHighly Parallelized: Large Loop Length

Many Incompatible Nodes: Slow Convergence

１ 14 2 15 3

16 4 17 5 18

6 19 7 20 8

21 9 22 10 23

11 24 12 25 13

IWOMP, JUN05

34

Four-Color OrderingFairly Parallelized: Shorter Loop Length

Fewer Incompatible Nodes: Faster Convergence

１ 8 14 20 2

9 15 21 3 10

16 22 4 11 17

23 5 12 18 24

6 13 19 25 7

IWOMP, JUN05

35

Cyclic DJDS(RCM+CMC)for Forward/Backward Substitution in BILU Factorization

SMPparallel

do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)

enddoenddo

enddoenddo

Vectorized

SMPparallel



enddoenddo

enddoenddo

Vectorized

IWOMP, JUN05

36

Simple 3D Cubic Model

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

IWOMP, JUN05

37

Effect of Ordering

IWOMP, JUN05

38

Effect of Re-Ordering

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Long LoopsContinuous Access

Short LoopsContinuous Access

Short LoopsIrregular Access

Proposed Method

IWOMP, JUN05

39

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GFL

OPS

Effect of Vector LengthX10

+ Re-Ordering X100

22 GFLOPS, 34% of the Peak

Ideal Performance: 40%-45%for Single CPU

● ■ ▲



CRS no re-ordering

IWOMP, JUN05

40

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GFL

OPS

● ■ ▲



CRS no re-ordering

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.

IWOMP, JUN05

41

Ideal Performance: 48.6% of Peak

IWOMP, JUN05

42

SMP node # > 10up to 176 nodes (1408 PEs)

Problem size for each SMP node is fixed.PDJDS-CMC/RCM, Color #: 99

IWOMP, JUN05

43

3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

●：Flat MPI, ○：Hybrid

GFLOPS rate Parallel Work Ratio

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GFL

OPS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Para

llel W

ork

Rat

io: %

3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)

●Flat MPI

○Hybrid

IWOMP, JUN05

44

3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

●：Flat MPI, ○：Hybrid

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GFL

OPS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Para

llel W

ork

Rat

io: %

GFLOPS rate Parallel Work Ratio

●Flat MPI

○Hybrid

IWOMP, JUN05

45

3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator

ー：Flat MPI, ---：Hybrid

15

20

25

30

35

40

1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

DOF#

Peak

Per

form

ance

Rat

io: %

Prob Size/SMP node = 3x64^3

3x80^33x100^3

3x128^3

3x256x128x128 SMPnodes

176 SMPnodes

1.E+02

1.E+03

1.E+04

1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

DOF#

GFL

OPS

Prob Size/SMP node = 3x64^3

3x80^3

3x100^3

3x128

3x256x128x128

Large SMP Node # Large SMP Node #

IWOMP, JUN05

46

3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator

ー：Flat MPI, ---：Hybrid

Large SMP Node # Large SMP Node #

0

500

1000

1500

2000

2500

1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

DOF#

Itera

tions

80

90

100

110

120

130

140

150

160

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Hyb

rid/F

lat M

PI: %

Prob Size/SMP node= 3x64^3

3x100^3

3x256x128x128

Hybrid isfaster

FlatMPIis faster

IWOMP, JUN05

47

Hybrid outperforms Flat-MPI• when ...

– number of SMP node (PE) is large.– problem size/node is small.

• because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio

• Effect of communication becomes significant if number of SMP node (or PE) is large.

• Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES

N

N

HybridFlat MPIN/2

N/2

N

N

HybridFlat MPIN/2

N/2

IWOMP, JUN05

48

Comm. latency is relatively large in ES

D.J.Kerbyson et al. LA-UR-02-5222

ES ASCI-QHP ES45

500 MHz 1.25 GHzProc. Speed

8 x 640= 5,120 4 x 3,072=12,288SMP nodes

8 x 5,120= 40 TFLOPS

2.5 x 12,288= 30 TFLOPSPeak Speed

2 x 5,120= 10 TB 4 x 12,288= 48 TBMemory

32 GB/s 2 GB/sL1/L2 20/30 GB/sPeak Memory BWTH

Latency: 5.6 μsecBWTH : 11.8 GB/s

Latency: 5 μsecBWTH : 300 MB/sInter-Node MPI Comm.

IWOMP, JUN05

49

Performance Estimation for Finite-Element Type Application on ES using performance model (not real computations)

D.J.Kerbyson et al. LA-UR-02-5222 (LANL)

# Processors

Tim

e (%

)

100

01 5120

computation/memorycomm. latencycomm. bandwidth

# Processors

Tim

e (%

)

100

01 5120



IWOMP, JUN05

50

Hybrid outperforms Flat-MPI (cont.)

• Difference between Hybrid & Flat MPI of iteration number for convergence is not so significant … due to additive Schwartz domain decomposition.

ー：Flat MPI---：Hybrid

Large SMP Node #

0

500

1000

1500

2000

2500

1.E+06 1.E+07 1.E+08 1.E+09 1.E+10

DOF#

Itera

tions

IWOMP, JUN05

51

• Earth Simulator (ES)– Earth Simulator Center, JAMSTEC.– Vector Processors

• Hitachi SR8000– The University of Tokyo– Power-PC based Architecture– Pseudo-Vector Capability with Preload/Prefetch

• IBM SP-3– NERSC/Lawrence Berkeley National Laboratory: Seaborg– Power3– 8MB L2-cache for each PE

Various Platforms

IWOMP, JUN05

52

Platforms

PE#/node

Peak GF/PE

Mem.BW GB/s/node

ES SP3 SR8k

8

8.0

256

16

1.5

16

8

1.8

32

Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL

NERSC/LBNLNERSC/LBNL

Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo

Hitachi Ltd.Hitachi Ltd.

MPI lat. μsec

Network BW GB/s/Node

5.6

12.3

16.3*

1.00

6-20**

1.60

PE#/node

Peak GF/PE

Mem.BW GB/s/node

ES SP3 SR8k

8

8.0

256

16

1.5

16

8

1.8

32









MPI lat. μsec


5.6

12.3

16.3*

1.00

6-20**

1.60

*) Oliker et al.**) HLRS

IWOMP, JUN05 53

Reordering for the Earth Simulator Matrix Storage for ILU Factorization



CRS no re-ordering

Long LoopsContinuous Access

Short LoopsContinuous Access

Short LoopsIrregular Access

IWOMP, JUN05 54

3D Elastic SimulationProblem Size~GFLOPS

Earth Simulator1 SMP node (8 PE’s)

Flat-MPI23.4 GFLOPS, 36.6 % of Peak

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

OpenMP21.9 GFLOPS, 34.3 % of Peak

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

IWOMP, JUN05 55


Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

Flat-MPI2.17 GFLOPS, 15.0 % of Peak

OpenMP2.68 GFLOPS, 18.6 % of Peak

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

IWOMP, JUN05 56


IBM-SP3 (NERSC) 1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

Flat-MPI OpenMP

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GFL

OPS

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

Cache is wellCache is well--utilizedutilizedin Flatin Flat--MPI.MPI.

IWOMP, JUN05 57

3D Elastic SimulationSingle PE Performance

GFLOPS, 104 colors

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

EarthSimulator

98,304 DOF(=3×323)

331,776 DOF(=3×483)

786,432 DOF(=3×643)

DJDS● CRS■ DJDS● CRS■ DJDS● CRS■

1.67 -Hybrid 2.06 - 3.01 -

2.75 -FlatMPI 3.02 - 3.19 -

HitachiSR8000

0.404 0.127Hybrid 0.437 0.140 0.423 0.146

0.366 0.128FlatMPI 0.343 0.138 0.322 0.146

IBM SP30.145 0.167Hybrid 0.142 0.165 0.137 0.145

0.138 0.165FlatMPI 0.135 0.159 0.126 0.141

IWOMP, JUN05 58

Performance of Intra-Node MPI8 PE’s: Measurement using SEND/RECV

subroutines in GeoFEM

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+02 1.E+03 1.E+04 1.E+05

message size

sec.

-GB

/sec

.

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+02 1.E+03 1.E+04 1.E+05

message size

sec.

-GB

/sec

.

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+02 1.E+03 1.E+04 1.E+05

message size

sec.

-GB

/sec

.

ESES Hitachi SR8kHitachi SR8k IBM SP3IBM SP3

Elapsed time for 1,000 iterations (sec.)Elapsed time for 1,000 iterations without estimated latency(sec.)Estimated communication bandwidth (GB/sec)

IWOMP, JUN05 59

Performance of Intra-Node MPI8 PE’s: Measurement using inter-domain SEND/RECV

subroutines in GeoFEM

PE#/nodePeak GF/PEMem.BW GB/s/nodeMPI lat. μsecNetwork BW GB/s/NodeIntra-node MPI lat. μsecIntra-node MPI BWMB/s/PE

ES8

8.02565.6

12.3

5.29

1018.

SP3161.516

16.3*

1.00

8.32

52.5

SR8k8

1.832

6-20**

1.60

9.17

95.8

Intra-node communication performance seems reasonable accordingto memory BW, inter-node communication performance.

IWOMP, JUN05

60

• Earth Simulator (ES)– Flat MPI is slightly better.

• Hitachi SR8000– Similar feature with that of ES– Hybrid is better for PDJDS/CM-RCM ordering.– Difference of single PE performance between Hybrid and

Flat MPI for PDJDS/CM-RCM is significant.• compiler ?

• IBM SP-3– Effect of cache (8MB/PE) is significant.– Cache is well-utilized in Flat MPI– Performance in larger problems is similar

Summary: Single Node PerformanceFlat MPI vs. Hybrid

IWOMP, JUN05

61

Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up

extrapolated from 1-node performance)

EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)

0

50

100

150

0 32 64 96 128 160 192

SMP NODE# (8 PEs/NODE)

GFL

OPS

3.0M DOF/node

0

1000

2000

3000

4000

0 32 64 96 128 160 192


GFL

OPS

12.6M DOF/node

2.2G DOF3.80 TFLOPS 33.7% of peak(176 nodes)

0.38G DOF110. GFLOPS 7.16% of peak(128 nodes)(8 PEs/node)

IWOMP, JUN05

62


extrapolated from 1-node performance)

0

50

100

150

0 32 64 96 128 160 192


GFL

OPS

786.4K DOF/node

0

1000

2000

3000

4000

0 32 64 96 128 160 192


GFL

OPS

786.4K DOF/node

0.10G DOF115. GFLOPS 7.51% of peak(128 nodes)(8 PEs/node)

EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)

IWOMP, JUN05

63


extrapolated from 1-node performance)Hitachi SR8000Hitachi SR8000

0

100

200

300

400

0 16 32 48 64 80 96 112 128

SMP NODE#

GFL

OPS

0

100

200

300

400

0 16 32 48 64 80 96 112 128

SMP NODE#

GFL

OPS

6.29M DOF/node786.4K DOF/node

8.05G DOF, 128 nodes● 335 GFLOPS, 18.2% of peak● 257 GFLOPS, 14.7%

0.10G DOF, 128 nodes● 271 GFLOPS, 14.7% of peak● 276 GFLOPS, 15.0%

IWOMP, JUN05

64

Platforms

PE#/node

Peak GF/PE

Mem.BW GB/s/node

ES SP3 SR8k

8

8.0

256

16

1.5

16

8

1.8

32





MPI lat. μsec


5.6

12.3

16.3*

1.00

6-20**

1.60

PE#/node

Peak GF/PE

Mem.BW GB/s/node

ES SP3 SR8k

8

8.0

256

16

1.5

16

8

1.8

32









MPI lat. μsec


5.6

12.3

16.3*

1.00

6-20**

1.60 12 : 1 : 1.6

2.9 : 1 : 1.2-2.7

IWOMP, JUN05

65

• Earth Simulator (ES)– Effect of MPI latency is significant for:

• Flat MPI• Many node #• small problem size/node

• Hitachi SR8000• IBM SP-3

– Effect of MPI latency is hidden by their relatively low communication bandwidth.

– Performance with 128 SMP nodes are identical with extrapolation of single node performance, especially if problem size/node is sufficiently large.

Summary: Performance for Many Nodes, Flat MPI vs. Hybrid

IWOMP, JUN05

66






IWOMP, JUN05 67

Elastic Problemswith Cubic Model

IWOMP, JUN05 68

Effect of Loop Length (Color#)1003 nodes= 3x106 DOF

PDCRS/CM-RCM Re-OrderingIterations for Convergence

Flat MPIHybrid

200

250

300

350

400

10 100 1000

COLORS

ITER

ATI

ON

S

IWOMP, JUN05 69


PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)

Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.0

5.0

10.0

15.0

20.0

25.0

30.0

10 100 1000

COLORS

GFL

OPS

0

25

50

75

100

10 100 1000

COLORS

sec.

IWOMP, JUN05 70



Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.0

5.0

10.0

15.0

20.0

25.0

30.0

10 100 1000

COLORS

GFL

OPS

0

25

50

75

100

10 100 1000

COLORS

sec.

Smaller Vector LengthSmaller Vector Length

IWOMP, JUN05 71



Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.0

5.0

10.0

15.0

20.0

25.0

30.0

10 100 1000

COLORS

GFL

OPS

0

25

50

75

100

10 100 1000

COLORS

sec.

Smaller Vector LengthSmaller Vector Length++

OpenMPOpenMP OverheadOverhead

IWOMP, JUN05 72

Many Colors ... Trade-OffFast convergence according to iteration number

S.Doi et al.Smaller vector length

FLOPS rate decreases.

Hybrid is much more sensitive to color numberEspecially on the Earth Simulator

IWOMP, JUN05 73

Hybrid is much more sensitive to color numbers !

SMPparallel



enddoenddo

enddoenddo

Vectorized

SMPparallel



enddoenddo

enddoenddo

Vectorized

ｙ(i)= y(i) + A(k)*X(kk)

IWOMP, JUN05 74


PDJDS/CM-RCM Re-OrderingHitachi SR8k (single node)

Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.00

0.50

1.00

1.50

2.00

2.50

3.00

10 100 1000

COLORS

GFL

OPS

0

100

200

300

400

500

10 100 1000

COLORS

sec.

Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator. In FlatIn Flat--MPI, performance is rather improved.MPI, performance is rather improved.

IWOMP, JUN05 75


PDCRS/CM-RCM Re-OrderingIBM SP3 (single node, 8PE’s)

Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0

250

500

750

1000

10 100 1000

COLORS

sec.

Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator.

0.00

0.50

1.00

1.50

10 100 1000

COLORS

GFL

OPS

IWOMP, JUN05 76


IBM SP3 (single node, 8PE’s)Flat MPIHybrid

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.00

0.50

1.00

1.50

10 100 1000

COLORS

GFL

OPS

0.00

0.50

1.00

1.50

10 100 1000

COLORS

GFL

OPS

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

IWOMP, JUN05 77

South-West Japan ModelContact Problems

Selective Blocking Preconditioning

IWOMP, JUN05 78

South-West Japan

Eurasia

Philippine

PacificEurasia

Philippine

Pacific

IWOMP, JUN05 79

South-West Japan Model800K nodes, 2.4M DOF 1-SMP nodes (8 PE’s)

● Flat MPI○ Hybrid (OpenMP)

400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 80

South-West Japan Model800K nodes, 2.4M DOF ES, 1-SMP nodes (8 PE’s)


20

25

30

35

40

1 10 100 1000

Colors

sec.

5

10

15

20

25

1 10 100 1000

Colors

GFL

OPS




400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 81

South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 10 100 1000

Colors

GFL

OPS

0

100

200

300

1 10 100 1000

Colors

sec.


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 82


0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 10 100 1000

Colors

GFL

OPS

0

100

200

300

1 10 100 1000

Colors

sec.


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 83


0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 10 100 1000

Colors

GFL

OPS

0

100

200

300

1 10 100 1000

Colors

sec.


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions



IWOMP, JUN05 84


0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 10 100 1000

Colors

GFL

OPS

0

100

200

300

1 10 100 1000

Colors

sec.


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

WHY ?WHY ?

IWOMP, JUN05 85

South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)

0

250

500

750

1000

1 10 100 1000

COLOR#

sec.

0.00

0.25

0.50

0.75

1.00

1 10 100 1000

COLOR#

GFL

OPS


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 86

South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)

0

250

500

750

1000

1 10 100 1000

COLOR#

sec.

0.00

0.25

0.50

0.75

1.00

1 10 100 1000

COLOR#

GFL

OPS


400

450

500

550

600

650

1 10 100 1000

Colors

Itera

tions

IWOMP, JUN05 87

South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)

● Flat MPI○ Hybrid

3000

3100

3200

3300

3400

3500

10 100 1000

Colors

Itera

tions

IWOMP, JUN05 88

South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)

● Flat MPI○ Hybrid

75

100

125

150

175

200

10 100 1000

Colors

GFL

OPS

100

150

200

250

10 100 1000

Colors

sec.

Complicated GeometriesPGA for Mobile Pentium III

IWOMP, JUN05 90

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF)

IWOMP, JUN05 91

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering

● Flat MPI○ OpenMP

400

425

450

475

500

10 100 1000

Colors

Itera

tions

IWOMP, JUN05 92

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Earth Simulator

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0

50

100

150

10 100 1000

Colors

sec.

0

10

20

30

10 100 1000

Colors

GFL

OPS




IWOMP, JUN05 93

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Hitachi SR8000

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0

200

400

600

800

10 100 1000

Colors

sec.

0.00

1.00

2.00

3.00

10 100 1000

Colors

GFL

OPS


WHY ?WHY ?

IWOMP, JUN05 94

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, IBM-SP3

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0

1000

2000

3000

10 100 1000

Colors

sec.

0.00

0.25

0.50

0.75

1.00

10 100 1000

Colors

GFL

OPS

IWOMP, JUN05 95

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeHitachi SR8000

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

●■ Flat MPI○□ OpenMP

0.00

1.00

2.00

3.00

10 100 1000

Colors

GFL

OPS

0.00

1.00

2.00

3.00

10 100 1000

Colors

GFL

OPS

PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC

WHY ?WHY ?

IWOMP, JUN05 96

Single PE test for simple cubic model786,432 DOF= 3x643 nodesHitachi SR8000

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

● Flat MPI（PDJDS）■ Flat MPI（PDCRS）○ OpenMP（PDJDS）□ OpenMP（PDCRS）

0.00

0.10

0.20

0.30

0.40

0.50

10 100 1000

Colors

GFL

OPS

Behavior of Flat MPI withBehavior of Flat MPI withPDJDS seem like that of scalar PDJDS seem like that of scalar processors, while processors, while OpenMPOpenMP/PDJDS/PDJDSlooks like vector processors.looks like vector processors.

Feature (or problem) of compiler ?Feature (or problem) of compiler ?PseudoPseudo--Vector option does notVector option does notwork well in Flat MPI.work well in Flat MPI.

IWOMP, JUN05 97

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS

0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS


IWOMP, JUN05 98

PDCRS is more suitable for scalar processors

Reduction type loop of CRS is more suitable for cache-based scalar processor because of its localized operation.PDJDS provides data locality if the color number increases.

On scalar processors, performance may be improved as color number increases.

PDJDSPDJDS PDCRSPDCRS

do i= 1, Nk1= index(i-1)+1k2= index(i)do k= k1, k2kk= item(k)Y(i)= Y(i)+A(k)*x(k)

enddoenddo

do j= 1, NCONdo i= 1, NN(j)k = index(j-1) + ikk= item (k)Y(i)= Y(i)+A(k)*x(k)

enddoenddo

IWOMP, JUN05 99


● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS

0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS


Performance is slightlyPerformance is slightlyimproved due to data locality.improved due to data locality.

IWOMP, JUN05 100


● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS

0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS


Performance Performance is slightlyis slightlyimproved due improved due to data locality.to data locality.

IWOMP, JUN05 101


● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering


0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS

0.00

0.50

1.00

1.50

10 100 1000

Colors

GFL

OPS


OpenMPOpenMPoverheadoverhead

OpenMPOpenMPoverheadoverhead

IWOMP, JUN05 102

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeItanium II (SGI/Altix) (8 PE’s)

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

● Flat MPI（PDJDS）■ Flat MPI（PDCRS）

0.00

1.00

2.00

3.00

10 100 1000

Colors

GFL

OPS

0

250

500

750

1000

10 100 1000

Colors

sec.

IWOMP, JUN05 103

PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeAMD Opteron (8 PE’s)

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

● Flat MPI（PDJDS）■ Flat MPI（PDCRS）

0.00

1.00

2.00

3.00

10 100 1000

Colors

GFL

OPS

0

250

500

750

1000

10 100 1000

Colors

sec.

IWOMP, JUN05 104

Multigrids

IWOMP, JUN05

105

MultigridMultigrid is scalable !!!

MGCG: scalable

ICCG: unscalable

J2CG: unscalable

200

150

100

50

0100 101 102 103

Number of Processors (Problem Size)

Tim

e to

Sol

utio

n

Based on LLNL

IWOMP, JUN05

106

• Typical Geometry in Earth Science Simulations

– Mantle, Core etc.– Also applicable to external

flow• Boundary Conditions

– r= Rin• u=v=w=0, T=1

– r= Rout• u=v=w=0, T=0

– Heat Generation

Target ApplicationThermal Convection between 2 Spherical Surfaces

Rin=0.50

Rout=1.00

IWOMP, JUN05

107

• Governing equations– Momentum (Navier-Stoke)– Continuity– Energy

• Poisson equations for pressure correction derived from continuity constraint.

• Parallel MGCG iterative method for Poisson eqn’s– V-cycle, Geometrical Multigrid– Gauss-Seidel Smoothing– Semi-coarsening– Multilevel communication table

Semi-Implicit Pressure Correction Scheme

IWOMP, JUN05

108

Start from IcosahedronSurface Grid by Adaptive Mesh Refinement (AMR)

Level 012 nodes20 tri's



Level 42,562 nodes5,120 tri's

Level 3642 nodes

1,280 tri's

IWOMP, JUN05

109

• generated from unstructured surface triangles

• structured in normal-to-surface direction

• flexible• suitable for near-wall

boundary layer computation

Semi-Unstructured Prismatic Grids

IWOMP, JUN05

110

Results on Hitachi SR2201320x900=288,000 cells/PEup to 37M DOF on 128 PEs

0.00E+00

1.00E+03

2.00E+03

3.00E+03

4.00E+03

5.00E+03

0 16 32 48 64 80 96 112 128

PE#

sec.

ICCG MGCG/FGS MGCG/ILUs

60

70

80

90

100

110

0 16 32 48 64 80 96 112 128

PE#

Par

alle

l Per

form

ance

(%)

IWOMP, JUN05

111

How about on ES ?

• 1 CPU– 768,000 elements (1280×600)– ICCG : 29 sec.（Xeon 2.8GHz) ⇒169 sec.（ES）– MGCG： 58 sec.（Xeon 2.8GHz）⇒345 sec.（ES）

• Why ?– CRS: loop length is small

• Remedy– Reordering done in BILU(0)

• Gauss-Seidel smoothing: dependency

– Traditional Multicoloring

IWOMP, JUN05

112

Critical Issue of Multigridfor Vector Computer

• Shorter Loop-Length for Coarse Grid Level.

IWOMP, JUN05

113

Vectorization：Re-OrderingEarth Simulator

• RCM（Reverse Cuthil-Mckee）, Multicolor

768,000 elem'sES, 1 PE

○ MGCG（MC）△ MGCG（RCM）● ICCG（MC）▲ ICCG（RCM）

MGCG（original）： 345 sec.ICCG （original）： 169 sec.

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0 500 1000 1500

COLOR #

sec.

5

6

7

8

9

4

5

6

7

8

3

4

5

6

7

2

3

4

5

6

1

2

3

4

5

5

6

7

8

9

4

5

6

7

8

3

4

5

6

7

2

3

4

5

6

1

2

3

4

5

IWOMP, JUN05

114

GFLOPS，Iterations768,000 elem's, ES 1PE

○ MGCG（MC）△ MGCG（RCM）● ICCG（MC）▲ ICCG（RCM）

10

100

1000

0 500 1000 1500

COLOR #

Itera

tions

1.00

1.50

2.00

2.50

0 500 1000 1500

COLOR #

GFL

OPS

Many colors...• fewer iterations for convergence• lower performance due to shorter loops (especially in MG)

5

6

7

8

9

4

5

6

7

8

3

4

5

6

7

2

3

4

5

6

1

2

3

4

5

5

6

7

8

9

4

5

6

7

8

3

4

5

6

7

2

3

4

5

6

1

2

3

4

5

IWOMP, JUN05

115

Scalability: Flat MPI768,000 elem's/PE, up to 80 PE’s

0

10

20

30

40

50

0.E+00 2.E+07 4.E+07 6.E+07

DOF

sec.

○ MGCG（RCM）● ICCG（MC=1500）

IWOMP, JUN05

116

Next step: Hybrid

Many colors are required for reasonably fast convergence.Overhead of OpenMP.

SMPparallel



enddoenddo

enddoenddo

Vectorized

SMPparallel



enddoenddo

enddoenddo

Vectorized

IWOMP, JUN05

117

Hybrid vs. Flat MPI8PEs on ES (1 SMP node)

MGCG (10 iterations) ICCG (100 iterations)

○ Hybrid● Flat MPI

1

10

100

10 100 1000 10000

COLOR #

sec.

/ 10

itera

tions

1

10

100

10 100 1000 10000

COLOR #

sec.

/ 100

iter

atio

ns

IWOMP, JUN05

118

MGCG on ESExcellent performance by reordering

More than 20% of peak even in MGCGNice for Poisson solvers75% of ICCG

Excellent scalability for flat-MPI Hybrid

Low performance for many colorsSignificant for MGCG

But we need many colors for reasonably fast convergenceNot recommended

IWOMP, JUN05

119

Hybrid vs. Flat MPI8PE’s on ES (1 SMP node)



1

10

100

10 100 1000 10000

COLOR #

sec.

/ 10

itera

tions

1

10

100

10 100 1000 10000

COLOR #

sec.

/ 100

iter

atio

ns

1280x600x8= 6,144,000 elements

IWOMP, JUN05

120

Hybrid vs. Flat MPI8PE’s on Hitachi SR8000(1 SMP node)



0

20

40

60

80

10 100 1000 10000

COLOR #

sec.

/ 10

itera

tions

0

20

40

60

80

10 100 1000

COLOR #

sec.

/ 100

iter

atio

ns

10000

1280x600x8= 6,144,000 elements

IWOMP, JUN05

121

Hybrid vs. Flat MPI8PE’s on IBM-SP3 (1 SMP node)


○ Hybrid● Flat MPI1280x600x8= 6,144,000 elements

1.0E+00

1.0E+01

1.0E+02

1.0E+03

10 100 1000 10000

COLOR #

sec.

/ 100

iter

atio

ns

1.0E+00

1.0E+01

1.0E+02

1.0E+03

10 100 1000 10000

COLOR #

sec.

/ 10

itera

tions

IWOMP, JUN05

122






IWOMP, JUN05 123

Summary (Earth Simulator)Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%)Re-Ordering is really requiredSimple multicoloring is better for complicated geometries.

IWOMP, JUN05 124

Summary (cont.) (Earth Simulator)Hybrid vs. Flat MPI

Flat-MPI is better for small number of SMP nodes.Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.Flat MPI: Communication, Hybrid: Memorydepends on application, problem size etc.Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP, especially on the Earth Simulator.

This is not so significant in Hitachi SR8k & IBM SP3

IWOMP, JUN05 125

Summary (cont.) (General)Hybrid vs. Flat MPI

In IBM SP-3, difference between flat-MPI and hybrid is not so significant, although flat-MPI is slightly better.In Hitachi SR8000, pseudo-vector works better for hybrid cases.

Effect of ColorIn scalar processors (especially for flat-MPI), performance is getting better as color number increases, due to efficient utilization of cache.

IWOMP, JUN05 126

Hybrid vs. Flat MPIGenerally speaking, they are competitive.

Flat-MPI is better for multigrid type applications with multicolor ordering.In ill-conditioned problems, Flat-MPI may require more iterations.Depends on implementation of compiler

IWOMP, JUN05 127

Hybrid vs. Flat MPI (cont.)Earth Simulator

Flat MPI is betterHybrid is better for large PE number with small problem size.

IBM SP-3Flat MPI is better for small problem size because cache is utilized more efficiently

Hitachi SR8000Hybrid is better (mainly because of feature of compiler)

Careful treatment on color number is required for Hybrid programming model.

IWOMP, JUN05 128

My current feeling is …Flat MPI is slightly better on the Earth Simulator in FEM-type applications with multicoloring.

IWOMP, JUN05 129

Can Hybrid survive ?SAI (Sparse Approximate Inverse) preconditioning

Suitable for parallel & vector computation.especially for Hybrid Parallel Programming Modelonly Mat-Vec. product.reordering for avoiding dependency is not required.

corresponds to single-color

Preconditioning for contact problems.J.Zhang, K.Nakajima et al. (2003) for scalar processorsK.Nakajima (2005) for ES at WOMPEI05

IWOMP, JUN05 130

Matrix-Vector Product ONLY

!OMP PARALLEL DOdo ip= 1, PEsmpTOT

iv0= STACKmc(ip)do j= 1, NONdiagTOT

iS= indexS(ip, j)iE= indexE(ip, j)

!CDIR NODEPdo i= iv0+1, iv0+iE-iS

(computations)enddo

enddoenddo

!OMP END PARALLEL DO

IWOMP, JUN05 131

Earth Simulator, Mat-Vec.1 SMP node (64GFLOPS peak), 500 Iter's.Hybrid is rather faster !!

●： Flat MPI○： Hybrid

10.0

15.0

20.0

25.0

30.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GFL

OPS

IWOMP, JUN05

132

Results: ES (1/2)

0

100

200

300

1 2 4 6 8thread#

sec.

set-upsolver

0.00

2.00

4.00

6.00

8.00

0 1 2 3 4 5 6 7 8

thread#

Spee

d-up

●Large (2.47M DOF) ○Small (0.35M DOF)

Large Problem

Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.

IWOMP, JUN05

133

Results: ES (2/2)

1

10

100

1 10 100 1000

COLOR#

sec.

0

10

20

30

1 10 100 1000

COLOR#

GFL

OPS

●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)


IWOMP, JUN05

134

Results: ES (2/2)

1

10

100

1 10 100 1000

COLOR#

sec.

0

10

20

30

1 10 100 1000

COLOR#

GFL

OPS

●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)


IWOMP, JUN05

135

Results: Hitachi SR8000 (1/2)


Large Problem

0

500

1000

1500

2000

1 2 4 6 8

thread#

sec.

set-upsolver

0.00

2.00

4.00

6.00

8.00

0 1 2 3 4 5 6 7 8

thread#

Spee

d-up


IWOMP, JUN05

136

Comparison: ES vs. Hitachi SR8k

0

500

1000

1500

2000

1 2 4 6 8

thread#

sec.

set-upsolver

0.00

2.00

4.00

6.00

8.00

0 1 2 3 4 5 6 7 8

thread#

Spee

d-up

0

100

200

300

1 2 4 6 8thread#

sec.

set-upsolver

0.00

2.00

4.00

6.00

8.00

0 1 2 3 4 5 6 7 8

thread#

Spee

d-up

ES

SR8k

IWOMP, JUN05

137

Results: Hitachi SR8000 (2/2)●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)

1

10

100

1000

1 10 100 1000

COLOR#

sec.

0.00

1.00

2.00

3.00

1 10 100 1000

COLOR#

GFL

OPS


IWOMP, JUN05

138

Results: IBM SP-3 (1/2)


Large Problem

0

2500

5000

7500

1 2 4 8 12 16thread#

sec.

set-upsolver

0.00

4.00

8.00

12.00

16.00

0 4 8 12 16

thread#

Spee

d-up


IWOMP, JUN05 139

SponsorMEXT, Japan.

Computational ResourcesEarth Simulator Center, Japan.NERSC/Lawrence Berkeley National Laboratory, USA.

Dr. Horst D. SimonDr. Esmond G. Ng

Information Technology Center, The University of Tokyo, Japan.

Colleagues ...GeoFEM Project

IWOMP Committee

Acknowledgements

Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures�OverviewGeoFEM: FY.1998-2002�http://geofem.tokyo.rist.or.jp/System Configuration of�GeoFEMResults on Solid Earth SimulationResults by GeoFEMMotivationsEarth Simulator (ES)�http://www.es.jamstec.go.jp/Flat MPI vs. HybridHybrid vs. Flat-MPILocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodesOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�for Stabilizing Localized PreconditioningOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Hybrid vs. Flat-MPIBlock IC(0)-CG Solver on the Earth SimulatorFlat MPI vs. HybridLocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodes1 SMP node => 1 domain for Hybrid Programming Model�MPI communication among domainsBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECReordering = ColoringColors & Iterations�Incompatible Nodes�Doi, S. (NEC) et al.RCM OrderingRed-Black Ordering�Highly Parallelized: Large Loop Length �Many Incompatible Nodes: Slow ConvergenceFour-Color Ordering�Fairly Parallelized: Shorter Loop Length �Fewer Incompatible Nodes: Faster ConvergenceCyclic DJDS(RCM+CMC) �for Forward/Backward Substitution in BILU FactorizationSimple 3D Cubic ModelEffect of Ordering�Effect of Re-Ordering Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Ideal Performance: 48.6% of PeakSMP node # > 10�up to 176 nodes (1408 PEs)��Problem size for each SMP node is fixed.�PDJDS-CMC/RCM, Color #: 993D Elastic Model (Large Case)�256x128x128/SMP node, up to 2,214,592,512 DOF3D Elastic Model (Small Case)�64x64x64/SMP node, up to 125,829,120 DOF3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth Simulator3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth SimulatorHybrid outperforms Flat-MPIComm. latency is relatively �large in ES�D.J.Kerbyson et al. LA-UR-02-5222Performance Estimation for Finite-Element Type Application on ES �using performance model (not real computations)�D.J.KerbysonHybrid outperforms Flat-MPI (cont.)Various PlatformsPlatformsReordering for the Earth Simulator Matrix Storage for ILU Factorization3D Elastic Simulation�Problem Size~GFLOPS�Earth Simulator�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�Hitachi-SR8000-MPP with�Pseudo Vectorization�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�IBM-SP3 (NERSC) �1 SMP node (8 PE’s)3D Elastic Simulation�Single PE Performance�GFLOPS, 104 colorsPerformance of Intra-Node MPI�8 PE’s: Measurement using SEND/RECV �subroutines in GeoFEMPerformance of Intra-Node MPI�8 PE’s: Measurement using inter-domain SEND/RECV �subroutines in GeoFEMSummary: Single Node Performance�Flat MPI vs. HybridSpeed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)PlatformsSummary: Performance for Many Nodes, Flat MPI vs. HybridElastic Problems�with Cubic Model Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�Iterations for ConvergenceEffect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Many Colors ... Trade-OffHybrid is much more sensitive to color numbers !Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Hitachi SR8k (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�IBM SP3 (single node, 8PE’s)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�IBM SP3 (single node, 8PE’s)South-West Japan Model�Contact Problems��Selective Blocking PreconditioningSouth-West JapanSouth-West Japan Model�800K nodes, 2.4M DOF �1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �ES, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)Complicated Geometries�PGA for Mobile Pentium IIIPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node�PDJDS/MC Re-OrderingPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Earth SimulatorPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Hitachi SR8000Single PE test for simple cubic model�786,432 DOF= 3x643 nodes� Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PDCRS is more suitable for scalar processorsPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Itanium II (SGI/Altix) (8 PE’s)PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� AMD Opteron (8 PE’s)Multigrids Multigrid�Multigrid is scalable !!!Target Application�Thermal Convection between 2 Spherical SurfacesSemi-Implicit Pressure Correction SchemeStart from Icosahedron�Surface Grid by Adaptive Mesh Refinement (AMR)Semi-Unstructured Prismatic GridsResults on Hitachi SR2201�320x900=288,000 cells/PE�up to 37M DOF on 128 PEsHow about on ES ?Critical Issue of Multigrid�for Vector ComputerVectorization：Re-Ordering�Earth SimulatorGFLOPS，Iterations�768,000 elem's, ES 1PEScalability: Flat MPI�768,000 elem's/PE, up to 80 PE’sNext step: HybridHybrid vs. Flat MPI�8PEs on ES (1 SMP node)MGCG on ESHybrid vs. Flat MPI�8PE’s on ES (1 SMP node)Hybrid vs. Flat MPI�8PE’s on Hitachi SR8000 (1 SMP node)Hybrid vs. Flat MPI�8PE’s on IBM-SP3 �(1 SMP node)Summary (Earth Simulator)Summary (cont.) (Earth Simulator)Summary (cont.) (General)Hybrid vs. Flat MPIHybrid vs. Flat MPI (cont.)My current feeling is …Can Hybrid survive ?�SAI (Sparse Approximate Inverse) preconditioningMatrix-Vector Product ONLYEarth Simulator, Mat-Vec.�1 SMP node (64GFLOPS peak), 500 Iter's.�Hybrid is rather faster !!Results: ES (1/2)Results: ES (2/2)Results: ES (2/2)Results: Hitachi SR8000 (1/2)Comparison: ES vs. Hitachi SR8kResults: Hitachi SR8000 (2/2)Results: IBM SP-3 (1/2)Acknowledgements

flat mpi and hybrid parallel programming models for fem ... · mpi programming model (on es). •...

Documents