peri tiger teams fy07 report performance engineering research institute october 30, 2007 contact:...

PERI Tiger TeamsFY07 Report

Performance Engineering Research InstituteOctober 30, 2007

Contact: Bronis R. de [email protected]

mailto:[email protected]?subject=PERI%20FY07%20Tiger%20Team%20Report

Tiger Team Process and Milestones

Process from Section 4.3 of Proposal: Select one or two applications per year Consult with Office of Science Program Managers Consist of three to four PERI researchers

Milestones Q1: Identify applications and teams for current year Q1: Report on prior year’s teams Q3: Report progress; reassign as per DOE needs

FY07 Selection Process

Delayed to allow completion of application survey

Received guidance to focus on 3 JOULE metric codes S3D, GTC and Chimera

Initial discussions w/JOULE Metric coordinator K. Roche SciDAC PI meeting in Atlanta in January of 2007 Strong interest from both S3D and GTC Chimera expressed concerns over their staffing and needs

Narrowed to focus on S3D and GTC in early March

FY07 Tiger Team Formation

Solicited interest in participating on teams

Assignments made by PERI management based on: Perceived code team needs Prior engagement activities Balance of expertise

Participants from six of nine PERI institutions Also strong participation in both teams by Univ. of Oregon

Coordination Team-specific mailing lists Regular telecons

S3D Tiger Team

Team Lead: Bronis de Supinski (LLNL)

PERI Team Members John Mellor-Crummey, Mike Fagan (Rice) Nick Wright, Allan Snavely (SDSC) David Bailey (LBNL) Rich Vuduc (LLNL)

Affiliate Team Members Sameer Shende, Alan Morris , Allen Maloney, Kevin Huck (Oregon) Jeff Larkin (Cray/ORNL)

Application Team Participants Jackie Chen, David Lignell (SNL)

Facilitators Kenny Roche, Pat Worley (ORNL)

S3D: Direct numerical simulation (DNS) of turbulent combustion

State-of-the-art code developed at CRF/Sandia 2007 INCITE award - 6M hours on XT3/4 at NCCS Tier 1 pioneering application for 250TF system

Why DNS? Study micro-physics of turbulent reacting flows

Full access to time resolved fields Physical insight into chemistry turbulence interactions

Develop & validate reduced model descriptions usedin macro-scale simulations of engineering-level systems

DNSDNS PhysicalPhysicalModelsModels

EngineeringEngineeringCFD codesCFD codes

(RANS, LES)(RANS, LES)

Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

S3D - DNS Solver

Solves compressible reacting Navier-Stokes equations

High fidelity numerical methods 8th order finite-difference 4th order explicit RK integrator

Hierarchy of molecular transport models

Detailed chemistry

Multiphysics (sprays, radiation & soot) From SciDAC-TSTC

(Terascale Simulation of Combustion)

S3D Parallelization

Fortran90 + MPI

3D domain decomposition each MPI process manages part of the domain

All processes have same number of grid points & same computational load

Inter-processor communication only between nearest neighbors in 3D mesh large messages; non-blocking sends & receives

All-to-all communication only required for monitoring & synchronization ahead of I/O

Communication

ComputationkN 2

kN 3

1

N

S3D logical topology

Text courtesy of S3D PI, Jacqueline H. Chen, SNL

A Performance Mystery in S3D on PWR4 (SDSC)

Following line of code (and many similar others) has ~70% L1 hit rate

diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * (grad_Ys(:,:,:,n,m) + Ys(:,:,:,n) *grad_mixMW(:,:,:,m) ))

Total L2 data cache accesses: 9784.594 M

% accesses from L2 per cycle: 5.112 %

L2 traffic: 1194408.401 MBytes

L2 bandwidth per processor: 9183.869 MBytes/sec

Total load and store operations: 33073.374 M

Number of loads per load miss: 30.527

Number of stores per store miss: 1.014

Number of load/stores per D1 miss: 3.380

L1 cache hit rate: 70.415 %

Performance model provides expectation of 90%...

Discrepancy Understood, Performance Optimized

diffFlux is defined as a pointer: “diffFlux => grad_Ys”

Compiler unrolls the loop suboptimally Loops over the 2nd index instead of the 1st

i.e., It accesses memory in “nx-size” strides Alias analysis not sufficient to allow “obvious” optimization

Simple fix on IBM systems Use “-qalias=noaryovrlp” compiler flag on IBM Runtime on 8 PWR4+ 1.5 GHz CPUs, 200 timesteps

2949 s (before), 2728 s (after) 7.5% improvement and L1 hit rates what they should be

Same loops show expected ~93 % L1 hit rate on XT3/4

Substantial time in exp in the getrates routine Power4+ profiles Code examination revealed calls were not vectorizable

Perl script transformed from 0% to 50% vectorized Substantial Power4+ performance improvement

30% for getrates routine Approximately 10% overall

Smaller perfromance improvement on XT4 Approximately 10% for getrates routine Approximately ~1.5% overall Subject of continuing tuning effort (D. Bailey)

Vectorizing exp for S3D (SDSC)

# of cpu cycles PWR4+/1.5GHz Opteron / 2.6 GHz

exp 160 49

Vectorised exp 8 31

S3D Performance at the Loop Level (Rice)

Wasted Opportunity(Maximum FLOP rate

* cycles - (actual FLOPs)) / total waste

highlighted loop accounts for11.4% of total program waste

Overall performance (15% of peak)2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle

S3D: What Opportunities Exist?

reuse

performance problem

data streams in/out of memory

initialize

update

reuse

5D loop nest:2D explicit loops

3D F90 vector syntax

Apply LoopTool to S3D Diffusive Flux Loop

!dir$ uj 3 do m=1,3 ! DIRECTION!dir$ uj 2 do n=1,n_spec-1 ! SPECIES

!dir$ unswitch 2 if (baro_switch) then ! driving force includes gradient in mole fraction and baro-diffusion:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) & + (1 - molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press)) else ! driving force is just the gradient in mole fraction:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * grad_mixMW(:,:,:,m) ) endif

! Add thermal diffusion:!dir$ unswitch 2 if (thermDiff_switch) then!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m) - Ds_mixavg(:,:,:,n) * Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp endif

! compute contribution to nth species diffusive flux ! this will ensure that the sum of the diffusive fluxes is zero.!dir$ fuse 1 1 1 diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m) - diffFlux(:,:,:,n,m)

enddo ! SPECIES enddo ! DIRECTION

unswitching directives

controlled fusiondirectives

unroll and jam directives

if BSelseif TD

n=1,nspec-1

m=1,3

LoopTool

Optimization of S3D Diffusive Flux Loop

Transformation Log:Scalarization (4 stmts)Loop unswitching (2 conditions)Fusion (loops within 4 outer nests)Unroll-and-jam (2 loops)Peeling excess iterations (4 nests)

2.94x faster than original (6.7% total savings)

if BS if TD

else

else if TD

else

n=1,nspec-2,2

n=1,nspec-2,2

n=1,nspec-2,2

n=1,nspec-2,2

(35 lines) (445 lines)

S3D: An Unexpected Bottleneck

Adjust routine interfaces to avoid copy

100% faster

an implicit loop that copies a non-contiguous4D slice of 5D data to

contiguous storage

5.4% time

S3D Node Performance Tuning Summary

More opportunities remain Register reuse and tiling of stencil computations Inlining + fusion + array contraction of temporary variables

Further improvements require more changes Lots of potential smaller improvements

Enabling technologies contributions HPCToolkit enabled identifying and assessing bottlenecks LoopTool helped automate tedious code transformations

Achieved ~12.7% overall improvement Node performance increased from 15% of peak to 17.4% Estimated savings for 2M CPU hour run: 254K CPU hours

S3D Scaling Performance (App Team)

S3D performance on XT at NCCS

0

10

20

30

40

50

60

70

80

1 10 100 1000 10000 100000

No of cores

Co

st

per

gri

d p

oin

t p

er

tim

e s

tep

(i

n m

icro

seco

nd

)

XT3 XT4 XT7

S3D Scaling Study (Oregon)

Harness test case

Platform: Jaguar Combined Cray XT3/XT4 at ORNL Several runs to identify scaling trends Focus on 6400p

Evaluate impact of combined XT3/XT4 nodes Performance evaluation of MPI_Wait Study mapping of MPI ranks to nodes

Total Runtime Breakdown by Events - Time

MPI_Wait

WRITE_SAVEFILE*

*Recent analysis indicates not a scaling issue

TAU: ParaProf Profile

MPI_Wait times exhibit two

equivalence classes

Same equivalence classes also seen in memory bandwidth

intensive computation

routines

S3D Scaling Study Conclusion

Determined that XT3 nodes slowed certain S3D routines Consistent across all XT3 nodes Memory bandwidth limited routines

Suggested load balancing optimization Reduce grid size in one dimension

for XT3 nodes Not yet implemented due to

concerns over long term relevance Provided estimate of benefit for

combined XT3/XT4 runs

Many scaling and single node results appear in S3D IOP paper

Projected Heterogeneous Scaling (LLNL)

0

10

20

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1 1.2

Proportion XT4 Nodes

Co

st

pe

r g

rid

po

int

pe

r ti

me

ste

p

(in

mic

ros

ec

on

d)

S3D Modeling Results & Future Directions

PMaC predictions for S3D on XT3 and XT4 Currently within 15% for an 8 CPU run Extending to larger CPU counts Working on improving accuracy

What is the expected performance of S3D on ORNL’s 250 TFLOP machine?

Will our optimizations benefit quad-core system? Different cache structure

L2 1MB→512KB L3 0 → 2MB shared

What architecture will S3D perform best on?

GTC Tiger Team

Team Lead: Shirley Moore (UTK) PERI Team Members

Haihang You (UTK) John Mellor-Crummey, Gabriel Marin, Guohua Jin (Rice) Hongzhang Shan (LBNL)

Affiliate Team Members Kevin Huck (UOregon) Ed D’Azevedo (ORNL) Lenny Oliker (LBNL)

Application Team Participants Stephane Ethier, Weixing Wang, Wei-li Lee (PPPL) Scott Klasky (ORNL)

Facilitators Kenny Roche, Pat Worley (ORNL) Bronis de Supinski (LLNL)

GTC: Gyrokinetic Toroidal Code from PPPL

Particle in Cell (PIC) code with gyrokinetic simulation

GTC-S: “shaped” code More realistically represents experimentally relevant geometry

GTC-P is a new “petascale” version Partitions the poloidal plane into radial shells

Fortran 90 and MPI, PETSc used for Poisson solves Currently no OpenMP in GTC-S or GTC-P OpenMP may be considered for multicore

Code team science goals Impact of turbulent transport in burning plasma fusion devices Integrated simulations for ITER plasmas for a range of

temporal and spatial scales

The Gyrokinetic Toroidal Code

3D particle-in-cell code to study microturbulencein magnetically confined fusion plasmas

Solves the gyro-averaged Vlasov equation

Gyrokinetic Poisson equation solved in real space

Low noise f method

Global code (full torus as opposed to only a flux tube)

Massively parallel: typical runs use 1024+ processors

Electrostatic (for now…)

Nonlinear and fully self-consistent

Written in Fortran 90/95

Originally optimized for superscalar processors

Particle-in-Cell (PIC) Method

Particles sample distribution function.

The particles interact via a grid, on which the potential is calculated from deposited charges.

The PIC Steps• “SCATTER”, or deposit,

charges on the grid (nearest neighbors)

• Solve Poisson equation• “GATHER” forces on each

particle from potential• Move particles (PUSH)• Repeat…

Charge Deposition for Charged Rings:4-Point Average Method

Classic PIC

Charge Deposition Step (SCATTER operation)

4-Point Average GK(W.W. Lee)

GTC

Point-charge particles replaced by charged rings due to gyro-averaging

Application Team’s Flagship Code: The Gyrokinetic Toroidal Code (GTC)

Fully global 3D particle-in-cell code (PIC) in toroidal geometry Developed by Prof. Zhihong Lin (now at

UC Irvine) Used for non-linear gyrokinetic

simulations of plasma microturbulence Fully self-consistent Uses magnetic field line following

coordinates () [Boozer, 1981] Guiding center Hamiltonian [White and

Chance, 1984] Non-spectral Poisson solver [Lin and

Lee, 1995] Low numerical noise algorithm (f

method) Full torus (global) simulation

Scales to a very large number of processors

Excellent theoretical tool!

GTC Mesh and Geometry

Saves a factor of about 100 in CPU time

() /q

Field-line following coordinates

Poloidal plane (cross-section)unstructured mesh

Processor 2Processor 3

Processor 0Processor 1

New GTC Codes Use a New Parallel Model:Domain Decomposition + Particle Splitting

1D Domain decomposition: Several MPI processes can now

share a section of the torus

Particle splitting method The particles in a toroidal section are

equally divided between several MPI processes

Particles randomly distributed between processors within a toroidal domain

Pure MPI version

But OpenMP still there… for multicore?

New Version (GTC-S) InputsExperimental Equilibrium and Profiles

Original GTC has flat temperature and density to set the scale for the gyroradius and the grid, and an analytical gradient for the turbulence drive

GTC-S uses experimental profiles and plasma boundary extracted from the experimental database by using the widely-used TRANSP tool (http://w3.pppl.gov/transp/)

The magnetic equilibrium is calculated from the profiles and boundary by using ESC or JSOLVER

Spline coefficients are calculated for the equilibrium and profiles to allow interpolations at the particle positions

http://w3.pppl.gov/transp/

New Grid Follows Change inGyro-radius with Temperature Profile

Local gyro-radius proportional to temperature

Evenly spaced radial grid in new coordinate where

T

Original GTC circular gridwith flat temperature New GTC-S follows T(r)

)(/ rTTdr

dic

Poloidal Component of B Field Takeninto Account for gyro-orbit

For large-aspect ratio circular concentric cross-section, the difference between a poloidal plane and a gyro-plane is neglected.

A more accurate treatment is used here for general geometry.

Projection of gyro-plane on poloidal plane results in elliptic orbit.

4-point average method uses ellipse

GTC Performance Issues

Three basic operation types govern PIC performance Grids work (i.e., Poisson solve) Particle processing (e.g., position and velocity updates) Interpolation between the above (i.e., charge deposition and

field calculation in particle pushing)

Main GTC performance bottleneck is the charge deposition, or scatter operation True of most PIC codes More complex in GTC due to fast gyrating particles

Motion described by charged rings tracked by their guiding center

More GTC Performance Issues

Some scaling issues with GTC-P relative to expectations Time doubles when it should stay flat Load imbalance in particle push routine apparently due to

variation in TLB misses

179% speedup going from single to dual core mode Main computational kernels not memory bandwidth bound Warning: as number of cores increases, other routines that are

showing slowdown on dual core may start to dominate

Status of GTC Tiger Team Effort

PERI Application Survey completed

Several conference calls w/application team participants

GTC and GTC-S code versions released to Tiger Team and Performance Database WG members on request

Awaiting release of GTC-P code to investigate: Poor scaling Load imbalance issues

Profiling of GTC-S carried out on Jaguar using TAU Data accessible in password-protected PerfDMF database

Optimization of charge deposition by UTK

Detailed modeling, analysis, and optimization of GTC-S by Rice Brief summary follows; details in a submitted paper

TAU Profile Showing Weak Scaling of GTC-S on Jaguar

Hand optimization of Charge Deposition (UTK)

Hand-tuning techniques Common subexpression elimination Code movement Loop unrolling Cache blocking

Improved performance of chargei by ~10%

Changes incorporated into GTC-S code

Written up as success story for Fred Johnson

Modeling, Analysis, and Optimization of GTC at Rice

Detailed modeling of computation and memory hierarchy performance of GTC-S using Rice modeling toolkit

Identified opportunities for data and loop transformations

Transformations improved program node performance 33% on Itanium2 and 13% on Opteron 275 Changes sent to Stephane Ethier; awaiting response

GTC-S Memory Hierarchy Performance - I

• Total L3 miss count• L3 cache misses due to

fragmentation of data in cache lines: 14.4% of total

GTC-S suffers from poor spatial locality due to data layout

Model L3 cache miss counts for individual arrays at the loop level

particle_array is an alias to array zion used in gcmotion

fragmentation of arrays zion (AKA particle_array) + zion0, accounts for:– 95% of all L3 fragmentation misses– 48% of all misses to the zion arrays– 13.7% of total L3 cache misses

Solution: transpose particle arrays zion and zion0 transform arrays of structures into structures of arrays

(values predicted for 64 radial grid points and 15 particles/cell)

GTC-S Memory Hierarchy Performance - II

Understanding spatial and temporal data reuse patterns in GTC-S

Figure below: program scopes carrying > 2% of L3 cache misses Carried misses are non-compulsory misses (capacity + conflict misses) Carrying scope is the innermost dynamic scope in which the data is reused

Two loops in main carry 40% of all L3 carried misses; misses cannot be removed.

21.4% of misses are carried by the iterative loop of the Poisson solver. A recurrence in the solver prevents transformations.

Focus on routines chargei and pushi Fuse the two main loops in chargei Apply tiling and fusion over several loop nests in pushi


GTC-S Memory Hierarchy Performance - III

Pinpointing and reducing TLB misses

335 do kz=1,mzbig336 wz=real(kz)/real(mzbig)337 zdum=zetamin+deltaz*(real(k-1)+wz)338 do i=idiag1,idiag2339 ii=igrid(i)340 do j=1,mtdiag346 phiflux(kz+(k-1)*mzbig,j,i)= 347348 enddo349 enddo350 enddo


Interchange loop kz to the innermost position

Outer loop kz iterates over inner dimension of phiflux

Additional transformations Apply unroll & jam to increase ILP in routine spcpft Transform arrays used in the Poisson solver to

improve spatial locality

GTC-S Performance Improvements on Itanium2

Percentages represent incremental improvements for each transformation

Results for 10 and 100 particles/cell

TransformationL2 misses

(%)L3 misses

(%)TLB

misses (%)Execution Time (%)

+zion transpose -27 / -37.4 -30.9 / -39.1 -10.6 / -81.6 -6.9 / -11.7

+chargei fusion -3.9 / -6.2 -4.3 / -6.8 -1.6 / -3.1 -11.4 / -18.2

+spcpft U&J 0 / 0 +0.1 / +0.4 -0.1 / -0.4 -11.3 / -1.9

+poisson transf. -6.6 / -1 -6.4 / -1.3 -1.4 / +2 -7.4 / -1.4

+smooth LI -3 / -0.4 -2.4 / -0 -63.9 / -3.6 -0.7 / 0

+pushi tile/fuse -8.9 / -13.3 -10.9 / -16 -3.4 / -9.4 +0.6 / +0.8

Total -49.4 / -58.3 -54.8 / -63 -81.1 / -96.2 -37.3 / -32.4

Itanium2 has 16KB dedicated instruction cache. Improvements in data locality negated by increase in instruction cache misses. Bigger impact expected with larger instruction cache, e.g. Montecito.

Side effect: big reduction in unnecessary data prefetches inserted by Intel compiler

GTC-S Performance Improvements on Opteron

Issues Hardware prefetcher crucial for performance on Opteron

Prefetcher tracks up to 20 parallel data streams Zion transpose increases # of parallel streams in key loops

Reduces effectiveness of hardware prefetcher Data reuse improvements are negated by higher number of

non-prefetched memory accesses

Approach Reorganize five arrays in pushi as one array Reorganize fourteen arrays in gcmotion as four arrays

Result: Improves execution time on Opteron by 13% Reduces cache and TLB misses by > 50%

Exploring Run-time Data Reordering at Rice

Issue Performance degrades during GTC execution as particles

become disordered w.r.t. underlying tokamak grid

Preliminary study Particle reordering improves temporal locality during charge

deposition and particle pushing

Currently developing on-line feedback and control mechanism for particle reordering

What Worked

Close interactions with multiple members of app teams

Tiger Team specific mailing list for S3D Generated team-wide comments, tapping more expertise Not used very much for GTC

Large distributed teams Somewhat surprising Avoid duplication of measurement effort University of Oregon participation as affiliate was exemplary Rice participation was also exemplary

Publications and publicity S3D science focused IOP paper SciDAC Review paper & SciDAC conf. presentation (Mellor-Crummey) GTC success for Fred Johnson

What Didn’t

Timing of application selection Not finalized until halfway through fiscal year Delayed by survey OK for first year; future implications?

Long Jaguar down time soon after teams formed

Initial understanding of code distribution Provided through JOULE process, NOT direct from application teams

An appropriate distribution mechanism but unsettling to application team

Frequent, on-going concern of application teams Will always start with application team in future, regardless of reason for

selection or appropriateness of distribution

Mechanism for providing improvements back to application team Slow and cumbersome; No CVS access May not be solvable due to application team need for internal control Addressed by repeated direct interactions

FY08 Tiger Team Issues and Proposed Solutions

Which applications will be focus of FY08 Tiger Teams? Guidance from HQ requested Recommend one XT4-focused team, one BG/P-focused team Is JOULE precedent to continue? Expect timing to be similar to FY07 (January/March), if maybe a little sooner Plan to continue work with S3D and GTC during FY08 selection process

Solves late Q2 decision, one of FY07’s biggest issues Suggest elimination of Q3 reassessment milestone in light of timing

What happens to teams from previous year? Application tuning does not respect fiscal year boundaries Good relationships established

Don’t want to lose them Are liaison activities sufficient to maintain them?

Plan to slowly devolve FY07 teams into very active liaison activities Use different participants for FY08 teams in order to balance staffing requirements

How do we ensure that the results are publicized? Initial S3D paper is good; potential for more GTC success story is good; plan similar one for S3D Continued interactions will support solving this question

peri tiger teams fy07 report performance engineering research institute october 30, 2007 contact:...

Documents